profile-pic
Vetted Talent

Srinivas sukka

Vetted Talent
Total 10+ years of experience in developing enterprise data warehousing applications with a focus on implementing end to end Snowflake data warehouse solutions.
  • Role

    Lead Data Engineer

  • Years of Experience

    10 years

Skillsets

  • Python - 2 Years
  • Data Modeling - 10 Years
  • ELT - 2 Years
  • Azure - 2 Years
  • S3 - 2 Years
  • ETL - 10 Years
  • NO SQL - 4 Years
  • SQL - 10 Years
  • ETL/ELT - 10 Years
  • dbt - 1 Years
  • Complex SQL Queries - 10 Years
  • Snowflake - 4 Years
  • Data Visualization - 2 Years
  • Data Modelling
  • AWS - 2 Years
  • technical requirements
  • Data Integration
  • Rbac
  • Snowflake solutions
  • Performance Tuning
  • Team Leadership
  • Data Governance
  • NoSQL databases
  • Data Warehousing
  • SQL Queries

Vetted For

9Skills
  • Roles & Skills
  • Results
  • Details
  • icon-skill_image
    Senior Data Engineer With Snowflake (Remote)AI Screening
  • 73%
    icon-arrow-down
  • Skills assessed :Azure Synapse, Communication Skills, DevOps, CI/CD, ELT, Snowflake, Snowflake SQL, Azure Data Factory, Data Modelling
  • Score: 66/90

Professional Summary

10Years
  • Lead Data Engineer

    UST
  • Platform Engineer

    UST

Applications & Tools Known

  • icon-tool

    Snowflake

  • icon-tool

    SQL Server

  • icon-tool

    Oracle

  • icon-tool

    AWS

  • icon-tool

    Azure

  • icon-tool

    Python

  • icon-tool

    SQL

  • icon-tool

    Unix

  • icon-tool

    Informatica

  • icon-tool

    Airflow

  • icon-tool

    Tableau

  • icon-tool

    Streamlit

  • icon-tool

    SVN

Work History

10Years

Lead Data Engineer

UST
    IDS database migration using Azure Data Factory and Snowflake. Configuration based ETL solution. ADF pipelines. JavaScript procedures.

Platform Engineer

UST
    Building infrastructure for snowflake migration. Snowflake domain creation. RBAC models. Row Access Policies. Metadata archiving.

Achievements

  • All India Snowflake Hackthon Winner January 2024

Major Projects

2Projects

ESG Database Migration

    Migrate the existing ESG database in SQL server to Snowflake using technologies like DBT, Airflow, and Snowflake.

Invest Data Store Migration

    Designed the process for migrating IDS database from SQL server to snowflake using Azure Data Factory.

Education

  • Bachelor's degree, Computer Science

    Amrita Vishwa Vidyapeetham (2014)

Certifications

  • Oracle database 12c: advance sql

  • Sql advanced

AI-interview Questions & Answers

Hello. My name is. I'm currently working as senior data engineer at UST Global. So I have 10 years of experience in building and developing ETL solutions for enterprises. I work with clients like Shroder's, Equifax, Logitech, uh, to help them migrate their, uh, on prem database to Snowflake. Thank you.

so so to tune the performance of a query in snowflake basically we could use general mistake most of the people do in snowflake is they run a small query in an extra-large warehouse imagine an extra small warehouse can run five eight files in parallel right and if you are having data which is not that big and if you're using extra-large warehouse basically you're saying like there is no point of running it that in extra small extra-large warehouse right so the techniques I would use is make sure your warehouse is properly sized based on your workloads and for and make sure if it meets your deadlines and loads are complete on time also for the query optimization we could use the snowflake internal tables such as snowflake secure views to understand how much time the query is taking and using those tables we could identify the blockers which is taking much time we could improve the or either auto scale if the query is complex or we could could add more servers and if there is a lot of critical queries which are blocked because of the bandwidth so for the query optimization part as well we just have to look at snowflake internal tables such as snowflake query history copy history all these snowflake account usage tables which would give an idea of how your query is performing and that will help you a lot to identify the issues in your query and it will help us improve the query performance thank you

Yeah. So in my project, um, which is for asset management, I have a huge trip tool called DBT, which is a table tool. We could write singular tests or functional tests using SQLs to identify the data quality of the source and make sure the data refresh times are within within the regular timeline. So we could write a lot of stress on your incoming data and make sure everything is working as expected using the Databill tool. And good thing is Databill tool doesn't it's it's it's an ELD tool. It doesn't use external warehouses or servers. It uses Snowflake warehouse to run your, um, data quality test, which is written in SQL, and make sure all your tests are passed. And it there it is there for your data. Thank you.

sure so better efficiency yeah yeah there is a specific use case wherein we are migrating SQL server data into snowflake so what happened there is like 15 billion records which needs to be loaded from on-premise to snowflake 15 billion records was taking if you generate a file it's 15 billion records and if you move that to snowflake it was taking a lot of time because the size was too huge and but earlier it was in JSON format and we observed that the file is also randomly rapidly increasing the first thing we have done is we try to chunk the files into smaller bits so that whenever the file is generated it can be immediately moved to snowflake so there is parallel processing happening there and in the Azure factory side what kind of optimization we have done is we try to we try not to put anything any conversions in the ADF instead we try to generate a proper SQL which is in line with snowflake table and ADF is just dumping the data into CSV file and does not do any kind of validations on the raw data so that was that was helping to reduce the times better and so and also we have tried similar exercise with parquet format which is user-friendly with snowflake so that was also helping us generate files faster and load faster in snowflake

Yeah. So there is no definitive guide of, um, error handling in sick no click SQL. People are using them. People are using tools to, uh, execute their SQL such as DBT or SQL mesh or some other tools. But in general, what I feel is try to if you're if it's a simple query and you if if it's a complex if it's simple query, you can directly write it in this, you know, NoSQL line. That should be okay. But if it's a complex query, try to put that inside a procedure rather than to turn in JavaScript or Python and write apply proper try and catch method and catch those errors and log that into a table. So in future, if any error happens in the SQL, so you'll have a table where you can identify, uh, what is the error in the SQL, uh, the procedure, so which will improve the, uh, error handling mechanism in your projects. That is one best practice I would follow. The second one is, um, if your queries, try to put try to do a lot of, uh, data quality tests before loading into, uh, integration layer. Apply a lot of a singular test from dbt. Make sure everything is properly loaded in raw layer of your ETL pipeline, and do not run do not run smaller queries in larger warehouse. And command error handling and stuff like that. What else I could think of? Yeah. I think mostly, you have to try and put everything in try and catch method. And if you can configure a telemetry for your snowflake table that will cap that will gather all your procedures logs and put into a singular or central table. It can be referred very easily instead of separate tables for each procedure. That could be another approach for handling in Snowflake.

What do you want done? So monitoring in ADF was like, um, we have set up a centralized log mechanism in ADF. So all the logs are passed to a ADF. Uh, All the logs of ADF are passed to, um, Elasticsearch. And it is so when something fails, there is an alert return in elasticsearch to raise a ticket to ServiceNow via Elasticsearch. So, basically, this is like a one time setup for your entire project, wherein all your logs has to be gathered from all the data factory pipelines and push pull the push those, um, logs into a central log logging log system, and you can write a simple select star from the not select stars. Select simple select on these logs from Elasticsearch and see if there is any failure for the current day and raise an alert if it finds any failure or error in the log files. So that is the mechanism we have followed in our project.

could be wrong with the course okay and in case I can't start from customer from orders business logic to perform again supposed to be here the moment you get a customer order customers customers business logic to perform aggregation it's supposed to be here by the way this question looks ambiguous I couldn't understand what is it but I will give a try I want to start from total customers orders business logic to perform aggregation is supposed to be here business logic to perform aggregation is supposed to be here customers talking orders okay no yeah so let's select counter star it's there is no into counter star into customers from customer that's there is no such thing as into we could directly assign a variable where customer total customer count equals to we can open SQL execute and put this select statement inside that and we can execute that and capture that result set into a variable and pass that variable to the return statement so there is no into over here into is for SQL server migrations but into doesn't work in snowflake so that is a looks like format I mean into keyword is not present in snowflake it's actually a SQL servers keyword

so middle list is view let's find select star from ref raw data where state is equals to active and a difference created at a date less than 30 so he is trying to read the raw data which status equals to active and which is for the last month or the last 30 days and create a view with this raw data so it's basically pulling all the raw data for last 30 days which has status active and creative view on top in the snowflake and what and explain where this code may fail I'm not sure how raw data looks like in here but select star from raw data maybe instead of referring I think raw data should be referred from the source since he's using a star in general if in future when you column comes in this might give a trouble and it might fail at that point yeah so I would suggest either refer from the source where select star from open braces source of raw data define it as a source and control your input data from there and if you do this and even that only thing you have to change is in the source file not in the in the source SQL not in the down down downstream pipelines like this one yeah but I think in future because of the star it might fail but in general it should work but yeah

So and, generally, um, as you know, Soulflake is a centralized resource, which is which can be used by multiple teams. If you do not have a proper boundaries between the databases or data mass across teams, there is a possible chance of, um, um, security issues wherein unnecessary data can be accessed by other people. So the general solutions would be, you could implement an RPAC, which is role based access control, which is like a virtual layer around the team. And they can we can create a database and apply proper roles to it. And only owner of all or people who are part of these roles could access this database. No one else can access this database outside it. So, basically, you could, uh, clearly define the RBAC policies. And, also, we could we should apply row access policies for the Snowflake secure share sheds, wherein it's a common common table for entire Snowflake account. But if a if a person is querying that table, he might see unnecessary data. So we have to apply role access policies as well. And, yeah, first one is RAPAC and second one is role access policies. And, also, if if the data is having sensitive information like PIA data, maybe, basically, we could use tags, so which is a feature in Snowflake. We we can map a tag to the table or the column. It says it has sensitive information. And when exposed to tools like Colibra or other tools, it will, um, not it will have some alerts on top of it, which will not directly expose these sensor data to other teams. And what else we can do about security? Yep. Yeah. I think our backs and Nuvex's policies and also centralized log tagging policies should help, um, um, proper security and boundaries and privacy for data within Azure Data Factory and Snowflake Webex. Yeah. Thank you.

In my project, most of the datas are coming from outside the system, which is outside our company's network. So we our team has developed an API. Not just it's a it's a producer consumer application, wherein they are responsible for calling the APIs, which is outside Shorla's network and put that files into a a subtype server where from there, our job will begin. But in general, I have less experience in calling external APIs, but I I have it. For example, in one of my hackathon, I called publicly available APIs and using keys, security keys, and we authenticate via them, put those keys in Azure key vault, and authenticate the API calls via the security keys, which is generated earlier. And how do we handle it? So in general, every API will have a limitation. So, uh, the ADF pipeline should be designed in such a way, maybe you can put in a for loop, and Maybe a limit can be set on the for loop saying after 3 failures after 3 consecutive failures in the API, we could fail the ADF pipe pipeline instead of continuously running and allow this error into a centralized logging mechanism, which can be tracked later. So, basically, um, and, also, in these days, Snowflake is, by default, accepting external APIs call from Snowflake itself. We don't need an external tool for doing this. So we could use Snowflake capability to call APIs within Snowflake, and data can be collected.

Yeah. In general this is a general problem when a common repository is accessed by multiple team members. For example, there is 1 Git repo, and 10 people are working on the same Git repo. And people are merging features into the master. And before you merge your changes to the master, someone might have done the changes, that someone might have changed the same file which you are working on. In this in in this kind of scenarios, what I usually do is, um, um, I take a copy, clone the repository into my local, and I constantly check if there is any updates from the master before before before they clearly check the code in the repository and see if any updates are happening on the same file. If there is one, I would immediately pull it into my, uh, local branch, and I would have the latest file. This is not a solution, but it will it will resolve is the managed conflicts we get from the git repositories. Or we can work on, um, we can work on the code separately and test it properly. And once everything works, I usually take the latest copy and try to merge my changes into this one. That is another way, but there is no no there is no single answer for this question, but we could handle it multiple ways. I would use a Git repo and constantly check if there are updates in the master and pull it if it's if it doesn't replace my local code.