profile-pic
Vetted Talent

Sunkara Yasasvi

Vetted Talent
Experienced IT professional with proven track record of four years, adept in Snowflake, legacy data systems, SQL, data warehousing, big data technologies and ETL processes. Seeking avenues to further enrich my expertise by engaging with cutting-edge data technologies.
  • Role

    Software Engineer

  • Years of Experience

    4 years

Skillsets

  • Git
  • Teradata
  • Snowflake
  • Pl/sql
  • MongoDB
  • Informatica
  • Excel
  • Control-M
  • Bamboo
  • Appworx
  • Jira
  • SQL - 4 Years
  • Databricks
  • Java
  • Oracle
  • Confluence
  • PySpark
  • Bitbucket
  • Shell Scripting - 4 Years
  • Python - 2 Years
  • Python - 2 Years
  • SQL - 4 Years

Vetted For

9Skills
  • Roles & Skills
  • Results
  • Details
  • icon-skill_image
    Senior Data Engineer With Snowflake (Remote)AI Screening
  • 66%
    icon-arrow-down
  • Skills assessed :Azure Synapse, Communication Skills, DevOps, CI/CD, ELT, Snowflake, Snowflake SQL, Azure Data Factory, Data Modelling
  • Score: 59/90

Professional Summary

4Years
  • Oct, 2022 - Present3 yr 4 months

    Software Engineer

    Impetus Technologies
  • Jan, 2020 - Sep, 20222 yr 8 months

    Senior Software Engineer

    Infosys

Applications & Tools Known

  • icon-tool

    Snowflake

  • icon-tool

    Teradata

  • icon-tool

    Informatica

  • icon-tool

    Oracle

  • icon-tool

    Control-M

  • icon-tool

    AppWorx

Work History

4Years

Software Engineer

Impetus Technologies
Oct, 2022 - Present3 yr 4 months
    As a Snowflake Developer, actively involved in a data migration project, transferring Enterprise Data Warehouses to modern cloud and big data platforms. Focused on Informatica to Snowflake migration, emphasizing Teradata script migration to Snowflake. Responsibilities include validating and optimizing code, creating Snowflake stored procedures, translating Oracle and Teradata DDLs, managing data ingestion using Qlik and Azure, transforming shell scripts into Snowflake SQL scripts, creating wrappers for stored procedures, utilizing task automation in Snowflake, maintaining version control through Git, and tracking tasks using Jira. Developed CDC technologies through Qlik.

Senior Software Engineer

Infosys
Jan, 2020 - Sep, 20222 yr 8 months
    Streamlined asset transfer procedure using PL/SQL, reducing load and extraction times by 98% while processing at least 500,000 records. Deployed error-free code with 90%+ coverage in production environments and worked with agile methods such as JIRA and Scrum. Reduced average wait time for batch processes by 50% through automation and file alert mechanisms. Established NDM file transmissions for 15 clients, automating file transfers and decreasing late file escalations by 99%. Migrated 7 applications to IaaS servers using salt scripting. Managed production support team, reducing support tickets and resolution time. Trained team members on 7 applications and provided knowledge transfer. Resolved bugs and vulnerabilities as part of pen testing and Qualys findings for trading applications.

Education

  • Masters of Technology : Advanced Manufacturing

    SASTRA University (2019)

Certifications

  • Spark Developer Associate

    Databricks (Sep, 2023)

    Credential URL : Click here to view
  • Databricks Certified Associate Developer for Apache Spark 3.0

AI-interview Questions & Answers

So, yeah. Hello. According to the question asked, a brief introduction about me would be explaining that as a data engineer, I've started my career in 2022 with Snowflake migration projects. So, while working with the migration project, we also worked with migrating the project existing data warehouse of Teradata, which is handled by Informatica to Snowflake. Now, this process was handled end-to-end, the development phase, deployment and everything by us. In this, we use Snowflake and that is where my experience with Snowflake and Azure actually comes from. Before that, I mainly worked with Infosys and I was part of a team that would handle finance-related applications. Yeah. So, this is a very brief introduction about myself. On top of that, I have also graduated from Sastry University with a master's degree. It's an integrated degree with a master's in B.Tech in Mechanical and Advanced Manufacturing. A bit more about my work with data engineering in Impetus is that I worked with Snowflake SQL, I worked with Snowflake stored procedures, I worked with tasks and I also worked with SnowSQL in general for deployment purposes. I also worked with Appworks as a job orchestration tool and we worked with ADF also for storage of data, like static data here. So, like if we had to ingest any CSVs or any other flat files, we would use ADF and we integrated that along with Snowflake.

How can you use transaction control to manage data consistency during concurrent ETL processes in Snowflake. So one way we can maintain transactional control is, uh, at least one way that we managed, uh, to do so is by creating views. So what we did is the original tables that would have logs entered into it. So let's say we have a table t one that has, uh, logs entered. So the same logs would be entered into t 2 and t 3. Now in the off chance that there could be a lock in t one, immediately, the data would be entered into t 2. If t 2 isn't locked, then t 3. So the and on top of that, all these 3 would be the base tables for a view called table t, and the table t would have all the execution logs ordered as per its definition. So this way, uh, one thing is that none of the, uh, day none of the logs would be affected. And on top of that, in the off chance, there is a lock happening. The lock would be handled, uh, by exception handling and, uh, entering data into the other same same kind of tables, like t 2 or t 3. So this way, we would maintain transactional control. Uh, that was, again, an innovative way. On top of that, we could go for grant grant access or stuff like that also.

What architectural considerations would you keep in mind when designing a high volume data processing pipeline using Snowflake and ADF? So when we are using ADF sorry. When we are using ADF and Snowflake, uh, to design a data processing pipeline, at least from my point of view, the first thing we would have to consider is how we are ingesting the data. So because, uh, once the data is in Snowflake and we are handling that data through, uh, gold architecture, sir, bronze, silver, gold architecture, that will be fine. The main problem that I would focus on is how we are ingesting the data from ADF to Snowflake. So my first thought would be to, uh, see how that raw data interacts with, uh, the bronze layer. So let's say we have jobs that, uh, we have jobs, let's say, that run on a daily basis. Now something that runs on a daily basis, what we would, uh, look for is we would look for watermarking process. So in the case of watermarking process, uh, what we would get is that we would ingest small amounts of data per day. So depending upon the schedule, depending on how we are, maintaining the watermark, uh, on a daily basis, we would, uh, ingest the data ingest the data in small amounts, like batches batch processing, basically. So this way, we wouldn't have problem with, uh, a high volume date like, we wouldn't have a problem with the initial high volume from raw. Now on top of that, if we are if we have to process it in, uh, if we have to process that amount of data, uh, through the bronze to gold architecture, so in that case, I will think of something like creating, uh, stage and target layers where data would be handled based on watermarking. On top of the target layer, we'd have reporting and everything. So this way, when we're using watermarking, even in the target table, even though it would have a lot of records, those records would be entered based upon watermarking.

What method would you use to ensure data quality and accuracy within Azure Data Factory pipelines? Uh, to handle data quality and accuracy, uh, in a pipeline. My main go to would be um, yeah, I would probably use something like, uh, duplicate check mainly because, uh, duplicate duplication check or watermarking check, and, we would implement something like, uh, a CD situation. So the reason for all of this is that let's say we have 2 types of records. 1 is a newly entered record. So now that the record is entered, it will end up in target to the ETL will end loop. Now another record which is being updated as part of a previous record. So in that case, we would update the last updated or something. So this way, we would be able to maintain, uh, the quality of the data and also the accuracy of this. So, basically, the idea would be that whenever a record is being updated or whenever a record is being deleted, we would have a record of it, uh, using either soft delete features or using some other issue feature like last updated or something like that.

I am not actually sure what item put in means, but, uh, my guess for how we could my my guess to item potency would be that, uh, how we ingest the data based on, uh, like, let's say, there there is a data the basically, the idea should be that when we are ingesting data, the data should uh, maintain its changing status. Like, if, let's say, a new record is being entered, it should be entered properly. If there is any change happening, then the record should find, uh, its corresponding record in the target and then get updated accordingly. In that case, I would, uh, implement streams and tasks. What we would do is we would implement some kind of an SCD, uh, logic using streams, and then we would go ahead with tasks.

Do you leverage Snowflake's time travel and 0 copy cloning features to enhance data recovery and testing? So, uh, let's say we have 4 environments, devs, SIT, UAT, and, uh, production. Okay. So, uh, depending on whether we have the enterprise for production and, uh, we have, let's say, the, uh, a standard version or something like that. Maybe, let's say, we have business critical for, production, and, uh, for the rest, we have enterprise edition. Now in this case, what we would do is we would set time travel, uh, retention for production up to 90 days. So and the fail safe would be for 7 days. So this would ensure that we are able to recover data whenever we need. And, uh, on top of that so this would, uh, help us in case of any outages also. On top of that, when data is in such, uh, when in production, uh, let's say a job fails or something, we would be able to recover any data that is lost in that situation, assuming a failsafe hasn't been implemented in the ETL. Now during the time of testing, especially, we could create we could compare, uh, before and after an ETL has run, and this way we would easily be able to understand what are the records that have changed, what are the records that have been added, or something like that. So at least this was how I used time trial feature mainly, especially with time stamp. So if we have execution logs that, uh, mention when something has been executed on. So we could do a before and after that a specific query query ID has been executed. So like this, we would use

On the provided is no SQL snippet. Can you explain what the issue is with the current stored procedure and how it might affect data processing? Create a update auto begin, update auto set status, where status equals to list. One thing that seems odd to me in this, uh, code is that No. I am excluding that missing commit statement because, uh, when we execute the procedure, it will be committed. That shouldn't be a problem. Uh, the main problem that I feel like is the backslashes, which could affect how the data is entered into the

In this code snippet is integrated. There is a target that affects the model deployment. Can you spot and explain the error in the configuration of this model? Materialize table post create index at a time. This user ID. Grant select on So the pre and post hooks, uh, do not seem to be in do not seem like, the way it is, uh, set up is that once the index is created, we'll get, uh, the access to it. And then, uh, it would drop the index. Now in such a situation this user ID. Maybe the problem is that we are trying to grant, uh, access on one problem is that we can't create index even though we are using DBT. Let's say we are using this DBT for Snowflake. We wouldn't be able to create indexes. That would be my 1st gripe with, uh, this model, uh, and grant select on to role analyst. Yeah. Because we would, uh, that that is what seems like a problem is that we wouldn't be able to create an index. As far as I know, indexes are not a thing in Snowflake, uh, using a even if we use a d b t model. Maybe we could, uh, create a clustering key.

How would you build a machine learning data pipeline, Snowflake, and ensure it is updatable as new data becomes available? So, uh, while I haven't exactly worked with the machine learning data pipeline, my go to would be to try use try and use Snowpark along with maybe Snowpipe. So Snowpark is mainly, uh, used for machine learning data pipelines. So I would try to set up all the transformations and everything using, uh, Snowpark. And I would use Snowpipe and, uh, in conjecture with streams and tasks to make sure data ingestion is also done quickly. Otherwise, we could use ADF for something else also. But this would be a very vague idea of how the data pipeline would be available.

Optimize data retrieval time in Snowflake while dealing with large semi structured JSON datasets using Snowflake SQL. So one way we could nog in there when we consider data retrieval time, Uh, that basically means that we would have when we need to optimize data retrieval time, uh, at least in Snowflake, the first thing that our mind would go to is caching. So we would have to find a way where we can, uh, leverage using result caching. So, uh, let's say we are when we are dealing with JSON datasets, and we need to, uh, call we need to find a part of the dataset. At least some micropartitions that we know are being called, uh, let's say, some kind of keys that we know are called on a daily basis, and those keys are static. So in that case, we could call the keys, Uh, we could make sure that they are a part of the result caching. So this way, it would be easy for us to call those. It would be quicker for us to call them. Uh, maybe we could, uh, implement clustering of sorts also. And if we are specifically talking about SnowSQL, then we would Snowflake SQL, then we could, uh, yeah, mainly clustering. We could use result caching, and maybe we could see how the queries are structured.

So apply DevOps practice. Now I'm not exactly knowledgeable with the DevOps practices, but my idea would, uh, go around the fact that we could, uh, when it comes to deployment, let's say, uh, when we're using SnowSQL to deploy and we have the scripts, we could structure the DevOps pipeline in such a way that, uh, whenever any changes made to the migration script, the migration script immediately executes itself. So the data, uh, whatever, uh, changes we've made are deployed immediately. Uh, and on top of that, any testing that we know any redundant testing, like execute we could set up the DevOps pipeline in such a way that, uh, once you deploy the procedure, the job related to that would immediately be triggered, and this way we could test it in SIT, uh, all the environments.