
Oriented Snowflake Developer with passion for leveraging data to drive business decisions. Seeking to utilize my 3 years of experience in Snowflake development, Advance SQL proficiency, and expertise in ETL processes to contribute to the success of dynamic organization. Dedicated to delivering scalable and efficient data solutions that meet and exceed business objectives
Snowflake Developer
TCSJunior ETL Developer
TCSSnowflake

Advance SQL

Talend

Python

Azure Data Factory

Unix
.png)
Jenkins

Github

SQL Server

Tableau

Putty
Designed, implemented and Orchestrated Scalable robust Snowflake Data Warehouses. Innovatively crafted schema designs and data loading strategies, ensuring optimal performance and scalability of Snowflake environments. Loading data from Azure factory into Snowflake using Pipelines. Developement of SQL scripts.
Role : Snowflake developer Client : US Language : SQL server Tools: Snowflake, Qlik compose, SQL server Create, test, and implement enterprise-level apps with Snowflake Solve performance issues and scalability issues in the system Build, monitor, and optimize ETL and ELT processes with data models. Engineered efficient ETL processes using Snowflake tasks and streams, automating data ingestion and transformation workflows.
Role : Junior ETL developer Client : US Language : SQL server Tools: IBM Datastage, Talend, AQT, Unix, Service now Experience in Designing, Developing, Testing, Documenting of ETL Jobs and monitoring the application to process data using SQL and Unix for processing data. Proficiency in writing and debugging complex SQL queries Engaged in Patching and testing of the Data in production monthly and quarterly. Reports are viewed and processed on Tableau. Extract, Transform, Load process, and technological infrastructure implementation. debugging and fixing issues during development and implementation of given requirements.
Hi, I'm Pallavi, based out of Bangalore. I have been working with TCS for the past 2.8 years, where I started off my journey as a junior ETL developer, working under a BFSI unit. As an ETL developer, I've worked with several ETL tools like Informatica PowerCenter, IBM Datastage, where I've used different technologies like SQL, PostgreSQL. My daily task was to create, develop, design, and implement small jobs where I would be dealing with real-time transactional data coming in every day. I was also dealing with header handling and managing tasks that happened on a daily routine. So this was my task as a junior ETL developer. Currently, I've been working as a Snowflake developer for the past 1.6 years, where I've been working on improving and developing Snowflake data warehouses and advanced SQL scripting. I've been creating complex use cases, working with window functions, CTEs, and also working on joins, sorting, and filtering data. As a Snowflake developer, I have indulged myself in creating pipelines, dealing with data ETL and ELT processes, loading and unloading data in and out of data warehouses. We've been using cloud-based data warehouses, including Snowflake, Azure Data Factory, and Azure Data Lake to load the data permanently. As a Snowflake developer, my daily tasks include developing SQL scripts, loading data from sources to targets, and designing new pipelines and workflows for the same. As a Snowflake developer, I have been using tools like Azure Data Factory, Azure Data Lake, SQL, Snowflake SQL, and other related tools. This was my current project. I have worked as a junior Snowflake developer for six months in my previous project, where I used Clickhouse as my ideal tool. Overall, the journey of my Snowflake development has made me confident in doing things and debugging things by myself. I have also worked with the testing team, where I have understood how to work on my errors and go back to recheck my faults, which has helped me to grow and learn a lot in this journey. So I'm looking forward to using my capabilities to work in your organization. Thank you.
methods for implementing idempotency in snowflake data ingestion let's go to methods for implementing idempotency in data ingestion so for data ingestion we have primarily been using azure data factory where we will be loading our data into our azure data lake so while we have been doing this we have to create pipelines to bring in the seamless serverless and without code the seamless process that is loading data through pipelines so when we are doing this we have to understand how to take in the steps which is I will see the quality of the data the volume of the data the size and the frequency of the data it is coming in based on which we will be understanding the data ingestion whether it is a bulk load or is it a continuous data loading or is it a real-time streaming data so based on this we will be creating pipelines so if at all it is manual if we are manually going and running the pipelines then we don't have to give any kind of functionality or scheduling but if we have been triggering the pipelines based on the streaming data we'll have to give the metadata entities in there based on the data we've been receiving so for data ingestion primarily we'll have to focus on the type of data it is coming in and how the data has been processed or how the ETL process is going on so for data ingestion mainly we'll have to focus on the source from where we've been receiving is it coming in from the databases the APIs or is it coming from the files so we have to make sure the same and the blockers for data ingestion could be if we have not given the proper target where it should go and land or the proper place if it is not if it is overloaded we'll have to truncate and then load the data so these are the challenges which we'll face when we are doing a data ingestion
What ways would you leverage data modeling techniques to enhance query performance in Snowflake? To enhance query performance in Snowflake? To enhance query performance in Snowflake? To enhance query performance in Snowflake? To enhance query performance in Snowflake? To enhance query performance in Snowflake? To enhance query performance in Snowflake? To enhance query performance in Snowflake? To enhance query performance in Snowflake? To enhance query performance in Snowflake? To enhance query performance in Snowflake? To enhance query performance in Snowflake?
How would you design a resilient data pipeline in Azure Data Factory to handle intermittent source data availability. To design a data pipeline, first, it plays a very major role because the first step will be data ingestion. To do data ingestion, I should understand the data and how the data is coming in. Is it through the databases I've been receiving, or is it coming from APIs, or is it coming from SAP, or am I getting raw data from SAP or am I getting data from files? Once I rectify this particular part, then I'll understand how the data is being provided. I would want to load it. I will understand the method of data ingestion. Is it bulk loading or is it continuous data load or is it real-time streaming? So once I understand this, I will be able to create a data pipeline. And, also, I would want to consider the format of the data, the size of the data, and the frequency of the data in which it is coming in. So, these are the three factors I would consider before creating a data pipeline. So when I'm creating a pipeline, which is for the incoming data that is from the source, after creating the pipeline, I would want to also bring in, if at all, we have been dealing with Snowflake and Azure Data Factory. So we'll be loading the data in batches. So which will be loaded in Azure data lake. So we'll be loading our data in there using data pipelines. So when I have worked on both of these, where I've been choosing where I've used manually triggering the pipelines which I've created. And, also, I have used the automated web method to trigger my data pipeline. So when I have manually been loading it, I will be giving the required details about the metadata to my leader. The triggering part, I will be giving at what particular time for every 5 minutes or 10 minutes I would want to trigger my pipeline. So this is what I would consider when I am creating a resilient data pipeline. So once the pipeline is created, I will make sure the pipeline is running fine. Before I run it, I will do a testing of the pipeline. That is, I will follow the debugging format so that I will understand where it is failing. So when I do the debugging part, I will understand where the data pipeline is failing or what are the issues it is going through. I will fix it, and this is how I have successfully understood the way of bringing in a proper data pipeline. So I will do the testing first, and then I will bring in my proper real-time data or bulk data or continuous data to load it into my required target. So I would design a resilient data pipeline in which to handle intermittent source data availability, I would do the testing after creating the data pipeline. Testing has saved a lot of errors and the issues which I've faced before, and it has been working fine afterwards.
When designing a high volume data processing pipeline using Snowflake, what are the architectural considerations? What are the architectural considerations that I would keep in mind for a data processing pipeline using things like Azure Data Factory? As a data engineer, I have understood that before loading any kind of data, Snowflake has its own way of dealing with huge volumes of data. That is, we have data storage and we have virtual machines and we have a data service layer, which can hassle-free improve the process of processing high volumes of data. So, if at all we've been loading high volumes of data from the Azure Data Factory, firstly, we have a technique under data storage where it will compress and encrypt the data, and it reduces data redundancy. So, when I'm storing high volumes of data, I don't have to worry about data redundancy here. So, my data has been encrypted and compressed. So, my storage is kept as the storage, and I have to pay only for what I've been using. So, this problem has been solved here. And, also, here, I can scale up and down. The virtual machines can be scaled up and down based on the data which is coming in. So, this can solve all the issues or function faster even though I have been getting large volumes of data. I know that virtual machines can be scaled up if there is a huge amount of data from a small to medium to large, I can increase the size of the virtual machine to increase the speed of the processing. So, using this, I can improve my data processing by scaling up and down. And, also, I don't have to worry about data loss because we have a time travel here. We have features like metadata policy here. We have micropartitions coming in here. So, when we have time travel, if at all I am processing data using a SQL query, once it is running, it is already going and the storage is happening. So, the metadata is stored in the cloud service layer, and its micro partition like metadata stores micro partitions details, account details, max, min, the format, the size of the data. So, all the information is stored in my metadata, which is stored in the cloud service layer. So, I don't have to worry about data loss or anything. So, we have time travel here, so I can give my data retention period accordingly. So, for enterprise level, it is for 90 days. So, even if I lose the data or anything, I can go back and bring in my data. So, this is one way of using the architectures of Snowflake. So, what are the visual considerations that I would keep in mind when designing a high volume data processing pipeline? When I have so many features available in Snowflake, and also we have a wide variety of the features under the data storage and virtual machines and the data service layer, I would keep in mind the following considerations: 1. Data compression and encryption to reduce data redundancy. 2. Scaling up and down of virtual machines based on incoming data. 3. Time travel features like metadata policy and micropartitions to prevent data loss. 4. Data retention period for enterprise level, which is 90 days. 5. Using the cloud service layer to store metadata and micro partitions details.
Azure Data Factory is mainly used to load data into our Azure storage. That is, I've been using Azure Data Lake, which can load, store, restructured or unstructured data. So dynamically scaling based on the workload demands. For example, I've been getting huge volumes of data at a time. I can load this bulk data. If I'm loading bulk data, I've been bringing in the concept of pipelines. So I can load multiple forms of data using a pipeline at a time because we can think in terms of up to 40 activities at a time, where one pipeline can take 40 activities at a time when using Azure Data Factory. If that's not happening, I can bring in another pipeline to load this bulk data, which can be used as a child pipeline. I can run that together. So two pipelines at a time will be able to bring in my bulk data insight. If I've been dealing with continuous data load, that is, I've been getting 1,000,000 records every 10 minutes, I would bring in the frequency of what it's coming in. For example, I've been told that at 10:10 minutes I've been getting data. For that particular thing, I will create a pipeline where I will be scheduling to trigger the particular stream. I will create a stream and a task and give the times of the schedule for every 10 minutes. So every 10 minutes this particular task will trigger and this pipeline will bring in the data into my external stage for every 10 minutes. This is how I will deal with continuous data load. If at all I have continuous real-time streaming data, that is, for every 1 minute or for every 60 seconds or for every 2 minutes I've been getting data, I will use the same technique, tasks or scheduled streaming. I'll create a stream because I have to capture the changes, to capture what kind of data is coming in, what are the changes happening to my table when I'm loading it. So for this also, I'll be using this triggering method to store all the data batch-wise into my data. So either way, we can deal with it. If the data is huge, we can handle it using our Data Factory. And if it's medium-sized data, we can handle it using Azure Data Factory. This is my experience of how I have dealt with loading data using Azure Data Factory.
In this course we're using data built tool. According to the coding, I think the creating the index part should be done in the prehook part, not in the posthook part. Because if at all you are creating a model, you would want the indexing to be done in the pre step, but not in the post hook step. So if you're doing the indexing part in the pre step, it'll help you improve your performance and increase the speed of the performance, and it'll help you. Yeah. So indexing is not needed in the post hook part. So that is the error. So the create part should be given in the prehook part. The drop index if exists and also, when you're granting, you have to give the table details when you're granting any kind of access. So it is just on the role, but not on the table or any kind of object entity, or attribute.
Suppose I come across this section within a CID CD pipeline configuration using Azure Data Factory or Azure's identity service: identify the possible oversight and expose down. Deploy. So deployment process in my project or what should be followed here is when you're building certain things, you would first want to test the same in your QA, the test environment, or the UAT. Once the testing has been successful there, only then can you move it to the prod, the production part. When you are not moving it to production directly, you have a chance of seeing a lot of errors there. I think the deployment process should first be done in the test environment, then in the UAT, and then in production. This is the safest way because you will see a lot of bugs or errors when you do the testing in your test environment, where you do 1, 2 iterations, and then move it to UAT, and then to production. Production is the final place where you deploy things. I guess prior testing should be done before deploying it to production. That's what I think is missing here. You've built it, and then you've been deploying it to production directly. I guess that step is missing here.
Machine learning data pipeline in Snowflake and ensure it's updatable as a new data pipeline, motion learning, data, and ensure it's if I would want to build or use machine learning, if I'm building a data pipeline using machine learning. So firstly, I would build on the data I've been continuously dealing with. That is, if I have the same routine data coming in. So I know how the data functions. I know the metadata. I know where the data has been loading. So, I will understand the dataset properly, and then I'll create the data pipeline for Snowflake using machine learning. Once I've trained the data model, it functions like if I train the data model, it will start giving the output accordingly. So, understanding my frequency of the data format, data size, and the attributes, entities, and constraints that are present in my data, I will list them out, and then I will create a model. And then I will try building a data pipeline using these entities. And with this, I will be successfully able to create a data pipeline using my data using data modeling techniques. Then I will run the pipeline and use automation techniques so that I have routine data coming in. So the routine way the data pipeline also starts functioning. So, if at all, I would want it as a new data available in Snowflake. So, once I create the data pipeline, the data will be landing in the data lake. I will bring in new databases or external tables so that I can pull in and use it as new data. So this is one way. Or if I'm manually doing it, I can also trigger my pipeline, which I have created manually, by giving the entities I have learned from the patterns of my previous routine data. So I will create a data model from the same, and I will try running my data pipeline by testing it first by using debugging patterns and then understanding what are the errors. If I have any errors after performing the debugging part on the pipeline, I will run the pipeline, land the data in Datalake, and then try pulling the same data using an external table, giving the path of the URL of our particular storage, be it Azure or Blob or Azure storage or AWS storage or GCP storage. I will put the data directly into the external storage, and then it will be a complete new data. I'll start using it for further querying. After querying, I'll be using the data.
Apply DevOps practices to improve collaboration and reduce lead time in smooth data operations. So we've been using DevOps practices to improve collaboration and reduce lead time in small data operations. Okay. So we have been using Azure DevOps for our day-to-day planning. So we have used DevOps to plan our project, manage code, build and test the codes, and deploy the codes or release them into production and monitor them. So how it has collaborated, I have seen how it has been a collaboration between the developers and the testers. We have the designing and developing and planning of the project go on our Azure or DevOps dashboards. And we have scrums, and we have reporting tools available in our Azure DevOps. And once this is done, the building and testing part goes on when we're working on the code management of the coding part. So we can bring in our codes and we have our version controls here. So each developer who's been coding can bring in his new version here. And if someone wants to refer the code, they can go back there and refer the codes from the code version parts. So we have Azure repos where we'll be uploading our codes. And after uploading, our testing team comes here where they can plan their test and run their test plans here. And if they have any kind of problems with the testing part, they can raise bugs and attach that particular developer's name there for that particular bug. So if I have a count mismatch or if I have any bug in that particular testing part, I can add the developer there. And we have sprint-wise bugs so that developer can fix that particular bug in that particular sprint and come back. So this way, the collaborations have been going on well and it has seamlessly helped us to collaborate things. And also, we can do continuous integration and continuous deployment using Azure DevOps where we have Jenkins to deploy things. And we also have GitHub Labs, and we also have different tools here. So for deploying, we can directly use different environments here, such as test, uat, and production. So we have unit testing also going on here. So we can also run our unit tests, which are uploaded by developers on Azure DevOps. So this has helped us to interact with developers and bring in live coding going on between two developers. So it has helped in planning, managing the code, building and testing the same and deploying and releasing the data into the production and also monitoring. So we have varieties of things in our Azure DevOps, such as dashboards, reporting tools, scrums, sprints, and boards. We have boards wherein we can see the projects and we can also plan the projects. And we can also see our user stories uploaded, which is for each sprint, where we can upload our codes, verify things.
Optimize data retrieval time. Optimize data retrieval time in smoothly while dealing with large semi-structured data sets using Snowflake SQL. Optimize data retrieval time in Snowflake while dealing with large semi-structured data sets. If we're dealing with large semi-structured, JSON data, we have the option to convert the JSON file using a variant, and then convert that back into Snowflake. This process has no difference or is not a very large task for us to do. Once you're converting the variant into your required format, it's easy to structure it. When converting the JSON file into a variant and then converting the same variant file into rows and columns, it's very easy to process these rows and columns using SQL. Once it's in the format of rows and columns, database table or schema format, or table format, it will be easy to process the data, clean the data, transform the data, or manipulate the data. Optimizing the data retrieval time in Snowflake can be done in this particular format: converting the semi-structured JSON file into the variant type and then converting that variant type into rows and columns format. Using this formatting for processing or cleaning of the data will help us reduce the time and optimize the complete large semi-structured JSON file. We can also process the table in the form of columns and rows in a faster manner, which will improve the performance.