
Experienced Data Engineer with a passion for transforming complex data into actionable insights. With a proven track record of 3 years, I specialize in designing and implementing robust data pipelines that handle large datasets efficiently. Proficient in programming languages like Python and Scala, I have hands-on experience with Hadoop and Spark ecosystems, enabling me to develop and optimize data processing workflows.
During my journey, I've successfully contributed to projects that improved data accuracy, reduced processing times. As the big data landscape evolves, I stay at the forefront of emerging technologies and best practices to ensure that my skills are always aligned with industry trends.
Business Technology Solution Associate
ZSSoftware Engineer Analytics - Data Engineer
SagilityAssociate Software Engineer
Hinduja Global Solutions Digital
Azure Data Factory

Azure Databricks

Azure Synapse

Azure DevOps

Amazon Redshift

AWS S3

AWS Lambda

AWS DynamoDB

Hadoop
.png)
Apache Spark

Apache Sqoop

Hive

CICD

Cloudera

Eclipse

Git

Linux
Pune, Maharashtra, India Hybrid
Hi. Uh, this is Srivatsar. I'm not bad. So, currently, I am working as a Azure data engineer in, and, uh, I have completed my graduation from Oxford College of Engineering. Uh, currently, I'm located in Bangalore. And as a data engineer, my, uh, I I'm working with the Azure technologies like Azure Data Factory and Azure Databricks. So, uh, uh, currently, I am handling 2 projects. Uh, 1 is Amplify coaching tool, and another 1 is MyTime application, which is a payroll process application. Uh, for payroll application, uh, we are using Azure Databricks where we, uh, process, uh, we get data every 15 days, and we process. And, uh, we process it by using Azure Databricks, uh, using PySpark. So, uh, this is my overall, uh, brief introduction about myself.
So I will follow a few steps to capture, uh, to implement the CDC solution in Azure Data Factory. Uh, one is, uh, I will determine the source system. I will identify the source and destination. For example, a database and data warehouse from which you can, uh, we can capture changes and the destination where we want to load the changed data and choose a c I will choose a CDC mechanism. Uh, this is the common approach to use CDC features provided by this whole system such as SQL server or SQL database change tracking or Oracle Log Miner. And, uh, third thing, I will set up a source of, uh, source, uh, data extraction. Uh, we use, uh, appropriate connectors to extract data from this, uh, source system. For example, if I, uh, am using SQL server CDC, uh, to connect with SQL server connector. I will configure the connector and then I will retry the change database on CDC mechanism. Okay? I will customize the logic. I will define the incremental load strategy. Uh, I can, uh, by using time stamp columns, change tracking columns, or, uh, I will specify, uh, CDC specific, um, metadata. For example, LSN in SQL server CDC. And based on that, I will, uh, update the rules, uh, since the last execution. And I will implement the detection logic using activities like lookup or store procedure in ADF to detect changes, uh, based on my, uh, based on my strategy. I will compare the last extract data time stamp with the latest data to identify new records. And I will, uh, to handle the deleted records, I will implement the logic to detect and process deletes based on CDC metadata or comparison with previous data snapshots. And, uh, I will monitor the CDC pipeline in ADF by running at regular intervals based on your on my data freshness requirements. I will monitor the pipeline execution and performance to ensure, uh, efficient capture and loading incremental data.
First, uh, uh, I will take the following steps. 1st step I will take is analyze the current data model, including the schema tables, indexes, and dependencies. I will identify data migration requirements such as data types, compatibility between the source and the Snowflake. And, uh, I will prepare for I will set up a Snowflake, you know, and create the tables and structures. And I will extract the data from the source system using, like, Snowflake, SnowSQL data integration tools or third party ETL tools. Uh, will perform data profiling and cleansing as needed to ensure data quality before migration, uh, then load the extracted data into Snowflake using Snowflake's copy into command and bulk loading utilities or ETL process. And I'll for incremental updates, if this host data config continues to change during the migration process, I will use CDC mechanism or delta processing techniques to capture and migrate incremental changes without interrupting ongoing operations. Uh, so I will do the testing and validations of the, uh, to ensure the accuracy, completeness, and integrity. And I will plan to cut over timing, uh, carefully to minimize downtime and business impact. Uh, then post migration validation test, I will, uh, I will verify that during data processing or reporting of our analytics functions work as
So, uh, I will follow some key considerations. Uh, one is data partitioning and distribution. So in this, uh, I will partition large datasets based on relevant columns to distribute data processing load across multiple nodes in Snowflake. And I will use, uh, Snowflake clustering keys to physically group related data together, uh, to improve the query performance and minimizing this input output. Uh, I'll configure ADF activities to run-in panel to maximize resources utilization and processing speed and leverage this Snowflake automatic concurrency scaling to handle concurrent queries and workloads bytes efficiently. Uh, to implement incremental data processing strategies, uh, I will use the change data since the last term. And, uh, I will utilize the Snowflakes, time travel, or CDC features for tracking, changing changes and maintaining data history. I'll use efficient compression techniques to reduce storage footprint and improve data transfer protocol, uh, speeds in this Snowflake. And, uh, to optimize I will optimize the storage configurations in Snowflake such as using appropriate clustering keys and storage policies based on access platforms. Uh, for data movement and integration, uh, I'll use the ADM data flow activities for data, uh, and within the cloud environment, minimizing the data movement between source and the Snowflake. Uh, I'll I I can use the, uh, we can use, uh, your Snowflake's native connectors and integration capabilities to directly ingest data from various sources, reducing latency and complexity. Uh, and I will set up the monitoring and platform for performance, uh, performance things using, uh, alert using alert and metrics in the ADF to track the pipeline performance metrics. And, uh, I will continuously monitor, uh, and tune Snowflake warehouse, uh, configurations, uh, configurations, query optimization, and indexing strategies, uh, to improve overall performance. For fault for fault audits, I'll implement, uh, ADF using retrying mechanisms, error handling, and checkpointing to handle failures gracefully and ensure data integrity. And, uh, leverage this NoFlex data replication and backup features.
To ensure data quality and accuracy, first, I will first method I will use is data profiling. Using data profiling activities in ADF, to analyze the structure, completeness, uniqueness, and distribution of data in my datasets. Uh, I will identify data anomalies, missing values, duplicates, outliers, and data quality issues to profiling. Uh, I'll implement, uh, data validation checks within ADF pipelines using activities such as lookup, edge test, or conditioner split to verify data integrity and correctness. Uh, and I will validate the data against predefined business rules, reference datasets, or expected patterns. Uh, we'll use data cleansing, transformations, and activities in data ADF data flows to standardize, cleanse, and enrich data, remove or, uh, correct the invalidate data invalid data outliers, duplicates, inconsistencies using cleansing, uh, functionals and logic. I'll implement error handling mechanisms in Adrian pipeline to capture and handle data quality issues, expectation exceptions, and failures using logging and monitoring features in data ADF to track data quality, uh, metrics, error counts, and processing, uh, statistics and maintain metadata repositories or catalogs to track data lineage, data quality rules, transformations and mapping. And, uh, I will perform data reconciliation, uh, between source and target datasets or system to ensure data consistency and accuracy, monitor data quality trends and, uh, metrics and KPIs to track improvements and address recurring issues.
So the 1st approach I will use is, uh, I will identify the dimension table representing business entities such as customer, product, time, or etcetera. And identify the I will identify the fact table containing the numeric measures and foreign keys to dimension tables. Uh, first, I will design the dimension table in Snowflake, uh, using the create table statement and define the column for attributes related to each dimension such as, uh, for example, customer ID, uh, name, address, product, uh, ID, etcetera, and set appropriate data types, constraints, and default values for dimension table columns. Uh, I will define surrogate keys, uh, for dimension table for improved performance and historical tracking. Uh, 2nd, I will 2nd approach is I will design a fact table and define the columns for numeric measures, uh, for example, sales amount or quality sold, uh, and for in case 2 dimension table, determine the grain of I will determine the grain of the fact table based on the business requirements and set appropriate data types, constraints, and default values for fact table. Uh, then, uh, 3rd approach I will do is I will define the relate primary key constraints on dimension table and foreign key constraints on fact tables to establish the relationship. Uh, I'll use the other statement or Snowflake UI to add constraints as needed. And using Snowflake data loading tools such as copying to Snowflake data ingest service to load data into dimension table from source system of files. And, uh, I'll use few transformations, um, as needed during the loading process. Then load data, uh, then 3rd and 4th approach I will use, uh, I'll load the data into the fact table. Create I will create the indexes and clustering keys, uh, to improve the query outperformance, especially for frequently accessed columns. And I'll define the clustering keys on fact table based on frequently dimensions to optimize storage and query execution. Then I will test and validate the data model. Uh, these are the approaches which I would take to implement a star schema data model in the Snowflake
Uh, the possible oversight and its potential impact on deployment process. So first, I will, uh, check the incomplete testing. Uh, if if it is not performing comprehensive testing of pipelines and data integration workflows before deployment, And I will check the potential impact. This can be lead to deployment of faulty or incorrect pipelines causing data inconsistencies, processing error, and operational disruptions. Um, so for second is lack of version control. Uh, if we are not utilizing version control system, for example, Git for managing ADF or synapse, artifacts and configurations, Uh, and the potential impact can be it will be difficult in tracking changes, managing code collaborations, and rolling back to previous versions in case of issues or regressions. Uh, and inadequate deployment automations, oversight will be relying solely on manual deployments without implementing automated deployment pipelines. And the impact will be increasing the deployment time, human errors, inconsistency between development, staging, and production environments. Uh, then the limited monitoring and logging, uh, can neglect, uh, this this can neglect, uh, configure monitoring alerts, logging, and performance metrics for deployed pipelines. And the potential impact can be the inability to detect and troubleshoot issues properly, leading to prolonged downtime, data loss, and degraded performance. And insufficient security practices like not implementing proper proper access control, encryptions, and data protection measures in CICD pipelines. So impact can be the risk of unauthorized, uh, ads such as data breaches, compliance violations, and loss of sensitive informations. Ignoring role by strategies can, uh, fail the fail like, failing the to define and test rollback strategies in case of failed deployments of production issues. And, uh, the impact can be the difficult, uh, in identifying in, uh, reverting the changes, data corruption, and
Uh, there are a couple of issues. So, uh, we have here in the in the code snippet, uh, commit, uh, commit is missed. Commit keyword is missed. So, uh, we should include the commit statement to commit the transaction and makes the changes for, uh, and make make the changes permanent in the database. So commit statement is missed in the, uh, current
So, uh, to design, uh, the streaming real time streaming data with the batch processed historical data. Uh, first, I will, uh, identify the data source. I will identify the real time streaming data source, for example, Kafka or Azure Event Hubs or AWS Kinesis, and batch processing historical data sources, for example, data source, database, data rates, uh, that I want to integrate. Then I will choose the integration tool. Uh, I will use the utilize the Snowflake native connectors or third part third party tools, uh, with the Snowflake to ingest real time and dash data into Snowflake. For real time streaming processing data, I'll consider this new pipe for continuous data loading for from stream source to into Snowflake. Then I'll define the data ingestion patterns by determining the frequency and volume of volume of real time data updates and batch data loads. And I will choose the appropriate ingestion patterns such as micro batching or continuous streaming for real time data and schedule batch processing or historical data. Uh, then, uh, I will use this no flag data transformations, uh, capabilities such as data flows or external functions to transform and enrich the incoming data. Then I will join the real time and historical database on common keys or time time frames to create comprehensive datasets. Uh, I'll implement the CDC. Uh, you and I will utilize this no flag time travel and CDC feature and custom CDC mechanism to capture incremental change, uh, incremental changes in real time data and apply them in destroy them to historical data. Then I will design the storage structures in Snowflake such as tables and schemas to accommodate both real time and historical data. And I'll utilize the partitioning strategies based on time intervals, for example, daily or hardly, or the business keys, uh, for efficient data storage and query. And then I will implement the data governance practices such as access control, encryption, and data masking, venture data security and, uh, compliances. And, uh, using Snowflake native SQL capabilities, including window functions, aggregations, and joints for processing and analyzing, uh, integrating
Apply the logs package to improve collaboration and reduce wait time to so first, I will do the version control. I'll use the version control system. For example, git to manage Snowflake objects such as database, schema, tables, and view. And then I'll implement the, uh, infrastructure as a code, which is IaC practices by defining the Snowflake object and configurations in code using SQL scripts or Snowflake scripting language. Uh, I will set up a CI pipelines to automate the validation and integration of code changes into Snowflake environment, uh, and use CI tools and frameworks to trigger the automated build test and deployment of Snowflake objects. Then I'll implement the CD pipelines to automate the deployment of validated code changes to sniff like snowflake environments and define the deployment strategies, uh, to minimize the downtime and ensure smooth, uh, rollout of changes. Then I will use the configurations management tools to manage Snowflake objects, configurations, environment variables, and connection settings. Uh, I'll ensure consistency and traceability of configurations across different Snowflake environment. Uh, I will develop the automated test for snowflake objects. For example, SQL script, store procedure to validate functionality, uh, performance, and data quality. And I will include the unit tests, integration tests, and regressions test in CI pipelines to catch issues early in the development cycle. I'll set up monitoring and alerting mechanisms in Snowflake to track, uh, to track performance metrics, like, resource usage and data pipeline health using monitoring tools and dashboards to proactively identify and address potential issues in Snowflake data operations and, uh, maintain comprehensive document, so for Snowflake objects, data pipelines, configurations, and deployment process. I'll conduct regular knowledge sharing sessions, uh, and workshops to educate my team members on best practice tools and techniques for
Version control mechanism, see your data pipeline deployment. So first, I will use the source control, uh, to store the, uh, I will store the ADF artifacts, their pipeline datasets, and link service in a version control system. And, uh, uh, I'll use the branch to manage different dev development stages, for example, testing, dev development or production, and isolate the changes until they are ready for the deployment. Then I will export the ARM templates. Uh, ARM temp ARM, usually, it's a Azure resource manager templates for for my ADF resources from the Azure portal or using Azure CLI. Store I will store these ARM templates in my version control system to track changes and manage the deployments, and I will automate the deployment with CICD. Uh, I'll set up the c I c I c d, uh, CI pipelines to automatically build and validate the changes to my ADF artifacts stored in version control. And I will use the CD pipelines to automate the deployment of validated changes to ADF environment. And, uh, I will, uh, implement the release management practices to control the deployment of changes to different environments, for example, QA staging or production in a controlled manner, and define approval of flows and gates to ensure that only approved changes are deployed to production in, uh, environment. Uh, I'll use the versioning and tagging in my version control system to track, release, milestones, and important changes to ADF artifacts, Apply semantic versioning or a similar versioning scheme scheme to clearly indicate, uh, the significance of each release. Uh, I'll regularly back up my area of resources and configurations used including ARM template, pipelines, and datasets, and implement procedures for restoring ADF resources from backup in case of data loss or deployment issues. I'll monitor the deployments and track changes to our area of artifacts using logging, auditing, and monitoring tools provided by Azure and implementing alert mechanisms to notify relevant stack of stakeholders on deployment failures or issues.