profile-pic
Vetted Talent

Shreevatsa Ramanatha Bhat

Vetted Talent

Experienced Data Engineer with a passion for transforming complex data into actionable insights. With a proven track record of 3 years, I specialize in designing and implementing robust data pipelines that handle large datasets efficiently. Proficient in programming languages like Python and Scala, I have hands-on experience with Hadoop and Spark ecosystems, enabling me to develop and optimize data processing workflows.

During my journey, I've successfully contributed to projects that improved data accuracy, reduced processing times. As the big data landscape evolves, I stay at the forefront of emerging technologies and best practices to ensure that my skills are always aligned with industry trends.

  • Role

    Software Engineer Analytics - Data Engineer

  • Years of Experience

    4.5 years

Skillsets

  • Python
  • SQL
  • Data Processing
  • Azure
  • PySpark
  • ETL
  • Data Ingestion
  • AWS Cloud
  • Data crunching

Vetted For

9Skills
  • Roles & Skills
  • Results
  • Details
  • icon-skill_image
    Senior Data Engineer With Snowflake (Remote)AI Screening
  • 77%
    icon-arrow-down
  • Skills assessed :Azure Synapse, Communication Skills, DevOps, CI/CD, ELT, Snowflake, Snowflake SQL, Azure Data Factory, Data Modelling
  • Score: 69/90

Professional Summary

4.5Years
  • Jul, 2024 - Present1 yr 11 months

    Business Technology Solution Associate

    ZS
  • Oct, 2022 - Jul, 20241 yr 9 months

    Software Engineer Analytics

    Sagility
  • Oct, 2021 - Oct, 20221 yr

    Associate Software Engineer

    Sagility

Applications & Tools Known

  • icon-tool

    Azure Data Factory

  • icon-tool

    Azure Databricks

  • icon-tool

    Azure Synapse

  • icon-tool

    Azure DevOps

  • icon-tool

    Amazon Redshift

  • icon-tool

    AWS S3

  • icon-tool

    AWS Lambda

  • icon-tool

    AWS DynamoDB

  • icon-tool

    Hadoop

  • icon-tool

    Apache Spark

  • icon-tool

    Apache Sqoop

  • icon-tool

    Hive

  • icon-tool

    CICD

  • icon-tool

    Cloudera

  • icon-tool

    Eclipse

  • icon-tool

    Git

  • icon-tool

    Linux

Work History

4.5Years

Business Technology Solution Associate

ZS
Jul, 2024 - Present1 yr 11 months

    Pune, Maharashtra, India Hybrid

Software Engineer Analytics

Sagility
Oct, 2022 - Jul, 20241 yr 9 months
    Reduced redundant activities of departments by building Data Warehouse, Data Modelling, and Data Ingestion activities.

Associate Software Engineer

Sagility
Oct, 2021 - Oct, 20221 yr

Achievements

  • Reduced redundant activities by building Data Warehouse, Data Modelling, and Data Ingestion activities.
  • Contributed to upgrading enterprise Time, Attendance, and Payroll processing applications.
  • Implemented streamlined, pipelined architecture with 100% efficiency in delivery.
  • Processed 1TB of data daily with 90% accuracy, boosting efficiency by 30%.
  • Achieved 15% improvement in data processing with PySpark code optimization.
  • Created and queried HIVE tables with 100% consistency in retrieving valuable analytical information.
  • Reduced 50-man hours per activity with partitioning and bucketing in Hive.

Education

  • BE - Information Science Engineering

    The Oxford College of Engineering (2021)

Certifications

  • Dp-203: data engineering on microsoft azure

  • Az-900: azure fundamentals

AI-interview Questions & Answers

Hi, this is Srivatsar. I'm not bad. So, currently, I am working as an Azure data engineer in, and I have completed my graduation from Oxford College of Engineering. Currently, I'm located in Bangalore. And as a data engineer, I'm working with the Azure technologies like Azure Data Factory and Azure Databricks. So, currently, I am handling two projects. One is the Amplify coaching tool, and another is the MyTime application, which is a payroll process application. For the payroll application, we are using Azure Databricks where we process data every 15 days, and we process it by using Azure Databricks, using PySpark. So, this is my overall brief introduction about myself.

So you will follow a few steps to capture and implement the CDC solution in Azure Data Factory. One is, I will determine the source system. I will identify the source and destination. For example, a database and data warehouse from which you can capture changes and the destination where you want to load the changed data. I will choose a CDC mechanism. This is the common approach to use CDC features provided by the whole system, such as SQL Server or SQL database change tracking or Oracle Log Miner. And, third, I will set up a source of data extraction. We use appropriate connectors to extract data from the source system. For example, if I am using SQL Server CDC, I will use the SQL Server connector. I will configure the connector and then retry the change database on the CDC mechanism. Okay? I will customize the logic. I will define the incremental load strategy. I can use timestamp columns, change tracking columns, or specify CDC-specific metadata. For example, LSN in SQL Server CDC. And based on that, I will update the rules since the last execution. And I will implement the detection logic using activities like a lookup or a stored procedure in ADF to detect changes based on my strategy. I will compare the last extracted data timestamp with the latest data to identify new records. And to handle the deleted records, I will implement the logic to detect and process deletes based on CDC metadata or comparison with previous data snapshots. And I will monitor the CDC pipeline in ADF by running at regular intervals based on your data freshness requirements. I will monitor the pipeline execution and performance to ensure efficient capture and loading of incremental data.

First, I will take the following steps. 1st step I will take is to analyze the current data model, including the schema tables, indexes, and dependencies. I will identify data migration requirements such as data types, compatibility between the source and Snowflake. I will prepare for setting up a Snowflake, you know, and create the tables and structures. And I will extract the data from the source system using Snowflake, SnowSQL data integration tools or third-party ETL tools. I will perform data profiling and cleansing as needed to ensure data quality before migration. Then, I will load the extracted data into Snowflake using Snowflake's copy into command and bulk loading utilities or ETL process. For incremental updates, if the source data continues to change during the migration process, I will use CDC mechanism or delta processing techniques to capture and migrate incremental changes without interrupting ongoing operations. I will do the testing and validations to ensure the accuracy, completeness, and integrity. I will plan the cut-over timing carefully to minimize downtime and business impact. Then, post-migration validation, I will verify that during data processing or reporting, our analytics functions work as

So, I will follow some key considerations. One is data partitioning and distribution. In this, I will partition large datasets based on relevant columns to distribute data processing load across multiple nodes in Snowflake. And I will use Snowflake clustering keys to physically group related data together to improve query performance and minimize input/output. I'll configure ADF activities to run in parallel to maximize resource utilization and processing speed and leverage Snowflake's automatic concurrency scaling to handle concurrent queries and workloads efficiently. To implement incremental data processing strategies, I will use the change data capture feature. And I will utilize Snowflake's time travel or CDC features for tracking changing data and maintaining data history. I'll use efficient compression techniques to reduce storage footprint and improve data transfer speeds in Snowflake. And to optimize, I will optimize storage configurations in Snowflake such as using appropriate clustering keys and storage policies based on access patterns. For data movement and integration, I'll use ADF data flow activities for data and within the cloud environment, minimizing data movement between the source and Snowflake. I'll use Snowflake's native connectors and integration capabilities to directly ingest data from various sources, reducing latency and complexity. And I will set up monitoring and performance tracking using alerts and metrics in the ADF to track pipeline performance metrics. And I will continuously monitor and tune Snowflake warehouse configurations, query optimization, and indexing strategies to improve overall performance. For fault tolerance, I'll implement ADF using retrying mechanisms, error handling, and checkpointing to handle failures gracefully and ensure data integrity. And I will leverage Snowflake's data replication and backup features.

To ensure data quality and accuracy, I will first use the data profiling method. Using data profiling activities in ADF, I will analyze the structure, completeness, uniqueness, and distribution of data in my datasets. I will identify data anomalies, missing values, duplicates, outliers, and data quality issues through profiling. I'll implement data validation checks within ADF pipelines using activities such as lookup, edge test, or condition split to verify data integrity and correctness. I will validate the data against predefined business rules, reference datasets, or expected patterns. We will use data cleansing, transformations, and activities in ADF data flows to standardize, cleanse, and enrich data, remove or correct invalid data outliers, duplicates, inconsistencies using cleansing functions and logic. I'll implement error handling mechanisms in ADF pipelines to capture and handle data quality issues, unexpected exceptions, and failures using logging and monitoring features in ADF to track data quality metrics, error counts, and processing statistics. And, I will maintain metadata repositories or catalogs to track data lineage, data quality rules, transformations, and mapping. I will perform data reconciliation between source and target datasets or systems to ensure data consistency and accuracy, monitor data quality trends and metrics, and track improvements to address recurring issues.

So the 1st approach I will use is, I will identify the dimension table representing business entities such as customer, product, time, or etcetera. And I will identify the fact table containing the numeric measures and foreign keys to dimension tables. First, I will design the dimension table in Snowflake, using the create table statement and define the column for attributes related to each dimension such as, for example, customer ID, name, address, product ID, etcetera, and set appropriate data types, constraints, and default values for dimension table columns. I will define surrogate keys for dimension tables for improved performance and historical tracking. The 2nd approach is I will design a fact table and define the columns for numeric measures, for example, sales amount or quality sold, and for in case of 2 dimension tables, determine the grain of the fact table based on the business requirements and set appropriate data types, constraints, and default values for fact table columns. Then, the 3rd approach I will do is I will define the primary key constraints on dimension tables and foreign key constraints on fact tables to establish the relationship. I'll use SQL statements or Snowflake UI to add constraints as needed. And using Snowflake data loading tools such as copying to Snowflake data ingest service to load data into dimension tables from source systems or files. I'll use a few transformations as needed during the loading process. Then I will load the data into the fact table. I will create indexes and clustering keys to improve the query performance, especially for frequently accessed columns. I'll define the clustering keys on fact tables based on frequently accessed dimensions to optimize storage and query execution. Then I will test and validate the data model. These are the approaches which I would take to implement a star schema data model in Snowflake.

The possible oversight and its potential impact on the deployment process. So first, I will check the incomplete testing. If it is not performing comprehensive testing of pipelines and data integration workflows before deployment, I will check the potential impact. This can lead to the deployment of faulty or incorrect pipelines causing data inconsistencies, processing errors, and operational disruptions. If we are not utilizing version control systems, for example, Git for managing ADF or Synapse artifacts and configurations, the potential impact can be difficulty in tracking changes, managing code collaborations, and rolling back to previous versions in case of issues or regressions. Inadequate deployment automations will result in relying solely on manual deployments without implementing automated deployment pipelines. The impact will be increased deployment time, human errors, and inconsistencies between development, staging, and production environments. Limited monitoring and logging can neglect configuring monitoring alerts, logging, and performance metrics for deployed pipelines. The potential impact can be the inability to detect and troubleshoot issues properly, leading to prolonged downtime, data loss, and degraded performance. Insufficient security practices, such as not implementing proper access control, encryptions, and data protection measures in CICD pipelines, can result in the risk of unauthorized access, data breaches, compliance violations, and loss of sensitive information. Ignoring rollback strategies can fail to define and test rollback strategies in case of failed deployments or production issues. The impact can be difficulty in identifying and reverting changes, data corruption, and prolonged downtime.

There are a couple of issues. So, we have here in the code snippet, the commit keyword is missed. We should include the commit statement to commit the transaction and make the changes permanent in the database. So, the commit statement is missed in the current

So, to design the streaming real-time streaming data with batch processed historical data. First, I will identify the data source. I will identify the real-time streaming data source, for example, Kafka or Azure Event Hubs or AWS Kinesis, and batch processing historical data sources, for example, databases, data rates, that I want to integrate. Then I will choose the integration tool. I will use the Snowflake native connectors or third-party tools with Snowflake to ingest real-time and batch data into Snowflake. For real-time streaming processing data, I'll consider a new pipeline for continuous data loading from the stream source to Snowflake. Then I'll define the data ingestion patterns by determining the frequency and volume of real-time data updates and batch data loads. And I will choose the appropriate ingestion patterns, such as micro-batching or continuous streaming for real-time data and schedule batch processing for historical data. Then, I will use data transformation capabilities, such as data flows or external functions, to transform and enrich the incoming data. Then I will join the real-time and historical data on common keys or time frames to create comprehensive datasets. I'll implement CDC. You and I will utilize the time travel and CDC feature and custom CDC mechanism to capture incremental changes in real-time data and apply them to historical data. Then I will design the storage structures in Snowflake, such as tables and schemas, to accommodate both real-time and historical data. And I'll utilize partitioning strategies based on time intervals, for example, daily or hourly, or business keys, for efficient data storage and query. And then I will implement data governance practices, such as access control, encryption, and data masking, data security, and compliance. Using Snowflake native SQL capabilities, including window functions, aggregations, and joins for processing and analyzing real-time and historical data.

Apply the logs package to improve collaboration and reduce wait time to so first, I will do the version control. I'll use the version control system, for example, Git, to manage Snowflake objects such as databases, schemas, tables, and views. And then I'll implement the infrastructure as code, which is IaC practices, by defining the Snowflake objects and configurations in code using SQL scripts or the Snowflake scripting language. I will set up a CI pipeline to automate the validation and integration of code changes into the Snowflake environment, and use CI tools and frameworks to trigger the automated build, test, and deployment of Snowflake objects. Then I'll implement the CD pipelines to automate the deployment of validated code changes to Snowflake environments and define the deployment strategies to minimize downtime and ensure smooth rollout of changes. Then I will use configuration management tools to manage Snowflake objects, configurations, environment variables, and connection settings. I'll ensure consistency and traceability of configurations across different Snowflake environments. I will develop automated tests for Snowflake objects, for example, SQL scripts and stored procedures to validate functionality, performance, and data quality. And I will include unit tests, integration tests, and regression tests in CI pipelines to catch issues early in the development cycle. I'll set up monitoring and alerting mechanisms in Snowflake to track performance metrics, such as resource usage and data pipeline health, using monitoring tools and dashboards to proactively identify and address potential issues in Snowflake data operations. I'll maintain comprehensive documentation for Snowflake objects, data pipelines, configurations, and deployment processes. I'll conduct regular knowledge-sharing sessions and workshops to educate my team members on best-practice tools and techniques for Snowflake.

Version control mechanism, see your data pipeline deployment. So first, I will use the source control to store the ADF artifacts, their pipeline datasets, and link service in a version control system. And I'll use the branch to manage different development stages, for example, testing, development, or production, and isolate the changes until they are ready for deployment. Then I will export the ARM templates, which are Azure resource manager templates for my ADF resources from the Azure portal or using Azure CLI. I will store these ARM templates in my version control system to track changes and manage deployments, and I will automate the deployment with CICD. I'll set up the CI pipelines to automatically build and validate the changes to my ADF artifacts stored in version control. And I will use the CD pipelines to automate the deployment of validated changes to the ADF environment. I will implement release management practices to control the deployment of changes to different environments, for example, QA or production, in a controlled manner, and define approval flows and gates to ensure that only approved changes are deployed to production. I'll use versioning and tagging in my version control system to track releases, milestones, and important changes to ADF artifacts. I will apply semantic versioning or a similar versioning scheme to clearly indicate the significance of each release. I'll regularly back up my ADF resources and configurations used, including ARM templates, pipelines, and datasets, and implement procedures for restoring ADF resources from backup in case of data loss or deployment issues. I'll monitor the deployments and track changes to our ADF artifacts using logging, auditing, and monitoring tools provided by Azure, and implementing alert mechanisms to notify relevant stakeholders on deployment failures or issues.