profile-pic
Vetted Talent

Siddharth Sujir

Vetted Talent
With over a decade of experience in the field, I have honed my skills in roles that require expertise in SQL, Apache Spark, AWS, and Snowflake. My journey in the industry has equipped me with a deep understanding of database management, data processing, cloud computing, and data warehousing. I have successfully leveraged these skills to drive business insights, optimize data workflows, and create scalable solutions for organizations. Through continuous learning and hands-on experience, I am well-equipped to tackle complex data challenges and deliver valuable outcomes.
  • Role

    Sr. Consultant - Data Engineer

  • Years of Experience

    10 years

  • Professional Portfolio

    View here

Skillsets

  • Apache Kafka
  • SparkSQL
  • PostgreSQL
  • dbt
  • Microsoft SSIS
  • Sqoop
  • Scala
  • Redshift
  • Hive
  • MS SQL Server
  • Spring Boot
  • Hadoop
  • Oracle
  • Jenkins
  • SQL - 10 Years
  • MySQL - 4 Years
  • MySQL - 4 Years
  • Python - 4 Years
  • Python - 4 Years
  • Java - 10 Years
  • Java - 10 Years
  • Snowflake - 1.5 Years
  • Snowflake - 1 Years
  • AWS - 2 Years
  • AWS - 2 Years
  • Apache Spark - 6 Years
  • Apache Spark - 6 Years
  • SQL - 10 Years

Vetted For

9Skills
  • Roles & Skills
  • Results
  • Details
  • icon-skill_image
    Data Engineer (Remote)AI Screening
  • 67%
    icon-arrow-down
  • Skills assessed :Team Collaboration, Data Modeling, ETL, Snowflake, Postgre SQL, Problem Solving Attitude, Python, SQL, Strong Attention to Detail
  • Score: 60/90

Professional Summary

10Years
  • Feb, 2023 - Present3 yr 4 months

    Sr. Consultant - Data Engineer

    Visa Inc.
  • Jun, 2022 - Feb, 2023 8 months

    Data Engineer II

    Amazon Music
  • May, 2021 - Jun, 20221 yr 1 month

    Data Engineer

    Zoom Video Communications
  • Nov, 2011 - Apr, 20142 yr 5 months

    Senior Systems Engineer

    Infosys Limited
  • Jun, 2016 - Jun, 20193 yr

    Sr. Software Engineer

    Visa Inc.
  • Jun, 2019 - Apr, 20211 yr 10 months

    Staff Software Engineer

    Visa Inc.

Applications & Tools Known

  • icon-tool

    Kafka

  • icon-tool

    Airflow

  • icon-tool

    Apache Spark

  • icon-tool

    AWS

  • icon-tool

    Snowflake

  • icon-tool

    Redshift

  • icon-tool

    Sqoop

  • icon-tool

    Hive

  • icon-tool

    Hadoop

  • icon-tool

    Spring Boot

  • icon-tool

    Airflow

  • icon-tool

    AWS

  • icon-tool

    Redshift

  • icon-tool

    Hive

  • icon-tool

    Sqoop

  • icon-tool

    Microsoft SQL Server

  • icon-tool

    Jenkins

  • icon-tool

    PostgreSQL

  • icon-tool

    MySQL

  • icon-tool

    Oracle

  • icon-tool

    Apache Kafka

  • icon-tool

    Hive

  • icon-tool

    AWS

Work History

10Years

Sr. Consultant - Data Engineer

Visa Inc.
Feb, 2023 - Present3 yr 4 months
    Leading a data engineering team in transitioning batch data processing pipeline to a streaming pipeline using Kafka, Apache Spark, resulting in a significant reduction of data delivery time. Designed and developed Kafka consumer microservices using Spring boot to consume data from Kafka topic and push updates to downstream applications in real time.

Data Engineer II

Amazon Music
Jun, 2022 - Feb, 2023 8 months
    Developed and maintained events data ETL pipeline using Airflow, Apache Spark and AWS tools. Redesigned data model for legacy datasets to adhere to data best practices and address customer pain points.

Data Engineer

Zoom Video Communications
May, 2021 - Jun, 20221 yr 1 month
    Developed data pipelines using Airflow, Apache Spark and DBT to import data and transform it for analytics and critical business decisions. Collaborated closely with data scientists to address their data needs by ingesting from multiple sources and loading transformed data into data warehouses such as Snowflake and Redshift.

Staff Software Engineer

Visa Inc.
Jun, 2019 - Apr, 20211 yr 10 months
    Designed and developed rule engine using Apache Spark, Sqoop and Hive for complex transformation and aggregation of card transactions. Developed REST APIs for reporting application dashboards and optimized ETL pipelines resulting in 40% performance improvement.

Sr. Software Engineer

Visa Inc.
Jun, 2016 - Jun, 20193 yr
    Developed ETL pipelines using Apache Spark and Hive for data transformation and analytics. Worked on building a recommendation engine using Spark MLlib for transaction recommendations.

Senior Systems Engineer

Infosys Limited
Nov, 2011 - Apr, 20142 yr 5 months
    Created design documents, test plans, and developed features using Java, Microsoft SSIS, and SSRS. Designed data warehouses using star-schema model, and built ETL solutions with Microsoft SSIS.

Achievements

  • Rewrote batch data processing to streaming pipeline
  • Reduced data delivery timeline from 48 hours to under 24 hours
  • Developed Kafka consumer microservices
  • Participated in data model overhaul adhering to best practices
  • Developed rule engine for complex calculations on card transactions
  • Recommended transactions using machine-learning algorithms
  • Evaluated indexing features for faster analytics

Major Projects

2Projects

Recommendation System using Apache Spark

    Implemented recommendation system using collaborative filtering approach on Apache Spark with large datasets stored on Hadoop cluster.

Geo-Spatial Operations on Apache Spark

    Developed Geo-Spatial operations for convex hull generation, farthest/closest point calculations, spatial joins, and range queries on high volumes of data using Apache Spark and Hadoop.

Education

  • Master of Computer Science

    Arizona State University (2016)
  • Bachelor of Engineering, Computer Science and Engineering

    Anna University (2011)

AI-interview Questions & Answers

I have about 10 years of experience in industry where I have worked on multiple data engineering projects. So, I have a master's degree in computer science from Arizona State University. So, after my master's degree, I joined Visa as a senior software engineer. I worked with Visa for about 5 years on the Visa Commercial Services team. So, I have experience working on multiple applications within Visa, where I primarily worked as a data engineer. So, I started my initial project by maintaining an ETL pipeline that was written in the Microsoft SQL tech stack, and then I had an opportunity to work on some big data projects using Spark. So, I also worked on developing some of the back-end services using Java. So, after working at Visa for about 5 years, I moved to Zoom. With Zoom, I was primarily part of the data science team. I worked with data scientists to develop data pipelines to extract data from different sources, transform it, and load it into the final data variables. The tech stack I worked on included the AWS tech stack, I used S3, I used Redshift, and I used Airflow as an orchestration tool. So, after working at Zoom for about a year, I moved to Amazon. In Amazon, I worked with the Amazon music team. So, I worked primarily on pulling streaming data or events-based data, and the tech stack included the AWS tech stack to build data pipelines. I used PySpark, which was running on an EMR instance. Airflow was the primary orchestration tool here. The programming language was Python, and the data variables I used here were Redshift, and I also used a little bit of RDS as well. And after working at Amazon for close to a year, I again had an opportunity to join Visa. In Visa, I joined back the same team, which was the data services team. The project that I'm currently working on is like this: I'm leading a team to rewrite some of the batch jobs into a streaming job. The tech stack includes Spark streaming, Apache Kafka as the streaming tool where we write the data into topics, which are consumed by downstream consumers. The tech stack here includes Spark streaming and Hive as a primary data source, and I also have an opportunity to work on microservices using Java Spring Boot. This is a consumer application, which reads from the same Kafka topic, which our streaming job writes into, and it just prepares the data and then pushes it to a third-party consumer. So, this is about my overall experience.

In terms of detecting data skews and snowflake data warps, primarily what I would be doing is the snowflake would be running on top of any cloud storage, which would be streamed in case of AWS or GCS in case of GCP cloud platform. So, the data would be partitioned on the SOFistry bucket, and it can be based on a location or a region or a marketplace, or it can be based on the event timestamp. I would be checking for every partition if it is partitioned by region, and I would be checking the volume of data in each of these regions. Are there any chances where a specific region has a larger volume of data compared to another region, for example, the US region has a larger volume of data compared to the Asia-Pacific region? I would be dividing the buckets and the partitions into sub-partitions, which can be in terms of event process time, so that processing it based on event process time can be distributed uniformly. Alternatively, I could sub-partition into a region or a state or any other sub-partition to handle the data appropriately. Another thing I would do is when I'm dumping the data into the final data warehouse, I would make sure that the data is uniformly distributed, possibly through an alternate key, for example, if the region partition has a huge volume of data, I would create another key on top of it by applying some techniques like sorting to distribute the data uniformly.

Some of the complex, some of the secret techniques that can be used in data manipulation include making use of Windows functions or scans to calculate running averages on top of incoming data. So, you can do real-time running averages. If you're doing aggregates on top of incoming streaming data, you can apply a Windows function to do a running average or running application on top of the data. That's one technique I can think of. And, the data manipulation techniques I can think of is making use of joins. If there's incoming data that we have to screen with some of the lookup tables that exist within the Postgres table, we can have the data joined with some of the smaller metadata tables to do an investigation.

Okay, so in terms of error logging mechanism, so what we can do is we can have an error logging mechanism at every stage of the ETA pipeline, starting from the source, the transformation layer, and as well as at the loading layer. At the source, I would have logs like the partitions that we are reading from, the source location that we are reading from, and at the completion of the reading, I would log messages saying that the message has been read successfully or if it runs into exceptions, I would log the exception messages accordingly. Likewise, the same with the transformation layer, where I would be getting a lot of details about the data, the data model that we are reading it from, and what kind of transformation that we are applying, such as multiple layers of transformation like data cleansing, data aggregation, and all those things. Each of these steps, I would be logging the appropriate logs. The same goes in terms of data dumping into the data warehouse as well, with the number of records that are getting inserted into the final data warehouse and what kind of data table it is writing into. These are some of the logs that I can think of. In terms of retry mechanism at every layer, I would be using an orchestration tool like Airflow, which has an inbuilt retry mechanism. We can configure the retry mechanism of every component to retry in x number of times, and if at all x number of attempts has been failed, I would log the appropriate message about the failure. We could also have alerts that can be enabled if the retry of the particular layer has run into issues and even after retry, it is failing at every layer, we could send out an alert to the respective team, which can be the dev team or support team, so that they could take a look at the logs and they could take appropriate actions.

So some of the best practices in Python for reducing memory usage and processing large datasets. One option that I can think of was using yield instead of a return. So, what yield does is yield this kind of a generator or just a function called a generator within Python. Doing a return accumulates the results in memory. Once all the results are available, it returns the entire object, which can be a list or a tuple or whatever data structure that we are working on. However, when it comes to yield, rather than storing the entire data of your object in memory, it keeps passing the data to the caller record by record basis. So thereby, even if we are processing a large volume of data, it doesn't store them in memory and doesn't run into any memory issues in Python. Another best practice is to read the data in batches rather than reading an entire dataset and putting it into memory. Just divide the data into patches with a batch size so that you don't.

So, in terms of automating the data quality checks post ETL and Snowflake, what I would think of is having a batch of, which is scheduled using Airflow. So, after a data is loaded, or after a schedule of ETL is complete, so I would have a batch of triggered in post ETL which would do some kind of data quality checks. Some of the data quality checks I would do is checking for any duplicates that have duplicate data and checking any null values in the final data source. And also in terms of data validation, if there are any bad data that is passed into the final data warehouse, it is something which we can check. And also in terms of the business values, if there is any value which doesn't fit into the business logic, for example, if we are considering about age-specific domain of a school database, having a child's age less than 5 years doesn't make sense. So, we can have these checks enabled as a part of the data check job that we are running. And there is something which we can schedule using Airflow. After the final ETL is completed, we can have a post ETL job which can be triggered to do this kind of data quality checks and enable triggers accordingly.

So, one of the performance bottlenecks in this query is, you see, there is an order by clause that is being used, okay, order by descending, this one is just to fetch the queries, fetching the top 10 average salaries of employees, so instead of, you see, group by is being used and order by is being used, so a better option that I can think of is using some window function, maybe use a rank function partitioned by name and ordered by salary and once we have the rank of all the employees based on the average salary, we can just filter the query to get the top 10 rank from the employee table. So, yeah, using a window function instead of group by, it would reduce the performance bottleneck on top of the last dataset.

So what I'm saying is that I think it is violating is that, the code is violating the single responsibility principle. The data process class has way too many methods which are doing too many things. It is doing reading of the data, it is doing the processing of data within the same class, it is writing the data and it is blocking the error, which violates the single responsibility principle of the SOLID. So, what we can do to avoid this is, we could refer to the code for better meta-analytic practices, like put each of these methods in a separate class. So that each of these classes have a separate responsibility. And that would be easier to maintain the code. And also, like,

In terms of ETL workflow, that can have required schema evolution. So, what we can do is have this if we have these submissions, mostly tables designed in such a way that we can have a table divided into two types of fields. One is some of the key fields, which don't change on a regular basis. And the other type of field, which can evolve over time. So, that is something which can be put into some sort of data map. A data map would store data in adjacent format where any new fields can be added or removed over time. So, and also, having that structure of the table will remain the same. In terms of having a detailed workflow for it, what we can do is, so we can have if data is sourced from S3, we can have the data. Let's say there are three data in the S3 stored in a JSON format, so we can identify the key fields from this particular JSON object. We can have that mapped to the key fields in this notebook table. And in terms of the other fields, that are available in this S3, it can be opened as an adjacent object, so that even if the fields that are added on a frequent basis or fields that are depending on the data that reduces on a day-to-day basis, the structure won't impact the structure of the table. And it can be easy to write into the data model project as well. And Smoke Lake provides support to work with adjacent. And whenever we are doing, we can design it in such a way that whenever we're quitting any specific fields, we can identify whether that field is available or not, and it provides query support for that.