profile-pic
Vetted Talent

Siddharth Sujir

Vetted Talent
With over a decade of experience in the field, I have honed my skills in roles that require expertise in SQL, Apache Spark, AWS, and Snowflake. My journey in the industry has equipped me with a deep understanding of database management, data processing, cloud computing, and data warehousing. I have successfully leveraged these skills to drive business insights, optimize data workflows, and create scalable solutions for organizations. Through continuous learning and hands-on experience, I am well-equipped to tackle complex data challenges and deliver valuable outcomes.
  • Role

    Sr. Consultant - Data Engineer

  • Years of Experience

    10 years

  • Professional Portfolio

    View here

Skillsets

  • Apache Kafka
  • SparkSQL
  • PostgreSQL
  • dbt
  • Microsoft SSIS
  • Sqoop
  • Scala
  • Redshift
  • Hive
  • MS SQL Server
  • Spring Boot
  • Hadoop
  • Oracle
  • Jenkins
  • SQL - 10 Years
  • MySQL - 4 Years
  • MySQL - 4 Years
  • Python - 4 Years
  • Python - 4 Years
  • Java - 10 Years
  • Java - 10 Years
  • Snowflake - 1.5 Years
  • Snowflake - 1 Years
  • AWS - 2 Years
  • AWS - 2 Years
  • Apache Spark - 6 Years
  • Apache Spark - 6 Years
  • SQL - 10 Years

Vetted For

9Skills
  • Roles & Skills
  • Results
  • Details
  • icon-skill_image
    Data Engineer (Remote)AI Screening
  • 67%
    icon-arrow-down
  • Skills assessed :Team Collaboration, Data Modeling, ETL, Snowflake, Postgre SQL, Problem Solving Attitude, Python, SQL, Strong Attention to Detail
  • Score: 60/90

Professional Summary

10Years
  • Feb, 2023 - Present2 yr 10 months

    Sr. Consultant - Data Engineer

    Visa Inc.
  • Jun, 2022 - Feb, 2023 8 months

    Data Engineer II

    Amazon Music
  • May, 2021 - Jun, 20221 yr 1 month

    Data Engineer

    Zoom Video Communications
  • Nov, 2011 - Apr, 20142 yr 5 months

    Senior Systems Engineer

    Infosys Limited
  • Jun, 2016 - Jun, 20193 yr

    Sr. Software Engineer

    Visa Inc.
  • Jun, 2019 - Apr, 20211 yr 10 months

    Staff Software Engineer

    Visa Inc.

Applications & Tools Known

  • icon-tool

    Kafka

  • icon-tool

    Airflow

  • icon-tool

    Apache Spark

  • icon-tool

    AWS

  • icon-tool

    Snowflake

  • icon-tool

    Redshift

  • icon-tool

    Sqoop

  • icon-tool

    Hive

  • icon-tool

    Hadoop

  • icon-tool

    Spring Boot

  • icon-tool

    Airflow

  • icon-tool

    AWS

  • icon-tool

    Redshift

  • icon-tool

    Hive

  • icon-tool

    Sqoop

  • icon-tool

    Microsoft SQL Server

  • icon-tool

    Jenkins

  • icon-tool

    PostgreSQL

  • icon-tool

    MySQL

  • icon-tool

    Oracle

  • icon-tool

    Apache Kafka

  • icon-tool

    Hive

  • icon-tool

    AWS

Work History

10Years

Sr. Consultant - Data Engineer

Visa Inc.
Feb, 2023 - Present2 yr 10 months
    Leading a data engineering team in transitioning batch data processing pipeline to a streaming pipeline using Kafka, Apache Spark, resulting in a significant reduction of data delivery time. Designed and developed Kafka consumer microservices using Spring boot to consume data from Kafka topic and push updates to downstream applications in real time.

Data Engineer II

Amazon Music
Jun, 2022 - Feb, 2023 8 months
    Developed and maintained events data ETL pipeline using Airflow, Apache Spark and AWS tools. Redesigned data model for legacy datasets to adhere to data best practices and address customer pain points.

Data Engineer

Zoom Video Communications
May, 2021 - Jun, 20221 yr 1 month
    Developed data pipelines using Airflow, Apache Spark and DBT to import data and transform it for analytics and critical business decisions. Collaborated closely with data scientists to address their data needs by ingesting from multiple sources and loading transformed data into data warehouses such as Snowflake and Redshift.

Staff Software Engineer

Visa Inc.
Jun, 2019 - Apr, 20211 yr 10 months
    Designed and developed rule engine using Apache Spark, Sqoop and Hive for complex transformation and aggregation of card transactions. Developed REST APIs for reporting application dashboards and optimized ETL pipelines resulting in 40% performance improvement.

Sr. Software Engineer

Visa Inc.
Jun, 2016 - Jun, 20193 yr
    Developed ETL pipelines using Apache Spark and Hive for data transformation and analytics. Worked on building a recommendation engine using Spark MLlib for transaction recommendations.

Senior Systems Engineer

Infosys Limited
Nov, 2011 - Apr, 20142 yr 5 months
    Created design documents, test plans, and developed features using Java, Microsoft SSIS, and SSRS. Designed data warehouses using star-schema model, and built ETL solutions with Microsoft SSIS.

Achievements

  • Rewrote batch data processing to streaming pipeline
  • Reduced data delivery timeline from 48 hours to under 24 hours
  • Developed Kafka consumer microservices
  • Participated in data model overhaul adhering to best practices
  • Developed rule engine for complex calculations on card transactions
  • Recommended transactions using machine-learning algorithms
  • Evaluated indexing features for faster analytics

Major Projects

2Projects

Recommendation System using Apache Spark

    Implemented recommendation system using collaborative filtering approach on Apache Spark with large datasets stored on Hadoop cluster.

Geo-Spatial Operations on Apache Spark

    Developed Geo-Spatial operations for convex hull generation, farthest/closest point calculations, spatial joins, and range queries on high volumes of data using Apache Spark and Hadoop.

Education

  • Master of Computer Science

    Arizona State University (2016)
  • Bachelor of Engineering, Computer Science and Engineering

    Anna University (2011)

AI-interview Questions & Answers

I have about 10 years of experience in industry where I have worked on multiple data engineering projects. So, I have a master's degree in computer science from Arizona State University. So, after my master's degree I joined Visa as a senior software engineer. I worked with Visa for about 5 years for Visa Commercial Services team. So, I have only part of multiple applications within Visa where I primarily worked as a data engineer. So, I started my initial project by maintaining an ETL pipeline that was written in Microsoft SQL tech stack and then I had an opportunity to work on some big data projects using Spark. So, I also worked on developing some of the back-end services using Java. So, after working in Visa for about 5 years, I moved to Zoom. With Zoom, I was primarily part of the data science team. I worked with data scientists for developing data pipelines to extract data from different sources, transform it, and do it into the final data variables. Tech stack that I worked on included the AWS tech stack, I used S3, I used Redshift, and I used Afro as orchestration tool. So, after working in Zoom for about a year, I moved to Amazon. In Amazon, I worked with the Amazon music team. So, I worked primarily on pulling streaming data or events-based data, and the tech stack included AWS tech stack to build data pipelines. I used PySpark, which was running on EMR instance. Afro was the primary orchestration tool here. Programming language was in Python, and the data variables that I used here was Redshift, and I also used a little bit of RDS as well. And after working in Amazon for close to a year, I again had an opportunity to join Visa. In Visa, I joined back the same team, which was the data services team. The project that I'm currently working on is like I'm leading a team to rewrite some of the bad jobs into a streaming job. The tech stack includes Spark streaming, Apache Kafka being the streaming tool where we write the data into topics, which are consumed by downstream consumers. The tech stack here includes Spark streaming and Hive as a primary data source, and I also have an opportunity to work on microservices using Java Spring Boot. This is a consumer application, which reads from the same Kafka topic, which our streaming job writes into, and it just prepares the data and then pushes it to a third-party consumer. So this is about my overall experience.

So, in terms of detecting data skews and snowflake data warrows, so primarily what I would be doing is, the snowflake will be running on top of any of the cloud storage which would be streamed in case of AWS or GCS in case of GCP2 cloud platform. So what I would be doing is, the data would be partitioned on the SOFistry bucket, so it can be in terms of a location or a region or a marketplace, it can also be based on the event timestamp. So what I would be doing is, so I would be checking for every partition, if it is partitioned by region, so I would be checking the volume of data in each of these regions, so are there any chances where a specific region has a larger volume of data compared to, compared to another region, let's say for example for US region has a larger volume of data as compared to, let's say Asia-Pacific region, so what I would be doing is, I would be dividing the buckets, dividing the partitions into sub-partitions which can be in terms of event process time, so where processing it based on event process time can be distributed uniformly, or the other approach that I can think of is like, sub-partitioning into a region or a state or any other sub-partition to handle the data appropriately. And the other thing which I would do is, when I'm dumping the data into the final data warehouse, so I would be making sure that the data is uniformly distributed, maybe through an alternate key, like let's say for example, if the region partition has a huge volume of data, I would be again like creating another key on top of it, maybe by applying some techniques like sorting which would distribute uniformly, distribute the data uniformly.

Uh, some of the complex, uh, some of the some of the secret techniques that can be used in data manipulation includes, uh, making use of, uh, windows function, or scan to, uh, some kind of running averages on top of incoming data. Uh, so can do real time running averages. So if at all, like, we are doing aggregates on top of, uh, incoming streaming data, We can apply a Windows function to do or running some, running average, or, uh, running applications on top of the data. Uh, so that is on one ticket that I, uh, what I can think of. And, um, yeah, the thing the data manipulations that I can think of is, like, uh, making use of joints. Uh, so if at all, there's, like, an incoming data that we have to, uh, do an screen with some of the lookup tables that, uh, that exists within the Postgres table. So we can have the data joined with, uh, some of the smaller metadata tables, uh, to do investment

Okay, so in terms of error logging mechanism, so what we can do is we can have, so if we are building an ETA pipeline, so there's multiple stages, let's say for example, getting the data from source and then doing the data transformation on top of the data and finally, I would say it would be like dumping the data into the final data warehouse, so this is the ETA pipeline that I can think of, so I would be having a logging mechanism at every stage of, it's every stage of this particular ETA pipeline, I think, starting from the source, the transformation layer and as well as at the loading layer, so what I would be doing is, at the source, so I would have some logs like the partitions that we are reading from, the source location that we are reading from and at the completion of the reading, so it would be like clocking some messages saying that the message has been read successfully or if it runs into exceptions, we would be logging the exception messages accordingly and likewise the same with the transformation layer as well where we would be getting a lot of details about the data, the data model that we are reading it from and what kind of transformation that we are applying, there are like multiple layers of transformation like data cleansing, data aggregation and all those things, so each of these every step, we would be logging the appropriate logs and the same goes in terms of data dumping into the data warehouse as well with the number of records that are getting inserted into the final data warehouse and what kind of data table it is writing into, these are some of the logs that I can think of and in terms of retry mechanism at every layer, maybe I would be using an orchestration tool like Airflow, which has an inbuilt, it has an inbuilt retry mechanism, so we can configure the retry mechanism of every component to retry in x number of times and if at all, x number of attempts has been failed, we would be logging the appropriate message about the failure and so we could also have alerts that can be enabled if the retry of the particular layer has run into issues and even after retry, it is failing at every layer, we could send out an alert to the respective team, which can be dev team or support team of someone, so that they could take a look at the logs and they could take appropriate actions.

So some of the best practices in Python for reducing memory usage and processing or large dataset. So one option that I can think of was, like, uh, maybe using, uh, yield instead of a return. Um, so what yield would be doing is, uh, yield this kind of a generator or just a function called generator within Python. So doing a return, what it does is it, uh, accumulates the results in memory. And once all the results are available, it has the return of the entire, uh, object, which can be a list or a triple or whatever data data structure that we are working on. However, when it comes to yield, uh, rather than storing the entire, uh, data of your object in memory, uh, it keeps passing the data to the, uh, the call function, uh, record by record basis. So thereby, uh, even if we are processing a large volume of data, uh, it, uh, it doesn't store them in memory and doesn't run into any memory issues in Python. There is one, uh, there is one best practice that I can think of. And, uh, the other best practice is is like, uh, maybe, like, reading the data in batches, uh, rather than reading an entire dataset and putting it into memory. Uh, just divide the data into patches with, uh, batch with, uh, k k size so that, uh, you don't

So, in terms of automating the data quality checks post ETL and Snowflake, what I would think of is having a batch of, which is scheduled using Airflow. So, after a data is loaded, or after a schedule of ETL is complete, so I would have a batch of triggered in post ETL which would do some kind of data quality checks. So, some of the data quality checks I would do is checking for any duplicates that have duplicate data and checking any null values in the final data source. And also in terms of data validation, so if there are like any, if there are like any, if there are like any, if there are like any, sorry, if there are like any bad data that is passed into the final data warehouse, it is something which we can check. And also in terms of the business values, if there is like any value which doesn't fit into the business logic, for example, if there is like, if we are considering about age-specific domain of a school database, so having a children's age less than 5 years doesn't make sense. So, we can have this amount of checks enabled as a part of the data check job that we are running. And there is something which we can schedule using Airflow. After the final ETL is completed, we can have a post ETL job which can be triggered to do this kind of data quality checks and enable triggers accordingly.

So, one of the performance bottleneck in this query is, you see there is an order by clause that is being used, okay, order by descending, this one is just to fetch the queries, fetching the top 10 average salaries of employees, so instead of, you see group by is being used and order by is being used, so a better option that I can think of is using some windows function, maybe use a rank function partitioned by name and ordered by salary and once we have the rank of all the employees based on the average salary, we can just like, just filter the query to get the top 10 rank from the employee table. So, yeah, using a windows function instead of group by, it would reduce the performance bottleneck on top of the last dataset.

So what I'm saying what I think it is violating is that, uh, is violating the single responsibility principle. Uh, the data process class, it has way too many, uh, methods in it which is doing too many things. It is doing, uh, reading of the data. It is doing the processing of data within the same class. It is writing the data and it is blocking the error, Uh, it is which violates the single responsibility principle of the solid. So, uh, what we can do to avoid this is, uh, we could refer to the code for better meta analytic is, like, uh, put, uh, each of these methods in a separate class. Uh, so that each of these classes have a separate responsibility. Um, and that they, uh, would be easier to maintain the code. And, uh, and also, like, uh

Okay. In terms of, uh, in terms of ETL workflow that can have required schema evolution. Um, so what we can do is, We can have, uh, this if I have these submissions, mostly, like, table designed in such a way that, uh, we can have we can have a table, uh, divided into 2 types of fields. 1 is, like, some of the key fields, which, uh, they don't change on a regular basis. And the other type of field, uh, which can, uh, evolve more time. So that is something which I can which you can put it into some sort of data map. So data map would, uh, would store data in adjacent format where, uh, any new fields can be added or, uh, or it can be removed over time. Uh, so and also, like, having that structure of the table will remain the same. Uh, in terms of having an detailed workflow for it, uh, what we can do is, um, so we can have, uh, let's say, like, if data is sourced from, uh, s 3 s 3 bucket, uh, we can have the data. Right? Uh, let's say, there's 3 data in the s three is stored in a JSON format, uh, so we can identify the key fields from this particular JSON object. Uh, we can have that mapped to the key fields in this notebook table. And in terms of the other fields, uh, that are available in this three, it can be, uh, opened as adjacent object separate adjacent object, uh, so that, uh, even if the, like, the, uh, fees that are added on a frequent basis or fees that are, you know, uh, depending on the data that reduces on a day to day basis. Structure, uh, won't impact the structure of the table, Uh, and it can be, uh, easy to write into the, uh, to the data model project as well. And, uh, Smoke Lake, yes, Smoke Lake provides support to work with adjacent. Uh, and whenever we are doing, uh, we can design it in such a way that whenever we're we're quitting, uh, any specific fields, mostly can identify whether that field is available or not and it it provides query support for