
Sr. Consultant - Data Engineer
Visa Inc.Data Engineer II
Amazon MusicData Engineer
Zoom Video CommunicationsSenior Systems Engineer
Infosys LimitedSr. Software Engineer
Visa Inc.Staff Software Engineer
Visa Inc.
Kafka

Airflow
.png)
Apache Spark

AWS
Snowflake

Redshift

Sqoop

Hive

Hadoop

Spring Boot

Airflow

AWS

Redshift

Hive

Sqoop

Microsoft SQL Server
.png)
Jenkins

PostgreSQL

MySQL

Oracle

Apache Kafka

Hive

AWS
I have about 10 years of experience in industry where I have worked on multiple data engineering projects. So, I have a master's degree in computer science from Arizona State University. So, after my master's degree, I joined Visa as a senior software engineer. I worked with Visa for about 5 years on the Visa Commercial Services team. So, I have experience working on multiple applications within Visa, where I primarily worked as a data engineer. So, I started my initial project by maintaining an ETL pipeline that was written in the Microsoft SQL tech stack, and then I had an opportunity to work on some big data projects using Spark. So, I also worked on developing some of the back-end services using Java. So, after working at Visa for about 5 years, I moved to Zoom. With Zoom, I was primarily part of the data science team. I worked with data scientists to develop data pipelines to extract data from different sources, transform it, and load it into the final data variables. The tech stack I worked on included the AWS tech stack, I used S3, I used Redshift, and I used Airflow as an orchestration tool. So, after working at Zoom for about a year, I moved to Amazon. In Amazon, I worked with the Amazon music team. So, I worked primarily on pulling streaming data or events-based data, and the tech stack included the AWS tech stack to build data pipelines. I used PySpark, which was running on an EMR instance. Airflow was the primary orchestration tool here. The programming language was Python, and the data variables I used here were Redshift, and I also used a little bit of RDS as well. And after working at Amazon for close to a year, I again had an opportunity to join Visa. In Visa, I joined back the same team, which was the data services team. The project that I'm currently working on is like this: I'm leading a team to rewrite some of the batch jobs into a streaming job. The tech stack includes Spark streaming, Apache Kafka as the streaming tool where we write the data into topics, which are consumed by downstream consumers. The tech stack here includes Spark streaming and Hive as a primary data source, and I also have an opportunity to work on microservices using Java Spring Boot. This is a consumer application, which reads from the same Kafka topic, which our streaming job writes into, and it just prepares the data and then pushes it to a third-party consumer. So, this is about my overall experience.
In terms of detecting data skews and snowflake data warps, primarily what I would be doing is the snowflake would be running on top of any cloud storage, which would be streamed in case of AWS or GCS in case of GCP cloud platform. So, the data would be partitioned on the SOFistry bucket, and it can be based on a location or a region or a marketplace, or it can be based on the event timestamp. I would be checking for every partition if it is partitioned by region, and I would be checking the volume of data in each of these regions. Are there any chances where a specific region has a larger volume of data compared to another region, for example, the US region has a larger volume of data compared to the Asia-Pacific region? I would be dividing the buckets and the partitions into sub-partitions, which can be in terms of event process time, so that processing it based on event process time can be distributed uniformly. Alternatively, I could sub-partition into a region or a state or any other sub-partition to handle the data appropriately. Another thing I would do is when I'm dumping the data into the final data warehouse, I would make sure that the data is uniformly distributed, possibly through an alternate key, for example, if the region partition has a huge volume of data, I would create another key on top of it by applying some techniques like sorting to distribute the data uniformly.
Some of the complex, some of the secret techniques that can be used in data manipulation include making use of Windows functions or scans to calculate running averages on top of incoming data. So, you can do real-time running averages. If you're doing aggregates on top of incoming streaming data, you can apply a Windows function to do a running average or running application on top of the data. That's one technique I can think of. And, the data manipulation techniques I can think of is making use of joins. If there's incoming data that we have to screen with some of the lookup tables that exist within the Postgres table, we can have the data joined with some of the smaller metadata tables to do an investigation.
Okay, so in terms of error logging mechanism, so what we can do is we can have an error logging mechanism at every stage of the ETA pipeline, starting from the source, the transformation layer, and as well as at the loading layer. At the source, I would have logs like the partitions that we are reading from, the source location that we are reading from, and at the completion of the reading, I would log messages saying that the message has been read successfully or if it runs into exceptions, I would log the exception messages accordingly. Likewise, the same with the transformation layer, where I would be getting a lot of details about the data, the data model that we are reading it from, and what kind of transformation that we are applying, such as multiple layers of transformation like data cleansing, data aggregation, and all those things. Each of these steps, I would be logging the appropriate logs. The same goes in terms of data dumping into the data warehouse as well, with the number of records that are getting inserted into the final data warehouse and what kind of data table it is writing into. These are some of the logs that I can think of. In terms of retry mechanism at every layer, I would be using an orchestration tool like Airflow, which has an inbuilt retry mechanism. We can configure the retry mechanism of every component to retry in x number of times, and if at all x number of attempts has been failed, I would log the appropriate message about the failure. We could also have alerts that can be enabled if the retry of the particular layer has run into issues and even after retry, it is failing at every layer, we could send out an alert to the respective team, which can be the dev team or support team, so that they could take a look at the logs and they could take appropriate actions.
So some of the best practices in Python for reducing memory usage and processing large datasets. One option that I can think of was using yield instead of a return. So, what yield does is yield this kind of a generator or just a function called a generator within Python. Doing a return accumulates the results in memory. Once all the results are available, it returns the entire object, which can be a list or a tuple or whatever data structure that we are working on. However, when it comes to yield, rather than storing the entire data of your object in memory, it keeps passing the data to the caller record by record basis. So thereby, even if we are processing a large volume of data, it doesn't store them in memory and doesn't run into any memory issues in Python. Another best practice is to read the data in batches rather than reading an entire dataset and putting it into memory. Just divide the data into patches with a batch size so that you don't.
So, in terms of automating the data quality checks post ETL and Snowflake, what I would think of is having a batch of, which is scheduled using Airflow. So, after a data is loaded, or after a schedule of ETL is complete, so I would have a batch of triggered in post ETL which would do some kind of data quality checks. Some of the data quality checks I would do is checking for any duplicates that have duplicate data and checking any null values in the final data source. And also in terms of data validation, if there are any bad data that is passed into the final data warehouse, it is something which we can check. And also in terms of the business values, if there is any value which doesn't fit into the business logic, for example, if we are considering about age-specific domain of a school database, having a child's age less than 5 years doesn't make sense. So, we can have these checks enabled as a part of the data check job that we are running. And there is something which we can schedule using Airflow. After the final ETL is completed, we can have a post ETL job which can be triggered to do this kind of data quality checks and enable triggers accordingly.
So, one of the performance bottlenecks in this query is, you see, there is an order by clause that is being used, okay, order by descending, this one is just to fetch the queries, fetching the top 10 average salaries of employees, so instead of, you see, group by is being used and order by is being used, so a better option that I can think of is using some window function, maybe use a rank function partitioned by name and ordered by salary and once we have the rank of all the employees based on the average salary, we can just filter the query to get the top 10 rank from the employee table. So, yeah, using a window function instead of group by, it would reduce the performance bottleneck on top of the last dataset.
So what I'm saying is that I think it is violating is that, the code is violating the single responsibility principle. The data process class has way too many methods which are doing too many things. It is doing reading of the data, it is doing the processing of data within the same class, it is writing the data and it is blocking the error, which violates the single responsibility principle of the SOLID. So, what we can do to avoid this is, we could refer to the code for better meta-analytic practices, like put each of these methods in a separate class. So that each of these classes have a separate responsibility. And that would be easier to maintain the code. And also, like,
In terms of ETL workflow, that can have required schema evolution. So, what we can do is have this if we have these submissions, mostly tables designed in such a way that we can have a table divided into two types of fields. One is some of the key fields, which don't change on a regular basis. And the other type of field, which can evolve over time. So, that is something which can be put into some sort of data map. A data map would store data in adjacent format where any new fields can be added or removed over time. So, and also, having that structure of the table will remain the same. In terms of having a detailed workflow for it, what we can do is, so we can have if data is sourced from S3, we can have the data. Let's say there are three data in the S3 stored in a JSON format, so we can identify the key fields from this particular JSON object. We can have that mapped to the key fields in this notebook table. And in terms of the other fields, that are available in this S3, it can be opened as an adjacent object, so that even if the fields that are added on a frequent basis or fields that are depending on the data that reduces on a day-to-day basis, the structure won't impact the structure of the table. And it can be easy to write into the data model project as well. And Smoke Lake provides support to work with adjacent. And whenever we are doing, we can design it in such a way that whenever we're quitting any specific fields, we can identify whether that field is available or not, and it provides query support for that.