
Diligent engineer with 12+ years of experience which includes contributions in data science and engineering,
development of software framework, platforms, applications and customer interaction with multilingual and
multicultural clients. An effective team player and well versed in various platforms, programming languages
and programming with different databases. Also have extensive experience in all phases of software
development, and on waterfall and agile methods of project life cycle.
Consultant | Data Science Engineer
Sinergia Media LabsIT Consultant
AI Rawahy Technical ServicesSoftware Engineer
Huawei
Airflow

Pycharm

Jupyter Notebook

Eclipse

Visual Studio

Jupyter Notebook

GitHub

SVN
I'm a grammar editor for interview transcripts. I have 11 years of experience working in multiple domains, including ecommerce, media and entertainment, pharmaceutical, and telecom. Currently, I'm working in ecommerce, and my previous roles were in media and entertainment for 2.5 years and pharmaceutical for an unspecified period. Prior to that, I worked in telecom, creating security solutions for server applications. I enjoy solving technical problems and am a quick learner of new tools and technical documentation. I'm skilled at creating proof-of-concepts and exploring the feasibility of new tools in our current environment. I'm also passionate about mentoring juniors and creating a seamless work atmosphere. As a team player, I've been working remotely for 3 years and find it to be a positive experience. In contrast to my current remote work, my previous experience working in an office was similar, and I appreciated the opportunity to connect with colleagues, share knowledge, and receive assistance when needed.
Implementing a data quality framework using PySpark to ensure the integrity of ETL process data is very much required for the downstream process. Mainly, we use PySpark for big data processing. To ensure quality, we have to have all these mandatory fields which are being used downstream properly, null checks, and all. We can enable all these null checks and more using PySpark for the columns that are incoming. Also, we can have the schema enabled while reading the data, which will ensure that the data of each column is appropriate for our data. Additionally, if we need some mandatory checks to be conducted, that can also be accommodated in the schema.
How do you perform deduplication on a dataset in Snowflake that has been ingested from an ETL pipeline incorrectly multiple times? This was kind of performed in my workplace. So, we performed incremental loads into Snowflake, and those corresponding tables have time stamp columns that determine the duplicate data. We performed a count analysis on how data was ingested or when the data was ingested to the table. Depending on the count and time, if it would be expected to complete at a specific time only. If there are any discrepancies, we would detect that and delete all that data. Apart from that, we would have all that data deleted from the Snowflake table. Yeah.
If I were to design an ideal pipeline that handles time series data, which design patterns would I implement and why? To be frank, I haven't handled any time series data until now. However, one thing that comes to my mind is that we would need specific checks to ensure that when the data is populated, the scheduler runs accordingly. I'm not sure if I'm hitting the right point of this question. However, this is what comes to my mind when I read the question. There should be mechanisms to check if the data is loaded at specific time points, and the scheduler should run at appropriate times. The data should be loaded into the chosen lake tools, whether it's Databricks, Delta Lake, or Snowflake, and it should have a time count or timestamp to ensure the data is populated at the right timings. In terms of design patterns, this is one pattern I would suggest to handle time series data. If there are any discrepancies, such as a failed load at a particular point in time, alerts should be posted to the appropriate channels for stakeholders to be notified that the data failed to load.
How do you detect and handle skewness in a large data set when performing data transformation using PySpark? Skewness, in terms of large datasets, can be defined as when the data is not appropriately distributed, then there is a skewness in the data. So, I suppose that when the data is received and stored, how it is partitioned while saving might cause the skewness. If it is not partitioned properly, then a few partitions might have a larger amount of data while the others do not, which will cause high latency for those partitions with the larger size. If we encounter such issues while reading the data where a few partitions are having higher size than the others, then we should pick up the partitions appropriately. In Databricks, we have terminology to set around 4 columns initially, and then a total of 30 columns to keep track of the statistics of those columns. The data will be stored appropriately, meaning we will order the data based on the value of those columns. When the data is fetched, this particular order will be of great help. For example, suppose we are filtering the data based on a value. If the value is on a specific layer in the stack, it will appropriately go and fetch the data from that location. This information will get from the metadata, actually. That is the way I might handle the data skewness with my current knowledge.
When optimizing a skill query for reporting purposes in Snowflake, yeah, so here the Snowflake stores data in terms of micropartitions. So, basically, all those queries, whatever we give the filter conditions and all, the filtering should happen first in the inner query rather than in the outer query. So, once we have all the inquiries, we have to have the filter condition as and when we give it. Because if you're doing the filter queries on the outermost, fill filter conditions in the outermost query, what will happen is it will take up more resources, and all that data will get loaded into the memory and the processing, like the warehouse will be throttled. So, the best practice is to put all those subqueries, and only the maximum filter conditions should go in, and only the computation should happen in the outer query. So, in this way, we can ensure that only the relevant data is actually fetched for the computation.
ViceSpark's job has to that has to join 2 large datasets. Okay. Probably, I'm not sure if my approach is correct. But what can be done is this join can be done, and then we can write the file as parquet. And that parquet, we can actually, you know, copy into the corresponding Snowflake table. So in that way, the processing will be faster. And the join of the two large datasets, obviously, the filter conditions should be applied on the appropriate subqueries first, and then the join should happen on those queries, on the partition columns, basically. All the filters should happen on the partition columns so that the data fetches faster. Because in the case of Databricks, Delta, if you are fetching the data based on the partition columns, the statistics of the data are already available, and hence the data doesn't need to scan the entire data actually of the other partitions. So it can go directly to the corresponding partition and then fetch the data and apply the computational logic.
AWS managed Airflow can be a full service wherein it can pick data from multiple data sources and store it to multiple endpoints. We can leverage that end-to-end, which I am currently using in my project. The lambda functions can fit in the architecture. Lambda functions are basically used when we have to do some computation or transformation on the data, irrespective of loading data into a table. So we have some source data, and upon that, we are doing some processing, and we have to store the data into another location. That is when Lambda functions come into the picture. It can be a call from an API or some triggered Lambda functions. What happens is when data is available in a particular location, this can be triggered so that the data gets processed appropriately and stored into the appropriate location. Lambda functions can do spinning up the EC2 machines or any of the AWS services needed to process. All that can be done. And once the result is available in another location, the downstream systems can use it.
What strategy would you use to migrate Python ETL scripts running on legacy systems to utilize PySpark for enhanced parallel processing capabilities? Okay, this was one aspect we had done for the project with our media and editing team involved. So we had Python scripts running in a SnapLogic tool, which was scheduling them and triggering EMR jobs in the background. But what we did was, we brought Databricks into place and leveraged the parallel processing capability of Databricks. We coded in PySpark and used Spark SQL. Since it's a seamless transition, even though the scripts are in Python, internally it would be a Spark DataFrame, which would do all the processing in the background in a parallel way. We could then save the data to a Lakehouse or save it in Parquet format, as needed. This can be achieved using Databricks, and we've already done that.
What key metrics would you use to measure and improve the performance of an ETL pipeline that frequently handles JSON and CSV data files? In terms of performance perspective, multiple things come into consideration. How fast it is able to process the data? What is the level of failure? And when the JSON files come in, when we need to encounter some new elements, does it fail? Or in CSV, if we encounter any new columns, does it fail? So, we have generalized it. The main thing is to have a schema in place with the appropriate data types and all mentioned. So, when we expect this data of this data type in the source, and then only we'll process. This can be ensured to improve the quality and also the performance of our ETL pipeline.
Can you propose a method for real-time data processing using AWS Lab and Kinesis for a data-driven application? Yeah, as I said before, AWS Lambda is something that I have not used. But, in the project that I worked on, there was a data science aspect to it. So, what was done is, we had all the models ready and deployed. So, this Lambda function would actually run on a scheduled basis, taking in data, generating results, and storing them into the target table. And that target table would be used subsequently for predicting, like, the best products that customers can buy and all that. For Kinesis, yeah, this was used for iTrouble. So iTrouble sends data in batches. Kinesis was configured to get the data through Kinesis for processing, and that data would actually be stored in our S3 location.