Diligent engineer with 12+ years of experience which includes contributions in data science and engineering,
development of software framework, platforms, applications and customer interaction with multilingual and
multicultural clients. An effective team player and well versed in various platforms, programming languages
and programming with different databases. Also have extensive experience in all phases of software
development, and on waterfall and agile methods of project life cycle.
Consultant | Data Science Engineer
Sinergia Media LabsIT Consultant
AI Rawahy Technical ServicesSoftware Engineer
HuaweiAirflow
Pycharm
Jupyter Notebook
Eclipse
Visual Studio
Jupyter Notebook
GitHub
SVN
Uh, could you help me understand more about your background by giving a brief introduction about yourself? Sure. Uh, Yeah. I have, uh, I'm a data engineer working as a data engineer right now and, uh, working remotely also. And, uh, I have a total of, uh, 11 years experience, uh, which spans over multiple domains. So, currently, the domain is ecommerce. And, uh, previous to this, I was working in a media and entertainment domain, uh, close to 2.5 years. And prior to that, it was a pharmaceutical, uh, client. And, uh, that was on the data, uh, engineering and, you know, data science experience. Prior to that, I was working in the telecom domain and creating security solutions for their server applications. And I I love solving problems, uh, more on the technical side, and I am a good learner of tools, can grasp and understand the technical documentation and get things work. And I love doing POCs, uh, and, uh, exploring the new tools and the feasibility on our current environment. And, also, I am, uh, I am passionate about, you know, guiding, uh, the juniors, uh, and getting to have a seamless, uh, work atmosphere. And I'm a very great team player, and, uh, I I have been working remotely for the past, uh, 3 years. And I'm very happy about it. Like, prior to that, I was working in office, and it had been, uh, no different. And I love to connect with them and work with them how I was in office and also build a rapport among them, yeah, which all was a great value addition when I was working and, you know, sharing knowledge, uh, getting, uh, assistance when running into issues and all that. Yeah. Thank you.
How do you implement a data quality framework using PySpark to ensure the integrity of ETL process data? Probably, I would like to put that the quality of data is is very much required for the downstream process. And, mainly, we use spike spark for, uh, the big data, to process the big data. And, uh, to ensure the quality, we have to have all these mandatory fields which are being used downstream properly, uh, you know, uh, null checks and all, uh, to be ensured. And, uh, um, the so using using the, uh, PySpark, we can enable, like, all these null checks and all implemented for the, um, for the, uh, columns, uh, that is incoming. And, also, we can have, uh, the schema enabled while reading the data, which will ensure that the data of each, uh, column is inappropriate, uh, for our data. And as well as if we need some mandatory checks to be conducted, uh, that also can be accommodated in the schema.
How do you perform deduplication on a dataset in Snowflake that has been ingested from an ETL pipeline incorrectly multiple times? This was multiple times here, uh, kind of performed in my workplace. So, uh, we we, uh, in in almost all of these, uh, in incremental loads happening into the snowflake, and those all those corresponding tables have the, uh, time stamp, uh, columns, uh, so which, uh, which will actually determine, uh, the duplicate data. And we used to perform a count analysis on, uh, how, uh, data was ingested or when the data was ingested basically to the table. And, uh, depending on the count and also the time, if, uh, it will be expected to complete, uh, at, um, a specific time only. And if there are any discrepancies, we would detect that and, you know, delete, uh, all those data. Uh, and apart from that, we would have the, uh, data in like, um, all those data deleted from, uh, from the Snowflake table. Yeah. Sorry for that interruption in between.
If you were to design an ideal pipeline that handles time series data, which design patterns would you implement and why? To be frank, I have not handled any time series data as such uh, until now. Uh, but, um, one one thing that comes to my mind is we we we would need, uh, you know, specific checks happening, uh, to see that when the data is going to be populated. So, uh, the scheduler should run it. I I'm not, uh, I'm not sure if I'm right, uh, hitting the right point of this question. But this is what comes to my mind when I read the question. Uh, so, um, probably, uh, there should be handles to check if the data is loaded at the specific, uh, time points, uh, and the the scheduler should be, uh, running at appropriate times. And the data should be loaded to the, uh, lake, whichever, uh, lake, uh, tools we have, whether it be, uh, Databricks, Delta Lake, or Snowflake, wherever it is. It, uh, that needs to have a a time count, uh, or a time stamp, which will ensure that that the data has all appropriated, uh, appropriately populated at the right timings. Uh, in terms of design patterns, uh, yeah, this this is one pattern I would suggest to handle. And if there are any discrepancies, like, if the previous if a load has failed at a particular point of time, then alerts should be, uh, you know, posted to the appropriate channels, uh, of the stakeholders to let them know that, um, uh, the data has failed to load.
How do you detect and handle skewness in a large data set when performing data transformation using, uh, PySpark. Skewness, uh, in terms of, uh, large dataset. Okay. Uh, I am not, uh, that much familiar with the term skewness in terms of data engineering. But in terms of data science, yeah, we we can say that, uh, when the data is not appropriately distributed, then, uh, there is a skewness in the data. So I I under I I suppose, that, uh, when when the data is, uh, you know, received and then they are stored, basically, how they are partitioned, uh, down, uh, like, while saving is what might cause the skewness. So if it is not, uh, partitioned properly, then, uh, a a few partitions might have a larger amount of data, uh, while the others not, which will, uh, cause while while reading the data at the the later point of time, uh, the latency, uh, would be high, uh, for those partitions with the larger size. Uh, so, uh, so if if suppose we have such, uh, we we encounter such issues while reading the data that a few partitions are having, um, higher, you know, um, size than the others, then, uh, we should, uh, pick up the, uh, tran like, the partitions appropriately. So I would say, like, in in the Databricks, we have, uh, terminology. Um, I'm not able to recollect it, uh, completely. Like, we can set on, um, like, around the I I think it is around 4 columns initially, and then a total of 30 columns wherein we keep track of the statistics of those columns. And then, uh, the data will be stored appropriately. Uh, meaning, uh, on, uh, we will order the data based on the value of those columns. So, uh, when, uh, when the data is fetched, this, uh, particular order will be of great help. Like, say, suppose we, uh, we are filtering the data based on an on a value. So if the value is, uh, on on on a specific, uh, layer in the stack, so it will appropriately go and, uh, fetch the data from that location. Like, this information, it will get from the metadata, actually. So that is the way, uh, uh, I I with my current knowledge, I might handle the data skewness.
When optimizing a skill queries for reporting purposes in Snowflake, what best practices would you follow? Yeah. So here, uh, the the Snowflake say, uh, stores data in terms of micropartitions. So, um, basically, all those queries, whatever we give, uh, the the filter conditions and all, uh, should be the the filtering should happen, uh, first in the inner query rather than in the outer query. So, uh, once it come meaning, uh, the all the inquiries, we have to, um, have the filter condition as and when we we have to, you know, give it. Because what if you are doing the filter queries on the outermost, uh, fill filter conditions in the outermost query, uh, what will happen is it will take up more resources, and all those data will get loaded into the memory and the processing, uh, like, the warehouse will be throttled. So what is the best practice will be all those, um, subqueries. Uh, there, the maximum filter conditions should go in, and only the computation should happen in the outer query. So, uh, in this way, we can ensure that, uh, only the relevant data is actually fetched for the computation.
How would you optimize a PySpark job that has to join 2 large datasets and write the output to a Snowflake table? ViceSpark's job has to that has to join 2 large datasets. Okay. Probably, um, I'm not sure if my approach is correct. But what what can be done is this join, uh, can be done, and then we can write the file as parquet. And that parquet, we can actually, you know, copy into the corresponding Snowflake table. So in that way, the processing will be faster. And the join of the 2 large areas, obviously, the filter conditions, uh, should be applied on the appropriate subqueries first, and then uh, the join should happen on those queries, um, on the on the partition columns, basically. All the filters should happen on the partition columns so that the data fetches, uh, faster. Because, uh, in in case of in case of Databricks, uh, delta, the if if you are, uh, fetching, uh, the del uh, the data based on the partition columns, Uh, the the statistics of the data is already available, and, uh, hence, uh, the date it need not scan the entire data actually of the other partitions. So it can go directly to the corresponding partition and then fetch the data and apply the, uh, computational logic.
What AWS services would you leverage for constructing a serverless data processing pipeline? And how would Lambda functions fit in the architecture? Serverless data processing pipeline. Okay. One thing is, uh, the AWS managed airflow, which can be, uh, is, uh, as such, a full, uh, in um, service wherein it can actually, uh, you know, pick the data, uh, from multiple data sources and then, uh, store it to multiple endpoints as well. So we can leverage, uh, that end to end, basically, which I am currently using in my project. Uh, the lambda and how would lambda functions fit in the architecture. Actually, lambda functions, basically, when we have to do some computation on the transformation on the data, irrespective of, uh, you know, like, loading of data into the table. So we have some source data. And upon that, we are doing some, uh, processing, and we have to store the data into into another, uh, you know, another location. So that is when probably Lambda functions come into picture. It can be a call from the API or it can be, uh, some triggered Lambda functions. So, uh, what have you been when a data is available in some particular locations, this can be triggered so that the data gets processed appropriately, and it gets stored into into the, um, appropriate location. So Lambda functions can pretty much, uh, do, uh, like, I have not personally used Lambda functions, uh, but I I suppose Lambda functions can, uh, very well do spinning up the EC 2 machines or, uh, any any of the AWS services, what all it needs to, uh, process. So all that can be done. And once the result is available into another location, so, uh, uh, the downstream systems can use it.
What strategy would you use to migrate Python ETL scripts running on legacy systems to utilize PySpark for enhanced parallel processing capabilities? Okay, yeah, this was one aspect we had done for the project with our media and editing team inclined. So we had this Python scripts running with like, it was running in a SnapLogic tool and that was the one scheduling it and at the back end it was like triggering EMR jobs and all that. But what we did was, we brought Databricks into place and we leveraged the parallel processing, big data processing capability of the Databricks and we coded in PySpark and we used Spark SQL as well. And then, since it is a seamless, you know, we could, you know, internally, even if it is Python scripts, so internally it would be actually a Spark data frame and that will do all the processing in the background in a parallel way only. And save the data or load the data into Lakehouse or save the data into Parquet format or how you want it to be. So this can very well be achieved using Databricks and yeah, have done that already.
What key metrics would you use to measure and improve the performance of an ETL pipeline that frequently handles JSON and CSV data files? In terms of performance perspective, it would be, uh, like, multiple things will come into consideration. How fast it is able to, uh, process the data? What is the level of failure? And, uh, when the JSON files come in, uh, yeah, like, when we need to encounter some new elements, is it failing? Or in CSV, if we encounter any new columns, is it failing? So how have we generalized it? Uh, so, uh, main thing, uh, we can have as a schema in place, uh, with the with the appropriate data types and all, uh, mentioned. So, uh, when which will make sure that we are experience I mean, we are expecting this data, uh, of this data type, uh, in in the source, and then only we'll process. So that can be ensured to improve the quality, um, and also the performance of our retail pipeline.
Can you propose a method for real time data processing using AWS Lab and Kinesis for a data driven application? Yeah. As I said before, AWS Lambda is, uh, something that I have not used. But, uh, in the project that I worked, there was a data science aspect to it. So in that, what what was done is, like, um, we had all the models ready and deployed. So, uh, this Lambda functions would actually when in a on a when when scheduled on a regular basis, when running its, uh, uh, when running the job, it would actually take in data appropriately, generate, uh, the results and store into the, uh, the target table. So and that target table would be used subsequently for its, um, you know, predict like, subsequently used for showing the customers on what are the, uh, best products that they can buy and all that. So for Kinesis, uh, yeah, this was used for the iTrouble, actually. So iTrouble will send the data batch wise. So Kinesis used to, uh, was configured such that we get the data through Kinesis, uh, for processing, and that data would actually come and sit in our s three location.