Vetted Talent

Anmol mishra

Vetted Talent

Adaptable Data Engineer cum Analyst with unique blend of technical expertise in data engineering and analytical skills. Proficient in designing and implementing data pipelines, executing ETL processes, and managing databases. Experienced in leveraging statistical analysis and data visualization to uncover valuable insights. Skilled in SQL, Python, data modeling, and BI tools.

Role
Senior Data Engineer & Neo4j Developer
Years of Experience
5.1 years

Skillsets

Fivetran
Vertica
Snowflake
Preset
PDI
Neo4j
MicroStrategy
Looker
Linux
SQL - 5 Years
dbt
Tableau
Redshift
MySQL
Oracle
Python - 5 Years
Python - 5 Years
SQL - 5 Years

Vetted For

9Skills

Roles & Skills
Results
Details

Data Engineer (Remote)AI Screening
52%

Skills assessed :Team Collaboration, Data Modeling, ETL, Snowflake, Postgre SQL, Problem Solving Attitude, Python, SQL, Strong Attention to Detail
Score: 47/90

Professional Summary

5.1Years

Oct, 2021 - Present4 yr 1 month
Senior Data Engineer
5x Pte ltd.
Jul, 2018 - Sep, 20213 yr 2 months
Software Engineer
BizAcuity Solutions

Work History

5.1Years

Senior Data Engineer

5x Pte ltd.

Oct, 2021 - Present4 yr 1 month

Results-driven Modern Data Stack Engineering with expertise in designing and implementing client-centric data solutions. Proficient in leveraging tools such as Fivetran, DBT, Looker, Preset, and more to extract, transform, and visualize data for actionable insights. Skilled in seamless data integration, robust transformation workflows, and developing intuitive dashboards and reports. Shortened the time frame from the initiation of client onboarding to the ultimate delivery of actionable insights to just around a week.

Software Engineer

BizAcuity Solutions

Jul, 2018 - Sep, 20213 yr 2 months

Data analyst: Developed SQL queries to optimally fetch load containing many customer-centric metrics from clients OLAP Data Warehouse which was later utilized to predict player behaviour. This helped the business to identify the churn users early. Product Backend engineer: Designed and developed backend flow for BI-service product using python scripts to handle data validation and enrichment process while working with Vertica and Mysql Databases. Neo4j Developer: Developed scripts to build a recommender system which involved consuming daily loads from multiple sources into neo4j environment and generating system and personal level recommendations for users. Database admin and architect: Configured a multi-node K-safe Vertica data Warehouse and designed its schema objects and configured the corresponding resource pool which met the requirement of the data-pipeline. Linux system admin: Set-up and configured Linux server to install different services and configured firewall, IP whitelisting, crontab and monitored system health and performance.

Achievements

Ranked among the top 1% in IIT-JEE and secured top-100 rank in GATE examination

Education

B.Tech.
Indian Institute of Technology (BHU) (2018)

AI-interview Questions & Answers

Just yeah. Alright. So I've passed out of ITBHU in 2018. And since then, I've been working as a data engineer cum data analyst. Um, I have a overall experience of 5.5 years now. And till now my proficiency lies in writing SQL queries, working with Python scripts, and I have good experience working on Linux shell scripts as well. And I have, uh, been working with different reporting tools as well, like Tableau, Sigma, Preface, and, uh, recently with Looker as well. I have plenty of experience working with different databases as well. I have, uh, good experience with Snowflake and BigQuery as well. I have, uh, at our time, I have also worked with the Vertica warehouse. So that is about my experience. And, uh, in terms of the different roles that I've taken and kind of responsibilities that I had, I have worked with clients from different domains, um, including clients from retail industry and from online casinos and online gaming industry as well. My primary role has been to understand the different use cases that the client had and to leverage the underlying data that the client is generating either from their production database or from their different, uh, marketing tools and to leverage that data and answer the, uh, key questions that the stakeholders have. For to do that, I have, um, even used the modern data stack tools like DBT, Fivetran, and I have been a very, uh, I played a very essential role in developing the ETL tools using, uh, the aforementioned have played a very good role in designing those ETL pipelines using the aforementioned tools and have designed various, uh, dashboards, uh, pertaining to questions coming from the marketing, from the finance team, or or the, uh, inventory team. And in, uh, in one of my previous organizations, I have also been, uh, I have also led a team of around 10 to 11 developers and data engineers where we have where we had developed, uh, in house application, which was kind of summarized which which kind of encapsulated our whole service that we're providing of our data warehousing and analytics. And we're able to ingest data coming in on a daily basis from our different, uh, businesses, uh, clients from the, uh, from the online casinos and online gaming industries. And we are able to ingest that data, do certain transformations using Linux scripts and Python scripts, and we designed different, uh, dashboards for them using Tableau. So that is all about my experience till now. Thank you.

What approach would you use to validate data integrity after an ETL process is an ETL process impacting multiple tables in PostgreSQL? So there are certain things that we will be, you know, we'll be looking for many of these steps implementation will again depend on the kind of the ETL tool that we are using or whether it's an in-house pipeline that we have built or we have used any of the available ETL tools in the market. But the key principles or a key methodology that we'll have the first will be obviously having an basic understanding of the ETL process that we have and being aware that if we have certain let's say raw tables with millions of rows and we are doing certain transformations to get aggregated data into the final layer, then we will obviously be checking for a certain number of we will be anticipating a certain number of rows coming into the final layer which we can validate and cross check. We can also cross check by doing certain aggregations on top of our final layer and then against our staging layer or our source tables to make sure the very high end numbers do match. We can also go for a day level or a month level top KPIs aggregated numbers to match to ensure that we have nothing wrong with our ETL processes. We are getting the right numbers. Then what else can we do? We can in case of failures or if we come across, but we are essentially talking about the data integrity validation, right? So apart from this, we can obviously keep an eye on the ETL process. The first thing would be to get an alert. If our ETL processes are failing, we will have to have an alert system in place to make sure that we are getting the messages or slack messages or emails on the right time whenever these ETL processes are failing. And once we have that, we need to go through those logs that are coming for the ETL and check which one of our tables is not getting populated or where the error is coming from on which of the jobs in the ETL. And once we have that, we can go about and dig it from the upside down and look into the individual jobs that have failed and then we can go about validating and backtracking the error that might have caused it. But in general without any error also to make sure the data integrity is there, we can have certain tests in place after the ETL, after daily ETL jobs have run to check the aggregated level numbers to make sure the there is no very you know basically there is no anomaly in the number of records that are coming in on a daily basis to certain final tables yeah that is my answer. Thank you.

What advanced skill techniques can be used in Postgres for complex data manipulation in a real time streaming ETL pipeline. Can be used in Postgres for complex data manipulation. So if, um, I do not have that much, um, familiarity with the Postgres relational databases. I have primarily worked with Snowflake, BigQuery, and Vertica, but I'll try to answer this. He's, uh, probably giving out a generalized answer. So if we have a real time streaming, uh, ETL pipeline and, um, in that, what kind of, you know, SQL techniques we can have, the first thing, uh, to, understand here is that we'll have a streaming pipeline, so we will be getting, uh, there would be 1 set of table which will hold the archive data and there would be one more staging layer we'll be getting, let's say, um, a real time data every minute or every 30 minutes or so. And we'll have to have the 1st staging layer which will move data from from those real time, uh, every 30 minutes data that we're getting into those 1st layer of staging and will have to append that many records that are coming for every 30 minutes or 15 minutes into our archive table, which will hold the historical data. So that is the first part we'll have to manage. We can have, um, a a where clause or or a or a certain condition in our query which can make sure that we are not we're getting data just after what we have in the archive. And and once we have done that, then we have the, uh, archive layer with with all data that we need. We can do certain level of partitioning as well on the data that we are holding in archive to make sure the further layer of, uh, ETL runs very smoothly and and is optimized. So, uh, we can have the partition done on the, uh, staging layer or the archive layer on, uh, probably on a on a create time stamp or updated time stamp, whatever is the field that we have. And we can do it at a daily level or daily would be too much. I think we can do it at a weekly or monthly level. Once we have that, then we can run, um, whatever use case we have and we can run a number of transformations on that final layer. So the the first thing, uh, the crux of my answer is that we'll have to be very, um, careful about making sure that those real time streaming data we're getting every 10 minutes or 5 minutes or 15 minutes gets appended to our much larger historical table in a very smooth way by making sure that we are uh, taking care of, uh, where we have conditions where conditions including those time stamps comparison.

How would you apply Postgres' partitioning features to improve ETL pipeline efficiency for larger datasets? Yeah. So it will be kind of a extension to my previous answer. I'm not sure about the the partitioning, how it's done in the post test, but I have, uh, done partitioning for our big query tables. So I have a general idea of how partitioning is done and how it helps in, uh, in improving the efficiency of the query utterance on top of those tables. So I think the main idea of partitioning is that if you have a very, very large table, let's say of, uh, hundreds of millions of records coming, um, holding, then we can do partitioning to make sure that, uh, data gets stored in in in stacks of those monthly or weekly granularity. And it again depends on, uh, the nature of queries that we are going to run on those tables. If our queries, uh, or even even if our ETL queries or the BI layer queries that run on top of those tables, if they are, um, of of a nature where they have certain timelines in the where clause, then it, uh, it should it is very prudent to make sure that we apply the partitioning for those tables. And once we have those partitioning, it really makes the query done very fast because, uh, the scanning of the table the scanning of the whole table gets reduced to the scanning of that particular segment of the table where that, uh, that date range lies. So in that way, we can improve the partitioning. Even, uh, we can improve the efficiency of our large datasets when querying upon them. And apart from the, uh, partitioning done just for the chronology chronological, uh, partitioning, we can also go about partitioning for other dimensions as well. That, again, depends on the nature of queries that we are going to run on top of this. Generally, I have found, uh, mostly it's done only for the timestamp fields or the date fields. So, yeah, that was my answer. Thank you.

What method would you use in Python to efficiently transform nested JSON data for insertion into a host grayscale database? So we don't use Python efficiently transform. Right. So for this, there has to be, uh, basically, a library in Python that can help us interact with the Postgres, uh, Postgres SQL database. And if we have the I'm pretty sure there must be a like, I'm I'm not directly aware of the library that will be there and how it's being used, but we generally have, uh, specific libraries. For example, for at least for Snowflake that I'm aware of where we can use that library's functionality to to load the JSON data or CSV files indirectly. Load that into the Snowflake materialize that into a Snowflake table. And I probably don't have much to to response to respond to this question, but, um, yeah, I'll probably skip this

Describe the process of using python to automate the migration of schema changes from development to production Postgres SQL database. Process of using python to automate the migration of schema changes from development to production Postgres database. So, again we will be using we are going to use a python script to interact with the Postgres database and using that library what we are going to do is we are going to package one of our SQL queries which is something like select table name comma DDL from the information schema equivalent in Postgres SQL. We have information schema in snowflake I think it is called something else in Postgres and we can use that to get the DDL of the underlying table that we are looking at. Once we have the DDL we can get to know the fields that we have what kind of kind of variables they are holding the nature of the underlying fields. Once we have that then we will get back to we will try to compare it against running a similar query against our production database table and compare those and then when using that we can finalize the differences that we see and we will probably be packaging one more SQL query which will be an update query or sorry an alter query which will alter the table that we have in the production database and alter those fields as we come across in the earlier comparison. I am not sure if there is some readymade more much more advanced python library which can handle all these tasks, but I am maybe I am trying to reinvent the wheel here because I am not aware of that specific library as of now. So, I am just trying to think of a more first principle approach to this problem. Thank you.

Following Python code snippet, forwarding data into pandas data frame and saving it to send a couple of issues. Explain what's wrong and how it Uh, d dot CSV, which is reading the file, then d f new column. We are defining that new column as d f existing column plus 2. Right. So d f existing one. There must be some existing one that they have. We're multiplying it by 2 and d f two CSV. Right. I mean, the the error here so what this Python code is trying to do is it's supporting pandas, then it's creating a data frame using, uh, CSV file called data. It's reading that using bt.read CSV, then it's trying to add a new feed to the data frame d f by saying d f new column is basically and then that field's value is trying to say that, uh, it should be equal to two times of an existing column that we have in the d f. And then we are materializing that data frame into a c s v by saving it into a c s v d f dot two CSV, naming it as data dot CSV, header equals to false. So the summary was in the following, but most of the part is different and same. It was a couple of issues. What's wrong and how would you fix it? It's probably something I'm not a 100% sure. I I'll probably have to write it and then understand what what's on but, probably, it has something to do with the 3rd line 3rd line, uh, where it's trying to multiply the value and by 2. But, again, there's 1 more thing that I see that it's trying to, uh, save it as the same file, which cannot be the error or or issue that we're looking for, but it's not a very good practice to save to alter the source CSV file. We should be saving it as some other name under some other name, and we are saying header equals to false. So there's no header in this. That is also, I think, wrong. We should be saying that header equals to true so that, otherwise, it will, uh, create a new header, like, string 1, string 2, those kind of and then actually append, uh, actually add the header information as well as one of the rows in the underlying CSV, which is strong. And, like, not probably 90% sure that the 3rd line is right. Uh, there might be something, but I think, um, apart from this, everything looks good

This is a performance bottleneck when running against large dataset. Can you explain what might be causing the slowdown and how it could be addressed for better optimization? So we are selecting name and doing an average of salary from employees table. We are grouping it by name and order by average salary, uh, in a descending manner, limiting it by 10, and it's being run against a very large dataset. Now here, I'd say the the one thing that we can have is could be causing the slowdown, and I put it here just for better optimization. It's a very plain query, but running it against a very large table, um, and what might be causing a slowdown We install large dataset. So we'll have to understand the nature of the date underlying dataset that we have here. It could be that we have multiple duplications of the of the name present here, of the name field. Uh, there might be multiple same values of the name coming up in the name field, and the ideal way is to do it against an uh, employee ID or something. Generally, um, the Python I'm sorry. Not the Python, but it's much more faster when you run try to retrieve the IDs instead of getting it against, uh, much most string kind of value. And or do by salary. Mhmm. We can probably instead of trying to order by average salary descending, we can get but we need to find the top ten highest paid employees from this group. What else could be causing it? What we can do here is we can instead of taking out an average, we can, um, have the names and we can have the name. We can do account of the occurrences of that name and have the sum against them. And then in the 2nd layer, uh, that would be a CTE statement. Then in 2nd layer, we can do the name and then the sum of salary divided by the number of occurrences that we see to get the, uh, average salary. Because generally, the count star is much faster and and sum would be, my guess is, faster than doing an average. That's just my hunch right now. Um, how can we address it for better optimization? So what can be done here further? We can have further conditions on this table. If needed, we can put a condition on where the, um, salary, uh, withdrawn time or some or something like that. The salary release time is within a certain time frame to make our our scan, uh, table scan much more smaller. Because right now it's going over the whole of the table. We can have maybe a a region filter or or other filters as well to to further bring down our total scan of the table.

How would you design a resilient Python based ETL workflow that can handle frequent schema evolution in Snowflake without manual intervention? Design a resilient Python based ETL workflow that can handle frequent schema evolution in Snowflake without any intervention. So we are talking about an ETL workflow here where we can, um, have multiple schema changes happening through the time line of the project. For this, we'll have to have basically, we will be, uh, creating an object first of the the different field names that we have, and we will be comparing that field, uh, the list of field name that we have against the target table before we try to append data to that table and we'll have to first do a comparison that the table names or the number of fields on or the nature of the field that we're expecting is the same as present in the underlying, uh, target table. If it's not, we will update our, uh, we'll be updating our, uh, the reference, uh, reference DDL that we have. And that reference DDL can be saved in one of the as one of the seeded file that can be used later on as well. And that seeded file will keep on evolving as our, uh, snowflake, uh, table schema evolves. So that is one thing that we can do. 1st, we'll be basically, we'll be comparing that seeded, uh, DDL files content against the current, uh, DDL that we have for the existing table, then we will make sure we make, uh, those, uh, changes that we found after comparison. That changes will have to be incorporated into the DML query that we'll have, and, uh, then we can have run the DML queries fine. And at the same time, those new fields or or changes that we find, we'll have to update that in the seeded, uh, DDL, um, table as well so that later on, our DDL is updated with the latest changes. That is my response. Thank you.

What are some methods to incorporate machine learning algorithms within Python ETL pipelines for data enrichment? So we are talking about not actually applying the machine learning algorithms to for analytics purpose on top of the ETL, but we are actually talking about how we can use machine learning algorithms within the ETL pipelines for data enrichment. So again, we can use a couple of things here, maybe. So one thing that comes on top of my head is we can have a time series forecasting on and the time series forecasting will be done using the historical number of records that we have been observing in the final layer of ETL tables, that would be our raw data. And if we observe a very different number of records in our ETL layer than what is predicted by that time series, then we can raise an alert as a warning to whoever is responsible for those ETL enrichment job, then you can come and check if the difference that we see is what's the cause of that. So that is one thing that we can have, we can actually use the historical number of records that we are getting to make the predictions on how many records we should ideally be getting on a Sunday or on a regular weekday, and then we can create a warning system using that. Then the second thing, I mean, there would be a lot of things, there are a lot of things that we do in the enrichment layer, basically cleaning up the entries that we are getting in the staging layer, getting rid of the different delimiters that we have and making sure that we are using the right delimiter based on the nature of the data that we have and filling the null values that we have there and make sure the number of records, number of fields that we are getting is consistent, those kind of things. Basically a housekeeping kind of thing we do, making sure that and sanity check where we make sure that the records that we are getting is consistent with what we are expecting to get, but I am not sure how we can apply this further machine learning algorithms apart from that time series one that I just mentioned. Let me think more. The few more things that we can actually do. So if we have, let's say we are trying to, I will just try to answer this quickly. So if we have use cases that we are trying to predict customer's LTV behavior or customer's churn prediction, and we have, let's say, 20,000 customers data over the two years, and we find that we cannot use, this is not enough sample to make good prediction. So we can use machine learning algorithms to, I think it's called smart, not smart, something where we can generate the further artificial samples or further artificial customers and the behavior using those 20,000 customers to increase our customer sample size. And that can be basically from 20,000 sample customers, we will go to 50,000 using that. And then that might be enough for our LTV models later on down the line to make good predictions. So that is also one of the machine learning algorithms that we can use during the data ingestion process. Thank you.

How can you use Postgres write ahead logging for recovery purposes during a failed EDM job? How can you use Postgres write ahead logging for recovery purposes during a failed ETL job? To be honest, I'm not sure about the write ahead login feature of Postgres, but for recovery purposes during a failed ETL job. Uh, repurpose is doing a failed ATL job. Not sure I can, um, like, we'll be able to answer that because I'm not aware of the right