profile-pic
Vetted Talent

Anmol mishra

Vetted Talent
Adaptable Data Engineer cum Analyst with unique blend of technical expertise in data engineering and analytical skills. Proficient in designing and implementing data pipelines, executing ETL processes, and managing databases. Experienced in leveraging statistical analysis and data visualization to uncover valuable insights. Skilled in SQL, Python, data modeling, and BI tools.
  • Role

    Senior Data Engineer & Neo4j Developer

  • Years of Experience

    5.1 years

Skillsets

  • Fivetran
  • Vertica
  • Snowflake
  • Preset
  • PDI
  • Neo4j
  • MicroStrategy
  • Looker
  • Linux
  • SQL - 5 Years
  • dbt
  • Tableau
  • Redshift
  • MySQL
  • Oracle
  • Python - 5 Years
  • Python - 5 Years
  • SQL - 5 Years

Vetted For

9Skills
  • Roles & Skills
  • Results
  • Details
  • icon-skill_image
    Data Engineer (Remote)AI Screening
  • 52%
    icon-arrow-down
  • Skills assessed :Team Collaboration, Data Modeling, ETL, Snowflake, Postgre SQL, Problem Solving Attitude, Python, SQL, Strong Attention to Detail
  • Score: 47/90

Professional Summary

5.1Years
  • Oct, 2021 - Present4 yr 7 months

    Senior Data Engineer

    5x Pte ltd.
  • Jul, 2018 - Sep, 20213 yr 2 months

    Software Engineer

    BizAcuity Solutions

Work History

5.1Years

Senior Data Engineer

5x Pte ltd.
Oct, 2021 - Present4 yr 7 months
    Results-driven Modern Data Stack Engineering with expertise in designing and implementing client-centric data solutions. Proficient in leveraging tools such as Fivetran, DBT, Looker, Preset, and more to extract, transform, and visualize data for actionable insights. Skilled in seamless data integration, robust transformation workflows, and developing intuitive dashboards and reports. Shortened the time frame from the initiation of client onboarding to the ultimate delivery of actionable insights to just around a week.

Software Engineer

BizAcuity Solutions
Jul, 2018 - Sep, 20213 yr 2 months
    Data analyst: Developed SQL queries to optimally fetch load containing many customer-centric metrics from clients OLAP Data Warehouse which was later utilized to predict player behaviour. This helped the business to identify the churn users early. Product Backend engineer: Designed and developed backend flow for BI-service product using python scripts to handle data validation and enrichment process while working with Vertica and Mysql Databases. Neo4j Developer: Developed scripts to build a recommender system which involved consuming daily loads from multiple sources into neo4j environment and generating system and personal level recommendations for users. Database admin and architect: Configured a multi-node K-safe Vertica data Warehouse and designed its schema objects and configured the corresponding resource pool which met the requirement of the data-pipeline. Linux system admin: Set-up and configured Linux server to install different services and configured firewall, IP whitelisting, crontab and monitored system health and performance.

Achievements

  • Ranked among the top 1% in IIT-JEE and secured top-100 rank in GATE examination

Education

  • B.Tech.

    Indian Institute of Technology (BHU) (2018)

AI-interview Questions & Answers

Just yeah. Alright. So I've passed out of ITBHU in 2018. And since then, I've been working as a data engineer and data analyst. I have a total experience of 5.5 years now. And till now, my proficiency lies in writing SQL queries, working with Python scripts, and I have good experience working on Linux shell scripts as well. I have been working with different reporting tools, like Tableau, Sigma, Preface, and recently with Looker as well. I have plenty of experience working with different databases as well. I have good experience with Snowflake and BigQuery as well. I have also worked with the Vertica warehouse. So that is about my experience. And in terms of the different roles that I've taken and kind of responsibilities that I had, I have worked with clients from different domains, including clients from the retail industry and from the online casinos and online gaming industry as well. My primary role has been to understand the different use cases that the client had and to leverage the underlying data that the client is generating either from their production database or from their different marketing tools and to leverage that data and answer the key questions that the stakeholders have. For that, I have even used modern data stack tools like DBT, Fivetran, and I have been a very essential role in developing the ETL tools using those tools and designing those ETL pipelines using those tools and have designed various dashboards pertaining to questions coming from the marketing, finance team, or inventory team. And in one of my previous organizations, I also led a team of around 10 to 11 developers and data engineers where we developed an in-house application that encapsulated our whole service that we're providing of our data warehousing and analytics. And we were able to ingest data coming in on a daily basis from our different businesses, clients from the online casinos and online gaming industries. And we were able to ingest that data, do certain transformations using Linux scripts and Python scripts, and we designed different dashboards for them using Tableau. So that is all about my experience till now. Thank you.

What approach would you use to validate data integrity after an ETL process is an ETL process impacting multiple tables in PostgreSQL? So there are certain things that we will be looking for in many of these steps, the implementation will again depend on the kind of the ETL tool that we are using or whether it's an in-house pipeline that we have built or we have used any of the available ETL tools in the market. But the key principles or a key methodology that we'll have is to first have a basic understanding of the ETL process that we have and be aware that if we have certain raw tables with millions of rows and we are doing certain transformations to get aggregated data into the final layer, then we will obviously be checking for a certain number of rows coming into the final layer which we can validate and cross-check. We can also cross-check by doing certain aggregations on top of our final layer and then against our staging layer or our source tables to make sure the very high-end numbers do match. We can also go for a day-level or a month-level top KPIs aggregated numbers to match to ensure that we have nothing wrong with our ETL processes. We are getting the right numbers. Then what else can we do? We can in case of failures or if we come across issues, but we are essentially talking about the data integrity validation, right? So apart from this, we can obviously keep an eye on the ETL process. The first thing would be to get an alert. If our ETL processes are failing, we will have to have an alert system in place to make sure that we are getting the messages or Slack messages or emails on time whenever these ETL processes are failing. And once we have that, we need to go through those logs that are coming for the ETL and check which one of our tables is not getting populated or where the error is coming from on which of the jobs in the ETL. And once we have that, we can go about and dig into it and look into the individual jobs that have failed and then we can go about validating and backtracking the error that might have caused it. But in general, without any errors, also to make sure the data integrity is there, we can have certain tests in place after the ETL, after daily ETL jobs have run to check the aggregated level numbers to make sure there is no very you know basically there is no anomaly in the number of records that are coming in on a daily basis to certain final tables, yeah, that is my answer. Thank you.

What advanced skill techniques can be used in Postgres for complex data manipulation in a real-time streaming ETL pipeline? Can be used in Postgres for complex data manipulation. If I do not have that much familiarity with the Postgres relational database, I have primarily worked with Snowflake, BigQuery, and Vertica, but I'll try to answer this. So if we have a real-time streaming ETL pipeline, and in that, what kind of SQL techniques we can use, the first thing to understand is that we'll have a streaming pipeline, so we will be getting one set of table which will hold the archive data and there would be one more staging layer we'll be getting, let's say, real-time data every minute or every 30 minutes or so. And we'll have to have the first staging layer which will move data from those real-time, every 30 minutes, data that we're getting into the first layer of staging and will have to append that many records that are coming for every 30 minutes or 15 minutes into our archive table, which will hold the historical data. So that is the first part we'll have to manage. We can have a WHERE clause or a certain condition in our query which can make sure that we are not getting data just after what we have in the archive. And once we have done that, then we have the archive layer with all data that we need. We can do certain levels of partitioning as well on the data that we are holding in the archive to make sure the further layer of ETL runs very smoothly and is optimized. So, we can have the partition done on the staging layer or the archive layer on probably on a create timestamp or updated timestamp, whatever is the field that we have. And we can do it at a daily level or weekly level. Once we have that, then we can run whatever use case we have and we can run a number of transformations on that final layer. So the crux of my answer is that we'll have to be very careful about making sure that those real-time streaming data we're getting every 10 minutes or 5 minutes or 15 minutes gets appended to our much larger historical table in a very smooth way by making sure that we are taking care of WHERE clauses including those timestamp comparisons.

How would you apply Postgres' partitioning features to improve ETL pipeline efficiency for larger datasets? Yeah. So it will be kind of an extension to my previous answer. I'm not sure about the partitioning, how it's done in the post test, but I have done partitioning for our big query tables. So I have a general idea of how partitioning is done and how it helps in improving the efficiency of queries on top of those tables. So I think the main idea of partitioning is that if you have a very large table, let's say of hundreds of millions of records, then we can do partitioning to make sure that data gets stored in stacks of those monthly or weekly granularity. And it depends on the nature of queries that we are going to run on those tables. If our queries or even ETL queries or the BI layer queries that run on top of those tables have certain timelines in the where clause, then it's very prudent to make sure that we apply partitioning for those tables. And once we have those partitioning, it really makes queries done very fast because the scanning of the table gets reduced to the scanning of that particular segment of the table where that date range lies. So in that way, we can improve the efficiency of our large datasets when querying upon them. Even we can improve the efficiency of our large datasets when querying upon them. And apart from the partitioning done just for the chronology, we can also go about partitioning for other dimensions as well. That again depends on the nature of queries that we are going to run on top of this. Generally, I have found that it's mostly done only for the timestamp fields or the date fields. So, yeah, that was my answer.

What method would you use in Python to efficiently transform nested JSON data for insertion into a host grayscale database? So there has to be a library in Python that can help us interact with the Postgres database. And if we have specific libraries, we generally use them. For example, for at least for Snowflake, we can use that library's functionality to load the JSON data or CSV files indirectly. Load that into the Snowflake materialize that into a Snowflake table. We would use a library like psycopg2 for Postgres or snowflake-connector-python for Snowflake to interact with the database.

We're going to use a python script to interact with the Postgres database. We'll use a library to package one of our SQL queries, which is like a select table name, comma, DDL from the information schema equivalent in Postgres SQL. We have information schema in Snowflake, I think it's called something else in Postgres, and we can use that to get the DDL of the underlying table that we're looking at. Once we have the DDL, we can get to know the fields that we have, what kind of variables they're holding, the nature of the underlying fields. Once we have that, we'll get back to comparing it against running a similar query against our production database table, and compare those, and then we can finalize the differences that we see. We'll probably package one more SQL query, which will be an update query or, sorry, an alter query, which will alter the table that we have in the production database and alter those fields as we come across in the earlier comparison. I'm not sure if there's a readymade, more advanced Python library that can handle all these tasks, but I might be trying to reinvent the wheel here because I'm not aware of that specific library as of now. So, I'm just trying to think of a more first-principle approach to this problem.

data = pd.read_csv('data.csv') df = pd.read_csv('data.csv') df['new column'] = df['existing column'] * 2 df.to_csv('data.csv', header=False, index=False) The issues with this code snippet are: 1. The line `df = pd.read_csv('data.csv')` is reading the CSV file into a new data frame, but then the line `df['new column'] = df['existing column'] * 2` is trying to modify the original data frame. However, since the original data frame is already overwritten by the line `df = pd.read_csv('data.csv')`, this will not have the desired effect. It should be `data = pd.read_csv('data.csv')` instead. 2. The line `df.to_csv('data.csv', header=False, index=False)` is trying to save the modified data frame back to the same CSV file. This will overwrite the original file and is not a good practice. It should be saved to a different file. 3. The line `df.to_csv('data.csv', header=False, index=False)` is saving the data frame to a CSV file with no header. However, this is not necessary and might not be the desired behavior. The `header=False` parameter will suppress the column names in the output file. It should be `header=True` instead. Here is the corrected code snippet: ```python data = pd.read_csv('data.csv') data['new column'] = data['existing column'] * 2 data.to_csv('data_two.csv', header=True, index=False) ```

This is a performance bottleneck when running against large datasets. Can you explain what might be causing the slowdown and how it could be addressed for better optimization? So we're selecting name and doing an average of salary from employees table. We're grouping it by name and ordering by average salary in a descending manner, limiting it by 10, and it's being run against a very large dataset. It's a very plain query, but running it against a very large table, and what might be causing the slowdown. We install large datasets. So we'll have to understand the nature of the underlying dataset that we have here. It could be that we have multiple duplications of the name present in the name field. There might be multiple same values of the name coming up in the name field, and the ideal way is to do it against an employee ID or something. Generally, it's much faster when you run queries to retrieve IDs instead of getting it against string kind of values. And instead of ordering by average salary descending, we can get the top ten highest paid employees from this group. What else could be causing it? What we can do here is we can instead of taking out an average, we can have the names and we can have the name. We can account for the occurrences of that name and have the sum against them. And then in a second layer, that would be a CTE statement. Then in the second layer, we can do the name and then the sum of salary divided by the number of occurrences that we see to get the average salary. Because generally, the count star is much faster and sum would be faster than doing an average. That's just my hunch right now. How can we address it for better optimization? So what can be done here further? We can have further conditions on this table. If needed, we can put a condition on where the salary withdrawal time or something like that is within a certain time frame to make our table scan much smaller. Because right now it's going over the whole of the table. We can have maybe a region filter or other filters as well to further bring down our total scan of the table.

How would you design a resilient Python based ETL workflow that can handle frequent schema evolution in Snowflake without manual intervention? Design a resilient Python based ETL workflow that can handle frequent schema evolution in Snowflake without any intervention. So we are talking about an ETL workflow here where we can, have multiple schema changes happening through the time line of the project. For this, we'll have to have basically, we will be, creating an object first of the different field names that we have, and we will be comparing that field, the list of field name that we have against the target table before we try to append data to that table and we'll have to first do a comparison that the table names or the number of fields on or the nature of the field that we're expecting is the same as present in the underlying, target table. If it's not, we will update our, we'll be updating our, the reference DDL that we have. And that reference DDL can be saved in one of the as one of the seeded file that can be used later on as well. And that seeded file will keep on evolving as our, snowflake, table schema evolves. So that is one thing that we can do. 1st, we'll be basically, we'll be comparing that seeded, DDL files content against the current, DDL that we have for the existing table, then we will make sure we make, those, changes that we found after comparison. That changes will have to be incorporated into the DML query that we'll have, and, then we can have run the DML queries fine. And at the same time, those new fields or changes that we find, we'll have to update that in the seeded, DDL, table as well so that later on, our DDL is updated with the latest changes. That is my response. Thank you.

What are some methods to incorporate machine learning algorithms within Python ETL pipelines for data enrichment? So we are talking about not actually applying the machine learning algorithms for analytics purposes on top of the ETL, but we are actually talking about how we can use machine learning algorithms within the ETL pipelines for data enrichment. So again, we can use a couple of things here, maybe. One thing that comes to mind is using time series forecasting on the historical number of records in the final layer of ETL tables, which is our raw data. The time series forecasting will be done using the historical numbers, and if we observe a very different number of records in our ETL layer than what is predicted by that time series, then we can raise an alert as a warning to whoever is responsible for those ETL enrichment jobs, so they can check if the difference is what's causing it. We can use the historical number of records to make predictions on how many records we should ideally be getting on a Sunday or on a regular weekday, and then we can create a warning system using that. Another thing is that there are a lot of things we do in the enrichment layer, basically cleaning up the entries that we're getting in the staging layer, getting rid of different delimiters that we have and making sure we're using the right delimiter based on the nature of the data that we have, filling the null values that we have there, and making sure the number of records and number of fields that we're getting is consistent. That's basically a housekeeping kind of thing we do, making sure that the records we're getting are consistent with what we're expecting to get. I'm not sure how we can apply machine learning algorithms further apart from that time series one, but one more thing that comes to mind is using machine learning algorithms to generate artificial samples or artificial customers and their behavior using the 20,000 customers we have over two years. This can help increase our customer sample size, for example, from 20,000 to 50,000, which might be enough for our LTV models later on to make good predictions.

How can you use Postgres write ahead logging for recovery purposes during a failed EDM job? How can you use Postgres write ahead logging for recovery purposes during a failed ETL job? To be honest, I'm not sure about the write ahead login feature of Postgres, but for recovery purposes during a failed ETL job. repurpose is doing a failed ATL job. Not sure I can, like, we'll be able to answer that because I'm not aware of the right