profile-pic
Vetted Talent

Madhan Lingareddy

Vetted Talent

Data Engineer with 6 years of experience working on various types of data, different ETL and BI tools. Technically strong in Python, Apache Spark (PySpark) and SQL. Responsive expert experienced in monitoring database performance, troubleshooting issues and optimising database environment. Possesses strong analytical skills, excellent problem-solving abilities, and deep understanding of database technologies and systems.

  • Role

    Senior Data Engineer

  • Years of Experience

    8 years

Skillsets

  • BI development - 3 Years
  • Python - 5 Years
  • SQL - 7 Years
  • Data Modeling
  • Git
  • Git
  • Serverless
  • Streaming
  • GCP - 3 Years
  • ETL - 5 Years
  • Snowflake - 2 Years

Vetted For

13Skills
  • Roles & Skills
  • Results
  • Details
  • icon-skill_image
    Data Engineer || (Remote)AI Screening
  • 78%
    icon-arrow-down
  • Skills assessed :Airflow, Data Governance, machine learning and data science, BigQuery, ETL processes, Hive, Relational DB, Snowflake, Hadoop, Java, Postgre SQL, Python, SQL
  • Score: 70/90

Professional Summary

8Years
  • Jun, 2024 - Present1 yr 4 months

    Senior Data Engineer

    Procore Technologies Inc.
  • Apr, 2022 - May, 20242 yr 1 month

    Sr Data Engineer

    Pasarpolis
  • Apr, 2021 - Mar, 2022 11 months

    Sr Data Engineer

    ADF Data Science
  • Feb, 2018 - Mar, 20213 yr 1 month

    BI Developer

    Wipro

Applications & Tools Known

  • icon-tool

    Python

  • icon-tool

    SQL

  • icon-tool

    Apache Spark

  • icon-tool

    AWS

  • icon-tool

    GCP

  • icon-tool

    Airflow

  • icon-tool

    Amazon Redshift

  • icon-tool

    Snowflake

  • icon-tool

    BigQuery

  • icon-tool

    AWS Glue

  • icon-tool

    AWS Lake Formation

  • icon-tool

    Tableau

  • icon-tool

    Quicksight

  • icon-tool

    Cloud Composer

  • icon-tool

    GitLab

  • icon-tool

    AWS S3

  • icon-tool

    Pentaho

  • icon-tool

    Spline

Work History

8Years

Senior Data Engineer

Procore Technologies Inc.
Jun, 2024 - Present1 yr 4 months
    Built low latency data streaming workflows, reengineered existing ETL workflows, designed data quality framework.

Sr Data Engineer

Pasarpolis
Apr, 2022 - May, 20242 yr 1 month
    Designed and implemented ETL architecture on GCP, created Airflow Workflows, setup real-time data streaming pipeline.

Sr Data Engineer

ADF Data Science
Apr, 2021 - Mar, 2022 11 months
    Orchestrated data pipelines, authored scripts for data migration, designed metadata.

BI Developer

Wipro
Feb, 2018 - Mar, 20213 yr 1 month
    Worked as a BI Developer and Data Engineer, built Data models and ETL workflows, developed dashboards.

Achievements

  • Captain of the winning team, Cricket in the University Sports Meet
  • Customer Charmer Award, Wipro: 2021

Education

  • Bachelor of Technology: Electronics And Communications Engineering

    IIIT, RGUKT - India (2017)

Certifications

  • Incorta fundamentals for admins and developers: 2019

  • Python intermediate, hackerrank: 2020

  • Aws cda, udemy: 2020

Interests

  • Cricket
  • Watching Movies
  • Driving
  • Badminton
  • AI-interview Questions & Answers

    Sure. So, I've been a data engineer. I started as a BI developer in my first year. I worked with tools like Tableau, Incorta, data modeling. And from there, I switched to be a data engineer. I started working with Python, SQL, a lot of relational databases, for example, Oracle, PostgreSQL, MySQL. I started building data warehouses, most of which are also on Oracle. And some of them are file-based and some are on cloud-based systems like AWS and GCP. And throughout my career, I experienced working with multiple tools, multiple ETL tools, like Informatica, Pentaho. And recently, I've been experimenting with AWS Glue. And I also worked with PySpark. That's my first introduction to big data. So, I have good enough hands-on with PySpark. I've installed and maintained clusters for our organization. And I also worked with Google Cloud, where I experimented or worked hands-on with a data warehouse like BigQuery. And I built data workflows using Airflow, maintained stuff there, and helped clients build their own data ecosystem, how to take advantage of the in-house data and build their own analytics. And throughout my career, I also worked with various BI tools like QuickSight, Google Data Studio, Tableau, Power BI, and Encoder. And yeah, that's pretty much it.

    Alright. Um, how would I design a system to monitor the health and performance of an ETR pipeline crossing daily data bytes of data? So in in my latest experience, I work with an an organization that deals with payments. So, uh, obviously, the data is huge, uh, in petabyte scales somehow. And we use AWS Glue as our ETL tool, so it can process large volumes of data depending on how much configuration we are providing it, and it cannot scale. So that's our, uh, major detail tool. And our scale storage was on s 3, which is also scalable, so there is no physical limit to it. And we build workflows, and we monitor everything through CloudWatch logs. And we set up alerts using SNS topics that ping us directly on our team's channel. And we do occasional, um, monitoring and performance tuning, and we also implement, um, you know, data validation checks when our we're not getting a key column populated or if it's a foreign key that's not available under the table. So we do we do these kind of checks if, uh, also, uh, if the large volume of data doesn't have data in it, like, a volume a column is coming up as blank for most of the records, we we do automatic checks, and we ping either on our Slack or Teams channels, and everything is automated. So, uh, it's all integrated. And, also, um, the way we deploy things is also automated using GitLab CICD pipelines. So that's pretty much it. And, yeah, we we maintain most of our infrastructure on serverless code, so it's easy to deploy and, um, maintain.

    Data Consistency. Again, we get our data from our RDS instance, which we don't query the actual system, actual production system, we touch the read replica. So we tend to do our checks, our validations against that. And we build a data lake on S3, like I said before. And we do this using an open source format called Apache Hudi, which supports ACID transactions by managing a log-based file system. And also we build an aggregated end-presentation layer. It's like three different layers of warehouses that we maintain. One, an exact copy of prod, and the second being an aggregated layer, which contains a denormalized version of tables and also aggregated version of tables. And from there, we pick these tables up and build a final layer called presentation layer, which most of the reporting and the end users are using. And every step of this journey from the source till the presentation of the reporting layer that we build, we do validation checks, we do performance training, and we continuously monitor and we make changes depending on what can be changed and what can be tweaked to achieve a better performance and also better trust in the data. And yeah, I think that's it.

    Alright. Tweaking a JSON data to to to load it into a relation database. A relation database is structured, so it has to be with a fixed fixed set of columns. And, um, you insert each row with, you know, with all the values in inside thereof. Right? When you save JSON data, it can be semi structured. So let's say a record comes with 3 different keys and the secondary card comes with 4 different keys. Our final goal is to produce a a row that contains, uh, dataset that contains 4 different columns. And wherever the missing key is there, we we populate a blank or another. And, uh, in my experience, we can either do this with Python just using pandas. Uh, we use JSON underscore normalize function, which will just split or explode all these JSON fields into a a separate column. Or we can use PySpark to achieve the same, uh, with a very similar effect. So we explore it, and then, uh, we just union depending on our, you know, we just, you know, evolve the schema based on how many columns we have. If the column is missing, populated blank, and then we dump it to j JDBC. And there can be cases where where where your source dataset could contain, um, more more columns than you have on your on your final relational database. In those cases, we need to, you know, alter our final table, our target table a bit to accommodate these new columns or drop columns that are coming in.

    Alright. Um, so how do I optimize data storage and relational database for data intensive application? Uh, first off, if if we can avoid using the relational database for data intensive application, At least for analytics purpose, we try to use, you know, uh, warehouses like BigQuery or Shift uh, and such tools. And if not, we have to, you know, understand the requirements here. Like, let's say let's say an application is querying based on a particular key column. It's it's ideal to, um, cluster or partition based on this column and also build indexes on these columns, uh, that can fasten up the querying on this. Um, we're working with data warehouses, which support, um, partitioning, clustering, these sorts of things. Uh, it's ideal to partition based on the column that you're filtering on or aggregating on. And, uh, you you can sort our cluster database on the key that you, uh, that you tend to join with. And these kind these kind of situations, um, you know, clustering, partitioning, um, or building indexes, These are the the fussy that comes to my mind.

    Migrate an existing data process from an on premise Huddl cluster to BigQuery. Uh, the first step would be to to export all existing data on your Huddl system to a cloud storage that is scalable. So I'm guessing that's GCS. So you can just write a spark application that reads from Hadoop and pushes it to s 3 either in CSV or parquet format, preferably parquet. And then there there is this option called reading from external tables and BigQuery, which means you can directly connect to GCS, read the data from there, and sync it to an external table. Or you can even build a native table that's, um, that keeps its storage on BigQuery layer. And once the incident, processing incremental data is going to be easy. So you just have to, right, uh, you know, make the connections. So there is this concept collection of connections, which, uh, can connect to your relational databases or file systems and use that to query in SQL style to bring the incremental data in and merge it with your final, um, the the the native table that you built with with the full dot. So and from there, it's it's going to be easy. You either build sequels or you can just, um, use a tool like Spark or even Airflow to run these incremental loads for you.

    The the only issue that I see here is using exception SE, which is very generic. So we don't really know what's happening in there. It's always better to handle specific exceptions before going to a, you know, a default exception case. So that's what I would change. Yeah. I think that's it. And a better user readable error message before actually raising the issue, which just generally says data loading failed. We may want to give a better error case scenario. So deal with specific case, uh, exceptions and write messages that are easily understandable, which also comes with the trace back, and then go to the very default scenario.

    The thumb function is supposed to have seen data from the database. Alright. Alright. Looking at this, uh, one thing I thought I would tweak is instead of using a while through, maybe we can use a while cursor dot next, which tells us if there is an extra one, then process it instead of just entering the loop and then fetching 1, in which case at the end of the data, when the data is completely being processed out, we will still will still hit a, uh, an error saying, you know, here diagnostic data, the next row does not exist.

    Alright. On BigQuery, um, the data is pretty spread out, like, but also we need to understand if we're doing incremental loads based on the key that we are updating, is it really clustered on that, uh, column? Because that may prove, uh, you know, very valuable when doing updates. And it's always better to partition based on a a particular date column. Um, let's say we are we're pulling the incremental data based on an updated it or a created it column. Um, it's ideal to partition our table based on this, which means that when updating or even when querying your data, uh, the queries that you run doesn't actually process, uh, the entire table, doesn't have to scan through the entire table, but only the partitions that you need to update. And if the the table is too big, we can create cluster tables, and we can see if we can run parallel queries to update the table. What else? Filter the data before we are processing it and select only what you need. Aggregate before joining if you could. Uh, filter before joining if you could. Yeah. I think that that's pretty much it.

    First up, I I would understand what the requirements are here. Like, what is the source? What is the target? And is it okay to go with functional programming? Do I need to implement object oriented programming here? And then, um, I'll start writing function classes, and I'll build the code for reusability. And I build the code, um, for incremental and full loads for flexibility. And give enough room for user input but not clutter it so that, you know, we can tweak the code for specific use cases. And then, um, I'll use the docstrings. I'll use type hinting. I'll implement unit test cases, include data validation scenarios, and add enough comments to understand what's really happening inside. Push the code, get it reviewed, and then deploy it. And since we are working with cloud systems most of the time, so, um, it's also ideal to throw in a serverless or Terraform file that deploys stuff for you. And, uh, probably a shell script also that automates things or easier to build your c c I c d pipeline on GitLab. So stuff like that.

    Yeah. So that that's an important one, I believe. So, In in AWS, we use blue catalog combined with Lake Formation to to manage the data governance most of the time. So it's pretty easy there. So, uh, we build roles around users and assign roles to these users. And then based on which role a user is assigned, they are able to access a specific set of data. And in here, we can, you know, select either the the databases, the tables. You can filter the data that they can access. You can select the columns that they can access. And, uh, it's already, you know, um, visual the way that, uh, it is done on, like, formation. But if we have to understand that there there can be sensitive data that that you don't wanna expose, all this can be taken into consideration. You might wanna do count checks. You you may wanna do key, uh, key column comparisons. Make sure the data is trustable before we make it open to the the end users, and, um, um, write down the test cases, monitor, Get user confirmation before things are deployed, and they will be entered access to. And thorough testing, everything comes into play.