profile-pic
Vetted Talent

Deependra Singh Rathore

Vetted Talent
Experienced Data Engineer and Analyst with over years of hands-on leadership, guiding team of 10+ developers through diverse project phases, including Data Warehousing, Modeling, Analytics, and ETL processes. Proficient in SQL scripting, Data Visualization, and navigating Legacy Systems like SAP and FTP Servers. A proven track record of completing 3+ projects and client POCs, consistently driving organizational profitability through commitment to excellence in data engineering and analytics.
  • Role

    Data & Databricks Engineer

  • Years of Experience

    4.9 years

Skillsets

  • Python
  • ETL
  • fabrics
  • Github
  • Glue
  • Jenkins
  • Jira
  • Jupyter
  • KPIs
  • Lakehouse architecture
  • Power BI
  • EMR
  • Redshift
  • Row-Level Security
  • S3
  • SLAS
  • SQL
  • Streaming pipelines
  • synapse
  • Talend
  • Madallian architecture
  • Data Factory
  • Data Governance
  • ADLS
  • Athena
  • AWS
  • Azure
  • Azure Boards
  • Azure DevOps
  • Compliance
  • Control-M
  • Data Warehousing
  • data lineage
  • Data Modeling
  • data quality
  • data reconciliation
  • Data Security
  • Databricks
  • Dbeaver
  • Delta Lake
  • ELT

Vetted For

9Skills
  • Roles & Skills
  • Results
  • Details
  • icon-skill_image
    Data Engineer (Remote)AI Screening
  • 53%
    icon-arrow-down
  • Skills assessed :Team Collaboration, Data Modeling, ETL, Snowflake, Postgre SQL, Problem Solving Attitude, Python, SQL, Strong Attention to Detail
  • Score: 48/90

Professional Summary

4.9Years
  • Dec, 2021 - Present4 yr 6 months

    Data Engineer

    Celebal Technologies
  • Sep, 2021 - Nov, 2021 2 months

    Associate Data Engineer

    Celebal Technologies
  • May, 2021 - Aug, 2021 3 months

    Associate Data Engineer

    Celebal Technologies
  • May, 2018 - Jul, 2018 2 months

    SQL Developer

Applications & Tools Known

  • icon-tool

    Azure Synapse Analytics

  • icon-tool

    Azure SQL Database

  • icon-tool

    Oracle

  • icon-tool

    AWS Glue

  • icon-tool

    AWS S3

  • icon-tool

    Azure AD

  • icon-tool

    Microsoft Power BI

  • icon-tool

    Python Libraries

  • icon-tool

    Databricks

  • icon-tool

    Spark

  • icon-tool

    Azure

  • icon-tool

    AWS

  • icon-tool

    Microsoft Excel

  • icon-tool

    Jupyter

  • icon-tool

    Visual Studio

  • icon-tool

    Google Data Studio

Work History

4.9Years

Data Engineer

Celebal Technologies
Dec, 2021 - Present4 yr 6 months

Associate Data Engineer

Celebal Technologies
Sep, 2021 - Nov, 2021 2 months

Associate Data Engineer

Celebal Technologies
May, 2021 - Aug, 2021 3 months
    Migrated multiple legacy on-premises warehouses (SAP BW, SQL Server, Oracle) to cloud-native platforms (AWS Redshift, Azure Synapse, Databricks), consolidating 15+ heterogeneous sources and achieving 50%+ cost reduction with up to 3x query performance improvement. Designed and implemented an enterprise data governance framework for 30+ data assets, including data lineage, security policies, SLAs, and KPI alignment, reducing audit cycles from weeks to hours and improving compliance readiness. Built and maintained 5+ production ETL/ELT pipelines using Python, SQL, and Spark on Databricks and cloud services to process GBs of daily data from APIs, ERP systems, databases, and on-prem sources, achieving 99%+ pipeline reliability and sub-hour data freshness. Engineered a streaming analytics platform on Azure Databricks and Delta Lake processing 23 TB/day of TV ratings and Ad-VOD data, implementing TVR computation models and sub-second API queries, reducing platform costs by 26% and enabling 10+ new analytics use cases. Developed a Common Data Model (CDM) framework integrating SAP ECC, SAP BW, and Salesforce via dynamic attribute mapping and runtime schema handling in ADF and SQL Views, reducing manual reconciliation effort by 70% and standardizing consumption for 30+ Power BI users. Built an automated Row-Level Security (RLS) replication framework to migrate SAP BW custom security to Databricks Unity Catalog using Azure Data Factory and parameterized Databricks notebooks, eliminating manual admin effort and cutting dataset onboarding from 35 days to 23 hours. Implemented data quality checks, control tables, and monitoring dashboards across 30+ critical pipelines, supporting 100+ downstream analytics users with reliable, auditable datasets. Trained and mentored junior engineers on cloud platforms (AWS, Azure, Databricks), data modeling, and pipeline optimization, while authoring documentation and runbooks for 30+ data assets and improving onboarding speed by 60%.

SQL Developer

May, 2018 - Jul, 2018 2 months

Achievements

  • Designed and executed 150+ data pipelines
  • Optimized Traditional Data Driven Approaches by 65%
  • Developed 3+ dashboards
  • Achieved remarkable 90% reduction in processing time by optimizing data modeling scripts

Major Projects

5Projects

DIA Pricing Analytics Data Lake & ETL

    Automated Talend and Control-M ETL pipelines ingesting 33M+ pricing records from legacy TED systems into AWS Redshift with conformance layers and certified datasets, integrating SAP ECC/S4, PROS CPQ, and Salesforce to enable near real-time pricing analytics for global teams.

Streaming Analytics Platform for TV Ratings & Ad-VOD

    Designed a unified streaming architecture on Azure Databricks using Spark Structured Streaming, Delta Lake, and Event Hubs to process 23 TB/day of viewer and click-stream data, with TVR computation logic and Azure Functions APIs that reduced costs by 26% and met strict latency SLAs.

SAP BW Security Replication on Databricks Unity Catalog

    Engineered an automated framework using Azure Data Factory and Databricks to replicate SAP BW security into Unity Catalog with dynamic RLS generation and automated group/user assignment via SCIM/API, standardizing security governance and shrinking onboarding from days to hours.

Common Data Model (CDM) Implementation & Multi-Source Migration

    Built ADF pipelines and SQL Views to integrate SAP ECC, SAP BW, and Salesforce into standardized CDM entities with dynamic attribute mapping and automated sink-table creation, cutting manual reconciliation by 70% and enabling consistent analytics across Power BI and Dynamics 365.

CallMiner Multilingual Data Ingestion & Processing

    Developed an end-to-end Databricks pipeline orchestrated via Control-M and AWS S3 to ingest multilingual (English, Mandarin) call interaction data from CallMiner Bulk APIs, with validation, Bronze/Silver/Gold layering, and retry mechanisms achieving near 100% job success rate and a scalable base for AI-driven insights.

Education

  • B.Tech in Computer Science and Engineering

    Kautilya Institute of Technology and Engineering (2020)

Certifications

  • Microsoft data fundamentals

  • Databricks developer foundation & essentials

  • Databricks solution architect essentials

AI-interview Questions & Answers

Yeah. Hi. My name is, and I have joined CLL Technologies since May 2021. I have joined with my primary skills, Python and SQL. After getting into SQL, I have learned more technologies and enhanced my primary skills. I have learned the technologies like Azure Databricks, Azure Data Factories, and other Azure services. Apart from that, I have also learned the visualization tools, which is Power BI. I have been working with real-time streaming data. Apart from that, I also process data in batch processing. After getting all that data, I am well-versed in the analytical process and the visualization process. As I talk about, I have completed five plus projects and ten plus POCs in different domains for clients during my tenure at Syllable Technologies. Apart from that, I am leading one of the projects with ten plus developers. We have successfully completed the projects for the particular client. With this project, which is based on data warehousing migrations and applying security and governance to the warehousing on cloud platforms. Apart from that, I'm currently learning real-time streaming data and how to handle those data in databases.

Okay, so the use of Python is that I'm using Python in the Databricks platform. And I'm using one of the famous Python frameworks, which is PySpark, which is the combination of the Python and Spark technologies. I'm doing the transformation based on that. And from those transformations, I'm collecting the data from the Snowflake data warehousing. Suppose the client has different source systems like Snowflake and SAP, apart from that, we're collecting those data into a consolidated storage system, which we call the delta, where we use delta lake and do the entire middle architecture workflow, which is called the bronze layer. And apart from that, we can do some kind of transformations and then some kind of cleaning process as we store into that server layer. After that, we're implementing all those aggregations, functions, and all kinds of stuff. There's the final layer which we conclude, which we transfer to the particular layer. And then the particular Power Query concludes it and takes the data from the folder and does the process. So, in the Python usage, we use Python to manage all those three layers of architecture. And apart from that, we're pushing that data into the tagging system, which could be our Snowflake data warehousing and maintaining the data out there.

Okay. So for the error log mechanism and the retrying mechanism, what we can do is build out user-defined functional usage, which we call UDFs. By defining user-defining functions, we could maintain error logging based on audit log tables. We are, for example, defining 10 functions out there. And now we have to collect if some of the functions could fail and some of the functions could run successfully. Suppose I build up the audit log tables and I have collected all the metadata information about the functions out there. Now, as my program runs, I will collect if the sum of the functions may be the 5 functions. And the sum of 3 functions run successfully, and the 4th function makes an error. So what I could do is audit that error function to my audit log table, and I will just retry that function again. If I have built up if-else conditions out there. Suppose if the function if the program has done successfully, then it will go to the next step. If the program has failed, then it will retry. And I will do the retry mechanism based on that 4 times or 5 times. Suppose if retries have failed in the fifth time, then it will note down the particular error into the tables and the particular process into the next program. And it will get a notification to the particular user using emails and other instances so that the program has failed and kindly look into the particular program. That's why we can implement this mechanism into the particular program for making error logging and the retry mechanism successful.

So the issue of dot data skew or some kind of skew in further steps can be handled by making the partitioning and indexing particularly tailored to the scenarios. For doing the partitioning, it can be done in further ways. The most common way to do the partitioning is based on date-based columns. Suppose the data is coming from the source side, and it's been coming on a daily basis. From the data that's been loading, like for the particular processes, suppose 15 GBs of data is coming out there. In a single word, it couldn't be possible to load it directly into the system. It can create a problem in our system. To resolve that problem, we can try partitioning based on the date system. Suppose that data is coming, like 15 GBs, on a particular late day basis, and the data is coming in at 1 GB per day. We can load that data into the system on a day-by-day basis. If the day basis has also collected a large amount of data, we can further divide it into a timely basis or into 4-hour and 6-hour time frames using the timestamp column or date column. And using that, we can handle that kind of data skew into the data warehousing system like Snowflake. And for particularly fast execution processes, we can use indexing processes to make the execution faster.

So for those things, what we can do is build out programs generating dynamic SQL queries. Suppose I have 10 queries that need to be processed, but the parameters are the same and repeating. I can store the parameter values in an audit table or metadata table, and then build dynamic queries based on those values. I will make the dynamic queries where the data can be transferred from the tables where I have stored the parameter values in the metadata tables. The parameter values get from there and I will run the dynamic queries, which will make my performance faster. For further processes, I can make parallel processing using a thread pool. I will send the dynamic queries to the thread pool, and I will use the ETL process. In the ETL process, suppose the five parameters are getting into there, and the third parameter values are coming from a list. I will send that value directly from the list, and it will do the parallel processing using the thread pool executions. Once the process of the thread pool is complete, I will start my thread pool executions, then run my program, and then close my thread pool executions.

So the partitioning features could be applied in the basis of support. I have talked before using the date columns. So there's a categorized column that we can do partitioning. Suppose data is coming, there is a date column, created date or the updated date. So I can do the other partitioning features based on the date column. And suppose there's some tables with no date columns, then I can create one header table with the date column. I will put down the joining function between the header table and the data table. Then I will put the joining conditions, and I will load the particular table based on the dead joining condition and the header table and the data table. If those tables are also not present, I can use partitioning based on a categorized column, which could be region wise, city wise, and any other kind of column. But there is one drawback. There's a unique column, like IDs column and some other kind of column that cannot be included in the partitioning column because it contains unique values. The sum of containing unique values cannot be performed to the partition columns.

This network of the hypothetical data pipeline processes job or other solutes and solute principles. The principles are to invalidate it and this is how the effect functions. Yeah. It can be violated. Like, log error functions couldn't be violated because, like, so they here, we are, like, firstly, we are reading the data, then we are processing the data, then we are writing the data, and then the log error will be generated. If an error is not generated, then the log error cannot be recorded. But for better performance, what we can do is put the if-else condition, like try and except conditions. So what we can do is just maintain these blocks, read data, process data, and write data, under the try condition. If the try condition fails, then the error log will be generated. Otherwise, the error log will not be generated. This is the effect.

In the scale code, there is a performance issue. Let me explain. What might be causing the slowdown and how it could be addressed? Okay. So the further slowdown, particularly, is caused by the group by and order by statements, and everything will also be calculated. Like, suppose there are 50,000 records, and this will do the group by name. First, it will do the group by, then afterwards, it will go to the order by of the particular average. The average is calculated twice here. Firstly, the average is calculated based on the selected statement. Secondly, the average is calculated in the order by statement. So for better performance, we can do the city performance here. City implementation here. What the CT will do is create the particular objects of select areas and the name wise, and it will do the group by. Then once the objects have been created, afterwards, we will do the order by statement and then into the city. So it will be a better performing city per city implementation. It will get better performance out.

In what way is the latest Python meta-program within the field to adapt dynamic schema images? Okay. For dynamic schema changes, what we can do is compare the source table and the target table. So, suppose I have built up the source table and the target table, and we will introduce an intermediary table, which we could call a temporary table. Suppose I have loaded data from the source systems and put the data into a staging table. Then, after that, we will compare our staging table with the target table. If some schema changes have been done, then those changes will be performed onto the staging table, and the implementation could be done. And if a column has been added particularly to the target table, or a column has been deleted from the source systems, such as name, city, and gender, and name, city, and gender. And in the next time, if the source system has deleted the city column, then what I will do is compare my staging table with the target table. And if some column exchanges and those changes could be implemented into the target system. And every time the staging table will be recreated from the source systems, it will be in override mode, so it will implement the same system as the source table. And some changes have been done, then it will be implemented into the target table.

So I have never worked on machine learning algorithms and all kinds of stuff. So I'm not expected to get the best answer out there. What are the machine learning algorithm patterns for ETL for data enrichment? But we have incorporated some machine learning, not the machine learning algorithm, but the automation algorithm. What are the patterns we have implemented? We have implemented some automation into our data warehousing migration project. We have implemented automations, limiting the schedule trigger, and other stuff, which do the better processing and a smoother process for the ETL pipelines. We have implemented automation techniques.

What settings would you employ into the Python dynamic risk retail workers in response to the increasing data volumes? Okay, so for that, we can configure particularly, suppose we are using the Databricks platform for the ETL processes. We can just configure the ETL cluster configurations. In normal scenarios or systems, we can also implement it. We can dynamically configure particularly clusters, with the number of driver nodes, the number of worker nodes, and the run times, processors, and memory out there. We can maintain our metadata table for the particular scenario. Suppose there are five scenarios: small, large, extra large, and ultra large. We can categorize those cluster configurations into those categories and implement them based on the data volumes. And the other things we would do to implement auto scaling. Suppose we have provided a range of two to ten worker nodes. It will auto scale by itself according to the data volumes, making it suitable for those kinds of systems.