Vetted Talent

Ankesh Jain

Vetted Talent

I am a strategic Big Data leader with 11+ years driving measurable business outcomes through innovative data solutions. I specialize in transforming operational challenges into competitive advantages across insurance, banking, retail, and telecommunications sectors.

My expertise lies in architecting enterprise-grade data ecosystems using Azure and GCP technologies that directly enhance decision-making velocity, reduce operational costs, and unlock revenue opportunities. I excel at translating complex business requirements into scalable data platforms that drive ROI and sustainable growth.

Core Capabilities:

Cloud-Native Solutions: Azure Databricks, Synapse Analytics, GCP BigQuery

Real-Time Analytics: Apache Spark, Kafka, streaming data architectures

Modern Data Stack: DBT, Snowflake, Delta Lake, NoSQL implementations

I don't just build data pipelines—I architect strategic capabilities that align technical innovation with business objectives, ensuring every solution delivers measurable value and positions organizations for data-driven excellence.

Role
Sr Data & Hive Engineer
Years of Experience
11 years

Skillsets

dbt
Superset
MongoDB
Git lab
ETL
Data Warehousing
Big Data
Azure DataBricks
Azure Data Lake
Jenkins
Pig
Sqoop
Python - 03 Years
Neo4j
Hive
Snowflake
Kafka
Scala - 03 Years
Redis
Azure
Shell Scripting
GCP
Apache Spark
SQL - 08 Years

Vetted For

13Skills

Roles & Skills
Results
Details

Data Engineer || (Remote)AI Screening
44%

Skills assessed :Airflow, Data Governance, machine learning and data science, BigQuery, ETL processes, Hive, Relational DB, Snowflake, Hadoop, Java, Postgre SQL, Python, SQL
Score: 40/90

Professional Summary

11Years

Mar, 2023 - Present2 yr 8 months
Senior Data Engineer
Vanderlande India Pvt Ltd
May, 2022 - Feb, 2023 9 months
Senior Data Engineer
Life pal
Aug, 2021 - May, 2022 9 months
Senior Developer
Barclays
Jan, 2017 - Apr, 20181 yr 3 months
Spark Developer
Tech Mahindra
Apr, 2018 - May, 20191 yr 1 month
Data Engineer
Teradata India Pvt Ltd
May, 2019 - Jul, 20212 yr 2 months
Developer
IBM India Pvt Ltd
Apr, 2014 - Jan, 20172 yr 9 months
Hadoop Developer
Sears IT and Management Services India Pvt Ltd

Applications & Tools Known

Python
Delta Lake
Apache Spark
Scala
Kafka
Pyspark
Apache Hive
Sqoop
Apache Superset
Metabase
Snowflake
Redis
Neo4j
Apache NiFi
SQL
GCP
Control-M
Eclipse
Intellij
Pycharm
MySQL
Teradata
Git Lab
SVN
Git
Jenkins
Jira
IntelliJ
PyCharm
Redis
Bit Bucket
Bamboo

Work History

11Years

Senior Data Engineer

Vanderlande India Pvt Ltd

Mar, 2023 - Present2 yr 8 months

Designed and implemented a configuration-driven approach for managing Blob to Delta ingestion in Databricks. Utilized Databricks Autoloader to ingest data into Delta Lake. Developed a system to read log line data (telegrams) from Kafka in real-time.

Senior Data Engineer

Life pal

May, 2022 - Feb, 2023 9 months

Designed and implemented a real-time streaming pipeline to capture data from a source PostgreSQL database. Developed a system to process voice recordings for speech-to-text in real-time.

Senior Developer

Barclays

Aug, 2021 - May, 2022 9 months

Processed data from the Falcon system to identify transactions marked as fraud. Generated curated datasets and delivered actionable insights.

Developer

IBM India Pvt Ltd

May, 2019 - Jul, 20212 yr 2 months

Designed and implemented an ETL pipeline to extract and transform data stored in Hive tables. Ensured workflow efficiency and optimized data processing.

Data Engineer

Teradata India Pvt Ltd

Apr, 2018 - May, 20191 yr 1 month

Developed data pipelines using Apache Spark and Scala. Orchestrated pipelines using Kylo for automated pipeline execution and reporting.

Spark Developer

Tech Mahindra

Jan, 2017 - Apr, 20181 yr 3 months

Designed and developed a centralized data store to track network outages. Created interconnected data representations using graph databases.

Hadoop Developer

Sears IT and Management Services India Pvt Ltd

Apr, 2014 - Jan, 20172 yr 9 months

Developed the Open Item Batch system for centralized product information management and ensured accurate merchandising across downstream systems.

Achievements

Got Pat on the Back for development in SSOT Application and deployed successfully on production
Got Applause award and team award for Development in an Open Item in Sears Holdings
Microsoft Certified - Microsoft Azure Fundamentals AZ-900
Teradata Database Certified Associate

Testimonial

Sears Holdings

Sears

Got best performer of the month from client

Major Projects

3Projects

Configure to Order

Designed a Blob-to-Delta ingestion system using configuration-driven approaches.

Telegram Parser

Developed a system for real-time log data parsing and classification, implementing DLQ handling.

Migration to Snowflake

Designed real-time streaming pipelines to capture and transform PostgreSQL data to Snowflake.

Education

B. Tech
Rajiv Gandhi Technical University (2012)

Certifications

AWS
Microsoft (Mar, 2020)
Microsoft azure fundamentals az-900
Teradata database certified associate
Microsoft certified - microsoft azure fundamentals az-900

Interests

Long Rides

Driving

AI-interview Questions & Answers

Okay. So this is, and I'm having total 10 years of experience. And I worked on a many big data stack like Hadoop, Hive, Big, Spark. I'm working on Spark since last 7 years. I used Kafka. I worked on many clouds projects like Azure and the Azure services I worked on, like Azure Databridge, Data Factory, Synapse, Dental Lake, Delta Tables. So I did many projects on real time streaming as well as on batch processing. Uh, so mainly, the programming language which I have worked on is Scala, Python, and Java. So currently, I am working on a project where we are getting data from IoT Edge. And after that, we we have some modules which, uh, built on, uh, Java. So the data will be stored on the block. And after that, we push the data to the uh, we consume the data from the Databricks. So we have written a notebook where it consumes the data from the Databricks. So we also use auto loader whenever any file uploaded to the blob, the data got consumed. And we process the data through our Databricks, and we store the data into the data tables. And so we basically use the data live table that is kind of a stream table which we have created and all are the data live tables. We created a pipeline on the top of that, and this pipeline is running and store the data into the data live tables. And after that, we created a dashboard on the top of that. So we have a dashboard on this plan, so data will be consumed from the delta live tables. So this is kind of a one project. So I did many projects on real time streaming basically on Kafka itself. So that is kind of my exposure on Kafka. We have a, uh, also, I worked on cloud services. I have, uh, uh, used in in in Azure, we have used Azure functions as well where we created the, uh, API in the spring. So what we did is we created APIs in a spring boot, and we you we uploaded that API as an Azure functions in the Azure services. So this is kind of my experience.

Okay, so we mainly use this spark. So I did one of the projects where we are getting a data in real-time streaming. So basically is a telecom based project where we are getting data based on we are getting an incident related data. So whenever any cable cut happens, the incident got generated and incident pushes to the Kafka topic. So we have a Kafka topic based on the status of the incident, like if incident is created, it is in queued state. Then after that the state would change it is an in progress state and after that the incident status is active state, deferred state, close and cancel state. So we have a multiple Kafka topic based on the status of the incident. So we got the data on that particular Kafka topic and after that we consume the data from a spark streaming. So we basically use a spark streaming. So after consuming the data, it returns a D stream. So we process the D stream, D stream is a collection of RDDs and after that the D stream we got, in D stream we have an incident ID, incident status and it's updated timestamp at which time the incident got created. So on the basis of incident ID, we will make a REST call and we will get a huge JSON response. So we have a huge JSON response, so we compute that JSON, that incident data. So we have incident data, we have a customer data, we have a ticket data that what will be. So we store the incident data in the MongoDB. So before that, this is a huge JSON response, so we compute that JSON. So we store the intermediate data and the final data we store into the MongoDB. So we have a different, different collection in MongoDB. So one is incident collection where we store the incident related data, we have a ticket collection. We store the ticket related data which contains like ticket ID, ticket status, who is a technician working on that ticket. So that is kind of a ticket collection. So we updated that real-time data in the MongoDB so that the customer has complete picture of the status of the ticket. Customer can see the status of their ticket on the UI. So we will have a real-time update on the MongoDB. And after that we, so this is how the job running on the Spark cluster and we use a three node cluster. So we have a conditionally running job and we got 100, 80 to 100 incidents per second.

so what matters you so to use the data quality you have to proper filtration of the data you just use you don't need to store the unnecessary data you can normalize the data while storing so and you can store proper you can while storing the data you can proper partition the data creates a bucket and you don't need to just taken care of the duplicate data there should not be duplicate data stored in the target so this is how you can maintain the data quality and data integrity so data should be normalized you don't need to store all the data in the sample in the one large table you need to separate out the table as per your scenarios so you need to store the data in a distributed way and in that multiple tables based on your use case and created a proper partition suppose if you were data so just of retail so just store the data in year wise then month wise then date wise so you will have a complete picture of the data and when you are wearing the data so you if you want to fetch a data for a particular one you don't need to scan all the data you just put putting on a particular partition so you can just fetch the data for a particular partition only so that is kind of your query performance will be improved and if you are filtering the data at the starting don't filter out the junk characters like we have a different different modes in a spark like fall fall back mode and I am not able to recall it but you have some modes and spark where you can just maintain that data quality your ETL process so you don't need to store the necessary data at the target set just handle in while reading the data handle the duplicates that's it

Uh, you can perform many data validation check using Python. Suppose if you want to store the data in this topic, suppose if you have a date format, you want to store the data on particular date format, so you can just perform the validation on the data that and also for if you want to store the data and the Snowflake and the target, you can perform that. The data should not be duplicate. Uh, data format should be correct. The data type should be correct. This is how you can perform some data validation, and you can just check the relation of the data that the data is the relations is maintaining the data or not. So you can perform so validation is based on your use case and your scenarios. So you can perform. Perform you can create a UDF and Spark. You perform the data validation on the top of that.

I'm not logged on for this.

So instead of using a passcode, you can just use a fire spark where you can create a work flow and you can just read the data from multiple sources. So Spark provided different connectors and different enabled connectors where you can just read the data from multiple sources and, uh, you can perform the transformation whatever you want, and you can store the data into the target. So you first, you need to decide then which format you want to store the data. What is the, uh, format of what is the format present in the source? So you can just think about that picture, and you can just perform the reading part. And you can just do the transformation and store the data into the target.

So in this code, remind me the loading field because the transform data is not correct. So the while while you are while you are loading the data into the target, it is unable to load the data due to some data type issue or might be We can say the data type issue, and it is not matching with the data type of the target. Suppose if you are putting data into the table, so the data type is different. So you need to perform that data validation while loading the data. And you can just see, uh, uh, the columns are correct of your table according to a table or not.

You need to pass the DB connection. I'm I'm not sure.

Okay. So process of tuning a data processing pipeline in your BigQuery, you can BigQuery. BigQuery.

Not sure.

Ankesh Jain

Sr Data & Hive Engineer

11 years

Skillsets

Vetted For

Professional Summary

Applications & Tools Known

Work History

Senior Data Engineer

Senior Data Engineer

Senior Developer

Developer

Data Engineer

Spark Developer

Hadoop Developer

Achievements

Testimonial

Sears Holdings

Major Projects

Configure to Order

Telegram Parser

Migration to Snowflake

Education

B. Tech

Certifications

AWS

Microsoft azure fundamentals az-900

Teradata database certified associate

Microsoft certified - microsoft azure fundamentals az-900

Interests

AI-interview Questions & Answers