profile-pic
Vetted Talent

Ankesh Jain

Vetted Talent

I am a strategic Big Data leader with 11+ years driving measurable business outcomes through innovative data solutions. I specialize in transforming operational challenges into competitive advantages across insurance, banking, retail, and telecommunications sectors.

My expertise lies in architecting enterprise-grade data ecosystems using Azure and GCP technologies that directly enhance decision-making velocity, reduce operational costs, and unlock revenue opportunities. I excel at translating complex business requirements into scalable data platforms that drive ROI and sustainable growth.

Core Capabilities:

Cloud-Native Solutions: Azure Databricks, Synapse Analytics, GCP BigQuery

Real-Time Analytics: Apache Spark, Kafka, streaming data architectures

Modern Data Stack: DBT, Snowflake, Delta Lake, NoSQL implementations

I don't just build data pipelines—I architect strategic capabilities that align technical innovation with business objectives, ensuring every solution delivers measurable value and positions organizations for data-driven excellence.

  • Role

    Sr Data & Hive Engineer

  • Years of Experience

    11 years

Skillsets

  • dbt
  • Superset
  • MongoDB
  • Git lab
  • ETL
  • Data Warehousing
  • Big Data
  • Azure DataBricks
  • Azure Data Lake
  • Jenkins
  • Pig
  • Sqoop
  • Python - 03 Years
  • Neo4j
  • Hive
  • Snowflake
  • Kafka
  • Scala - 03 Years
  • Redis
  • Azure
  • Shell Scripting
  • GCP
  • Apache Spark
  • SQL - 08 Years

Vetted For

13Skills
  • Roles & Skills
  • Results
  • Details
  • icon-skill_image
    Data Engineer || (Remote)AI Screening
  • 44%
    icon-arrow-down
  • Skills assessed :Airflow, Data Governance, machine learning and data science, BigQuery, ETL processes, Hive, Relational DB, Snowflake, Hadoop, Java, Postgre SQL, Python, SQL
  • Score: 40/90

Professional Summary

11Years
  • Mar, 2023 - Present2 yr 8 months

    Senior Data Engineer

    Vanderlande India Pvt Ltd
  • May, 2022 - Feb, 2023 9 months

    Senior Data Engineer

    Life pal
  • Aug, 2021 - May, 2022 9 months

    Senior Developer

    Barclays
  • Jan, 2017 - Apr, 20181 yr 3 months

    Spark Developer

    Tech Mahindra
  • Apr, 2018 - May, 20191 yr 1 month

    Data Engineer

    Teradata India Pvt Ltd
  • May, 2019 - Jul, 20212 yr 2 months

    Developer

    IBM India Pvt Ltd
  • Apr, 2014 - Jan, 20172 yr 9 months

    Hadoop Developer

    Sears IT and Management Services India Pvt Ltd

Applications & Tools Known

  • icon-tool

    Python

  • icon-tool

    Delta Lake

  • icon-tool

    Apache Spark

  • icon-tool

    Scala

  • icon-tool

    Kafka

  • icon-tool

    Pyspark

  • icon-tool

    Apache Hive

  • icon-tool

    Sqoop

  • icon-tool

    Apache Superset

  • icon-tool

    Metabase

  • icon-tool

    Snowflake

  • icon-tool

    Redis

  • icon-tool

    Neo4j

  • icon-tool

    Apache NiFi

  • icon-tool

    SQL

  • icon-tool

    GCP

  • icon-tool

    Control-M

  • icon-tool

    Eclipse

  • icon-tool

    Intellij

  • icon-tool

    Pycharm

  • icon-tool

    MySQL

  • icon-tool

    Teradata

  • icon-tool

    Git Lab

  • icon-tool

    SVN

  • icon-tool

    Git

  • icon-tool

    Jenkins

  • icon-tool

    Jira

  • icon-tool

    IntelliJ

  • icon-tool

    PyCharm

  • icon-tool

    Redis

  • icon-tool

    Bit Bucket

  • icon-tool

    Bamboo

Work History

11Years

Senior Data Engineer

Vanderlande India Pvt Ltd
Mar, 2023 - Present2 yr 8 months
    Designed and implemented a configuration-driven approach for managing Blob to Delta ingestion in Databricks. Utilized Databricks Autoloader to ingest data into Delta Lake. Developed a system to read log line data (telegrams) from Kafka in real-time.

Senior Data Engineer

Life pal
May, 2022 - Feb, 2023 9 months
    Designed and implemented a real-time streaming pipeline to capture data from a source PostgreSQL database. Developed a system to process voice recordings for speech-to-text in real-time.

Senior Developer

Barclays
Aug, 2021 - May, 2022 9 months
    Processed data from the Falcon system to identify transactions marked as fraud. Generated curated datasets and delivered actionable insights.

Developer

IBM India Pvt Ltd
May, 2019 - Jul, 20212 yr 2 months
    Designed and implemented an ETL pipeline to extract and transform data stored in Hive tables. Ensured workflow efficiency and optimized data processing.

Data Engineer

Teradata India Pvt Ltd
Apr, 2018 - May, 20191 yr 1 month
    Developed data pipelines using Apache Spark and Scala. Orchestrated pipelines using Kylo for automated pipeline execution and reporting.

Spark Developer

Tech Mahindra
Jan, 2017 - Apr, 20181 yr 3 months
    Designed and developed a centralized data store to track network outages. Created interconnected data representations using graph databases.

Hadoop Developer

Sears IT and Management Services India Pvt Ltd
Apr, 2014 - Jan, 20172 yr 9 months
    Developed the Open Item Batch system for centralized product information management and ensured accurate merchandising across downstream systems.

Achievements

  • Got Pat on the Back for development in SSOT Application and deployed successfully on production
  • Got Applause award and team award for Development in an Open Item in Sears Holdings
  • Microsoft Certified - Microsoft Azure Fundamentals AZ-900
  • Teradata Database Certified Associate

Testimonial

Sears Holdings

Sears

Got  best performer of the month from client

Major Projects

3Projects

Configure to Order

    Designed a Blob-to-Delta ingestion system using configuration-driven approaches.

Telegram Parser

    Developed a system for real-time log data parsing and classification, implementing DLQ handling.

Migration to Snowflake

    Designed real-time streaming pipelines to capture and transform PostgreSQL data to Snowflake.

Education

  • B. Tech

    Rajiv Gandhi Technical University (2012)

Certifications

  • AWS

    Microsoft (Mar, 2020)
  • Microsoft azure fundamentals az-900

  • Teradata database certified associate

  • Microsoft certified - microsoft azure fundamentals az-900

Interests

  • Long Rides
  • Driving
  • AI-interview Questions & Answers

    Okay. So this is, and I'm having total 10 years of experience. And I worked on a many big data stack like Hadoop, Hive, Big, Spark. I'm working on Spark since last 7 years. I used Kafka. I worked on many clouds projects like Azure and the Azure services I worked on, like Azure Databridge, Data Factory, Synapse, Dental Lake, Delta Tables. So I did many projects on real time streaming as well as on batch processing. Uh, so mainly, the programming language which I have worked on is Scala, Python, and Java. So currently, I am working on a project where we are getting data from IoT Edge. And after that, we we have some modules which, uh, built on, uh, Java. So the data will be stored on the block. And after that, we push the data to the uh, we consume the data from the Databricks. So we have written a notebook where it consumes the data from the Databricks. So we also use auto loader whenever any file uploaded to the blob, the data got consumed. And we process the data through our Databricks, and we store the data into the data tables. And so we basically use the data live table that is kind of a stream table which we have created and all are the data live tables. We created a pipeline on the top of that, and this pipeline is running and store the data into the data live tables. And after that, we created a dashboard on the top of that. So we have a dashboard on this plan, so data will be consumed from the delta live tables. So this is kind of a one project. So I did many projects on real time streaming basically on Kafka itself. So that is kind of my exposure on Kafka. We have a, uh, also, I worked on cloud services. I have, uh, uh, used in in in Azure, we have used Azure functions as well where we created the, uh, API in the spring. So what we did is we created APIs in a spring boot, and we you we uploaded that API as an Azure functions in the Azure services. So this is kind of my experience.

    Okay, so we mainly use this spark. So I did one of the projects where we are getting a data in real-time streaming. So basically is a telecom based project where we are getting data based on we are getting an incident related data. So whenever any cable cut happens, the incident got generated and incident pushes to the Kafka topic. So we have a Kafka topic based on the status of the incident, like if incident is created, it is in queued state. Then after that the state would change it is an in progress state and after that the incident status is active state, deferred state, close and cancel state. So we have a multiple Kafka topic based on the status of the incident. So we got the data on that particular Kafka topic and after that we consume the data from a spark streaming. So we basically use a spark streaming. So after consuming the data, it returns a D stream. So we process the D stream, D stream is a collection of RDDs and after that the D stream we got, in D stream we have an incident ID, incident status and it's updated timestamp at which time the incident got created. So on the basis of incident ID, we will make a REST call and we will get a huge JSON response. So we have a huge JSON response, so we compute that JSON, that incident data. So we have incident data, we have a customer data, we have a ticket data that what will be. So we store the incident data in the MongoDB. So before that, this is a huge JSON response, so we compute that JSON. So we store the intermediate data and the final data we store into the MongoDB. So we have a different, different collection in MongoDB. So one is incident collection where we store the incident related data, we have a ticket collection. We store the ticket related data which contains like ticket ID, ticket status, who is a technician working on that ticket. So that is kind of a ticket collection. So we updated that real-time data in the MongoDB so that the customer has complete picture of the status of the ticket. Customer can see the status of their ticket on the UI. So we will have a real-time update on the MongoDB. And after that we, so this is how the job running on the Spark cluster and we use a three node cluster. So we have a conditionally running job and we got 100, 80 to 100 incidents per second.

    so what matters you so to use the data quality you have to proper filtration of the data you just use you don't need to store the unnecessary data you can normalize the data while storing so and you can store proper you can while storing the data you can proper partition the data creates a bucket and you don't need to just taken care of the duplicate data there should not be duplicate data stored in the target so this is how you can maintain the data quality and data integrity so data should be normalized you don't need to store all the data in the sample in the one large table you need to separate out the table as per your scenarios so you need to store the data in a distributed way and in that multiple tables based on your use case and created a proper partition suppose if you were data so just of retail so just store the data in year wise then month wise then date wise so you will have a complete picture of the data and when you are wearing the data so you if you want to fetch a data for a particular one you don't need to scan all the data you just put putting on a particular partition so you can just fetch the data for a particular partition only so that is kind of your query performance will be improved and if you are filtering the data at the starting don't filter out the junk characters like we have a different different modes in a spark like fall fall back mode and I am not able to recall it but you have some modes and spark where you can just maintain that data quality your ETL process so you don't need to store the necessary data at the target set just handle in while reading the data handle the duplicates that's it

    Uh, you can perform many data validation check using Python. Suppose if you want to store the data in this topic, suppose if you have a date format, you want to store the data on particular date format, so you can just perform the validation on the data that and also for if you want to store the data and the Snowflake and the target, you can perform that. The data should not be duplicate. Uh, data format should be correct. The data type should be correct. This is how you can perform some data validation, and you can just check the relation of the data that the data is the relations is maintaining the data or not. So you can perform so validation is based on your use case and your scenarios. So you can perform. Perform you can create a UDF and Spark. You perform the data validation on the top of that.

    I'm not logged on for this.

    So instead of using a passcode, you can just use a fire spark where you can create a work flow and you can just read the data from multiple sources. So Spark provided different connectors and different enabled connectors where you can just read the data from multiple sources and, uh, you can perform the transformation whatever you want, and you can store the data into the target. So you first, you need to decide then which format you want to store the data. What is the, uh, format of what is the format present in the source? So you can just think about that picture, and you can just perform the reading part. And you can just do the transformation and store the data into the target.

    So in this code, remind me the loading field because the transform data is not correct. So the while while you are while you are loading the data into the target, it is unable to load the data due to some data type issue or might be We can say the data type issue, and it is not matching with the data type of the target. Suppose if you are putting data into the table, so the data type is different. So you need to perform that data validation while loading the data. And you can just see, uh, uh, the columns are correct of your table according to a table or not.

    You need to pass the DB connection. I'm I'm not sure.

    Okay. So process of tuning a data processing pipeline in your BigQuery, you can BigQuery. BigQuery.

    Not sure.