profile-pic
Vetted Talent

Ankesh Jain

Vetted Talent

I am a strategic Big Data leader with 11+ years driving measurable business outcomes through innovative data solutions. I specialize in transforming operational challenges into competitive advantages across insurance, banking, retail, and telecommunications sectors.

My expertise lies in architecting enterprise-grade data ecosystems using Azure and GCP technologies that directly enhance decision-making velocity, reduce operational costs, and unlock revenue opportunities. I excel at translating complex business requirements into scalable data platforms that drive ROI and sustainable growth.

Core Capabilities:

Cloud-Native Solutions: Azure Databricks, Synapse Analytics, GCP BigQuery

Real-Time Analytics: Apache Spark, Kafka, streaming data architectures

Modern Data Stack: DBT, Snowflake, Delta Lake, NoSQL implementations

I don't just build data pipelines—I architect strategic capabilities that align technical innovation with business objectives, ensuring every solution delivers measurable value and positions organizations for data-driven excellence.

  • Role

    Senior Data Engineer

  • Years of Experience

    11.83 years

Skillsets

  • Kylo
  • BigData
  • BigQuery
  • Bit bucket
  • Cloud Function
  • Control-M
  • Debezium
  • Delta Lake
  • Eclipse
  • Git
  • Hadoop
  • IntelliJ
  • Java
  • Jira
  • Bamboo
  • Metabases
  • Mongo DB
  • MySQL
  • NIFI
  • Oozie
  • Postgres SQL
  • PyCharm
  • Scala ide
  • Service-Now
  • Spark
  • SVN
  • Teradata
  • Jenkins
  • SQL - 08 Years
  • GCP
  • Shell Scripting
  • Azure
  • Redis
  • Scala - 03 Years
  • Kafka
  • Snowflake
  • Hive
  • Neo4j
  • dbt
  • Sqoop
  • Pig
  • Python - 03 Years
  • Azure Data Lake
  • Azure DataBricks
  • Git lab
  • Superset
  • Azure Data Factory
  • Azure Event Hub
  • Azure Functions
  • Azure Key Vault
  • Azure spring apps
  • Azure sql
  • Azure Storage
  • Azure Synapse Analytics

Vetted For

13Skills
  • Roles & Skills
  • Results
  • Details
  • icon-skill_image
    Data Engineer || (Remote)AI Screening
  • 44%
    icon-arrow-down
  • Skills assessed :Airflow, Data Governance, machine learning and data science, BigQuery, ETL processes, Hive, Relational DB, Snowflake, Hadoop, Java, Postgre SQL, Python, SQL
  • Score: 40/90

Professional Summary

11.83Years
  • Mar, 2023 - Present3 yr 2 months

    Senior Data Engineer

    Vanderlande
  • May, 2022 - Feb, 2023 9 months

    Senior Data Engineer

    Lifepal
  • Aug, 2021 - May, 2022 9 months

    Senior Bigdata Developer

    Barclays
  • Jan, 2017 - Apr, 20181 yr 3 months

    Spark Developer

    Tech Mahindra
  • Apr, 2018 - May, 20191 yr 1 month

    Senior Technical Associate

    Teradata
  • May, 2019 - Jul, 20212 yr 2 months

    Technical Services Specialist

    IBM
  • Apr, 2014 - Jan, 20172 yr 9 months

    Hadoop Developer

    Sears IT and Management Services India

Applications & Tools Known

  • icon-tool

    Python

  • icon-tool

    Delta Lake

  • icon-tool

    Apache Spark

  • icon-tool

    Scala

  • icon-tool

    Kafka

  • icon-tool

    Pyspark

  • icon-tool

    Apache Hive

  • icon-tool

    Sqoop

  • icon-tool

    Apache Superset

  • icon-tool

    Metabase

  • icon-tool

    Snowflake

  • icon-tool

    Redis

  • icon-tool

    Neo4j

  • icon-tool

    Apache NiFi

  • icon-tool

    SQL

  • icon-tool

    GCP

  • icon-tool

    Control-M

  • icon-tool

    Eclipse

  • icon-tool

    Intellij

  • icon-tool

    Pycharm

  • icon-tool

    MySQL

  • icon-tool

    Teradata

  • icon-tool

    Git Lab

  • icon-tool

    SVN

  • icon-tool

    Git

  • icon-tool

    Jenkins

  • icon-tool

    Jira

  • icon-tool

    IntelliJ

  • icon-tool

    PyCharm

  • icon-tool

    Redis

  • icon-tool

    Bit Bucket

  • icon-tool

    Bamboo

Work History

11.83Years

Senior Data Engineer

Vanderlande
Mar, 2023 - Present3 yr 2 months
    Designed and implemented a configuration-driven approach for managing Blob to Delta ingestion in Databricks. Utilized Databricks Autoloader to ingest data into Delta Lake, eliminating the need for redeployment when parameters change. Stored all configuration parameters in a Delta Lake-backed configuration table to enable dynamic retrieval at runtime. Implemented a mechanism to read the configuration table on job startup, dynamically initializing Autoloaders based on the parameters. Ensured that any changes in configuration do not require code redeployment, improving efficiency and flexibility. Developed a system to read log line data (telegrams) from Kafka in real-time. Implemented parsing logic using regular expressions to extract relevant information from the log data. Sent the parsed data to another Kafka topic for downstream processing. Implemented a Dead Letter Queue (DLQ) mechanism to handle unparsed data and send it to a separate Kafka topic for further investigation.

Senior Data Engineer

Lifepal
May, 2022 - Feb, 2023 9 months
    Designed and implemented a real-time streaming pipeline to capture data from a source PostgreSQL database. Used Debezium to stream changes in the PostgreSQL database and pushed the data to Kafka for processing. Ingested and stored the streamed data into Snowflake as the persistent layer. Leveraged DBT to process the data within Snowflake for transformation and enrichment. Updated Snowflake with processed data and integrated it into real-time dashboards using Superset. Developed a system to tag voice recordings uploaded to a cloud storage bucket as either voicemail or conversation. Utilized Cloud Functions with the Speech-to-Text library to automatically process new recordings. Converted voice recordings in Indonesian language into text for further analysis. Inserted the converted text records into a Big Query (BQ) table for data storage and querying.

Senior Bigdata Developer

Barclays
Aug, 2021 - May, 2022 9 months
    Processed data from the Falcon system to identify transactions marked as fraud. Applied business rules and logic to generate a curated dataset highlighting fraudulent transactions. Performed reconciliation of the generated dataset with data from the previous month to ensure consistency and accuracy. Delivered actionable insights from the data, supporting fraud detection and reconciliation processes.

Technical Services Specialist

IBM
May, 2019 - Jul, 20212 yr 2 months
    Designed and implemented an ETL pipeline to extract data from multiple tables in Zone1. Applied business logic transformations to the data based on predefined rules. Stored the processed data in Zone3 (Hive) in Parquet format for optimized performance. Ensured efficient data processing and transformation workflows within the pipeline.

Senior Technical Associate

Teradata
Apr, 2018 - May, 20191 yr 1 month
    Developed data pipelines to ingest and transform data from multiple heterogeneous sources. Applied business-specific transformation logic using Apache Spark and Scala. Designed efficient workflows and orchestrated them using Kylo, enabling automated and repeatable pipeline execution. Ensured processed data was stored in Hive in optimized formats for downstream consumption and reporting. Collaborated with Azure services for scalable and secure data processing.

Spark Developer

Tech Mahindra
Jan, 2017 - Apr, 20181 yr 3 months
    Designed and developed a centralized data store to track network outage incidents and their impact on customers across multiple interconnected data sources. When a cable cut or network failure occurs, the system captures incident data, including affected customers, and continuously tracks incident status updates in real-time. Integrated incident and customer data from various systems to create a unified source of truth, enabling faster root cause analysis and resolution tracking. Provided visualization of interconnected data sources to understand the downstream impact using graph representations.

Hadoop Developer

Sears IT and Management Services India
Apr, 2014 - Jan, 20172 yr 9 months
    Developed and maintained the Open Item Batch system, designed for centralized item and product information management. Managed product metadata including Division, Category, Store Location, and Selling Price (Retail) for accurate merchandising. Ensured consistent product information across systems to support accurate stock identification and customer purchases. Extracted item and pricing data from Teradata and the Enterprise Data Hub (EDH) to ensure consistency across multiple downstream systems. Enabled seamless data consumption by critical systems such as Local Publishing and Dynamic Pricing.

Achievements

  • Got Pat on the Back for development in SSOT Application and deployed successfully on production
  • Got Applause award and team award for Development in an Open Item in Sears Holdings
  • Microsoft Certified - Microsoft Azure Fundamentals AZ-900
  • Teradata Database Certified Associate

Testimonial

Sears Holdings

Sears

Got  best performer of the month from client

Major Projects

3Projects

Configure to Order

    Designed a Blob-to-Delta ingestion system using configuration-driven approaches.

Telegram Parser

    Developed a system for real-time log data parsing and classification, implementing DLQ handling.

Migration to Snowflake

    Designed real-time streaming pipelines to capture and transform PostgreSQL data to Snowflake.

Education

  • B. Tech

    Rajiv Gandhi Technical University

Certifications

  • AWS

    Microsoft (Mar, 2020)
  • Microsoft azure fundamentals az-900

  • Teradata database certified associate

  • Microsoft certified - microsoft azure fundamentals az-900

Interests

  • Long Rides
  • Driving
  • AI-interview Questions & Answers

    Okay, so I have 10 years of experience. And I've worked on many big data stacks, including Hadoop, Hive, Big, and Spark. I've been working on Spark for the past 7 years. I've also used Kafka. I've worked on many cloud projects, including Azure, and the Azure services I've worked on include Azure Data Bridge, Data Factory, Synapse, Databricks, and Delta Tables. So I've done many projects on real-time streaming as well as batch processing. So mainly, the programming languages I've worked on are Scala, Python, and Java. So currently, I'm working on a project where we're getting data from IoT Edge. And after that, we have some modules built on Java. So the data will be stored in blocks. And after that, we push the data to Databricks. So we have written a notebook that consumes the data from Databricks. So we also use auto loader whenever any file is uploaded to the blob, the data gets consumed. And we process the data through Databricks, and we store the data into Delta Tables. And so we basically use Delta Live Tables, which is a kind of stream table that we've created. All our data is in Delta Live Tables. We created a pipeline on top of that, and this pipeline is running and stores the data into Delta Live Tables. And after that, we created a dashboard on top of that. So we have a dashboard where the data is consumed from Delta Live Tables. So this is kind of one project. So I've done many projects on real-time streaming, basically on Kafka itself. So that's kind of my exposure to Kafka. We also worked on cloud services. I've used Azure, and we've used Azure Functions as well, where we created APIs in Spring Boot. So what we did is we created APIs in Spring Boot, and we uploaded that API as an Azure Function in Azure services. So this is kind of my experience.

    Okay, so we mainly use Spark. I did one of the projects where we are getting data in real-time streaming. So basically, it's a telecom-based project where we get data based on incident-related data. So whenever any cable cut happens, an incident gets generated, and the incident pushes to the Kafka topic. We have a Kafka topic based on the status of the incident, like if the incident is created, it's in a queued state. Then after that, the state would change, and it's in an in-progress state. After that, the incident status is active, deferred, closed, and canceled state. We have multiple Kafka topics based on the status of the incident. So we get the data on that particular Kafka topic, and after that, we consume the data from a Spark streaming. We basically use Spark streaming. After consuming the data, it returns a DStream. We process the DStream, which is a collection of RDDs. In the DStream, we have an incident ID, incident status, and the updated timestamp at which time the incident got created. On the basis of the incident ID, we make a REST call and get a huge JSON response. We have a huge JSON response, so we compute that JSON incident data. We have incident data, customer data, and ticket data. We store the incident data in MongoDB. Before that, this is a huge JSON response, so we compute that JSON. We store the intermediate data, and the final data we store into MongoDB. We have different collections in MongoDB. One is the incident collection where we store incident-related data, and we have a ticket collection. We store ticket-related data, which contains the ticket ID, ticket status, and who's the technician working on that ticket. So that's kind of a ticket collection. We update the real-time data in MongoDB so that the customer has a complete picture of the status of the ticket. The customer can see the status of their ticket on the UI. We have real-time updates on the MongoDB. And after that, this is how the job runs on the Spark cluster, and we use a three-node cluster. We have a conditionally running job, and we get 80 to 100 incidents per second.

    so what matters is using the data quality you have to proper filtration of the data you just use you don't need to store unnecessary data you can normalize the data while storing so you can store proper data you can while storing the data properly partition the data create a bucket and you don't need to just take care of the duplicate data there should not be duplicate data stored in the target so this is how you can maintain data quality and data integrity so data should be normalized you don't need to store all the data in one large table you need to separate out the table as per your scenarios so you need to store the data in a distributed way in multiple tables based on your use case and create a proper partition suppose if you were dealing with retail data just store the data year-wise then month-wise then date-wise so you will have a complete picture of the data and when you are querying the data if you want to fetch data for a particular one you don't need to scan all the data you just put a filter on a particular partition so you can just fetch the data for a particular partition only so that improves your query performance and if you are filtering the data at the starting don't filter out junk characters like you have different modes in Spark like failover mode and I believe you have some modes in Spark where you can just maintain data quality in your ETL process so you don't need to store unnecessary data at the target set just handle duplicates while reading the data that's it

    You can perform many data validation checks using Python. Suppose if you want to store the data in this topic, suppose if you have a date format, you want to store the data on a particular date format, so you can just perform the validation on the data that and also for if you want to store the data and in Snowflake and the target, you can perform that. The data should not be duplicate. Data format should be correct. The data type should be correct. This is how you can perform some data validation, and you can just check the relation of the data that the data is maintaining or not. So you can perform validation based on your use case and your scenarios. So you can perform. You can create a UDF in Spark. You perform data validation on top of that.

    I'm not logged on for this.

    So instead of using a passcode, you can just use a fire spark where you can create a workflow and you can just read the data from multiple sources. Spark provides different connectors and enabled connectors where you can just read the data from multiple sources and perform the transformation you want, and store the data into the target. You first need to decide which format you want to store the data. What is the format present in the source? You can just think about that and perform the reading part. You can just do the transformation and store the data into the target.

    So in this code, the loading field because the transform data is not correct. So while you are loading the data into the target, it is unable to load the data due to some data type issue or a data type issue that is not matching with the data type of the target. Suppose you are putting data into a table, the data type is different. You need to perform data validation while loading the data. And you can just see the columns are correct for your table.

    You need to pass the DB connection. I'm not sure.

    Okay, so the process of tuning a data processing pipeline in BigQuery is you can use BigQuery.

    I'm not certain.