profile-pic

Ambuj Kumar

Senior Data Engineer with 9+ years of experience in building data intensive applications, tackling challenging architectural and scalability problems, managing data repos for efficient visualization, for a wide range of products. Highly analytical team player, with the aptitude for prioritization of needs/risks. Constantly striving to streamlining processes and experimenting with optimising and benchmarking solutions. Creative troubleshooter/problem-solver and loves challenges. Experience in implementing ML Algorithms & CI/CD using distributed paradigms of Spark/Flink, in production, on Azure Databricks/AWS Sagemaker/MLFlow. Experience in shaping and implementing Big Data architecture for Medical Devices,Retail, Banking, Games and Transport Logistics domain (IOT).
  • Role

    Senior Data Engineer

  • Years of Experience

    9 years

  • Professional Portfolio

    View here

Skillsets

  • Redshift
  • MLFlow
  • MLlib
  • MongoDB
  • MQTT
  • Neo4j
  • Oozie
  • pandas
  • Pig
  • PostgreSQL
  • Python
  • Luigi
  • S3
  • Sagemaker
  • Scala
  • Scikit learn
  • Spark
  • SQL
  • Structured Streaming
  • Tableau
  • TensorFlow
  • Terraform
  • DVC
  • Akka
  • AWS
  • Azure
  • Azure datalake
  • Cassandra
  • ClickHouse
  • Cosmos
  • Databricks delta
  • dbt
  • Docker
  • Airflow
  • Flask
  • Flink
  • GCP
  • GitFlow
  • Hive
  • Java
  • Kafkastreams
  • Kubernetes
  • Looker

Professional Summary

9Years
  • Aug, 2022 - Present3 yr 4 months

    Senior Data Engineer

    British Petroleum
  • Mar, 2021 - Jun, 20221 yr 3 months

    Senior Software Engineer

    StrongArmTech
  • Feb, 2019 - Dec, 20201 yr 10 months

    Senior Data Engineer Advanced

    Jones LaSalle Lang Technologies
  • Jun, 2014 - Oct, 20173 yr 4 months

    Software Developer

    General Electric Corp
  • Oct, 2017 - Dec, 20181 yr 2 months

    Senior Data Engineer

    Robert Bosch Engineering Solutions

Applications & Tools Known

  • icon-tool

    Spark

  • icon-tool

    Flink

  • icon-tool

    PostgreSQL

  • icon-tool

    Cassandra

  • icon-tool

    MongoDB

  • icon-tool

    Redshift

  • icon-tool

    Clickhouse

  • icon-tool

    Snowflake

  • icon-tool

    Airflow

  • icon-tool

    Luigi

  • icon-tool

    Looker

  • icon-tool

    Tableau

  • icon-tool

    Azure DataLake

  • icon-tool

    S3

  • icon-tool

    AWS

  • icon-tool

    Azure

  • icon-tool

    GCP

  • icon-tool

    Databricks

  • icon-tool

    Docker

  • icon-tool

    Kubernetes

  • icon-tool

    Terraform

  • icon-tool

    GitFlow

  • icon-tool

    MLFlow

  • icon-tool

    DVC

  • icon-tool

    SageMaker

Work History

9Years

Senior Data Engineer

British Petroleum
Aug, 2022 - Present3 yr 4 months
    Worked on a realtime streaming and batch lambda architecture pipeline for ingesting blockchain events and populating KPIs/dashboard in DeltaLake. Created batch/streaming analytics jobs for the lambda architecture using Airflow managed periodic PySpark jobs, writing to DeltaLake. Modeled data warehouse for KPI tracking on Snowflake (OLAP) and Databricks Delta. Created and managed DBT models for extensive data quality enforcement on DBT cloud. Modeled and created updating pipelines to a Neo4j knowledge graph for end user data relationship management product. Used GitActions, Docker, Kubernetes, Terraform for CI/CD operations. Ensured GDPR and CCPA compliant data platform.

Senior Software Engineer

StrongArmTech
Mar, 2021 - Jun, 20221 yr 3 months
    Created streaming pipelines to ingest sensor data and process them in real time to populate dashboards and the warehouse. Created pipelines for sensor data published into Kinesis (and S3 for failsafe reprocessing) ingested by a Databricks job, written into Azure Delta tables and Clickhouse (GCP earlier). Worked on Looker and SQL Analytics dashboards for Clickhouse/GCP. Data quality testing and improvement via periodic comparison jobs. Built pipelines as part of a SOLID principled ML codebase including ad hoc time bound backruns, API CDC jobs for metadata entities and the MLlib production optimized code, in Python (including Pandas API). Designed and integrated entities of the product using Databricks Delta (parquet DeltaLake) and Clickhouse. Used Terraform and Github Actions for DevOps/Infra/CI CD. Ensured GDPR, CCPA, and HIPAA compliant data platform.

Senior Data Engineer Advanced

Jones LaSalle Lang Technologies
Feb, 2019 - Dec, 20201 yr 10 months
    Worked on multiple API source ingestion, dump schema creation and entity modelling using Cosmos and Scala Azure Functions. Worked on global multi region sources and associated rule based implementation of Spark Azure Databricks notebooks driven ETL region specific pipelines. Integrated entities in the property domain using Azure Cosmos Graph and Azure Databricks Notebooks, followed by Scala web-service APIs deployed on Azure HDinsights for quick search. Worked on streaming data application element of the pipeline, detecting refreshes. Designed individual table based schema handling, ingestion and implementation of a data warehouse for KPI tracking and its respective components for a full fledged reporting data warehouse. Created Spark jobs for handling daily data from Mongo, MySQL, Postgres and folder dumps to update the data warehouses, using Airflow scheduling. Managed scaled ingestion from public competitor APIs for tracking relevant parameters in analytics warehouse on Redshift. Worked on complex custom reporting Spark logic driving insightful marketing strategy. Benchmarked the real-time elements of the solution with Kafka Streams.

Senior Data Engineer

Robert Bosch Engineering Solutions
Oct, 2017 - Dec, 20181 yr 2 months
    Created Spark batch jobs based on derivation from incoming data-model via a productionised ML model with associated business logic. Implemented Flask APIs layer and simulator for the application. Tested end-to-end pipeline and DevOps of associated individual component log monitoring. Overall design and development of the lambda architecture: MQTT based, Kafka, Spark pipeline for data ingestion and alert detection. Developed cloud agnostic framework. Created Scala Flink complex event processing and detection pipeline from incoming data-model with business logic. Worked on APIs layer implementation in Akka and a simulator for data. Tested end-to-end pipeline and DevOps of associated individual component log monitoring on AWS. Created data format based overall design and development of a MQTT based, Kafka, Flink, RDBMS and Cassandra pipeline for data ingestion and event/milestone detection.

Software Developer

General Electric Corp
Jun, 2014 - Oct, 20173 yr 4 months
    GE Healthcare Device Monitoring Product: Deployment and maintenance of the Azure cloud based cluster (DevOps), along with pipeline design and data handling constraint using a Data Virtualization tool. Implemented detection algorithms of different respiration and lung parameters, and accumulation algorithms for case-end aggregation requirements. Data modeling for Cassandra for real-time data storage and case-end data aggregation. Data modeling for data-warehousing and UI based consumption. Company Log Data Analytics: Involved in PIG scripting and the HIVE database, to staging layer for processing before loading into final Hadoop table. Worked on OOZIE workflows for executing Java, Pig and Hive actions based on decision nodes, scheduled Oozie Workflow and Coordinator Jobs.

Education

  • Bachelors in Engg.

    Syb University (2014)

Certifications

  • Consensys certified blockchain developer

  • Oracle certified associate java programmer se 7

  • Oracle certified oracle database 11g_sql advanced