profile-pic

Suman khatua

Data Engineer with 5 years of IT industry experience with technical skills in the full development life cycle of software applications including requirement gathering, architecture design, and project planning. Managing the execution of the project from development to production release and maintaining production releases.


Have been responsible for the design and development of multiple applications involving data integration from ~10 Operational Data stores to Enterprise Data Warehouse while applying business logic and requirements.


Good working knowledge of NoSQL database for MongoDB and HBase.


Good coding skills in Python,Pyspark,DBT and SQL. Good understanding of algorithms and implementation of the same efficiently. Have basic knowledge of AWS and AZURE cloud services. Worked on Sqoop, Flume, RedShift, S3, EMR, etc.


Experienced in ingesting data from multiple data sources and deriving meaningful insights.


  • Role

    Senior Data Engineer

  • Years of Experience

    6.3 years

Skillsets

  • Hive
  • Shell-script
  • Scala
  • RDS
  • PyCharm
  • PowerBI
  • PostgreSQL
  • pandas
  • NumPy
  • NoSQL
  • MS Fabric
  • MS Excel
  • MongoDB
  • MapReduce
  • Jupyter
  • Jira
  • AWS - 2 Years
  • Databricks
  • Data lake
  • CosmosDB
  • Bitbucket
  • Apache
  • Redshift - 2 Years
  • Spark
  • Python - 5 Years
  • PySpark
  • MySQL
  • HBase
  • Hadoop
  • dbt
  • Azure

Professional Summary

6.3Years
  • Nov, 2024 - Present1 yr

    Senior Data Engineer (Senior Associate - Analytics)

    Altudo
  • Oct, 2022 - Nov, 20242 yr 1 month

    Senior Data Engineer (Assistant Manager)

    Enquero
  • May, 2022 - Sep, 2022 4 months

    Data Engineer (Senior Systems Engineer)

    Infosys
  • Jul, 2019 - May, 20222 yr 10 months

    Data Engineer (System Engineer)

    Tata Consultancy Services

Applications & Tools Known

  • icon-tool

    Jupyter

  • icon-tool

    pyCharm

  • icon-tool

    Jira

  • icon-tool

    Bitbucket

  • icon-tool

    Spark

  • icon-tool

    Hive

  • icon-tool

    Hadoop

  • icon-tool

    Apache

  • icon-tool

    Databricks

  • icon-tool

    Data Lake

  • icon-tool

    Microsoft Azure

  • icon-tool

    MySQL

  • icon-tool

    PostgreSQL

  • icon-tool

    MongoDB

  • icon-tool

    Redshift

  • icon-tool

    RDS

  • icon-tool

    CosmosDB

  • icon-tool

    MS Excel

Work History

6.3Years

Senior Data Engineer (Senior Associate - Analytics)

Altudo
Nov, 2024 - Present1 yr
    Building Medallion Architecture in MS-Fabric. Built metadata driven medallion architecture with 3 layers (Bronze, Silver, Gold). Perform CDC in Gold Layer and doing complex transformation in Gold and Silver Layer. Understanding the business with Client to help in building the reports, data mismatch in their old system. Building automated test cases for the pipelines built. Monitoring the pipeline with very less manual effort and seamlessly fixing if something fails. Migration of old data from multiple data sources such as: NAV, Salesforce, SFTP, Blueridge etc.

Senior Data Engineer (Assistant Manager)

Enquero
Oct, 2022 - Nov, 20242 yr 1 month
    Building End-to-end Pipelines. Refined and enriched insights of trends through the development of full pipelines using Global Transactional Data. Additionally make dashboards upon the enriched data using Python, SQL, Shell-script. Working with different source connections like FTP, Shared-Drive, AWS etc to get the data and import into required table for further transformations. Worked on data building tool (DBT) to create macros for different use case of data test building, data transformations. Constructed full pipelines in a 4 layer structure (staging, intermediate1, intermediate2 and core), running, testing, debugging the same. Implemented multiple test cases to large set of data such as not-null, unique, foreign key relationship, range, accepted values etc. Created more than 30 python functions to validate different types of data inconsistency. Validated most of the source data with 99.9% validations meeting use case requirements to build machine learning models. Databricks Migration: Migrating the on-premises pipelines into databricks using different layers (Bronze, Silver and Gold). Orchestrating the pipelines to get optimized storage and compute resources used. Using Different utilities of databricks like versioning, external tables, internal tables, views etc. to enhance the process of pipelining data.

Data Engineer (Senior Systems Engineer)

Infosys
May, 2022 - Sep, 2022 4 months
    Building AWS Lambda and Step Functions. Development and enhancement of the infrastructure required for extraction, optimal transformation, and loading of data from a wide variety of data source applications using AWS data services, Spark SQL, Apache Airflow. Build scripts in Scala for generating data flow and templates for the data extraction interface and load the metadata details in a MongoDB. Building multiple AWS Lambda functions to convert the generated data in MongoDB to responsible for triggering AWS Step function to execute extraction job and ingesting data into AWS Data lake.

Data Engineer (System Engineer)

Tata Consultancy Services
Jul, 2019 - May, 20222 yr 10 months
    Building ETL Pipelines. Built ETL pipeline with Azure Data factory to load data from different on-prem servers to raw layer of azure Data Lake. Then cleansing data with Azure Databricks to move the data to cleanse layer for ADRM Data modelling. Applied transformations like scd1 and scd2 on the cleansed data and breaking source tables into different entities in 3NF form as per ADRM model to make the data available in stage layer. Developed daily pipelines to migrate millions of data insights from the stage layer ADRM data to Azure Cosmos DB using spark-cosmos connector. Proper incremental logic, transformation based on the modelling, optimization and partitioning done in spark to move the data to Cosmos DB efficiently. This helped our client to reduce cost by 40%. Implemented near real time batch pipeline in Azure Data Factory and Spark which will load one transactional table incrementally to CosmosDB after doing proper cleansing and transformations.

Achievements

  • Secured 3.7/4 GPA
  • Secured 75%
  • Ranked among top 5% in ECE Batch
  • 5 star Gold badge in SQL on HackerRank platform
  • 5 star Gold badge in Python on HackerRank platform
  • 2 star Bronze badge in Problem Solving on HackerRank platform

Major Projects

1Projects

ETL and Data Analysis

    Extracted transactional data from MySQL RDS server to HDFS (EC2) using Sqoop. Transformed data with PySpark and loaded to S3 bucket. Created Redshift tables and loaded data from S3 to Redshift for analysis queries.

Education

  • Post Graduate Diploma in Data Engineering

    IIIT Bangalore (2022)
  • Bachelor of Technology in Electronics and Communication Engineering

    SOE, Cochin University of Science and Technology (2019)

Certifications

  • Databricks certified associate data engineer

  • Aws certified cloud practitioner

  • 3x microsoft azure certified

  • Advance sql for data science

  • Python data science certification

  • Infosys machine learning certified

  • Dbt fundamentals