Suman khatua

Data Engineer with 5 years of IT industry experience with technical skills in the full development life cycle of software applications including requirement gathering, architecture design, and project planning. Managing the execution of the project from development to production release and maintaining production releases.

Have been responsible for the design and development of multiple applications involving data integration from ~10 Operational Data stores to Enterprise Data Warehouse while applying business logic and requirements.

Good working knowledge of NoSQL database for MongoDB and HBase.

Good coding skills in Python,Pyspark,DBT and SQL. Good understanding of algorithms and implementation of the same efficiently. Have basic knowledge of AWS and AZURE cloud services. Worked on Sqoop, Flume, RedShift, S3, EMR, etc.

Experienced in ingesting data from multiple data sources and deriving meaningful insights.

Role
Senior Data Engineer
Years of Experience
6.3 years

Skillsets

Hive
Shell-script
Scala
RDS
PyCharm
PowerBI
PostgreSQL
pandas
NumPy
NoSQL
MS Fabric
MS Excel
MongoDB
MapReduce
Jupyter
Jira
AWS - 2 Years
Databricks
Data lake
CosmosDB
Bitbucket
Apache
Redshift - 2 Years
Spark
Python - 5 Years
PySpark
MySQL
HBase
Hadoop
dbt
Azure

Professional Summary

6.3Years

Nov, 2024 - Present1 yr
Senior Data Engineer (Senior Associate - Analytics)
Altudo
Oct, 2022 - Nov, 20242 yr 1 month
Senior Data Engineer (Assistant Manager)
Enquero
May, 2022 - Sep, 2022 4 months
Data Engineer (Senior Systems Engineer)
Infosys
Jul, 2019 - May, 20222 yr 10 months
Data Engineer (System Engineer)
Tata Consultancy Services

Applications & Tools Known

Jupyter
pyCharm
Jira
Bitbucket
Spark
Hive
Hadoop
Apache
Databricks
Data Lake
Microsoft Azure
MySQL
PostgreSQL
MongoDB
Redshift
RDS
CosmosDB
MS Excel

Work History

6.3Years

Senior Data Engineer (Senior Associate - Analytics)

Altudo

Nov, 2024 - Present1 yr

Building Medallion Architecture in MS-Fabric. Built metadata driven medallion architecture with 3 layers (Bronze, Silver, Gold). Perform CDC in Gold Layer and doing complex transformation in Gold and Silver Layer. Understanding the business with Client to help in building the reports, data mismatch in their old system. Building automated test cases for the pipelines built. Monitoring the pipeline with very less manual effort and seamlessly fixing if something fails. Migration of old data from multiple data sources such as: NAV, Salesforce, SFTP, Blueridge etc.

Senior Data Engineer (Assistant Manager)

Enquero

Oct, 2022 - Nov, 20242 yr 1 month

Building End-to-end Pipelines. Refined and enriched insights of trends through the development of full pipelines using Global Transactional Data. Additionally make dashboards upon the enriched data using Python, SQL, Shell-script. Working with different source connections like FTP, Shared-Drive, AWS etc to get the data and import into required table for further transformations. Worked on data building tool (DBT) to create macros for different use case of data test building, data transformations. Constructed full pipelines in a 4 layer structure (staging, intermediate1, intermediate2 and core), running, testing, debugging the same. Implemented multiple test cases to large set of data such as not-null, unique, foreign key relationship, range, accepted values etc. Created more than 30 python functions to validate different types of data inconsistency. Validated most of the source data with 99.9% validations meeting use case requirements to build machine learning models. Databricks Migration: Migrating the on-premises pipelines into databricks using different layers (Bronze, Silver and Gold). Orchestrating the pipelines to get optimized storage and compute resources used. Using Different utilities of databricks like versioning, external tables, internal tables, views etc. to enhance the process of pipelining data.

Data Engineer (Senior Systems Engineer)

Infosys

May, 2022 - Sep, 2022 4 months

Building AWS Lambda and Step Functions. Development and enhancement of the infrastructure required for extraction, optimal transformation, and loading of data from a wide variety of data source applications using AWS data services, Spark SQL, Apache Airflow. Build scripts in Scala for generating data flow and templates for the data extraction interface and load the metadata details in a MongoDB. Building multiple AWS Lambda functions to convert the generated data in MongoDB to responsible for triggering AWS Step function to execute extraction job and ingesting data into AWS Data lake.

Data Engineer (System Engineer)

Tata Consultancy Services

Jul, 2019 - May, 20222 yr 10 months

Building ETL Pipelines. Built ETL pipeline with Azure Data factory to load data from different on-prem servers to raw layer of azure Data Lake. Then cleansing data with Azure Databricks to move the data to cleanse layer for ADRM Data modelling. Applied transformations like scd1 and scd2 on the cleansed data and breaking source tables into different entities in 3NF form as per ADRM model to make the data available in stage layer. Developed daily pipelines to migrate millions of data insights from the stage layer ADRM data to Azure Cosmos DB using spark-cosmos connector. Proper incremental logic, transformation based on the modelling, optimization and partitioning done in spark to move the data to Cosmos DB efficiently. This helped our client to reduce cost by 40%. Implemented near real time batch pipeline in Azure Data Factory and Spark which will load one transactional table incrementally to CosmosDB after doing proper cleansing and transformations.

Achievements

Secured 3.7/4 GPA
Secured 75%
Ranked among top 5% in ECE Batch
5 star Gold badge in SQL on HackerRank platform
5 star Gold badge in Python on HackerRank platform
2 star Bronze badge in Problem Solving on HackerRank platform

Major Projects

1Projects

ETL and Data Analysis

Extracted transactional data from MySQL RDS server to HDFS (EC2) using Sqoop. Transformed data with PySpark and loaded to S3 bucket. Created Redshift tables and loaded data from S3 to Redshift for analysis queries.

Education

Post Graduate Diploma in Data Engineering
IIIT Bangalore (2022)
Bachelor of Technology in Electronics and Communication Engineering
SOE, Cochin University of Science and Technology (2019)

Certifications

Databricks certified associate data engineer
Aws certified cloud practitioner
3x microsoft azure certified
Advance sql for data science
Python data science certification
Infosys machine learning certified
Dbt fundamentals

Suman khatua

Senior Data Engineer

6.3 years

Skillsets

Professional Summary

Applications & Tools Known

Work History

Senior Data Engineer (Senior Associate - Analytics)

Senior Data Engineer (Assistant Manager)

Data Engineer (Senior Systems Engineer)

Data Engineer (System Engineer)

Achievements

Major Projects

ETL and Data Analysis

Education

Post Graduate Diploma in Data Engineering

Bachelor of Technology in Electronics and Communication Engineering

Certifications

Databricks certified associate data engineer

Aws certified cloud practitioner

3x microsoft azure certified

Advance sql for data science

Python data science certification

Infosys machine learning certified

Dbt fundamentals