Skilled in practices around Data Warehousing, Data Marts, Data Models, ETL & ELT for large-scale data management.
Built batch & streaming pipelines using Spark & Hadoop ecosystem capable of handling terabytes of data on day-to-day basis.
Worked on Cloud migration of thousands of pipelines (On-Primm to GCP).
Part of the Data Management team, involved in Automation & Support, Platform Engineering, and Data Governance activities.
Build & managed Data Pipeline frameworks (using Python, Scala) used for data ingestion activities across teams/markets.
Expert in writing Optimized SQL queries for OLTP & OLAP workloads using Indexes, Join, Aggregate, Window functions, CTEs, and more.
Worked in Retail, Banking, Financial Services & Insurance domain.
Collaborated on quarterly roadmaps, optimizing resource allocation and ensuring clear stakeholder communication.
Led sprint planning, prioritizing backlog, managing team capacity, and providing technical guidance for timely & high-quality delivery.
Senior Data Engineer
Walmart Global Tech IndiaData Engineer - III
Walmart Global Tech IndiaData Engineer - Associate
Infosys LimitedSpark
Hadoop
GCP
SQL
Python
Scala
Java
C
C++
Bash scripting
Hive
HDFS
Sqoop
Kafka
Kafka Connect
Oracle
MSSQL
MySQL
Dremio
PostgreSQL
Snowflake
Dataproc
Batch
IAM
BigQuery
Airflow
Looker
Power BI
Tableau
Apache Superset
Docker
Jenkins
Git
Maven
Excel
Led a team of 6 engineers in maintaining & enhancing in-house data pipeline frameworks (Scala & Python) for streaming and batch data ingestion across teams and domains (used by 6000+ pipelines), achieving a 70% reduction in pipeline creation effort.
Led a team of 25 data lake support personnel (L1), empowering them to conduct independent initial failure analysis and due diligence. This enabled self-sufficiency in monitoring and supporting over 6000+ big data pipelines across markets.
Partnered with stakeholders to define and implement technical solutions be it on framework or platform level for various data needs.
Worked on Airflow migration (Kubernetes to celery), version upgrades and optimization with growing pipelines count.
Established and managed CI/CD components utilizing Git, Maven, Jenkins, and Docker for growing pipelines and framework changes.
Established data lake best practices and conducted Proof-of-Concepts (POCs) for new tools & technologies, driving adoption and improving data lake functionality.
Mentored junior engineers, fostered collaboration, and ensured knowledge transfer for successful project execution.
Built a scalable Scala Spark application for e-commerce fraud detection, processing diverse data streams (sales, refunds, cancellation etc.) to identify various fraudulent activities, including employee/associate/partner collusion, frequent returns, and linked accounts.
Collaboration with UI/UX team as part of UI Integration & UAT ensuring data integrity & accuracy on dashboards.
Collaborated with Data Science team (providing data for model creation, integrating models into pipelines for fraud detection).
Used Apache Freemarker for generating dynamic SQL queries based upon dashboard interaction.
Provided Interactive Dashboards to Analysts by implementing dynamic SQL queries using Apache Freemarker.
Migrated scala based on-premise application to Cloud Platform (GCP) and optimized (caching, profiling, tuning spark parameters, etc) to attain cost savings of $4500 per month and a 250% improvement in execution time.
Migrated pipelines from Automic to Apache Airflow, applying optimizations leading to minimum task queue.
Automated various data management tasks like bucket creation, BigQuery table refresh with updated metadata from GCS buckets, etc.
Leveraged Erwin Data Modeler to streamline data modeling and transformation processes.
Utilized Scoop & jdbc for ingesting data into Data Lake from various RDBMS systems, cosmos and cassandra sources.
Perming Data cleaning activities and loading data into catalog zone in using Hive SQL, Spark SQL and PySpark.
Optimized pipelines resulting in 40% to 350% improvement in performance (using partitioning, tuning spark parameters)
Used Autosys for workflow management and orchestrating spark jobs.
Created technical design, data model and documentation of the solution.
Experienced in consuming data from REST API endpoints using python libraries like requests, pycurl, urllib3.
Used Python multi-threading and multi-processing to increase performance by several magnitudes.
Worked with different data file formats like xml, yaml, json, hocon, csv, txt, orc, avro, parquet, etc.