Associate Data Engineer
AccentureSep, 2021 - Dec, 20221 yr 3 months
Designed and developed a scalable PySpark-based Data Ingestion Framework leveraging Apache Spark, AWS S3, AWS EMR and Hadoop, capable of ingesting terabyte-scale datasets efficiently. Built and integrated pre-ingestion data validation checks using PySpark DataFrames to ensure high data quality and consistency. Improved ingestion performance by optimizing Spark job execution, partitioning strategy, and I/O operations, resulting in a 30% reduction in processing time. Integrated the framework with orchestration tools like Apache Airflow and metadata/catalog management via AWS Glue, enabling seamless pipeline scheduling and governance. Ensured adherence to data governance, scalability, and security best practices in a distributed data processing environment.