profile-pic
Vetted Talent

Rameshkumar Shanmugam

Vetted Talent

I manage, design, and implement data solutions for various business domains and use cases. I have over 18 years of IT industry experience, with a dedicated focus on data engineering for the past 7 years. My core competencies lie in data management activities, including data ingestion, integration, visualization, and data quality. I excel in the implementation of both OLTP and OLAP systems and have successfully led small to medium data analytics teams. I leverage my skills in AWS cloud computing to deliver scalable, reliable, and efficient data solutions. I hold a Data Architect Nanodegree from Udacity and multiple certifications and I keep learning modern technologies and best practices in the data analytics space

  • Role

    Python Cloud ETL Engineer

  • Years of Experience

    20.25 years

Skillsets

  • Database Administration
  • ETL - 5 Years
  • Big Data - 1 Years
  • Apache Airflow
  • Apache Spark
  • AWS Cloud Services
  • Data Architecture
  • Data Governance
  • Data Modeling
  • PowerBI - 2 Years
  • Mongo DB
  • MS SQL
  • Postgre SQL
  • Shell Scripting
  • Snowflake
  • Unix
  • Flask API
  • Data Profiling
  • PySpark - 1 Years
  • Python - 6 Years
  • MySQL - 3 Years
  • MySQL - 3 Years
  • Data Engineering - 7 Years
  • SQL - 4 Years
  • SQL - 6 Years
  • Data Processing - 3 Years
  • Data Processing - 3 Years
  • Python - 4 Years
  • PySpark - 1 Years
  • Cloud - 1 Years
  • Cloud - 1 Years
  • AWS - 1 Years
  • AWS - 1 Years
  • Databricks - 1 Years
  • API - 3 Years
  • Airflow - 2 Years

Vetted For

9Skills
  • Roles & Skills
  • Results
  • Details
  • icon-skill_image
    Python Cloud ETL Engineer (Remote)AI Screening
  • 64%
    icon-arrow-down
  • Skills assessed :SFMC, Streamlit, API, AWS, ETL, JavaScript, Python, React Js, SQL
  • Score: 58/90

Professional Summary

20.25Years
  • Apr, 2025 - Present1 yr 1 month

    Databricks senior consultant

    Cognizant
  • Nov, 2024 - Mar, 2025 4 months

    Senior Data Engineer

  • Apr, 2024 - Oct, 2024 6 months

    Senior Software Engineer | AI Research and development

    Turing
  • Oct, 2017 - Jul, 20213 yr 9 months

    Data Engineer

    Freelancer (Upwork)
  • Jul, 2021 - Jun, 2022 11 months

    Senior Data Engineer

    Singular Intelligence
  • Jul, 2022 - Apr, 20241 yr 9 months

    Senior Data Engineer

    Bosch Global Software Technologies
  • Jun, 2013 - Sep, 20163 yr 3 months

    Manager - Projects

    Cognizant
  • Sep, 2004 - Jun, 20138 yr 9 months

    Programmer Analyst, Associate, Senior Associate

    Cognizant

Applications & Tools Known

  • icon-tool

    MySQL

  • icon-tool

    Python

  • icon-tool

    Apache Airflow

  • icon-tool

    Pyspark

  • icon-tool

    Snowflake

  • icon-tool

    Microsoft Power BI

  • icon-tool

    draw.io

  • icon-tool

    Databricks

  • icon-tool

    AWS

  • icon-tool

    PostgreSQL

  • icon-tool

    Azure Databricks

  • icon-tool

    Flask

  • icon-tool

    SQL

  • icon-tool

    Git

  • icon-tool

    Jenkins

  • icon-tool

    AWS S3

  • icon-tool

    AWS Redshift

  • icon-tool

    Power BI

  • icon-tool

    MongoDB

  • icon-tool

    MSSQL

  • icon-tool

    Shell scripting

  • icon-tool

    Unix

  • icon-tool

    Glue

Work History

20.25Years

Databricks senior consultant

Cognizant
Apr, 2025 - Present1 yr 1 month

Senior Data Engineer

Nov, 2024 - Mar, 2025 4 months

Senior Software Engineer | AI Research and development

Turing
Apr, 2024 - Oct, 2024 6 months
    • Collaborated with a leading AI research organization to enhance the capabilities of a state-of-the-art Large Language Model.
    • Gathering, cleaning, and annotating datasets suitable for training the LLM models
    • Focused on improving the LLM models performance through SFT, DPO, and RLHF techniques.
    • Validated the LLM model responses data quality to minimize bias and maximize LLM efficiency and accuracy.
    • Developed Spark applications using PySpark and Spark SQL for data extraction, transformation, and aggregation from multiple file formats.
    • Analyzed the transformed data to uncover insight into customer usage patterns.
    • Optimized PySpark applications on Databricks to reduce the significant utilization cost.
    • Used Python, PostgreSQL, MySQL, Apache Spark (PySpark), Databricks and AWS

Senior Data Engineer

Bosch Global Software Technologies
Jul, 2022 - Apr, 20241 yr 9 months
    • Developed an in-house Python ETL (Extract, Transform, Load) library for aggregating and analyzing extensive test logs from reliability test centers across diverse locations.
    • Integrated MySQL as the database system to efficiently store and manage the extracted data.
    • Implemented robust data extraction processes using Python, ensuring seamless data flow from different test center locations to the MySQL database.
    • Employed Git for version control, facilitating collaborative development and tracking changes in the codebase.
    • Applied data analysis techniques to derive valuable insights from the aggregated test logs.
    • Collaborated with the ECP Data Engineering team to streamline the data engineering processes.
    • Demonstrated proficiency in continuous integration and continuous deployment (CI/CD) practices for efficient development and deployment workflows.
    • Leveraged expertise in relational database management systems (RDBMS) to optimize data storage and retrieval.
    • Used Flask, SQL, Python, Git, MySQL, Data Analysis, CI/CD, RDBMS, Data Engineering, Apache Airflow.

Senior Data Engineer

Singular Intelligence
Jul, 2021 - Jun, 2022 11 months
    • Engineered a sales forecasting and analysis platform, Catman AI, tailored for consumer retail products.
    • Implemented predictive models considering factors such as mobility, GDP, product distribution, seasonality, and weather.
    • Developed Python-based ETL jobs for efficient feature engineering and to support AI/ML activities.
    • Established Flask API calls to seamlessly integrate the backend with Catman AI's frontend.
    • Leveraged Python and SQL for data manipulation and storage, ensuring a robust foundation for analytics.
    • Utilized AWS for scalable and reliable infrastructure, enhancing the platform's performance and capabilities.
    • Applied Git for version control, facilitating collaborative development and codebase management.
    • Demonstrated proficiency in Machine Learning techniques to enhance forecasting accuracy.
    • Used Python, Apache Airflow ,SQL, AWS, Flask, Git

Data Engineer

Freelancer (Upwork)
Oct, 2017 - Jul, 20213 yr 9 months
    • Collaborated with clients to understand business needs for BI solutions and translate those business needs into actionable reports in Power BI, which saves 40 hours of manual effort every month.
    • API integration between the Send Grid and HubSpot CRM applications to synchronize prospects' details for marketing.
    • 30+ workflows designed and developed in Apache Airflow
    • Performed data quality analysis of Power BI report to check the accuracy of various metrics.
    • Developed a Python program to extract data from multiple Excel files and extract data from JSON API calls to store the consolidated data in MySQL. It saved the manual effort by 80%.
    • Used Python, SQL, Apache Airflow, MySQL, Snowflake, Databricks, AWS S3, and Data Modeling

Manager - Projects

Cognizant
Jun, 2013 - Sep, 20163 yr 3 months
    • Managed a team of ten performance test resources for a leading Healthcare company in the US.
    • Performed end-to-end ETL/Batch performance testing activities, including identifying/prioritizing business scope for batch testing, input file setup, and server monitoring using UNIX shell scripts.
    • 100+ workflows/sessions monitored and analyzed using the Informatica workflow monitor tool.
    • Identified the root cause of the 10+ long-running batch jobs and provided findings and recommendations.

Programmer Analyst, Associate, Senior Associate

Cognizant
Sep, 2004 - Jun, 20138 yr 9 months
    • Collaborated with ETO Performance testing team at leading insurance company.
    • Involved in various complex projects to evaluate and identify performance and scalability issues.
    • Performed data complexity and production log analysis to obtain the production like workload scenario in load test execution.
    • Created and maintained test artifacts from Requirement gathering to Test report sign off.
    • Conducted weekly meetings with project stakeholders to update the progress of projects.

Achievements

  • Received cost Innovation award for Idea AI/ML based Data Insight platform in XC-CT Domain
  • Maintained Top Rated Freelancer badge for more than one year.
  • Received Star performer of the Quarter for quality of work and on time delivery.
  • Cost Innovation award for AI/ML based Data Insight platform development 2023
  • Star performer of the Quarter 2014 at Cognizant Technology Solutions

Testimonial

Bosch Global software Technology private Ltd

Thank you for your contribution towards ''Cost Innovation & Efficiency Campaign"

Major Projects

3Projects

Data insight platform

Bosch Global software Technology private limited
Jul, 2022 - Present3 yr 10 months
    • Data insight is a web-based framework consist of ETL tool and AI/ML to build end to end data analytics solution. 
    • It ingests, process and loads data into backend through batch processing, and makes the processed data available to end users through web analytics interface.
    • It is a scalable platform with the ability to full-text search, analytics capabilities. This solution provides a more efficient log analysis and troubleshooting process.

CATMAN AI

Singular Intelligence
Jul, 2021 - Jul, 20221 yr
    • Catman AI is AI Augmented, always on shopper centric category management tool.
    • It is build focused strategy for channel, product, regions , consumer segments, pricing and promotion using all consumer, market, competition and environment factors

prospects Dashboard

MinIO(upwork)
Sep, 2019 - May, 20211 yr 8 months
    • This project aims to create an analytics dashboard for MinIO sales team to analyze customer behavior, access patterns , session time and visit count .
    • This daily, weekly and monthly reports help them to identify potential new prospects and improve sales

Education

  • Master of Science (Applied Mathematics)

    PSG college of Technology (2004)
  • Master of Science (Software Engineering)

    Birla institute of Technology and Science - Pilani (2008)
  • Master of Science (Applied Mathematics)

    PSG College of Technology, Coimbatore (2004)
  • Bachelor of Science (Mathematics)

    NGM College, Pollachi (2002)

Certifications

  • Data Architect Nano degree from Udacity

    (Dec, 2023)
  • Data architect nano degree from udacity

  • Snowflake the complete masterclass from udemy

  • Ibm data science professional certificate from coursera

  • Aws technical essentials from simplilearn

AI-interview Questions & Answers

Hi. Good morning. My full name is Ramesh. I have a total of 18 years of experience in the IT industry, with a dedicated focus on the data engineering domain for the past 7 years. I started my data engineering journey with the Minio data analytical team, where I learned and implemented end-to-end data pipelines using Python and Apache Airflow. And I have created data visualization dashboards using Power BI. This was my first data engineering project. It helped me learn all the modern data technology stacks, and I implemented data modeling and performed data quality checks. In those various activities from the data engineering domain, in 2021, I moved to the single intelligence data engineering team, where they are developing the product called CATMAN AI. It actually helps to forecast retail consumer products. I was part of the data engineering team. We extracted data from different sources like COVID, mobility, weather, sales forecasting, and Nielsen data. We extracted all these data from different sources, integrated them, and provided a single dataset to machine learning engineers to perform the prediction. We implemented this solution on AWS cloud services. We used Amazon S3, managed Apache Airflow, and Redshift. These AWS services implemented this solution. I developed APIs to support the CATMAN AI product. In 2022, I joined the Bosch Global Software Private Limited company, where we are developing the product called data insight platform. It actually provides aggregated and analyzed results of the logs generated from various towers located across the world. It's a completely in-house data solution that we hosted completely on premises. I'm good at implementing both OLTP and OLAP systems. I recently completed a data architect nano degree from Mordacity and continue learning in modern technology stacks and understanding the best practices followed across the industry.

So AWS Lambda is actually a serverless platform. It is actually a scalable platform. So you can deploy the code, all the data extraction, transformation, and the data loading logic, and through Python script, or you can even use PySpark to do that ETL code, and then apply it in that AWS Lambda services, in the AWS services. So it performs all the activities. So you don't need to manage any server or provision any AWS services to do that activity. So AWS itself manages that activity, and it will load the data into that designated system where you want to load that data, right, in any data warehouse like Redshift or any AWS systems like MySQL or MariaDB. Right? So it will load it.

So version controlling for Python ETL scripts, we can use Git or GitHub to store the script for version controlling. And we've done that, to separate the dev branch or future branch from the production line testing. So based on the business requirements under the availability of the production line and the implementation branch available in the project or in the customer. Right? So based on that, we can have versioning. And, we can have the DTL scripts also. We can have it and store it into GitHub's repository.

So in order to scale the volume of data, there are multiple ways to do that. So one of the solutions is to use parallel processing. Based on the volume of data, we can partition the incoming data volume, and do parallel processing to achieve the performance needed to complete the detailed process within the expected time. So then that's one approach. The second approach will be to increase the number of clusters. When you try to do the data load based on the volume of data, you can increase the number of servers available in the particular cluster to do that processing, to load the data into the system.

For processing large datasets, we can use SQL when you do the Python details scripts, so if you're processing the data row by row or doing the transformations or enrichment by row by row, it will take more time to do that. To optimize that, we can load the data into a stage, then we can use SQL techniques, like aggregation or Windows functions. Right? To do that, the data transformation logic can be done through a SQL query to do the transformation logic. Then after transformation is completed, we can load the data into that designated system. In that way, right, the overall ETL process will get improved when compared to when you try to do the entire transfer machine logic through Python ETL scripts. And one more thing, when you do that, the read and write using Python scripts would take more network time to retrieve the data from the database and load it back. To avoid that, we can directly do that through stored procedures and functions through SQL to improve the processing time.

In AWS services, so I would use for ETL processing, I would use either Lambda. So it is a serverless, scalable platform. The other one would be the AWS Glue to do that transformation. So in AWS Glue, there are multiple subservices that are available. So either, you can use AWS Glue Data Catalog for storing metadata for data processing. You can use Glue Databrew to do the no-code ETL transformations. So we can choose either one of them based on the business requirements and complexity of the data pipeline.

Okay, see, I see a couple of issues with that code. The one about it's actually processing seeing the item 1 by 1. So it actually reads the data from the API response, and it's in the JSON format, then you are iterating each item in the JSON and storing it in a list in the transformation, or transform data, then we are just converting it into a data frame. And at a high level, I don't see any major issues with the current code, but probably we need to see. There are, suppose that the quantity we need to do some data validation. Right? Suppose what if the quantity is 0, then we are trying to manipulate that value by multiplying it with the price. Right? So I would see some quality needs to be checked before processing the data, and we need to check whether we're getting a valid response. Suppose what if the data we receive contains null values. There are no responses from the API request. So actually, we need to validate first while doing that transformation, we need to check the data quality as well. So these are the two major things we need to check before requesting and processing the data, then before loading the data into the Python data frame.

Actually, in this SQL query, we are trying to use the window function, so it will not work. It will fail. So what you have to do is to store the window function. The requirement is to where the revenue is higher than the previous month's revenue. So what we can do is have a subquery and a nested query to get the current revenue and the previous revenue, then we can compare it in the next query so that it will work. When you try to add the window function directly to the where clause, it will not give the expected results.

So in Python-based ETL scripts, I would recommend having a JSON configuration where we can store all sensitive information like API usernames and passwords in a restricted location. So, only the user executing the Python code base has access to that, and that user has sufficient privileges to read the JSON configuration file and process it. So, instead of storing all sensitive information in the Python script, we can go with a JSON-based configuration. Either way, we can store the sensitive information there, and then extract it during runtime to run the Python ETL. That would be the right approach to do that.

I don't have much experience with React application. I do have experience with Streamlit, where I have dealt with some couple of data visualizations to provide a dashboard to one of my clients.

Yes, as I mentioned earlier, I don't have much experience with React. So, I'm not sure how we can optimize the web application load times. One of the best approaches is to keep it simple and not do more than necessary with the React application. Obviously, it will have more impact on client-side load and take more time to process it. I would recommend loading that business logic into the middleware and the API. It will do all the business logic and the data extraction process, then get the data from the back end. We can mainly use React for only the simple UI design alone. That would help for faster response and faster loading time in the React application.