profile-pic
Vetted Talent

Rameshkumar Shanmugam

Vetted Talent

I manage, design, and implement data solutions for various business domains and use cases. I have over 18 years of IT industry experience, with a dedicated focus on data engineering for the past 7 years. My core competencies lie in data management activities, including data ingestion, integration, visualization, and data quality. I excel in the implementation of both OLTP and OLAP systems and have successfully led small to medium data analytics teams. I leverage my skills in AWS cloud computing to deliver scalable, reliable, and efficient data solutions. I hold a Data Architect Nanodegree from Udacity and multiple certifications and I keep learning modern technologies and best practices in the data analytics space

  • Role

    Python Cloud ETL Engineer

  • Years of Experience

    18.00 years

Skillsets

  • Database Administration
  • ETL - 5 Years
  • Big Data - 1 Years
  • Apache Airflow
  • Apache Spark
  • AWS Cloud Services
  • Data Architecture
  • Data Governance
  • Data Modeling
  • PowerBI - 2 Years
  • Mongo DB
  • MS SQL
  • Postgre SQL
  • Shell Scripting
  • Snowflake
  • Unix
  • Flask API
  • Data Profiling
  • PySpark - 1 Years
  • Python - 6 Years
  • MySQL - 3 Years
  • MySQL - 3 Years
  • Data Engineering - 7 Years
  • SQL - 4 Years
  • SQL - 6 Years
  • Data Processing - 3 Years
  • Data Processing - 3 Years
  • Python - 4 Years
  • PySpark - 1 Years
  • Cloud - 1 Years
  • Cloud - 1 Years
  • AWS - 1 Years
  • AWS - 1 Years
  • Databricks - 1 Years
  • API - 3 Years
  • Airflow - 2 Years

Vetted For

9Skills
  • Roles & Skills
  • Results
  • Details
  • icon-skill_image
    Python Cloud ETL Engineer (Remote)AI Screening
  • 64%
    icon-arrow-down
  • Skills assessed :SFMC, Streamlit, API, AWS, ETL, JavaScript, Python, React Js, SQL
  • Score: 58/90

Professional Summary

18.00Years
  • Apr, 2024 - Present1 yr 10 months

    Senior Data Engineer

    Turing
  • Jul, 2022 - Present3 yr 7 months

    Senior Data Engineer

    Bosch Global software Technology private Ltd
  • Jul, 2021 - Jul, 20221 yr

    Data Engineer

    Singular intelligence
  • Jun, 2013 - Sep, 20163 yr 3 months

    Test Manager

    Cognizant
  • Sep, 2017 - Jul, 20213 yr 10 months

    Data Engineer

    Freelancer(upwork)
  • Oct, 2017 - Jul, 20213 yr 9 months

    Data Engineer

    Freelancer (Upwork)
  • Jun, 2013 - Sep, 20163 yr 3 months

    Manager - Projects

    Cognizant
  • Sep, 2004 - Jun, 20138 yr 9 months

    Programmer Analyst, Associate, Senior Associate

    Cognizant

Applications & Tools Known

  • icon-tool

    MySQL

  • icon-tool

    Python

  • icon-tool

    Apache Airflow

  • icon-tool

    Pyspark

  • icon-tool

    Snowflake

  • icon-tool

    Microsoft Power BI

  • icon-tool

    draw.io

  • icon-tool

    Databricks

  • icon-tool

    AWS

  • icon-tool

    PostgreSQL

  • icon-tool

    Azure Databricks

  • icon-tool

    Flask

  • icon-tool

    SQL

  • icon-tool

    Git

  • icon-tool

    Jenkins

  • icon-tool

    AWS S3

  • icon-tool

    AWS Redshift

  • icon-tool

    Power BI

  • icon-tool

    MongoDB

  • icon-tool

    MSSQL

  • icon-tool

    Shell scripting

  • icon-tool

    Unix

  • icon-tool

    Glue

Work History

18.00Years

Senior Data Engineer

Turing
Apr, 2024 - Present1 yr 10 months
    • Collaborated with a leading AI research organization to enhance the capabilities of a state-of-the-art Large Language Model.
    • Gathering, cleaning, and annotating datasets suitable for training the LLM models
    • Focused on improving the LLM models performance through SFT, DPO, and RLHF techniques.
    • Validated the LLM model responses data quality to minimize bias and maximize LLM efficiency and accuracy.
    • Developed Spark applications using PySpark and Spark SQL for data extraction, transformation, and aggregation from multiple file formats.
    • Analyzed the transformed data to uncover insight into customer usage patterns.
    • Optimized PySpark applications on Databricks to reduce the significant utilization cost.
    • Used Python, PostgreSQL, MySQL, Apache Spark (PySpark), Databricks and AWS

Senior Data Engineer

Bosch Global software Technology private Ltd
Jul, 2022 - Present3 yr 7 months
    • Developed an in-house Python ETL (Extract, Transform, Load) library for aggregating and analyzing extensive test logs from reliability test centers across diverse locations.
    • Integrated MySQL as the database system to efficiently store and manage the extracted data.
    • Implemented robust data extraction processes using Python, ensuring seamless data flow from different test center locations to the MySQL database.
    • Employed Git for version control, facilitating collaborative development and tracking changes in the codebase.
    • Applied data analysis techniques to derive valuable insights from the aggregated test logs.
    • Collaborated with the ECP Data Engineering team to streamline the data engineering processes.
    • Demonstrated proficiency in continuous integration and continuous deployment (CI/CD) practices for efficient development and deployment workflows.
    • Leveraged expertise in relational database management systems (RDBMS) to optimize data storage and retrieval.
    • Used Flask, SQL, Python, Git, MySQL, Data Analysis, CI/CD, RDBMS, Data Engineering, Apache Airflow.

Data Engineer

Singular intelligence
Jul, 2021 - Jul, 20221 yr
    • Engineered a sales forecasting and analysis platform, Catman AI, tailored for consumer retail products.
    • Implemented predictive models considering factors such as mobility, GDP, product distribution, seasonality, and weather.
    • Developed Python-based ETL jobs for efficient feature engineering and to support AI/ML activities.
    • Established Flask API calls to seamlessly integrate the backend with Catman AI's frontend.
    • Leveraged Python and SQL for data manipulation and storage, ensuring a robust foundation for analytics.
    • Utilized AWS for scalable and reliable infrastructure, enhancing the platform's performance and capabilities.
    • Applied Git for version control, facilitating collaborative development and codebase management.
    • Demonstrated proficiency in Machine Learning techniques to enhance forecasting accuracy.
    • Used Python, Apache Airflow ,SQL, AWS, Flask, Git

Data Engineer

Freelancer (Upwork)
Oct, 2017 - Jul, 20213 yr 9 months
    • Collaborated with clients to understand business needs for BI solutions and translate those business needs into actionable reports in Power BI, which saves 40 hours of manual effort every month.
    • API integration between the Send Grid and HubSpot CRM applications to synchronize prospects' details for marketing.
    • 30+ workflows designed and developed in Apache Airflow
    • Performed data quality analysis of Power BI report to check the accuracy of various metrics.
    • Developed a Python program to extract data from multiple Excel files and extract data from JSON API calls to store the consolidated data in MySQL. It saved the manual effort by 80%.
    • Used Python, SQL, Apache Airflow, MySQL, Snowflake, Databricks, AWS S3, and Data Modeling

Data Engineer

Freelancer(upwork)
Sep, 2017 - Jul, 20213 yr 10 months
    • Collaborated with client to understand business needs for BI solution and translate those business needs into actionable reports in Power BI, it saves 40 hours of manual effort every month.
    • API integration between Send Grid application and HubSpot CRM application to synchronize prospects details for marketing.
    • 30+ workflows designed and developed in Apache Airflow
    • Performed data quality analysis of power BI report to check the accuracy of reporting.
    • Developed python program to extract data from multiple Excel files and extract data from JSON API calls to store the consolidated data into MySQL.It saved the manual effort by 80%
    • Prepared details report that interprets website ranking based on the scoring model.

Test Manager

Cognizant
Jun, 2013 - Sep, 20163 yr 3 months
    • Managed a team of ten performance test resources for leading Healthcare company in US.
    • Performed end to end ETL/Batch performance testing activities which includes identify/prioritize business scope for batch testing, input file
    • Setup, server monitoring using UNIX shell scripts.
    • 100+ workflow/session monitored and analyzed using Informatica workflow monitor tool.
    • Identified the root cause of the 10+ long running batch jobs and provided findings and recommendations.

Manager - Projects

Cognizant
Jun, 2013 - Sep, 20163 yr 3 months
    • Managed a team of ten performance test resources for a leading Healthcare company in the US.
    • Performed end-to-end ETL/Batch performance testing activities, including identifying/prioritizing business scope for batch testing, input file setup, and server monitoring using UNIX shell scripts.
    • 100+ workflows/sessions monitored and analyzed using the Informatica workflow monitor tool.
    • Identified the root cause of the 10+ long-running batch jobs and provided findings and recommendations.

Programmer Analyst, Associate, Senior Associate

Cognizant
Sep, 2004 - Jun, 20138 yr 9 months
    • Collaborated with ETO Performance testing team at leading insurance company.
    • Involved in various complex projects to evaluate and identify performance and scalability issues.
    • Performed data complexity and production log analysis to obtain the production like workload scenario in load test execution.
    • Created and maintained test artifacts from Requirement gathering to Test report sign off.
    • Conducted weekly meetings with project stakeholders to update the progress of projects.

Achievements

  • Received cost Innovation award for Idea AI/ML based Data Insight platform in XC-CT Domain
  • Maintained Top Rated Freelancer badge for more than one year.
  • Received Star performer of the Quarter for quality of work and on time delivery.
  • Cost Innovation award for AI/ML based Data Insight platform development 2023
  • Star performer of the Quarter 2014 at Cognizant Technology Solutions

Testimonial

Bosch Global software Technology private Ltd

Thank you for your contribution towards ''Cost Innovation & Efficiency Campaign"

Major Projects

3Projects

Data insight platform

Bosch Global software Technology private limited
Jul, 2022 - Present3 yr 7 months
    • Data insight is a web-based framework consist of ETL tool and AI/ML to build end to end data analytics solution. 
    • It ingests, process and loads data into backend through batch processing, and makes the processed data available to end users through web analytics interface.
    • It is a scalable platform with the ability to full-text search, analytics capabilities. This solution provides a more efficient log analysis and troubleshooting process.

CATMAN AI

Singular Intelligence
Jul, 2021 - Jul, 20221 yr
    • Catman AI is AI Augmented, always on shopper centric category management tool.
    • It is build focused strategy for channel, product, regions , consumer segments, pricing and promotion using all consumer, market, competition and environment factors

prospects Dashboard

MinIO(upwork)
Sep, 2019 - May, 20211 yr 8 months
    • This project aims to create an analytics dashboard for MinIO sales team to analyze customer behavior, access patterns , session time and visit count .
    • This daily, weekly and monthly reports help them to identify potential new prospects and improve sales

Education

  • Master of Science (Applied Mathematics)

    PSG college of Technology (2004)
  • Master of Science (Software Engineering)

    Birla institute of Technology and Science - Pilani (2008)
  • Master of Science (Applied Mathematics)

    PSG College of Technology, Coimbatore (2004)
  • Bachelor of Science (Mathematics)

    NGM College, Pollachi (2002)

Certifications

  • Data Architect Nano degree from Udacity

    (Dec, 2023)
  • Data architect nano degree from udacity

  • Snowflake the complete masterclass from udemy

  • Ibm data science professional certificate from coursera

  • Aws technical essentials from simplilearn

AI-interview Questions & Answers

Hi. Good morning. Uh, my full name is Ramesh. I have total 18 years of experience in IT industry with dedicated focus on data engineering domain for the past 7 years. I started my data engineering journey with Minio data analytical team, where I have learned and implemented end to end data pipelines using Python and Apache Airflow. And I have created data visualization dashboards using Power BI. This is my the 1st data engineering project. It helps me to learn all the modern data technology stacks, and I implemented the data modeling and how we can do that that data quality checks. So in all those various activities from the data engineering domain. In 2021, I have moved to single intelligence data engineering team, where, actually, they are developing the product called CATMAN AI. It it actually helps to forecast the retail consumer products. So, actually, I was part of the data engineering team. Actually, it did extract the data from different sources like COVID, mobility, weather, um, sales forecasting, uh, from Nielsen data. So all these datas are extracted from the different sources and integrated, and we provided a single dataset to that machine learning engineers to perform the prediction. So we implemented this solution in an AWS cloud services. So we have used Amazon s three, uh, easy to, uh, Amazon manage Apache Airflow and Redshift. So these AWS services to implement this solution. So I did develop the, uh, the APIs to support that CAT MN AI product. And then in 2022, uh, I joined the Bosch Global Software Private Limited company where we are developing the product called data insight platform. It actually, um, it it it provides the aggregated and analysis result, um, of the logs generated from various, um, towers located across the world. So it is completely in house, um, that that data solution we hosted completely on premises. So I'm good at, um, implementing, uh, both OLTP and oil, uh, OLTP and OLAP systems. And I I recently completed data architect, uh, nano that architect nano degree from Modacity and keep learning in the modern technology stacks and um, and and understanding the best practices followed in across the across the

So AWS Lambda is actually, um, uh, it's a serverless platform. It it actually, um, it it it actually it is, um, scalable platform. So you can so you can deploy the code, uh, all the the, uh, the data extraction, transformation, and the data loading logic, and through Python script, or you can even use PySpark to to do that all the the ETL code, uh, and then apply it in that AWS Lambda, uh, services, um, in in in the AWS services. So it it it it performs all the activities. Um, so you don't need to manage any server or provision any any AWS services to do that activity. So AWS itself, it manage that activity, and and it will load the data into that designated system where you want to load that data, right, in any any data warehouse like, uh, Redshift or any, uh, AWS systems like MySQL or, uh, MariaDB. Right? So it will it will load it.

So version controlling for Python ETL scripts, we can use, uh, git or, uh, GitHub, um, to have to store the the script, uh, for, uh, version controlling. And so done that also, we can separate it, like, the dev, uh, dev branch or or or future branch are are are in production, right, production line testing. So based on the business requirements under the that, uh, the availability of the the line on the production and that implementation branch available in the project or in the customer. Right? So based on that, we can have the versioning. And, uh, we can have that the d d s d d DTML scripts also. We can, uh, have it in, uh, we can have it and store it into the, uh, GitHub's repository.

Okay. So in order to scale the volume of data so so the there are multiple ways to do that. So one of the solution, we can we can do the parallel processing. Based on the volume of data, we can have that partitioning the incoming data volume, and we can do parallel processing to achieve the, uh, the performance is to complete the detailed process within the expected time. Okay? So then that is on on approach. Then second 1 will be the increase the number of clusters. Suppose when you try to do the data load so so based on the volume of data, so we can increase the number of, uh, servers available available in the particular cluster to do that processing, to load the data into the uh

Okay. For processing large datasets, we can use so when you do the Python details scripts, um, so, like, if you're processing the data row by row or doing the transformations or enrichment by row by row. It will take more time to do that. To optimize that, we can load the data into stage, then we can use SQL techniques, um, like, um, aggregation or or or or, uh, um, Windows function. Right? To do that, uh, the data transformation logic, we can do it through SQL query to do the transformation logic. Then after transformation is completed, then we can load the data into, uh, that designated system. In that way, right, that overall, the ETL process will get, uh, improved when compared to when you try to do the the entire transfer machine logic through Python ETL scripts. And one more thing, like, um, when you do that, um, the read, write using Python script, it it would take more, uh, network time to retrieve the data from the data from the database. And, again, we are loading the data back to the database. To avoid that, we can directly do that, uh, through store procedures, uh, and the functions through SQL to improve the the processing time.

Okay. In AWS services, so I would use for ETL processing, I would use either Lambda. So it is a serverless, uh, scalable, uh, platform. The other one will be the AWS Blue, uh, to do that, uh, transformation. So in in AWS Blue, there are multiple, uh, subservices that are available. Um, so either, um, you can use AWS Blue, uh, Data Catalog for storing the metadata for a while data processing. Uh, you can use, um, Blue Databrew to do the the no code, uh, ETL transformations. Um, so so we can choose either one of them based on the business requirements and complexity of the the data pipeline.

Okay. See, I I see, um, couple of, I mean, I I I would guess, so couple of issues with that code. The one about it's actually processing, seeing the item 1 by 1. Um, so it actually reads the data from the the API response, and it's it's the JSON format, then you are iterating each item in the JSON and storing it in as a in the top list in the transformation, uh, transform data, then, um, we are just converting it into a data frame. And So at at at high level, I I don't see any major issues with the current code, um, but, uh, probably, we need to see. There are, um, suppose that the the quantity we need to do that some data validation. Right? Suppose what if that, um, the quantity is 0, then we are trying to manipulate that multiply that value with the price. Right? So I would see some quality needs to be checked before processing the data, and we need to check whether the getting the response is valid. Suppose what if, um, the data which we receive, right, null value. There are no response there's some response from the API request. So that, actually, we need to validate first under while doing that transformation, we need to check the data quality as well. So these are the 2 major things we need to check before, uh, um, before requesting and processing the data, then, uh, before loading the data into the Python data frame.

Actually, in this SQL query, actually, we are trying to use the Windows function, um, trying to use the Windows function and the back class. Um, so it it will not it will not work. It will fail. So what you have to do, you can have that Windows function and store it. So so the so so so the the requirement is is to where the revenue is higher than the previous month revenue. So what we can do, we can probably have that, uh, the subquery and our, uh, nested query to to get that current revenue and and the previous revenue, then we can compare it in that next query so that it will work. So when you try to we add the Windows spend directly to the where class, it will not give the expected results.

So in in Python based ETL scripts, so I would recommend we can have that JSON configuration where we can store all sensitive information like API's username, password, have it in, uh, in on on restricted location. So where only the Python user or the or is have doing that, um, the the user which is, uh, trying to execute the Python media code base. Right? That user have the sufficient privileges to read that file and do that, um, database connections and all that. So that user have the sufficient privilege to read that JSON configuration file and the processing it. So in stop storing all the sensitive information in the Python script, we can go with, uh, JSON based configuration. So where or or, uh, I am all right. So either way, we can store the the sensitive information over that, then we can extract it in, uh, during the during the run time to to run the Python ETL. So that would be the right approach to do that.

I don't have much experience with, uh, React application. Um, I do have experience since Streamlit, um, where where I have dealt some couple of, um, data visualizations to provide a dashboard to one of my clients.

Yes. Uh, as I mentioned earlier, uh, I don't have much experience with React. So, um, so so I'm I'm not sure how we can optimize the web application load times. So one of the best approach. Right? I mean, keep it simple, and we don't want to do more business with that, uh, React application. Obviously, it will have the more impact on the client side load and to take more time to process it. I would recommend to load that business logic, um, into that middleware and that API. Uh, so it will do all the business logic and the data extraction process, then then get the data from the back end. And and so we can mainly use for only the simple, um, that UI design alone with the React. So that would help for the the faster response and faster loading time in the React