I manage, design, and implement data solutions for various business domains and use cases. I have over 18 years of IT industry experience, with a dedicated focus on data engineering for the past 7 years. My core competencies lie in data management activities, including data ingestion, integration, visualization, and data quality. I excel in the implementation of both OLTP and OLAP systems and have successfully led small to medium data analytics teams. I leverage my skills in AWS cloud computing to deliver scalable, reliable, and efficient data solutions. I hold a Data Architect Nanodegree from Udacity and multiple certifications and I keep learning modern technologies and best practices in the data analytics space
Databricks senior consultant
CognizantSenior Data Engineer
Senior Software Engineer | AI Research and development
TuringData Engineer
Freelancer (Upwork)Senior Data Engineer
Singular IntelligenceSenior Data Engineer
Bosch Global Software TechnologiesManager - Projects
CognizantProgrammer Analyst, Associate, Senior Associate
Cognizant
MySQL

Python

Apache Airflow
Pyspark
Snowflake

Microsoft Power BI

draw.io

Databricks

AWS

PostgreSQL

Azure Databricks
.png)
Flask

SQL

Git
.png)
Jenkins

AWS S3

AWS Redshift

Power BI

MongoDB

MSSQL

Shell scripting

Unix

Glue
Thank you for your contribution towards ''Cost Innovation & Efficiency Campaign"
Hi. Good morning. My full name is Ramesh. I have a total of 18 years of experience in the IT industry, with a dedicated focus on the data engineering domain for the past 7 years. I started my data engineering journey with the Minio data analytical team, where I learned and implemented end-to-end data pipelines using Python and Apache Airflow. And I have created data visualization dashboards using Power BI. This was my first data engineering project. It helped me learn all the modern data technology stacks, and I implemented data modeling and performed data quality checks. In those various activities from the data engineering domain, in 2021, I moved to the single intelligence data engineering team, where they are developing the product called CATMAN AI. It actually helps to forecast retail consumer products. I was part of the data engineering team. We extracted data from different sources like COVID, mobility, weather, sales forecasting, and Nielsen data. We extracted all these data from different sources, integrated them, and provided a single dataset to machine learning engineers to perform the prediction. We implemented this solution on AWS cloud services. We used Amazon S3, managed Apache Airflow, and Redshift. These AWS services implemented this solution. I developed APIs to support the CATMAN AI product. In 2022, I joined the Bosch Global Software Private Limited company, where we are developing the product called data insight platform. It actually provides aggregated and analyzed results of the logs generated from various towers located across the world. It's a completely in-house data solution that we hosted completely on premises. I'm good at implementing both OLTP and OLAP systems. I recently completed a data architect nano degree from Mordacity and continue learning in modern technology stacks and understanding the best practices followed across the industry.
So AWS Lambda is actually a serverless platform. It is actually a scalable platform. So you can deploy the code, all the data extraction, transformation, and the data loading logic, and through Python script, or you can even use PySpark to do that ETL code, and then apply it in that AWS Lambda services, in the AWS services. So it performs all the activities. So you don't need to manage any server or provision any AWS services to do that activity. So AWS itself manages that activity, and it will load the data into that designated system where you want to load that data, right, in any data warehouse like Redshift or any AWS systems like MySQL or MariaDB. Right? So it will load it.
So version controlling for Python ETL scripts, we can use Git or GitHub to store the script for version controlling. And we've done that, to separate the dev branch or future branch from the production line testing. So based on the business requirements under the availability of the production line and the implementation branch available in the project or in the customer. Right? So based on that, we can have versioning. And, we can have the DTL scripts also. We can have it and store it into GitHub's repository.
So in order to scale the volume of data, there are multiple ways to do that. So one of the solutions is to use parallel processing. Based on the volume of data, we can partition the incoming data volume, and do parallel processing to achieve the performance needed to complete the detailed process within the expected time. So then that's one approach. The second approach will be to increase the number of clusters. When you try to do the data load based on the volume of data, you can increase the number of servers available in the particular cluster to do that processing, to load the data into the system.
For processing large datasets, we can use SQL when you do the Python details scripts, so if you're processing the data row by row or doing the transformations or enrichment by row by row, it will take more time to do that. To optimize that, we can load the data into a stage, then we can use SQL techniques, like aggregation or Windows functions. Right? To do that, the data transformation logic can be done through a SQL query to do the transformation logic. Then after transformation is completed, we can load the data into that designated system. In that way, right, the overall ETL process will get improved when compared to when you try to do the entire transfer machine logic through Python ETL scripts. And one more thing, when you do that, the read and write using Python scripts would take more network time to retrieve the data from the database and load it back. To avoid that, we can directly do that through stored procedures and functions through SQL to improve the processing time.
In AWS services, so I would use for ETL processing, I would use either Lambda. So it is a serverless, scalable platform. The other one would be the AWS Glue to do that transformation. So in AWS Glue, there are multiple subservices that are available. So either, you can use AWS Glue Data Catalog for storing metadata for data processing. You can use Glue Databrew to do the no-code ETL transformations. So we can choose either one of them based on the business requirements and complexity of the data pipeline.
Okay, see, I see a couple of issues with that code. The one about it's actually processing seeing the item 1 by 1. So it actually reads the data from the API response, and it's in the JSON format, then you are iterating each item in the JSON and storing it in a list in the transformation, or transform data, then we are just converting it into a data frame. And at a high level, I don't see any major issues with the current code, but probably we need to see. There are, suppose that the quantity we need to do some data validation. Right? Suppose what if the quantity is 0, then we are trying to manipulate that value by multiplying it with the price. Right? So I would see some quality needs to be checked before processing the data, and we need to check whether we're getting a valid response. Suppose what if the data we receive contains null values. There are no responses from the API request. So actually, we need to validate first while doing that transformation, we need to check the data quality as well. So these are the two major things we need to check before requesting and processing the data, then before loading the data into the Python data frame.
Actually, in this SQL query, we are trying to use the window function, so it will not work. It will fail. So what you have to do is to store the window function. The requirement is to where the revenue is higher than the previous month's revenue. So what we can do is have a subquery and a nested query to get the current revenue and the previous revenue, then we can compare it in the next query so that it will work. When you try to add the window function directly to the where clause, it will not give the expected results.
So in Python-based ETL scripts, I would recommend having a JSON configuration where we can store all sensitive information like API usernames and passwords in a restricted location. So, only the user executing the Python code base has access to that, and that user has sufficient privileges to read the JSON configuration file and process it. So, instead of storing all sensitive information in the Python script, we can go with a JSON-based configuration. Either way, we can store the sensitive information there, and then extract it during runtime to run the Python ETL. That would be the right approach to do that.
I don't have much experience with React application. I do have experience with Streamlit, where I have dealt with some couple of data visualizations to provide a dashboard to one of my clients.
Yes, as I mentioned earlier, I don't have much experience with React. So, I'm not sure how we can optimize the web application load times. One of the best approaches is to keep it simple and not do more than necessary with the React application. Obviously, it will have more impact on client-side load and take more time to process it. I would recommend loading that business logic into the middleware and the API. It will do all the business logic and the data extraction process, then get the data from the back end. We can mainly use React for only the simple UI design alone. That would help for faster response and faster loading time in the React application.