I manage, design, and implement data solutions for various business domains and use cases. I have over 18 years of IT industry experience, with a dedicated focus on data engineering for the past 7 years. My core competencies lie in data management activities, including data ingestion, integration, visualization, and data quality. I excel in the implementation of both OLTP and OLAP systems and have successfully led small to medium data analytics teams. I leverage my skills in AWS cloud computing to deliver scalable, reliable, and efficient data solutions. I hold a Data Architect Nanodegree from Udacity and multiple certifications and I keep learning modern technologies and best practices in the data analytics space
Senior Data Engineer
TuringSenior Data Engineer
Bosch Global software Technology private LtdData Engineer
Singular intelligenceTest Manager
CognizantData Engineer
Freelancer(upwork)Data Engineer
Freelancer (Upwork)Manager - Projects
CognizantProgrammer Analyst, Associate, Senior Associate
Cognizant
MySQL

Python

Apache Airflow
Pyspark
Snowflake

Microsoft Power BI

draw.io

Databricks

AWS

PostgreSQL

Azure Databricks
.png)
Flask

SQL

Git
.png)
Jenkins

AWS S3

AWS Redshift

Power BI

MongoDB

MSSQL

Shell scripting

Unix

Glue
Thank you for your contribution towards ''Cost Innovation & Efficiency Campaign"
Hi. Good morning. Uh, my full name is Ramesh. I have total 18 years of experience in IT industry with dedicated focus on data engineering domain for the past 7 years. I started my data engineering journey with Minio data analytical team, where I have learned and implemented end to end data pipelines using Python and Apache Airflow. And I have created data visualization dashboards using Power BI. This is my the 1st data engineering project. It helps me to learn all the modern data technology stacks, and I implemented the data modeling and how we can do that that data quality checks. So in all those various activities from the data engineering domain. In 2021, I have moved to single intelligence data engineering team, where, actually, they are developing the product called CATMAN AI. It it actually helps to forecast the retail consumer products. So, actually, I was part of the data engineering team. Actually, it did extract the data from different sources like COVID, mobility, weather, um, sales forecasting, uh, from Nielsen data. So all these datas are extracted from the different sources and integrated, and we provided a single dataset to that machine learning engineers to perform the prediction. So we implemented this solution in an AWS cloud services. So we have used Amazon s three, uh, easy to, uh, Amazon manage Apache Airflow and Redshift. So these AWS services to implement this solution. So I did develop the, uh, the APIs to support that CAT MN AI product. And then in 2022, uh, I joined the Bosch Global Software Private Limited company where we are developing the product called data insight platform. It actually, um, it it it provides the aggregated and analysis result, um, of the logs generated from various, um, towers located across the world. So it is completely in house, um, that that data solution we hosted completely on premises. So I'm good at, um, implementing, uh, both OLTP and oil, uh, OLTP and OLAP systems. And I I recently completed data architect, uh, nano that architect nano degree from Modacity and keep learning in the modern technology stacks and um, and and understanding the best practices followed in across the across the
So AWS Lambda is actually, um, uh, it's a serverless platform. It it actually, um, it it it actually it is, um, scalable platform. So you can so you can deploy the code, uh, all the the, uh, the data extraction, transformation, and the data loading logic, and through Python script, or you can even use PySpark to to do that all the the ETL code, uh, and then apply it in that AWS Lambda, uh, services, um, in in in the AWS services. So it it it it performs all the activities. Um, so you don't need to manage any server or provision any any AWS services to do that activity. So AWS itself, it manage that activity, and and it will load the data into that designated system where you want to load that data, right, in any any data warehouse like, uh, Redshift or any, uh, AWS systems like MySQL or, uh, MariaDB. Right? So it will it will load it.
So version controlling for Python ETL scripts, we can use, uh, git or, uh, GitHub, um, to have to store the the script, uh, for, uh, version controlling. And so done that also, we can separate it, like, the dev, uh, dev branch or or or future branch are are are in production, right, production line testing. So based on the business requirements under the that, uh, the availability of the the line on the production and that implementation branch available in the project or in the customer. Right? So based on that, we can have the versioning. And, uh, we can have that the d d s d d DTML scripts also. We can, uh, have it in, uh, we can have it and store it into the, uh, GitHub's repository.
Okay. So in order to scale the volume of data so so the there are multiple ways to do that. So one of the solution, we can we can do the parallel processing. Based on the volume of data, we can have that partitioning the incoming data volume, and we can do parallel processing to achieve the, uh, the performance is to complete the detailed process within the expected time. Okay? So then that is on on approach. Then second 1 will be the increase the number of clusters. Suppose when you try to do the data load so so based on the volume of data, so we can increase the number of, uh, servers available available in the particular cluster to do that processing, to load the data into the uh
Okay. For processing large datasets, we can use so when you do the Python details scripts, um, so, like, if you're processing the data row by row or doing the transformations or enrichment by row by row. It will take more time to do that. To optimize that, we can load the data into stage, then we can use SQL techniques, um, like, um, aggregation or or or or, uh, um, Windows function. Right? To do that, uh, the data transformation logic, we can do it through SQL query to do the transformation logic. Then after transformation is completed, then we can load the data into, uh, that designated system. In that way, right, that overall, the ETL process will get, uh, improved when compared to when you try to do the the entire transfer machine logic through Python ETL scripts. And one more thing, like, um, when you do that, um, the read, write using Python script, it it would take more, uh, network time to retrieve the data from the data from the database. And, again, we are loading the data back to the database. To avoid that, we can directly do that, uh, through store procedures, uh, and the functions through SQL to improve the the processing time.
Okay. In AWS services, so I would use for ETL processing, I would use either Lambda. So it is a serverless, uh, scalable, uh, platform. The other one will be the AWS Blue, uh, to do that, uh, transformation. So in in AWS Blue, there are multiple, uh, subservices that are available. Um, so either, um, you can use AWS Blue, uh, Data Catalog for storing the metadata for a while data processing. Uh, you can use, um, Blue Databrew to do the the no code, uh, ETL transformations. Um, so so we can choose either one of them based on the business requirements and complexity of the the data pipeline.
Okay. See, I I see, um, couple of, I mean, I I I would guess, so couple of issues with that code. The one about it's actually processing, seeing the item 1 by 1. Um, so it actually reads the data from the the API response, and it's it's the JSON format, then you are iterating each item in the JSON and storing it in as a in the top list in the transformation, uh, transform data, then, um, we are just converting it into a data frame. And So at at at high level, I I don't see any major issues with the current code, um, but, uh, probably, we need to see. There are, um, suppose that the the quantity we need to do that some data validation. Right? Suppose what if that, um, the quantity is 0, then we are trying to manipulate that multiply that value with the price. Right? So I would see some quality needs to be checked before processing the data, and we need to check whether the getting the response is valid. Suppose what if, um, the data which we receive, right, null value. There are no response there's some response from the API request. So that, actually, we need to validate first under while doing that transformation, we need to check the data quality as well. So these are the 2 major things we need to check before, uh, um, before requesting and processing the data, then, uh, before loading the data into the Python data frame.
Actually, in this SQL query, actually, we are trying to use the Windows function, um, trying to use the Windows function and the back class. Um, so it it will not it will not work. It will fail. So what you have to do, you can have that Windows function and store it. So so the so so so the the requirement is is to where the revenue is higher than the previous month revenue. So what we can do, we can probably have that, uh, the subquery and our, uh, nested query to to get that current revenue and and the previous revenue, then we can compare it in that next query so that it will work. So when you try to we add the Windows spend directly to the where class, it will not give the expected results.
So in in Python based ETL scripts, so I would recommend we can have that JSON configuration where we can store all sensitive information like API's username, password, have it in, uh, in on on restricted location. So where only the Python user or the or is have doing that, um, the the user which is, uh, trying to execute the Python media code base. Right? That user have the sufficient privileges to read that file and do that, um, database connections and all that. So that user have the sufficient privilege to read that JSON configuration file and the processing it. So in stop storing all the sensitive information in the Python script, we can go with, uh, JSON based configuration. So where or or, uh, I am all right. So either way, we can store the the sensitive information over that, then we can extract it in, uh, during the during the run time to to run the Python ETL. So that would be the right approach to do that.
I don't have much experience with, uh, React application. Um, I do have experience since Streamlit, um, where where I have dealt some couple of, um, data visualizations to provide a dashboard to one of my clients.
Yes. Uh, as I mentioned earlier, uh, I don't have much experience with React. So, um, so so I'm I'm not sure how we can optimize the web application load times. So one of the best approach. Right? I mean, keep it simple, and we don't want to do more business with that, uh, React application. Obviously, it will have the more impact on the client side load and to take more time to process it. I would recommend to load that business logic, um, into that middleware and that API. Uh, so it will do all the business logic and the data extraction process, then then get the data from the back end. And and so we can mainly use for only the simple, um, that UI design alone with the React. So that would help for the the faster response and faster loading time in the React