profile-pic
Vetted Talent

Raj gottipati

Vetted Talent

Total 5 years of experience as GCP Data Engineer ( 4 years ) and Abinitio Developer ( 1 year), specializing in developing and maintaining Data Warehouse applications. Worked on both Banking and retail domain projects migration. Involved into creating pipelines , managing and orchestrate the pipelines.

  • Role

    Senior Software Engineer

  • Years of Experience

    5 years

  • Professional Portfolio

    View here

Skillsets

  • Data Processing
  • Big Data
  • Data Pipeline
  • Data Governance
  • Cloud Technologies
  • Performance Tuning
  • Workflow Management
  • Batch Processing
  • data serialization

Vetted For

9Skills
  • Roles & Skills
  • Results
  • Details
  • icon-skill_image
    Python Cloud ETL Engineer (Remote)AI Screening
  • 57%
    icon-arrow-down
  • Skills assessed :SFMC, Streamlit, API, AWS, ETL, JavaScript, Python, React Js, SQL
  • Score: 51/90

Professional Summary

5Years
  • Aug, 2023 - Jan, 2024 5 months

    Senior Software Engineer

    Best Buy
  • Mar, 2022 - Aug, 20231 yr 5 months

    Data Engineer

    ETSY
  • Sep, 2021 - Mar, 2022 6 months

    SR Software Engineer

    NedBank
  • Apr, 2018 - May, 20202 yr 1 month

    Software Engineer

    KOHL'S
  • Apr, 2021 - Oct, 2021 6 months

    AB Initio Developer

    Discover

Applications & Tools Known

  • icon-tool

    BigQuery

  • icon-tool

    Cloud Composer

  • icon-tool

    Pub/Sub

  • icon-tool

    Kafka

  • icon-tool

    GitHub

  • icon-tool

    Python

  • icon-tool

    Tableau

  • icon-tool

    HDFS

  • icon-tool

    Spark

Work History

5Years

Senior Software Engineer

Best Buy
Aug, 2023 - Jan, 2024 5 months

    .planning and execution of migrating data warehouses from Teradata to

    Google Cloud Platform (GCP) using Qlik Replicate tool, ensuring minimal

    downtime and data integrity.

    managed ETL pipelines using Qlik Replicate for real-time data integration

    and batch processing, ensuring data consistency and availability across

    platforms.

    Performed extensive performance tuning of Big Query to optimize query

    execution times and reduce processing costs, leveraging GCP's data

    processing capabilities.

    .Collaborated closely with cross-functional teams, including data analysts,

    data scientists, and business stakeholders, to ensure the migration aligned

    with organizational goals and data needs.

    Utilized Google Cloud Composer for workflow management, scheduling,

    and orchestration of data pipeline jobs, enhancing operational

    efficiency.

Data Engineer

ETSY
Mar, 2022 - Aug, 20231 yr 5 months

    .Gathering requirements from the business team, designing data model,

    developing design document and implementing ETL pipeline.

    .Responsible for creating the GCS buckets, Datasets, Big Query tables in

    different layers of BQ projects.

    .Implemented Kafka for real time data streaming for system efficiency.

    .Involved in Development of creating Data Pipeline to read the Data from

    sources and load data to Big Query as part of GCP migration process.

    .Used Source code Management (SCM) as GITHub.

SR Software Engineer

NedBank
Sep, 2021 - Mar, 2022 6 months

    .Involved in design and implementation of extract, transformation and load

    processes using Ab initio as an ETL tool

    .Performed Data onboarding in M-Hub by using tables to the reservoir to the

    successful run in all environments.

    .Experience with Ab Initio tool suite such as Express>It ,Control Center,IBM

    Scheduler &Able to work on UNIX scripting (Medium complexity)

    Hands-on development experience with various Ab Initio components such

    as Rollup, Scan, join, Partition by key, Partition by Round Robin, Gather,

    Merge

    .Various file management commands like m ls, m wc, m mkfs, m copy, m

    dump were used extensively to operate with multi files.

AB Initio Developer

Discover
Apr, 2021 - Oct, 2021 6 months

    .involved in This project which aims to provide permanent fixes to the

    frequent production issues and understanding job flow and their interrelated

    dependency

    .Extracting Data from various sources such as Flat files, db2 and load into

    Staging Area

    .Understanding the exact issue for the prod related issues with detailed

    analysis

    .Preparation of various documents like design and testing documents to

    ensure that the technical logic implemented in various graphs is clearly

    understood by everyone

    .Performed data Insertion and deletion in SF from Backend and as well in

    Amazon S3 bucket

    .Prepared reports and Visuals in Tableau for analytics purpose on daily basis.

Software Engineer

KOHL'S
Apr, 2018 - May, 20202 yr 1 month

    .Developed DDL scripts to load data into BigQuery from GCS bucket.

    Developed producer scripts to read CSV files from source systems and

    transferdata to Kafka.

    Involved in Kafka and Pyspark integration for realtime data prosessing and to

    enable advance analytics.

    Experience in handling JSON, CSV and pipe delimited files.

    Developed Audit tables for reconciliation and metadata tracking purpose.

Education

  • Master's: information systems

    STRATFORD UNIVERSITY

Certifications

  • Databricks

    Databricks (Dec, 2023)

Interests

  • Long Rides
  • Watching Movies
  • Chess
  • Cricket
  • Listening Music
  • AI-interview Questions & Answers

    Hi. I have total 5 years of experience in data engineering. In that 1, 4 years have been completely in GCP. Along with the migration, apart from this, like you know, I have very hands-on experience with Python, SQL, and BigQuery, like Cloud Composer. And, like, a bit of Dataflow. These are the services I've had a chance to use in GCP, various services. So we've run totally 4 different projects. 2 of them completely in the banking domain and 2 of them in the retail domain. So first of all, it's Kohl's, and 2nd project is you know, Discover. And 3rd of all, it's from South Africa, Nedbank. And 4th is you know, Etsy, which is an ecommerce website. So let me start with my most recent project. So it is Etsy, like I mentioned. So from the USA itself, the complete operations are in place. They have 2 different departments, like, d two c and marketplace direct to the customer. And marketplace, like, we used to have all the data related to the customer, like artist data and also customer-related data and as well, third-party delivery service data. So all these kinds of data, they combine in the form of CSV files from upstream people. They push the data to my local drive. From my local drive, like you know, I push the data to GCS and where like you know, before processing further in BigQuery. So I will look to processing transformations and validations. And after that, like you know, I will apply all types of transformations based upon the requirements. I will do that. And after that, like you know, I will generate the views and hand out to the downstream people, like ML people and AI people. So that like you know, for this whole process, we are using Cloud Composer to orchestrate the whole process. So this is one project. And you know, another project I was involved in targeted to GCP migration, where we are using a migration tool, replicate, which is a median between the two tools. So that like you know, there is a change data capture feature availability in the tool. So that like whatever the data updated in the table, automatically, you know, along with the hash circle data and updated data, we can replicate the same tables like you know, into the query in GCP completely. So this is completely about my last couple of projects. But now in this project, we mainly focus on the fact that if the table length is huge, like let's suppose, if it is like 15 GB or more than 15 GB, we try to divide the table into small chunks, and we are making it like you know, different runs, not like a single run. So that's well, like you know, we can make it the complete process faster compared to sending 1 table at a time that's 15 GBs 15 GBs of the table. So apart from this, I have very good knowledge of Dataflow and Pubsub as well. These are other services like you know, Dataflow is ETL and Pubsub, which is a messaging tool where we use in the pipelines randomly to get the pop-up message whenever the data is available in the particular bucket or particular location. So that like you know, we can check with the various operators in there. So like you know, we'll get the data in the I mean, in the particular bucket. So what we are trying to expect on a daily basis, like you know, for particular run, whether it should be on a couple of days per day I mean, couple of times per day or like you know, couple of times couple of times or more than that, so for scheduling purposes. And apart from this, I have a little bit of knowledge of data proc and also data fusion and data prep as well. So yeah, that's all about my recent experience and projects. Thank you.

    So strategy to handle exceptions in Python while loading the data into a SQL database. Usually, I know, I dealt with a lot of files, including CSV and JSON files. These files, when I'm trying to read them and perform transformations, I use a Python script. However, in the scenario where I used to write SQL queries initially before loading into the SQL database, I'm trying to write SQL queries into the SQL operator itself. That is one way I'm trying to make it, but the strategies for exception handling involve various scenarios. Other than this, there is no exception handling. It seems like we'll go with try-except blocks. That is one scenario we can use. Like, try-except blocks are also one thing. And specific exception handling, we need to choose based upon the SQL database. Either it is a PostgreSQL database or completely using a SQL connector, such as MySQL, if I'm not wrong. So this is the one we need to make when we are writing the Python query. Like, we need to initially import the necessary libraries and everything. And transaction management is also one more thing. We need to make sure it's for data integrity and everything, in case if there's any errors occurring in the data while we are doing, like, pushing data from I mean, pushing data into the SQL database. And through logging also, we can able to check the logs for all the errors and everything like us. Because we can able to see all the errors through the logs itself, mostly, in most scenarios, like, what are the errors we are able to get, like, warnings and, like, what are the incoming messages, like, informational messages. So these kind of things. And also, like, you know, cleanup actions is the one more way that can we can able to make it, like, with the statement wise. So if you want to say an example, like, we can use, like, try. We can start the statement with the try, and we need to define the database operators instead of that statement, particularly. Like, you know, with the handle expression, that is the expression we need to give it. And finally, like, you know, we need to close the connection as well, like, database connection, whatever the connection we opened it previously. So that is the one way. I have, like, you know, these are the ways so far, you know. And, yeah, exception handling.

    So data integrity, performing transformations in Python, it will process. So data integrity, basically, let's assume you have a need to integrate different kinds of files. It might be a CSV file, like multiple files. But initially, we're taking an example. I'm using URLs so that I can fetch the data. And after that, I download the data into the form of a CSV file so that I can read that completely. And after that, I can perform validations on top of that file. Integrity techniques, like performing transformations, in my scenario, we can check with version control. That's one way. And also, we need to test the framework to write unit tests and everything, including test cases, and doing unit tests and everything for the transformations to fix bugs. So before we get an effect and complete whatever we're reading in Python. And after that, we can cut the complete process for the transformations we've done previously. And we need to debug the train transformations if there are any errors or something happened in that particular log. And for validation purposes, we're able to check inside the library, like before the data was transformed. So these kinds of things we can check. And also, for lineage, the main thing we need to make it clearly, for historical data to check from the origin itself. So this is where we can understand how data was transformed completely from one point to another. And data governance plays a key role, in my opinion. We need to see where the policies work, and also, we can see the framework process, and also we can do this kind of thing.

    If we are requesting Python, usually, we request APIs in this kind of format. We can communicate with other serverless architecture domains in my scenario, basically, where we use Python and Spark code for transformations. So, for understanding, a page number, an offset, and a limit are some ways to deal with APIs, basically. And, also, you know, token-based things. So, that is also one thing we need to consider when dealing with APIs. And, we need to go with a basic position, which is, we need to write a code, to approach the scenario with a page number and limit. So, these kind of scenarios when we go with a particular approach, especially. So, that, you know, we can write the Python code from scratch, using requests. And, we can fetch the particular page number. And, along with that, we can do that particular page along with that page number. And, response code, all these, in a similar kind, we can read the code in Python itself. And, after that, we can read the limit as well. We need to handle the limit by importing time and, you know, backup and everything in Python. And, also, expression-wise, so that, you know, we're able to set the maximum time, which is 60 seconds. So, that, we can consider efficiency considerations here with the parallel request and, just the page size as well. Since I mentioned previously, customer token-based optimization, when we are dealing with API, So, this is also

    Synchronous programming. So, basically, if we go step by step, because when we're dealing with switching tasks during waiting times and everything, it seems like when you're using async and await in Python, which is an asynchronous code. And for that, we need to set up a sync environment to do that. So, but we need to make sure that we're using Python version 3.7 or higher since we need to write this code, as it supports only Python 3.7 or higher versions. With the HTTP client libraries, these are all kinds of libraries along with H2O HTTP. We need to install it in parallel. The functions we need to import are both of them, show and H2O. So, both these libraries we need to import. And we need to fetch the same function asynchronously. And after that, we're going to need to assemble the task where we run the task completely and concurrently. And we need to exclude the total end-to-end ETL process with all listed URLs and endpoints. And we need to call the main function in that one. So, these are the scenarios we need to keep in mind before processing the data. So, to process the data, we need to follow a few considerations here. One is the rate limit, and another is concurrency versus parallelism, and also error handling. Also, one more thing we need to make sure. So, these are the things from my opinion, we need to consider.

    We need to either go with multiple ETL tools like a data flow. This kind of tool allows us to go with it. For modern architecture, if we consider, we need to go with separate details. Any kind of tool we can choose, apart from data flow, particularly from GCP services, like data storage, these kinds of tools. And also, we need to go with a microservices approach as well for large-scale systems since you mentioned for large-scale systems for huge data volumes in the extraction, transformation, and loading architecture. So for parallel processing, we send usually we go with asynchronous programming, replacing synchronous programming and concurrency for handling input/output like a base, I mean, bounded tasks by calling the APIs and for efficiency. With growing CPU-bound tasks and parallelism. These kinds of mechanisms, like you know, for scalable infrastructure, we go with cloud services, like Google Cloud or AWS or Azure. So we can choose it from AWS Lambda or Google Cloud Functions for continuous scheduling. And apart from this, like you know, for containerization, we go with either Kubernetes. So that is mostly used for containerization these days apart from Docker containers. So we can also use Docker containers as well for orchestration along with Kubernetes. Like you know, it's up to completely depending upon our situation and efficient data processing as well. Like you know, for both batch processing and stream processing, for large datasets, like you know, we need to implement dividing the data into chunks and reducing the memory usage. Let's so that we can process more data, and we can improve the processing speed for batch processing. And stream processing, like you know, we need to take frameworks along with Apache Kafka and also Apache Flink, and AWS kinds of ones to process the data. Database and optimization will go with Cassandra. Most SQL databases use it. MongoDB for, because these databases, obviously, they don't follow the DBMS. Right? So to access a particular pattern, basic files. And also, we get multiple options like scalable, distributed data, I mean, distributed databases because these are distributed databases. So that we can easily implement partitioning and also indexing strategies for optimizing localities. And for caching the data through data lake, we use various mechanisms to reduce the database load so that we can maximize the performance of the database to improve the response time as well. So that the database is not continuously running on the same query again and again for a long run. And also for data lakes, for fully architecture, data lake, we can utilize it for storing a vast amount of data compared to the data warehouse. We're able to store raw data before we are doing the transformation and everything so that we can perform on the data.

    The API response is directly calling the JSON. So, on the response, a request dot get exactly, without checking if the request was successful. If the API request calls due to a network issue, I'm assuming it might be the server returning a request as well, like, maybe some kind of incident issue in the status code. It might be another issue here. We need to consider it because, I'm assuming this is a one scenario where the code will not work according to them. I'm assuming the response structure. Since the API response is in the form of dictionaries, the transformed data can be seen with ID, item, and everything, quantity, and price. The API response is actually different or missing one item particularly. Then, a list comprehension will raise a key error. So, that's another thing I observed from here. And, data transformation logic, it may be either quantity or price missing or not a numeric data type, like, an item value. Item quantity here, we can see data type or maybe it will be the string or null. Maybe this is a type error here. So, I'm assuming. It's a safe approach to make it initially ID along with that item in the square brackets. You can declare it. Initially, we need to declare the API response code request like this before that. And where we can call the value with the float minimum with float of item dot get with quantity comma 0. And after that, again, we can get the float along with that same item dot get. So, that we can rectify this code and make it successful. And, like a data frame by using the data frame from the dictionary, we can solve this one. So, yeah. I think these are the changes I observed and how we can able to.

    We decided to send a batch of messages to an AWS Lambda function, as shown below. There appears to be an oversight that could lead to errors or unexpected behavior. So the potential issue might be the syntax errors or type errors. Initially, I'm thinking that we have the message to send, which is a list in its entirety. Here, the try ID message and f message of i's. With no for loop in the range of 10, this is what it should be, presumably – a list comprehension, definitely. Also, the function and parameter in the Lambda client, like that invoke, ends with a mismatched punctuation type, like in the braces, I mean, in the braces, with the quotes at the start and at the end. So I think that is the missed one, behind the message to send. That is what I observed. We need to know the character code block initialization. Assuming the indentation was correct, we create a list of dictionaries. If I'm not wrong, the correct initialization should look like messages_to_send. After that, we need to create same ID on one message, and then f message, where we need to declare a variable i in a range of 10. Here, there is an incorrect payload formatting. We need to correct this one by importing the JSON format instead of border 3. We need to import the JSON module. Inside that loop, we need to respond with a response equal to lambda_client, and invoke that one with the function name, process_message function. Along with that, the invocation type is 'event', and we need to pass the payload by using the JSON.dumps method, with the encoder of utf-8. And then, like that, we can able to

    So an effective way to debug a Python application that's experiencing performance issues during complex SQL data transformations is to profile the Python application. There are two ways to duplicate the issue to debug the Python application. We need to profile the Python application. In my scenario, where we need to go with code profiling. And after that, we need to analyze the SQL queries, how SQL queries were run. So when we directly optimize queries, they start by reading the queries, how efficient they use a lot of joins, unnecessary statements in the test query, along with the optimization techniques they used. Based on that optimization, we need to use some data access patterns as well. When we are dealing with batch processing or, we need to go with caching, lazy loading as well. If we are going with ORM or, when we are dealing with relationships in terms of the data. And resource allocation, also, one more thing we need to make sure, whether it is Dataproc or any Python environment, or either it is a database, we need to make sure we have sufficient resources in terms of CPU and memory. We need to configure them properly for the particular workload, for performance wise. And, also, application resources, similarly, we have enough resources to handle that particular application when we deploy the code instead of the workload. So that, it will consider, we didn't get multi-threading or any kind of bound tasks. And, also, we'll need to continuously monitor and test the complete environment by using various tools like Grafana or Datalog. These kinds of tools, we can use to check that continuously so that we can make sure everything is going on the same track. And, also, we need to perform load testing as well. Performing load testing on a particular application, we can check on real-world usage patterns and everything to get the performance issues if the application is getting.

    How Streamlit works. We integrated React components into Streamlit. Streamlit works like with the React components. Streamlit. Click on the React components. Python application. For the integration, initially, we need to set up a development environment, and we need to create a new Streamlit component by initializing and installing necessary dependencies, such as `npx react-scripts install`. And we need to create a Streamlit link wrapper as well. To integrate this complete process, we need to set up a Python package so that we can write the wrapper. I mean, and the Python model itself, we need to import Streamlit components to declare particular locations where it can interact with React components, frontend assets, and other components. After that, we can build on the React component inside of that environment. So that we can use the component in Streamlit. For the integration, I think we need to distribute that component to iterate through the integration process.