profile-pic
Vetted Talent

Raj gottipati

Vetted Talent

Total 5 years of experience as GCP Data Engineer ( 4 years ) and Abinitio Developer ( 1 year), specializing in developing and maintaining Data Warehouse applications. Worked on both Banking and retail domain projects migration. Involved into creating pipelines , managing and orchestrate the pipelines.

  • Role

    Senior Software Engineer

  • Years of Experience

    5 years

  • Professional Portfolio

    View here

Skillsets

  • Data Processing
  • Big Data
  • Data Pipeline
  • Data Governance
  • Cloud Technologies
  • Performance Tuning
  • Workflow Management
  • Batch Processing
  • data serialization

Vetted For

9Skills
  • Roles & Skills
  • Results
  • Details
  • icon-skill_image
    Python Cloud ETL Engineer (Remote)AI Screening
  • 57%
    icon-arrow-down
  • Skills assessed :SFMC, Streamlit, API, AWS, ETL, JavaScript, Python, React Js, SQL
  • Score: 51/90

Professional Summary

5Years
  • Aug, 2023 - Jan, 2024 5 months

    Senior Software Engineer

    Best Buy
  • Mar, 2022 - Aug, 20231 yr 5 months

    Data Engineer

    ETSY
  • Sep, 2021 - Mar, 2022 6 months

    SR Software Engineer

    NedBank
  • Apr, 2018 - May, 20202 yr 1 month

    Software Engineer

    KOHL'S
  • Apr, 2021 - Oct, 2021 6 months

    AB Initio Developer

    Discover

Applications & Tools Known

  • icon-tool

    BigQuery

  • icon-tool

    Cloud Composer

  • icon-tool

    Pub/Sub

  • icon-tool

    Kafka

  • icon-tool

    GitHub

  • icon-tool

    Python

  • icon-tool

    Tableau

  • icon-tool

    HDFS

  • icon-tool

    Spark

Work History

5Years

Senior Software Engineer

Best Buy
Aug, 2023 - Jan, 2024 5 months

    .planning and execution of migrating data warehouses from Teradata to

    Google Cloud Platform (GCP) using Qlik Replicate tool, ensuring minimal

    downtime and data integrity.

    managed ETL pipelines using Qlik Replicate for real-time data integration

    and batch processing, ensuring data consistency and availability across

    platforms.

    Performed extensive performance tuning of Big Query to optimize query

    execution times and reduce processing costs, leveraging GCP's data

    processing capabilities.

    .Collaborated closely with cross-functional teams, including data analysts,

    data scientists, and business stakeholders, to ensure the migration aligned

    with organizational goals and data needs.

    Utilized Google Cloud Composer for workflow management, scheduling,

    and orchestration of data pipeline jobs, enhancing operational

    efficiency.

Data Engineer

ETSY
Mar, 2022 - Aug, 20231 yr 5 months

    .Gathering requirements from the business team, designing data model,

    developing design document and implementing ETL pipeline.

    .Responsible for creating the GCS buckets, Datasets, Big Query tables in

    different layers of BQ projects.

    .Implemented Kafka for real time data streaming for system efficiency.

    .Involved in Development of creating Data Pipeline to read the Data from

    sources and load data to Big Query as part of GCP migration process.

    .Used Source code Management (SCM) as GITHub.

SR Software Engineer

NedBank
Sep, 2021 - Mar, 2022 6 months

    .Involved in design and implementation of extract, transformation and load

    processes using Ab initio as an ETL tool

    .Performed Data onboarding in M-Hub by using tables to the reservoir to the

    successful run in all environments.

    .Experience with Ab Initio tool suite such as Express>It ,Control Center,IBM

    Scheduler &Able to work on UNIX scripting (Medium complexity)

    Hands-on development experience with various Ab Initio components such

    as Rollup, Scan, join, Partition by key, Partition by Round Robin, Gather,

    Merge

    .Various file management commands like m ls, m wc, m mkfs, m copy, m

    dump were used extensively to operate with multi files.

AB Initio Developer

Discover
Apr, 2021 - Oct, 2021 6 months

    .involved in This project which aims to provide permanent fixes to the

    frequent production issues and understanding job flow and their interrelated

    dependency

    .Extracting Data from various sources such as Flat files, db2 and load into

    Staging Area

    .Understanding the exact issue for the prod related issues with detailed

    analysis

    .Preparation of various documents like design and testing documents to

    ensure that the technical logic implemented in various graphs is clearly

    understood by everyone

    .Performed data Insertion and deletion in SF from Backend and as well in

    Amazon S3 bucket

    .Prepared reports and Visuals in Tableau for analytics purpose on daily basis.

Software Engineer

KOHL'S
Apr, 2018 - May, 20202 yr 1 month

    .Developed DDL scripts to load data into BigQuery from GCS bucket.

    Developed producer scripts to read CSV files from source systems and

    transferdata to Kafka.

    Involved in Kafka and Pyspark integration for realtime data prosessing and to

    enable advance analytics.

    Experience in handling JSON, CSV and pipe delimited files.

    Developed Audit tables for reconciliation and metadata tracking purpose.

Education

  • Master's: information systems

    STRATFORD UNIVERSITY

Certifications

  • Databricks

    Databricks (Dec, 2023)

Interests

  • Long Rides
  • Watching Movies
  • Chess
  • Cricket
  • Listening Music
  • AI-interview Questions & Answers

    Hi. Uh, I have total 5 years of experience into data engineering. So in that 1, 4 years completely into the GCP. Along with the migration, apart from this, like, you know, I have very hands on Python and SQL and BigQuery, like Cloud Composer. And, like, a bit of, uh, Dataflow, Like, these are the services I get a chance to use in GCP, various services. So we've run totally 4 different projects. 2 of them completely into the banking domain and 2 of them into retail domain. So first of all, it is, like, Kohl's, and 2nd project is, like, you know, Discover. And 3rd of all, it is, uh, from South Africa, Nedbank. And 4th 1 is, like, you know, uh, Etsy ETSY, which is ecommerce website. So let me start with my most recent project. Uh, so it is ETSY, like I mentioned. So from USA itself, the complete operations, it is. So they have 2 different departments, like, uh, d two c and marketplace direct to the customer. And marketplace, like, um, uh, we used to have all the data related to the customer, uh, like, uh, artist data and, also, like, customer related data and, as well, third party delivery service data. So all these kind of the data, they combine in the form of CSV files from upstream people. They push the data to my local drive. From my local drive, like, you know, I push the data to GCS and where, like, you know, before processing to further BigQuery. So I will look to processing transformations and validations. And after that, like, you know, I will apply all type of transformations based upon the requirements. I will do that. And after that, like, you know, generate the views and hand out to the downstream people, like ML people and AI people. So that, like, you know, um, for this whole process, we are using Cloud Composer to orchestrate the whole process. So this is one project. And, you know, another project I was involved in targeted to GCP migration, where we are using it replicate is a migration tool, median between the both tools. So that, like, you know, uh, there is a change data capture is this I mean, uh, feature availability in the tool. So that, like, whatever the data updated in the table, Automatically, you know, along with the hash circle data and updated data. We can able to replicate the same tables, like, you know, into the query in GCP completely. So this is completely about my last couple of projects. But now in this project, we mainly focus down that since if the table length is huge, like, let's suppose, if it is, like, 15 GB or more than the 15 GB, we try to divide the table into small chunks, and we are making it like, you know, different different goes, not like a single run. So that's well, like, you know, we can make it, uh, the complete process faster compared to send 1 tablet at a time that 15 gs 15 gs of the tablet. So apart from this, I have very good knowledge Dataflow and Pubsub as well. These are other services like, you know, Dataflow is ETL and Pubsub, which is a messaging tools where we use in the pipelines randomly to get the pop up message whenever the data is available in the particular bucket or particular location. So that, like, you know, we can able to check with the various operators in there. So, like, you know, we'll get the data in the I mean, in the particular bucket. So what we are trying to expect on daily basis, like, you know, for particular run, whether it should be on a couple of days per day I mean, couple of times per day or, like, you know, couple of time couple of times or more than that, so for scheduling purpose. And apart from this, I have a little bit of knowledge to data proc and also data fusion and data prep as well. So, yeah, that's all about my, you know, recent experience and projects. Thank you.

    So strategy to handle, uh, exceptions in Python while loading the data into SQL database. So usually, like, so far I know, um, I dealt with, like, lot of files, like, along with the CSV file and JSON file. These files, when I'm trying to read that file and after that, doing the transformations and everything, uh, by using, like, you know, Python script only in the tag itself. But, like, you know, uh, in that scenario, like, where I used to write the SQL queries initially before loading into the SQL database. I'm trying to, uh, write the SQL queries into SQL operator itself. So that is the one way I'm trying to make it, but, uh, the strategies, like, uh, exception handlings, We done, like mostly, it involves, like, you know, various, uh, from my scenario. Other than this, like, no exception handling. Seems like, uh, we'll go with, like, um, blocks. So So that is, like, a one scenario we can able to make it. Like, uh, try accept blocks. That is also one thing. And, um, specific, uh, accept handling, We need to choose, like, uh, based upon the SQL database. Either it is, like, a post database or completely by using, like, a SQL connector, uh, like, um, for my SQL, if I'm not wrong. So this is the one we need to make it, like, uh, when we are writing the Python query. Like, we need to initially import the necessary libraries and everything. And transfer management, also, like, you know, one more thing. We need to make sure it, like, for the data integrity and everything, uh, purpose. Like, you know, I just in case if there's any errors occurring in the data while we are doing, uh, like, uh, pushing data from I mean, pushing data into the SQL database. And, like, you know, through logging also, like, you know, we can able to check the logs for all the errors and everything like us. Because, uh, we can able to see all the errors through the logs itself, uh, mostly, you know, in most scenarios, like, what are the errors we are able to get, like, warnings and, like, you know, what are the, I mean, incoming messages, like, uh, informational messages. So these kind of things. And, also, like, you know, cleanup actions is the one more way that can we can able to make it, like, uh, with the statement wise. So if you want to say an example, like, we can use, like, trial. Uh, we can start the statement with the trial, and we need to define the database, like, operators instead of that statement, particularly. Like, you know, uh, with the handle expression, that is the, uh, expression we need to give it. And finally, like, you know, uh, we need to close the connection as well, like, database connection, whatever the connection we opened it previously. So that is the one way. I have, like, you know, uh, these are the ways so far, you know. And, yeah, uh, exception

    So data integrity, performing transformations in Python, it will process. So data integrity, like, basically, uh, let's assume, like, you know, you have a, uh, like, a you need to integrate different kind of files. Uh, it might be, like, you know, uh, let's say, Zoom, it is a CSV file, like, multiple files. But, like, initially, we, uh, I'm taking an example. Like, I'm using URLs so that, like, you know, I can able to fetch the data. And after that, like, you know, download the data into the in the form of, like, a CSV file so so that, like, you know, um, after that, like, I'm able to read that, uh, completely their CSV files. And after that, like, you know, I can able to perform the validations on top of that files. So integrity techniques, like, performing transformations, uh, in my scenario, we can able to check with the version control. That is a one way. And, also, like, you know, uh, we need to test the framework to write the, like, uh, unit test and everything, win test cases, and doing the unit test and everything for the transformations to fix the bugs. So now before, be get, like, an effect and everything to complete for whatever we are reading the Python. And after that, like, you know, we can able to cut the complete process, like, you know, for what are the transformations we've done previously. And, like, you know, we need to debug the train transformations if there is any errors or, uh, something was happened in that particular log. And validation purpose, like, we're able to check inside of the library, uh, like, before the data was transformed I mean, before the data was transformed completely. So these kind of things, we can able to check it. And, also, for the lineage is the main thing we need to make it clearly, like, uh, for the historical data to check from the origin itself. So this is where, like, you know, we can it can you know, we can able to understand, like, you know, how data was transformed completely from into t n. And data governance plays the key role, uh, in my opinion. Like, uh, we need to see where the policies work, and, also, we can able to see, uh, the framework process and, also, we can able this kind of,

    If we are requesting Python, usually, like, uh, request in spite like this kind of, uh, APIs. We can able to communicate with other, like, uh, serverless architecture domains in my scenario, basically, uh, where we used to write the Python and by spark code and everything for the transformations. Like, Uh, so for understanding, like, a a p s by can you some ways? Like, uh, like, a page number, like, an offset and limit when we are dealing with, uh, APIs, basically. And, also, like, you know, token based thing. So that is also one thing we need to consider, uh, when we are dealing with the APIs. And, like, uh, and we need to go with, like, a basic, uh, position, uh, which is, like, um, we need to write a code, like, how what is the approach we are going to adapt when we are dealing with, like, half set and limit. So these kind of scenarios when we go with, like, you know, I mean, a personate, uh, especially. So that, like, you know, uh, we're we're, like, we can write the Python code from code necessary, like, by is, like, from requests too. And, like, uh, we can able to fetch the particular, uh, page number. And along with that, like, uh, we can able to do that particular page along with that page number. And response code, all these, like, in a similar kind, we can able to read the code in Python itself. And after that, like, you know, we can able to, uh, read the limit as well. Uh, like, uh, we need to handle the limit by, uh, importing the time and, you know, backup and everything in the Python. Um, and, also, like, uh, expressions wise so that, like, you know, we're able to set the maximum time, uh, which is 60 seconds. Um, so that, like, uh, we can after we can able to choose the efficiency considerations here with the parallel request and, uh, just the page page size as well. So since I mentioned previously, like, uh, customer token based, like, optimization, uh, when we are dealing with, um, API, So this is also

    Like, uh, synchronous programming. So, basically, like, if we go, like, step by step, like, um, because, like, when we are dealing with, like, switching the tasks during, like, you know, uh, waiting times and everything, seems like, you know, when you're touching using ratio and, uh, in which is, like, you know, Python enable library for lighting, as an asynchronous code. And for that, like, we need to set up a sync environment to do that. Um, so but, like, we need to make sure that we are using Python version 3.7 or higher since we need to write this code. Uh, it supports only, like, Python 3.7 or higher versions. Uh, with the HTTP I mean, our client client libraries, these are all kind of libraries along with that. Uh, you know, uh, H2O HTTP, Uh, we need to install it parallelly. The functions like, uh, we need to import we need to import both of them. Like I said, show and also, uh, it ought to be. So both these labels we need to import. And, uh, we need to fetch the same function, asynchronous. And after that, we're gonna need to assemble the task where we run the, I mean, task completely and concurrently. Um, and we need to exclude, like, uh, total end to end ETL process with all, like, listed URLs and endpoints. Um, um, we need to call the main function in that one. So, yeah, these are the scenarios we need to keep in mind before process the data. So for process the data, we need to follow, like, a a few considerations here. Like, uh, 1 is, like, uh, the rate limit, and another 1 is, like, you know, concurrency versus, like, you know, parallelism, um, and as well as, like, you know, error handling. Also, one more thing we need to make sure. So these are the things, from my opinion, we need to make

    Uh, we need to either go with, like, uh, multiple ETL tools like a data flow. Uh, this kind of tools, we can able to go with it. For modern architecture, if we consider, like, uh, we need to go with separate detail, uh, okay, any kind of tool we can choose, like, apart from, uh, data flow, particularly from the GCP services, like, uh, data storage, these kind of tools. And, also, we need to go with, like, a microservices approach as well for large scale systems since you mentioned, like, uh, for large scale systems for huge data volumes in the, I mean, extraction, transformation, and loading architecture. So for parallel processing, we send, uh, usually, we go with the asynchronous, like, uh, by replacing the synchronous programming and, you know, concurrency for handling input output like a base I mean, uh, bound tasks by calling the APIs and for the efficiency. With, growing of CPU bound tasks and, like, uh, parallelism. So these kind of mechanisms. Like, you know, for scalable infrastructure, we go with the cloud services, like, uh, Google Cloud or AWS or Azure. So we can choose it from AWS Lambda or, like, you know, Google Cloud Functions for continuous, uh, scheduling. And apart from this, like, you know, for containerization, we go with either, like, you know, Kubernetes. So that is, like, uh, mostly use a containerization these days apart from the Docker containers. So we can also able to use the Docker containers as well for orchestration along with the Kubernetes. Like, you know, it's up to completely depending upon our situation and efficient data data processing as well. Like, you know, for both batch processing and stream processing For large datasets, like, you know, we need to implement, uh, we need to divide the data into chunks and reduce the memory size, memory usage, not the size. Let's so that, like, you know, we can able to process more data, and we can able to improve the processing speed for the branch processing. And stream processing, like, you know, we need to we will take a frameworks along with the I mean, Apache Kafka and also Apache Flinker and AWS kinds of these kind of ones to process the data. Um, database and optimization will go with, like, uh, Cassandra. So most SQL database mostly. MongoDB for, uh, uh, you know, because these databases, uh, obviously, they don't follow the or DBMS. Right? So to access, like, a particular pattern, uh, basic files. And, also, we get the multiple option like scalable, uh, distributed data I mean, distributed database because these are the distributed databases. Uh, so that we can able to easily implement the partitioning and also indexing strategies for optimizing localities mostly. And for a caching, um, caching the data through data lake, We use, like, uh, various mechanisms, like, uh, um, to reduce the database load so that, like, you know, we can able we can maximize the performance of the database to improve the response time as well. So that, like, you know, the database is not continuously running on same query again and again for a long runs. And also for data lakes, like, uh, for fully architecture, uh, data lake, we can able to utilize it, uh, for storing, like, a vast amount of the data compared to the data warehouse. We're, like, you know, uh, we can able to store raw data before we are doing the transformation and everything so that we can able to perform on the

    Okay. The following Python script and data process is supposed to extract JSON data from an API, transform it, load it into the data file, we pay responses by request. We get example, and we transform the data. So data frame, the call it has a and the data. So so here, uh, what I observe is the file directly calling the jus I mean, uh, the the the file directly calling the JSON. So, like, you know, on the response, like, a request dot get exactly, uh, without checking, like, you know, if the request was successful. So if the API request calls means, like, uh, due to, like, network issue, I'm assuming or, like, you know, server returns I mean, sometimes, like, it might be the server return request as well. Like, maybe some kind of incident issue in status code. Um, it might it might be the 1 more issue here. We we need to consider it, uh, because, like, um, I'm assuming this is, like, a one one scenario where the code will not work according to them. Um, I'm assuming, like, uh, the response of the structure. Like, since, like, a API response is, like, uh, in the form of dictionaries, the transformed data we can able to see, like, you know, ID, item, and everything, quantity, and price. So the API response actually is different or, like, you know, this case are, like, missing 1 of any item particularly. Then, like, you know, lift list mean a list comprehension will raise a key error. So that is also, like, another thing what I observed from here. And, like, you know, data transformation logic, like, um, it may be either quantity or price missing or, like, you know, not like a numeric, uh, data type, uh, like a fair, like a item, uh, like, you know, value. Item quantity here, we can able to see. Like data type or, like, uh, maybe it will be the string or null. Uh, maybe this is, like, a type error here. So I'm assuming. So it's a safe approach, like, uh, we need to make it, like, uh, initially ID along with that, like, you know, item, uh, the square brackets. You can now we can able to declare it. Initially, we need to declare the API response code request like this before that. And where, like, you know, we can call the value with the float minimum with float of item dot get with quantity comma 0. And after that, again, we can able the call I mean, we can able to get the float along with that same item dot get. So that, you know, we can able to rectify this code and we can able to make it success. And like a data frame by using the data frame from the dictionary also, we can able to, uh, solve this one. So yeah. I think these are the changes I observed and, uh, how we can able to

    Block. We decided to send a batch of messages to an AWS Lambda. Function sees below. There appears to be an oversight that could lead to errors or expected behavior. Okay. So the potential the potential issue might be, um, I'm assuming it might be the syntax errors or type errors. So initially, what I'm thinking is, like, you know, the we have the message to send a list completely. So here, the try ID message and f message of i's. So with the, uh, no far loop in the range of 10. So this is what, like, you know, presumably, it should be a list comprehension, definitely. And, also, the function and parameter in the Lambda client, like that invoke, ends with the mismatched point type, like, uh, you know, in the braces I mean, in the, uh, braces, like, with the quotes at the start and at the end. So I think that is the missed one, like, behind the message to send. So that is what, like, when I observed it. We need to know character code block initialization. So assuming, like, you know, the indentation was create a list of dictionaries, if I'm not wrong. The correct initialization should look like, uh, uh, messages underscore to send. And after that, we need to create same ID on 1 message, and after that, f message. Uh, for I I mean, f message, uh, where we need to declare, like, a I in a range of 10. So here, um, there is like a incorrect, uh, payload formatting is here. So we need to, uh, correct this one by importing, like, a JSON format instead of, like, you know, border 3. So we need to import the JSON. And inside that loop, like, we need to response this, like, a response equal to lambda dot I mean, underscore client. And we need to invoke that 1 with the function name, like, process message function. And along with that, like, invocation type is even, and we need to, uh, payload that 1 by using the JSON dumps, like, with the encoder of u t f minus side. Yeah. And then like that, we can able to to

    So an effective way to debug a Python application. That's experiencing performance issues during complex SQL data transformations. Uh, 2 way to duplicate to debug the Python application, we need to profile the Python application. So in my scenario, like, where, uh, we need to go with, like, a code profiling. And after that, like, you know, we need to analyze the SQL queries, like, uh, how SQL queries was run. So is when direct optimized queries, like, uh, they start I mean, how they read the queries, like, how efficient they use a lot of joins, unnecessary statements in the test query, uh, along with the optimization techniques, they used. So based upon that optimization, uh, we need to use some data access patterns as well. Like, uh, when we are dealing with the batch processing or, like, we need to go with caching, uh, like, uh, lazy loading as well. Like, if we are go with, like, ORM or, like, um, when we are dealing with, like, relationships, uh, in terms of the data. So and resource allocation, also, like, one more thing we need to make sure, like, um, either it is, um, Dataproc or either it is, like, um, any any Python, like, um, or either it is a database, like, we need to make sure we have sufficient resources in terms of, CPU and memory. Uh, we need to configure them properly for the particular workload, for the performance wise. And, also, like, application resources, like, similarly, we have an you know, enough resources to handle that particular application when we deploy the code instead of the workload. So that, like, you know, it will consider, like, uh, we didn't get, like, multi multi threading or any kind of, like, um, bound tasks. So and, also, we'll need to continuous monitoring and testing that a complete environment by using various tools like a Grafana or like a Datalog. These kind of tools, we can able to check that continuously so that we can able to make sure everything is going on same track. And, also, we need to perform load testing as well. Performing load testing on particular application, like, you know, we can able to check on real world, like, uh, by using various usage patterns and everything to get the performance issues if the application is getting.

    How Stream limit. We integrated React components. Stream limit, like, with the React components. Stream limit. Click on the React components. Python application. For the integration, like, initially, we need to set up the, like, a development environment, and we need to create, like, a new streamline component by initializing, and I know we need to install necessary dependencies, uh, from I mean, now when we are writing we initialize the code, like, n b m install dot, you know, react, something like that. And, like, uh, we need to create, like, a stream link wrapper as well. Uh, to integrate this complete process, um, when we need to set up Python package so that, like, you know, we can able to write the wrapper. I mean, and the Python model itself, we need to import, like, a string built components as a, uh, to declare particular, um, location, you know, where it can react with components, front end, uh, assets, like, you know, whatever the voltage stored completely. Um, after that, I can look and able to build on the React component inside of that port. So that, like, you know, we can use the component, uh, in stream limit. For the integration, uh, I think we need to distribute that component to iterate the