profile-pic
Vetted Talent

Rajagopal Bojanapalli

Vetted Talent
With over a decade of experience in the field, I have honed my skills in MySQL, Java, Python, and Hadoop to a high level of proficiency. My expertise in these technologies has enabled me to successfully tackle a wide range of projects and challenges throughout my career. From designing and optimizing databases to developing complex software solutions, I have demonstrated a strong ability to deliver results and drive innovation. My extensive experience has equipped me with the knowledge and skills necessary to excel in the ever-evolving world of technology.
  • Role

    Software Engineer

  • Years of Experience

    10 years

  • Professional Portfolio

    View here

Skillsets

  • Apache Kafka
  • Pinot
  • Dataproc
  • MapReduce
  • Looker
  • Presto
  • Dataflow
  • Airflow
  • Apache Flink
  • BigQuery
  • Scala
  • MySQL
  • Terraform
  • PySpark
  • Kubernetes
  • Elasticsearch
  • GCP
  • Apache Spark
  • Hadoop
  • Python
  • Java

Vetted For

11Skills
  • Roles & Skills
  • Results
  • Details
  • icon-skill_image
    Senior Data EngineerAI Screening
  • 72%
    icon-arrow-down
  • Skills assessed :BigQuery, AWS, Big Data Technology, ETL, NO SQL, PySpark, Snowflake, 組込みLinux, Problem Solving Attitude, Python, SQL
  • Score: 65/90

Professional Summary

10Years
  • Feb, 2019 - Present6 yr 8 months

    Software Engineer

    Uber
  • Jan, 2017 - Feb, 20192 yr 1 month

    Software Engineer

    Walmart Labs
  • May, 2016 - Dec, 2016 7 months

    Big Data Intern

    Walmart Labs
  • Jan, 2013 - Jul, 2013 6 months

    Software Intern

    Flipkart Online Services
  • Jul, 2013 - Dec, 20141 yr 5 months

    Developer

    Fiorano Software Ltd
  • Jan, 2015 - May, 20161 yr 4 months

    Data Scientist

    Information Sciences Institute

Applications & Tools Known

  • icon-tool

    Apache Spark

  • icon-tool

    Apache Kafka

  • icon-tool

    Pinot

  • icon-tool

    Looker

  • icon-tool

    GCP

  • icon-tool

    Bigquery

  • icon-tool

    Presto

  • icon-tool

    Airflow

  • icon-tool

    Kubernetes

  • icon-tool

    Hadoop

  • icon-tool

    Dataflow

  • icon-tool

    Dataproc

  • icon-tool

    Terraform

Work History

10Years

Software Engineer

Uber
Feb, 2019 - Present6 yr 8 months
    Founder of CO Data Engineering, managed central warehouse, created ETLs, enhanced data quality/freshness for crucial datasets, built anomaly detection systems saving $500K, created data validation utilities reducing man hours, implemented access controls on datasets, led governance projects for data management. Collaborated with ML teams on feature building, managed GDPR-compliant pipelines, built real-time queue analytics systems, optimized dataset performance.

Software Engineer

Walmart Labs
Jan, 2017 - Feb, 20192 yr 1 month
    Developed pipelines for audience estimation, campaign management, and propensity audience; contributed to ad revenue increase of $10M; improved deployment infrastructure.

Big Data Intern

Walmart Labs
May, 2016 - Dec, 2016 7 months
    Created application to analyze log messages/alerts; enhanced algorithm efficiency and optimized disk usage.

Data Scientist

Information Sciences Institute
Jan, 2015 - May, 20161 yr 4 months
    Implemented clustering/graph algorithms to identify Human Traffickers; work featured on Forbes.

Developer

Fiorano Software Ltd
Jul, 2013 - Dec, 20141 yr 5 months
    Developed front-end web applications for API management, tracking metrics, and managing restrictions.

Software Intern

Flipkart Online Services
Jan, 2013 - Jul, 2013 6 months
    Built sales dashboards for analytics on product popularity, filtered by category/location/time.

Achievements

  • Founding member of CO Data Engineering
  • Built anomaly detection systems
  • Built data validation utility
  • Built Access controls
  • Led data governance project
  • Built features for ML Models like R&A issuance
  • Built Real-time Queue Analytics dashboard

Major Projects

6Projects

Building Knowledge Graph to Combat Human Trafficking

    Featured on Forbes, implemented clustering and graph algorithms.

Real-time Audience Size Estimation

    Real-time big data processing with less than 1.5 % error in minutes.

Modelled Data as Graph using Apache Spark GraphX

    Graph analytics on human trafficking ads, achieved significant results in clustering and identification.

ClickThrough Prediction Rate

    Predicted ad click probability using classifiers such as decision trees, random forests, and SVM; demonstrated results on ROC Curve.

Recommendation System

    Used dataset Audioscrobbler-data to recommend artists using Latent Factor Model and collaborative recommendation methods; implemented on IMDB Movie dataset.

Clustering, Graph Analytics on Human Trafficking

    Grouped similar ads, modeled data as graph using Spark Graphx, identified connected groups; work featured on Forbes.

Education

  • B.E. (Hons.)

    BITS-Pilani, Pilani campus (2013)
  • MS. Data Informatics

    University of Southern California (2016)

AI-interview Questions & Answers

Hello. Yeah, so I'm Raj. I have close to 10 years of work experience. I'm mostly into data engineering. So I did my master's in data mining slash big data distributed systems. And I worked in companies like Walmart and Uber, where I worked as a data engineer. I worked in both batch analytics and real time analytics. My tech stack is Apache Spark. I worked on different versions of Spark like PySpark, Scala Spark, Java Spark, and Python SQL. In case of real time analytics, my stack is like Apache Kafka, Flink and OLAP database like Pino. So for orchestration, I have used Airflow. I worked with Tableau for visualization. And the cloud systems I worked on AWS and GCP. I worked on big scale systems crossing terabyte and parabyte scale. I have done quite a number of optimizations, query optimizations, Spark optimizations, and tuning Kafka configurations and all. So I'm pretty much familiar with all the tech stack for data engineering. Thank you.

Sure, so handling handling skewness in the large data set for FISPAR, this is a pretty common data issue, like, usually most of the natural data sets which you encounter have skew issues, right, like, one particular city has large data, one particular country has large data like India, right, so the way we handle this is, whenever we execute a SPARC job, we can see how the job is performing in the SPARC UI, right, so SPARC creates this DAG with the jobs and task dependencies and everything, right, so if you look at how the tasks are performing over time during the job execution, some of the tasks finish faster but some tasks are the ones that are creating a bottleneck for the job, right, so there might be like one or two tasks which are pending and keeping the job running, the rest of all the tasks are completed, so this is a symptom for when the data is skewed, right, so we can check, we can do some exploratory data analysis based on this finding, we can check like how are the group bys done, how are the joins done, on what keys are this done, right, so if we can do some volume analysis on the join keys and the group by keys that will tell us the data skewness, sometimes it might be on the actual key where the join is happening but sometimes it could be just some null values, there could be some bad data coming in and if you are doing some group by on a null value, lot of data is coming there, right, so that is the way to detect the skewness and the way to fix this is we can use some salting like for the key which is more skewed, just add a secondary key there, some random secondary key, right, so this is the technique called salting, so let us say the key value is key 1 plus some random value, right, if you then you can do this grouping and have joins on this composite key that will distribute the data again across the tasks, so this technique should help handling the skewness.

. . . . . . . . . . . . . . . . . . .

Okay, let me think. All right, so essentially we want to address two things. One is the fault tolerance and other is the data consistency. So fault tolerance is nothing but redundancy and replication, right. So when the database, they are adding a new data to let us say NoSQL database, making sure the replication and the sharding is done as per what we, what is the expectation, right. So implementing the data replication and sharding in the NoSQL database, distribute the data across multiple nodes for fault tolerance and scalability and along with that replication, making sure the replication is not lost in the case of node failures, using sharding to partition the data into smaller chunks and distribute them evenly across nodes to improve the performance. So this will take care of the replication and sharding. For the consistency model, for the consistency models, we can choose what the NoSQL database is offering, right. So some of the databases offer eventual consistency, right, some offer strong consistency, right. So based on the requirement, we can configure that on the NoSQL database end and when the ETL writes, the NoSQL database will take care of the consistency that we have configured. So one more thing we can do is the transaction management. Let us say we want a strong consistency, we can use the transactions approach like start the transaction, commit the data and end the transaction, commit the transaction, right. So this will be much more stronger and need consistent and error handling. We can do retries, adding retry mechanism in case of any failures encountered, using exponential back off retry mechanisms, right. And also adding monitoring and alerting for the ETL pipelines, monitoring any key metrics like replication lag, node status, throughput, all those things.

Okay, and so the question is asking about how to perform the deduplication of a data set in a snowflake. There could be various methods, let me just structure the thought. So the best thing is just identify the duplicates in the snowflake data set and read it and write it back, right as simple as that and when reading you can just read whatever data is there in the data set and when writing you have to deduplicate, right essentially. So deduplication can be done using different things like group bys, listings, row numbers, right. So usually group bys are more efficient. Whatever you want to do, first you need to identify what is the duplicate key here. Is it like one column, duplicate on one column or is it on multiple columns. So based on that you can do the group bys and row number is also, we can use the row number as well, right. So distinct is usually expensive like row number or group bys is the way to go. And to avoid this in future, we can add data quality checks for the ETL so that we do not write duplicates again.

so so So it would depend on the use case here. I would say it could be a batch or it could be a real-time pipeline. I guess the context is more about real-time. I would answer the questions in both the contexts. So let's say this is a real-time pipeline, right? We could use some kind of streaming processing technology, right? Like an Apache fling for Kafka streams or Spark streaming. Usually the time series data is coming from a distributed queue like a Kafka. So all these stream processing engines have integration with Kafka. So I can implement my transformations and aggregations using this Apache fling for Spark streaming, right? They support a bunch of extensive features like transformations, mapping, register pair, all that, right? And also they support window aggregations, state management and all. So this is how I would handle if it is a real-time. In case of batch stream processing, I could use any execution engine like Spark or Hive to process the data. So here in case of batch, it's a pretty straightforward. I could do it like any other pipeline. It doesn't have to be like a time series. Apache Spark has pretty good features available for that. Once I do the processing, I can write the data. So for writing the data to my target table, I would choose the right partitioning, right retention, right? While partitioning, I will make sure the data is distributed evenly for the efficient processing, be it like an hourly partition or a daily partition or things like that. And I will add some configuration for the retention so that data won't keep growing all the time. And yeah, data governance, auditing, those things are pretty straightforward.

OK, losing the broadcast jobs. So, in Spark for broadcasting, there is a construct called broadcast where you can using that construct, you can broadcast a data frame, right, let us say df is the data frame, I can use broadcast of df data frame, right, so that will broadcast it, but we have to be mindful when doing it considering the size of the data frame, right, if we are broadcasting a very large data frame, it will cause out of memory issues, right, so we should be mindful doing it, usually Spark does it by default when the size is like less than 10 MB, right, so depending upon the data frame size, I would broadcast it, so if it is greater than 10 MB but maybe less than 50 MB something like that, then I would broadcast, but if it is more than 500 MB, I wouldn't broadcast it.

Okay, what are the best practices for SQL queries? So first thing is using the right data types. So using the right data types will minimize the storage requirement, right? So for example, using big int, wherever int is needed, right? Things like that. So choosing the proper data type and minimizing the data movement by filtering data as early as possible. And shuffles are very expensive, that would increase the cloud cost, right? And leveraging the cluster tables, cluster tables based on frequently used columns to improve query performance and adding the indexes wisely, wherever you're using the frequently used columns, adding proper index on that and optimizing joins, use appropriate join algorithms like hash or merge based on the size of the tables and using partition tables, right? And avoiding select star, specifying only the columns whichever are needed. And I can monitor the query execution query plan, see, I can make use of some query profiling libraries, see how the query can be optimized, right? And I can make use of the cache also, and caching, making use of the cache feature, right? Yeah, I think these are some things which would help in optimizing the Snowflake SQL query.

Yeah. Mm hmm. Yeah. So, this is for item potency ok. This one is a tricky question. All right. So, just to give some background item potency meaning how do we generate the same consistent output every time the ETL runs. No matter the order of the ingestion of the SQL sources changes, data sources change, we want to have the same output right. So, in a way that is what we are saying item potency means right. So, a few ways to handle this is making sure there is a unique identifier for each record from the different sources right. So, for each record in the you know SQL source assign a generate unique identifiers for each record. So, that during the ingestion process we could identify the record and that will be pretty unique for the data source right. So, that is one way. The other way is maintaining the state and tracking the processed data right. So, we could use some kind of a metadata table like a zookeeper or things to maintain the state of the information we have processes so, that we are not re-ingesting the process of data right. And utilize upsert operations let us say the same data is coming in instead of appending it again we could use upsert right. And also developing item potency logic in the ETL itself. So, making our ETL robust to handle duplicate data gracefully, if a record already exists skip the insertion things like that right and adding some error cases retry handling mechanisms to handle any retry errors network issues right. And also adding some monitoring on top of our data ingestion process tracking the number of records processed inserted etcetera all those things.

So okay all right. So, the first thing is choosing the right Linux based distribution right. So, which is which is well optimized for the performance popular ones are like Ubuntu, Red Hat Enterprise, CentOS things like that right. The next one is installing the required dependencies we can install the required dependencies for Python like Spark, Hadoop, Kafka things like that right. So, we can use the package managers like the VM and VM for installing these dependencies and Python environment. It is best to have a virtual environment so that we have a graceful start and the graceful end. So, using the virtual environment to manage Python dependencies for the big data applications right. And the next one is optimizing system resources configure the system to allocate sufficient resources CPU, memory, disk space to your big data applications, adjusting kernel parameters such as maximum open files and network buffers, optimizing the disk IO performance by using SSDs or optimizing file system settings like XFS or EXT4 file systems. And the next thing could be the networking configuration configuring configuring the network such that we have optimal communication between the nodes inside the cluster right, configuring the firewall security settings right. And the next thing could be configuring the memory management for optimal memory usage using tools like SICT T1 or over commit memory right. And I think these are good things to start. Once we have this using any monitoring and logging tools, monitoring the system metrics like CPU utilization, memory consumption all that. I think these are the things that can be done.

We want to have a high availability system. So meaning if one node goes down, there should always be other node which is serving our queries. So essentially that is what we are trying to target to have. One good practice is to have a multi-data, multi-data center thing. Usually nodes in the same data center have this network issues or power failure issues. So one good thing is to have a multi-data center deployment strategy. And there is a load balancing. Using load balancers to distribute traffic across multiple instances so that if one node is down, the load balancer can route the request to the other node. And auto-scaling. So if the traffic is increasing, using the automatic scaling system, right, and there is a stateless architecture, designing the application and services to be stateless whenever it is possible so that any node can come and sell. And other things are replication, sharding, all that, right. So we can have our database replicated across multiple regions, multiple zones, right. So all the cloud databases, cloud SQL, Azure SQL have already provided this out of the box.