
A collaborative engineering professional with substantial experience in designing and executing complex business problems involving large scale data. Overall experienced in building highly scalable, robust and fault tolerant systems.
Tech stack that I have worked in :
Big Data Ecosystem: HDFS, YARN, Map Reduce, Apache Spark 2.2, Hive, Oozie
• SQL Technologies: Teradata, DB2, SQL Server, Oracle 10g
• NoSQl Technologies: MongoDB
• Programming Languages: C, C++, Java, Python, C#
• ETL Tools: SAS Data Integration
• Real Time Data processing: Apache Spark Streaming
• Operating Systems: Redhat, Ubuntu, MS Windows
• Cloud Services Used: EC2 AWS Instances, Azure WebApps, kinesis, S3, EC2, Eventhub
Lead Data Engineer
DatabricksAnalyst Data Engineer
JP Morgan Chase and CoApplication Development Analyst
Accenture.png)
Databricks
.png)
Apache Spark

Apache Kafka

AWS

JSON

WebAPI
.jpg)
Teradata

Hadoop

Spark SQL

Hive

Impala

Scala

Java

JVM

Git
.png)
Jenkins
Hi there, sorry for the camera, okay. So, hi there, my name is Shraddha Bhattacharya, I work as a data engineer with Databricks. So I am a lead data engineer with Databricks, so I take care of use cases with JPMorgan migration projects from cloud data or on-prem clusters to Databricks, and then there is like we develop features or use cases specifically using Databricks platform. So I have a total of 8 years of experience in the data engineering space and primarily I started off with Accenture, I was a consultant for Nordea and worked there for 3 years. We were ingesting customer analytics data from platforms like SaaS, like SaaS, the backend tables which were populated to DB2, then we were ingesting campaign management data and also customer data and we were feeding it to our random forest models which were done using SaaS eMiner. Then I shifted to JPMorgan and Chase, where I was again a data engineer. We worked on risk data for the firm, where we ingested from various catalogs like user tool data, then your obligations data, class, those are the main components that I worked with. Then we built CI-CD pipelines for easy deployment and automation. Then coming to Databricks, I serve customers use cases where we jump in as a Spark specialist and we help the customer achieve their goal with the resource they are commissioning. So also we are a Delta First company, so we definitely promote Delta as a framework because the customer will not be tied up to the system even if they want to migrate in the near future. Thank you.
You can enable broadcast join by using, uh, in price market case, you can use broadcast hints. Okay. Um, so we have dot hint you can use to to hint the AQE to, uh, trigger broadcast. Also, you can play around with the broadcast schedule, uh, to enable broadcast, and, uh, hint works better. Uh, as for my end, uh, like, experience, in this does work better. Broadcasting happens, uh, of the, uh, light inside table, which is smaller in size, and the threshold can be maximum adjusted to 8 GB. It does, uh, we we should consider, like, uh, not broadcasting a large table, okay, and also, taking care of any broadcast network joins because, uh, multiple times if, uh, data is transmitted IO is a positive operation. So if multiple times data is transmitted, it will be, uh, like, it will have a negative impact on the performance. So, definitely, we should keep those in, uh, consideration before using broadcast. Definite broadcast, uh, like, if you have a small reference data that you want to broadcast to various executors. So it does, uh, bring up the speed of execution of the join, but definitely there is some down
Given an API pipeline, identify a strategy for handling implemented data by delete dealing as rules and slow flake as destination. Okay. So incremental data. Incremental data, uh, so we we have something known as, uh, autoloader where we we consume, uh, data in stream as well as autoloader can work on a batch mode. Like, uh, it's a simple switch, uh, read to read stream. Is that's what you have to do. Uh, so, uh, incrementally, we can do is, like, uh, we have something known as file notification where whenever a file comes into the, uh, directory or the bucket, it will automatically be picked, and it will be processed by the pipeline. And Autorouter being a streaming streaming work, uh, use case, so, uh, differently, uh, it works in the micro batch, and whatever transformation we we want to do on the micro batch can be done. And we can directly output using a Snowflake connected to the destination tables of Snowflake. So that's how we we handle incremental data. So how, uh, Autoloader keeps track is, like, through the, um, Rocks DB or, uh, cloud state providers. Okay. It can be HDFS backed or it can be RocksDB backed. So RocksDB is a distributed, um, database that we are using to track all these changes, and it it contains the file names which has to be ingested. Okay. What time it came in? So those information are stored. You can you can check the container of, uh, Logstv by using cloud, uh, cloud on underscore files, uh, SQL parameter and giving the checkpoint location.
Okay. So in AWS, there is a auto backup feature that you can enable. Um, and you click what we suggest is, like, setting up a separate Doctor environment where, um, the tables as well as, like, a snapshot of data is already present as well as the processes are, uh, like, the jobs are already present. We we do it for customers, like, like banking customers who needs, uh, quick recovery time. So just a switch, uh, on the, uh, what is it called? Uh, like, it's a simple switch between the, uh, like, the URLs here to this switch. The load balancer. Sorry. The load balancer, uh, URLs, you should point it to the VR environment, and that should, uh, automatically take you to a previous server. And the a DI recovery is easily managed.
Okay. So skewness can be detected very easily. Uh, skewness can be detected very easily. What do you have to do? Uh, if a stage is running for a very long time, you have to go into the stage, uh, stage view and in in Spark UI. And inside Spark UI, you have you you just sort the, um, a task that is running. Okay. And you see the input. If it is a input read, then you should see the, uh, data the task is already, uh, reading. Uh, if it say shuffle read, then you have to see the shuffle read size. So, uh, on skewness on right can be on the right operation, like, how much it height is writing or the shuffle right operation, uh, if it is rough writing the shuffle files. So you can check the, uh, check the quartile matrix, like, uh, how much if the job is already completed. So you can you can access the quartile matrix and see uh, how much, uh, how much percentage of job took longer duration, uh, at 50% or above. That can give you, um, uh, information if there is a skewness. How do now how do you handle it? You have to evenly distribute out your data so you can repartition the data, uh, using some key or columns. If there is no single column, then you can use use a salting technique. Like, you can concatenate the salt and then repartition the column, and that should actually do the trick for you. Or you can also use queue hints to, uh, hint AQV to automatically use queue joins, uh, in in case of any, uh, skewness is present during a join phase. So that's how you, um, help in, uh, addressing skewness in dataset.
Okay. So Snowflake is a computer brand. Uh, so I personally have just used it for testing. But if you want me to, uh, deduplicate any data, so I usually use the upsert logic. So upsert can never ingest, um, duplicate records. If there are multiple times the ETL pipeline has run, AbsaRT will never, uh, upgrade those record. It will be dropped. So that is that will give you a benefit in the or boost in the performance. Next is, like, if already you have some duplicates, so you have to find you can use row number, a partition by column, uh, like a, uh, composite key you can use to detect, uh, the duplicates, and you can just take the rank one, like, the 1st occurrence. You can, uh, order it by time stamp, which I used to do it, and, uh, just take the current stamp, uh, snapshot where rank is equal to 1, and you can just populate the table. So that's the simplest way you can remove duplicates. And to prevent it, go for absurd logic. You can use merge merge operation of Databricks that can that automatically takes take care of this scenario Or use a deduplication, uh, dedupe transform transformation. But just be mindful, like, dedupe, uh, actually is a stateful operator. So if you are using streaming, then you have to set a window, um, for, uh, or a times what is the watermark to actually hold the
So, uh, now coming to tuning tuning, uh, in, uh, process or memory management. So what we generally do do is, like, check the heap, how much the heap is being consumed. If, uh, so there are 2 things. Yeah. If the bottleneck is from the driver, we check the heap memory of the driver if it is overflowing. If there is a, um, the driver is getting overflowed, this kind of scenario of occurs when there is, like, multiple task the driver is trying to accomplish. So we can try increasing the number of calls so that will take off the load from the memory. Okay. If there are lot of data being brought to the driver that is, uh, is not recommended, you you should not use collect in a production job. So, uh, rather than you should write it to a table because, uh, table lights are taken care by the workers. If, um, that is the general recommendation. If there is lot of, uh, if there is a memory leak due to a huge plan, we can increase the cluster or the driver size to the next level, and that can efficiently handle it. Or we can break down the plan into sizable chunks, and we can consume the data. Uh, next is, like, um, the worker level. At the exporter level, sometimes, OEMs happen, and the worker gets killed. This is due to mainly due to skewness of data. Okay. And if there is not enough memory, then we we commission some more additional workers and also increase the parallelism. We also use task per task CPUs to increase the, um, uh, volume of data that can should be processed by a single task. So these are the common steps that we are, uh, usually implementing for memory management.
How do you ensure fault tolerant? Uh, so, usually, fault tolerance can be achieved through keeping multiple copies of data, um, and data consistency, we can achieve through, uh, data governance. We can, uh, onboard our, uh, like, databases through Unity Catalog. Unity Catalog tracks these issues or consistency data, and it flags it in the lakehouse dashboard. So that feature is very useful, and that can be definitely implemented.
What strategies would you use to migrate Python fields keep running in legacy system by using spy Spark compiling Processor? Okay. So, uh, if there are legacy systems, uh, so definitely during migration, uh, the things that we keep in mind mind is, like, the volume of data that is being done in. The source of data, if it is changing, definitely, there will be some refactoring has to happen. Like, I have worked on so many projects that definitely some some sort of refactoring definitely comes in. So we should keep that in mind, and, uh, the refactoring we have to take care. But, uh, we can we can more or less differentiate the Python code. Uh, so if it's a pure Python code, we we can and if it is if it can be done through the out of out of the box, uh, capabilities of FireSpark, then we can convert it directly, or else we can use UDFs to achieve the same thing, UDF and UVAF for aggregate functions. And to achieve parallel processing, we should utilize all the worker course. So those things we can keep in mind in the migration
Can you propose a method for real time data person? Okay. So Kinesis is a Kafka framework. Um, so I have used Kinesis as a source to ingest data from Kinesis and bringing it to our data lake. But, definitely, you can use it, uh, AWS NAND, uh, where you even try trigger any workload or any in end of processing based on the data mining data types. Uh, so you can use Kinesis connectors, uh, which are readily available, uh, in AWS, and you can connect to Lambda and do the processing. The, um, one example if I have to see, like, maybe an email triggering. Like, if some topic has some data, uh, means I would use, like, Lambda, like, if there is any, uh, like, deviation in d q or some some sort of deviation or there is a, um, uh, queue where, like, uh, the broken records are pushed. So those notification services, we can use AWS Lambda connected with other services email services from AWS. So that's a good use case of using 8 Kinesis with Lambda. Apart from that, real time application would be, uh, any streaming services we can utilize, like, real time
What process would you recommend for everyone? So so the migration from AWS s 3 to, uh, or to on prem to s 3, Firstly, identifying PII data. Um, so we we should not take PII data to cloud. Sometimes, like, banks are usually easily hesitant in that. Uh, other than that, we can utilize, uh, VPC for securing the connections, like, productable PC. Um, then we can utilize the parquet files if a parquet table. If it you can utilize the parquet files and create tables on top of that. We can directly, uh, move those. And AWS is a solution migration solution that we can use to migrate all the data from on prem to Hadoop level. Thank you for the time.