Vetted Talent

Sunando Bhattacharya

Vetted Talent

A collaborative engineering professional with substantial experience in designing and executing complex business problems involving large scale data. Overall experienced in building highly scalable, robust and fault tolerant systems.

Tech stack that I have worked in :

Big Data Ecosystem: HDFS, YARN, Map Reduce, Apache Spark 2.2, Hive, Oozie

• SQL Technologies: Teradata, DB2, SQL Server, Oracle 10g

• NoSQl Technologies: MongoDB

• Programming Languages: C, C++, Java, Python, C#

• ETL Tools: SAS Data Integration

• Real Time Data processing: Apache Spark Streaming

• Operating Systems: Redhat, Ubuntu, MS Windows

• Cloud Services Used: EC2 AWS Instances, Azure WebApps, kinesis, S3, EC2, Eventhub

Role
Lead Data Engineer
Years of Experience
8 years

Skillsets

Apache Kafka
Delta Lakehouse
Data Modelling
Kinesis
Impala
Databricks Unity Catalog
Data warehouse
Databricks
Apache Spark
Java
PySpark
Hive
Scala
Git
Jenkins
Hadoop
AWS

Vetted For

11Skills

Roles & Skills
Results
Details

Senior Data EngineerAI Screening
62%

Skills assessed :BigQuery, AWS, Big Data Technology, ETL, NO SQL, PySpark, Snowflake, 組込みLinux, Problem Solving Attitude, Python, SQL
Score: 56/90

Professional Summary

8Years

Jan, 2021 - Present5 yr 1 month
Lead Data Engineer
Databricks
Jan, 2019 - Jan, 20212 yr
Analyst Data Engineer
JP Morgan Chase and Co
Jan, 2016 - Jan, 20193 yr
Application Development Analyst
Accenture

Applications & Tools Known

Databricks
Apache Spark
Apache Kafka
AWS
JSON
WebAPI
Teradata
Hadoop
Spark SQL
Hive
Impala
Scala
Java
JVM
Git
Jenkins

Work History

8Years

Lead Data Engineer

Databricks

Jan, 2021 - Present5 yr 1 month

Experienced Databricks spark consultant orchestrating migration of legacy data infrastructure, leading to significant reductions in processing time, increased data availability, and enhanced team efficiency, while also contributing to the development and deployment of innovative customer features and solutions.

Analyst Data Engineer

JP Morgan Chase and Co

Jan, 2019 - Jan, 20212 yr

Experienced Apache Spark Developer at J.P. Morgan Chase specializing in data analytics for strategic initiatives, orchestrating ELT processes from diverse sources, conducting rigorous code quality assessments, streamlining deployments, contributing to a cutting-edge Data Reconciliation Framework, and managing critical data modules for risk identification and mitigation.

Application Development Analyst

Accenture

Jan, 2016 - Jan, 20193 yr

Experienced Big Data Engineer and Consultant at Nordea, specializing in migration projects, KYC application development, and customer analytics, adept at enhancing operational efficiency, delivering actionable insights, and pioneering transformative solutions leveraging Teradata, Java, Kafka, and Hadoop technologies for strategic decision-making and customer engagement.

Achievements

Orchestrated migration of legacy data infrastructure to Databricks
Managed seamless deployment of Databricks customer images
Developed custom Kafka DataSourceV2 connector
Implemented Kotlin UDF execution on Spark
Implemented ingestion framework for diverse data sources
Pioneered CDC framework for event-driven architecture
Migration projects and KYC application development at Nordea

Major Projects

3Projects

Migration to Databricks

Successfully migrated clients from legacy Cloudera clusters or EMR to Databricks platform, resulting in a 30% reduction in data processing time and a 40% increase in data availability.

Unity Catalog Adoption

Promoted adoption of Unity Catalog among clients, enhancing data governance practices.

Custom Kafka DataSourceV2 Connector

Developed a custom Kafka DataSourceV2 connector for ingesting broken records, improving data integrity.

Education

B. Tech
KIIT University (2016)
ISC
Delhi Public School, Megacity, Kolkata (2012)
ICSE
Delhi Public School, NewTown, Kolkata (2010)

Certifications

Databricks certified data engineering associate
Databricks certified data engineering professional
Databricks certified associate developer for apache spark 3.0
Certified Data Engineering Associate Databricks
Employee accreditation lakehouse with delta technical enablement

AI-interview Questions & Answers

Hi there, sorry for the camera, okay. So, hi there, my name is Shraddha Bhattacharya, I work as a data engineer with Databricks. So I am a lead data engineer with Databricks, so I take care of use cases with JPMorgan migration projects from cloud data or on-prem clusters to Databricks, and then there is like we develop features or use cases specifically using Databricks platform. So I have a total of 8 years of experience in the data engineering space and primarily I started off with Accenture, I was a consultant for Nordea and worked there for 3 years. We were ingesting customer analytics data from platforms like SaaS, like SaaS, the backend tables which were populated to DB2, then we were ingesting campaign management data and also customer data and we were feeding it to our random forest models which were done using SaaS eMiner. Then I shifted to JPMorgan and Chase, where I was again a data engineer. We worked on risk data for the firm, where we ingested from various catalogs like user tool data, then your obligations data, class, those are the main components that I worked with. Then we built CI-CD pipelines for easy deployment and automation. Then coming to Databricks, I serve customers use cases where we jump in as a Spark specialist and we help the customer achieve their goal with the resource they are commissioning. So also we are a Delta First company, so we definitely promote Delta as a framework because the customer will not be tied up to the system even if they want to migrate in the near future. Thank you.

You can enable broadcast join by using, uh, in price market case, you can use broadcast hints. Okay. Um, so we have dot hint you can use to to hint the AQE to, uh, trigger broadcast. Also, you can play around with the broadcast schedule, uh, to enable broadcast, and, uh, hint works better. Uh, as for my end, uh, like, experience, in this does work better. Broadcasting happens, uh, of the, uh, light inside table, which is smaller in size, and the threshold can be maximum adjusted to 8 GB. It does, uh, we we should consider, like, uh, not broadcasting a large table, okay, and also, taking care of any broadcast network joins because, uh, multiple times if, uh, data is transmitted IO is a positive operation. So if multiple times data is transmitted, it will be, uh, like, it will have a negative impact on the performance. So, definitely, we should keep those in, uh, consideration before using broadcast. Definite broadcast, uh, like, if you have a small reference data that you want to broadcast to various executors. So it does, uh, bring up the speed of execution of the join, but definitely there is some down

Given an API pipeline, identify a strategy for handling implemented data by delete dealing as rules and slow flake as destination. Okay. So incremental data. Incremental data, uh, so we we have something known as, uh, autoloader where we we consume, uh, data in stream as well as autoloader can work on a batch mode. Like, uh, it's a simple switch, uh, read to read stream. Is that's what you have to do. Uh, so, uh, incrementally, we can do is, like, uh, we have something known as file notification where whenever a file comes into the, uh, directory or the bucket, it will automatically be picked, and it will be processed by the pipeline. And Autorouter being a streaming streaming work, uh, use case, so, uh, differently, uh, it works in the micro batch, and whatever transformation we we want to do on the micro batch can be done. And we can directly output using a Snowflake connected to the destination tables of Snowflake. So that's how we we handle incremental data. So how, uh, Autoloader keeps track is, like, through the, um, Rocks DB or, uh, cloud state providers. Okay. It can be HDFS backed or it can be RocksDB backed. So RocksDB is a distributed, um, database that we are using to track all these changes, and it it contains the file names which has to be ingested. Okay. What time it came in? So those information are stored. You can you can check the container of, uh, Logstv by using cloud, uh, cloud on underscore files, uh, SQL parameter and giving the checkpoint location.

Okay. So in AWS, there is a auto backup feature that you can enable. Um, and you click what we suggest is, like, setting up a separate Doctor environment where, um, the tables as well as, like, a snapshot of data is already present as well as the processes are, uh, like, the jobs are already present. We we do it for customers, like, like banking customers who needs, uh, quick recovery time. So just a switch, uh, on the, uh, what is it called? Uh, like, it's a simple switch between the, uh, like, the URLs here to this switch. The load balancer. Sorry. The load balancer, uh, URLs, you should point it to the VR environment, and that should, uh, automatically take you to a previous server. And the a DI recovery is easily managed.

Okay. So skewness can be detected very easily. Uh, skewness can be detected very easily. What do you have to do? Uh, if a stage is running for a very long time, you have to go into the stage, uh, stage view and in in Spark UI. And inside Spark UI, you have you you just sort the, um, a task that is running. Okay. And you see the input. If it is a input read, then you should see the, uh, data the task is already, uh, reading. Uh, if it say shuffle read, then you have to see the shuffle read size. So, uh, on skewness on right can be on the right operation, like, how much it height is writing or the shuffle right operation, uh, if it is rough writing the shuffle files. So you can check the, uh, check the quartile matrix, like, uh, how much if the job is already completed. So you can you can access the quartile matrix and see uh, how much, uh, how much percentage of job took longer duration, uh, at 50% or above. That can give you, um, uh, information if there is a skewness. How do now how do you handle it? You have to evenly distribute out your data so you can repartition the data, uh, using some key or columns. If there is no single column, then you can use use a salting technique. Like, you can concatenate the salt and then repartition the column, and that should actually do the trick for you. Or you can also use queue hints to, uh, hint AQV to automatically use queue joins, uh, in in case of any, uh, skewness is present during a join phase. So that's how you, um, help in, uh, addressing skewness in dataset.

Okay. So Snowflake is a computer brand. Uh, so I personally have just used it for testing. But if you want me to, uh, deduplicate any data, so I usually use the upsert logic. So upsert can never ingest, um, duplicate records. If there are multiple times the ETL pipeline has run, AbsaRT will never, uh, upgrade those record. It will be dropped. So that is that will give you a benefit in the or boost in the performance. Next is, like, if already you have some duplicates, so you have to find you can use row number, a partition by column, uh, like a, uh, composite key you can use to detect, uh, the duplicates, and you can just take the rank one, like, the 1st occurrence. You can, uh, order it by time stamp, which I used to do it, and, uh, just take the current stamp, uh, snapshot where rank is equal to 1, and you can just populate the table. So that's the simplest way you can remove duplicates. And to prevent it, go for absurd logic. You can use merge merge operation of Databricks that can that automatically takes take care of this scenario Or use a deduplication, uh, dedupe transform transformation. But just be mindful, like, dedupe, uh, actually is a stateful operator. So if you are using streaming, then you have to set a window, um, for, uh, or a times what is the watermark to actually hold the

So, uh, now coming to tuning tuning, uh, in, uh, process or memory management. So what we generally do do is, like, check the heap, how much the heap is being consumed. If, uh, so there are 2 things. Yeah. If the bottleneck is from the driver, we check the heap memory of the driver if it is overflowing. If there is a, um, the driver is getting overflowed, this kind of scenario of occurs when there is, like, multiple task the driver is trying to accomplish. So we can try increasing the number of calls so that will take off the load from the memory. Okay. If there are lot of data being brought to the driver that is, uh, is not recommended, you you should not use collect in a production job. So, uh, rather than you should write it to a table because, uh, table lights are taken care by the workers. If, um, that is the general recommendation. If there is lot of, uh, if there is a memory leak due to a huge plan, we can increase the cluster or the driver size to the next level, and that can efficiently handle it. Or we can break down the plan into sizable chunks, and we can consume the data. Uh, next is, like, um, the worker level. At the exporter level, sometimes, OEMs happen, and the worker gets killed. This is due to mainly due to skewness of data. Okay. And if there is not enough memory, then we we commission some more additional workers and also increase the parallelism. We also use task per task CPUs to increase the, um, uh, volume of data that can should be processed by a single task. So these are the common steps that we are, uh, usually implementing for memory management.

How do you ensure fault tolerant? Uh, so, usually, fault tolerance can be achieved through keeping multiple copies of data, um, and data consistency, we can achieve through, uh, data governance. We can, uh, onboard our, uh, like, databases through Unity Catalog. Unity Catalog tracks these issues or consistency data, and it flags it in the lakehouse dashboard. So that feature is very useful, and that can be definitely implemented.

What strategies would you use to migrate Python fields keep running in legacy system by using spy Spark compiling Processor? Okay. So, uh, if there are legacy systems, uh, so definitely during migration, uh, the things that we keep in mind mind is, like, the volume of data that is being done in. The source of data, if it is changing, definitely, there will be some refactoring has to happen. Like, I have worked on so many projects that definitely some some sort of refactoring definitely comes in. So we should keep that in mind, and, uh, the refactoring we have to take care. But, uh, we can we can more or less differentiate the Python code. Uh, so if it's a pure Python code, we we can and if it is if it can be done through the out of out of the box, uh, capabilities of FireSpark, then we can convert it directly, or else we can use UDFs to achieve the same thing, UDF and UVAF for aggregate functions. And to achieve parallel processing, we should utilize all the worker course. So those things we can keep in mind in the migration

Can you propose a method for real time data person? Okay. So Kinesis is a Kafka framework. Um, so I have used Kinesis as a source to ingest data from Kinesis and bringing it to our data lake. But, definitely, you can use it, uh, AWS NAND, uh, where you even try trigger any workload or any in end of processing based on the data mining data types. Uh, so you can use Kinesis connectors, uh, which are readily available, uh, in AWS, and you can connect to Lambda and do the processing. The, um, one example if I have to see, like, maybe an email triggering. Like, if some topic has some data, uh, means I would use, like, Lambda, like, if there is any, uh, like, deviation in d q or some some sort of deviation or there is a, um, uh, queue where, like, uh, the broken records are pushed. So those notification services, we can use AWS Lambda connected with other services email services from AWS. So that's a good use case of using 8 Kinesis with Lambda. Apart from that, real time application would be, uh, any streaming services we can utilize, like, real time

What process would you recommend for everyone? So so the migration from AWS s 3 to, uh, or to on prem to s 3, Firstly, identifying PII data. Um, so we we should not take PII data to cloud. Sometimes, like, banks are usually easily hesitant in that. Uh, other than that, we can utilize, uh, VPC for securing the connections, like, productable PC. Um, then we can utilize the parquet files if a parquet table. If it you can utilize the parquet files and create tables on top of that. We can directly, uh, move those. And AWS is a solution migration solution that we can use to migrate all the data from on prem to Hadoop level. Thank you for the time.

Sunando Bhattacharya

Lead Data Engineer

8 years

Skillsets

Vetted For

Professional Summary

Applications & Tools Known

Work History

Lead Data Engineer

Analyst Data Engineer

Application Development Analyst

Achievements

Major Projects

Migration to Databricks

Unity Catalog Adoption

Custom Kafka DataSourceV2 Connector

Education

B. Tech

ISC

ICSE

Certifications

Databricks certified data engineering associate

Databricks certified data engineering professional

Databricks certified associate developer for apache spark 3.0

Certified Data Engineering Associate Databricks

Employee accreditation lakehouse with delta technical enablement

AI-interview Questions & Answers