
I am a strategic Big Data leader with 11+ years driving measurable business outcomes through innovative data solutions. I specialize in transforming operational challenges into competitive advantages across insurance, banking, retail, and telecommunications sectors.
My expertise lies in architecting enterprise-grade data ecosystems using Azure and GCP technologies that directly enhance decision-making velocity, reduce operational costs, and unlock revenue opportunities. I excel at translating complex business requirements into scalable data platforms that drive ROI and sustainable growth.
Core Capabilities:
Cloud-Native Solutions: Azure Databricks, Synapse Analytics, GCP BigQuery
Real-Time Analytics: Apache Spark, Kafka, streaming data architectures
Modern Data Stack: DBT, Snowflake, Delta Lake, NoSQL implementations
I don't just build data pipelines—I architect strategic capabilities that align technical innovation with business objectives, ensuring every solution delivers measurable value and positions organizations for data-driven excellence.
Senior Data Engineer
VanderlandeSenior Data Engineer
LifepalSenior Bigdata Developer
BarclaysSpark Developer
Tech MahindraSenior Technical Associate
TeradataTechnical Services Specialist
IBMHadoop Developer
Sears IT and Management Services India
Python

Delta Lake
.png)
Apache Spark

Scala

Kafka
Pyspark

Apache Hive

Sqoop

Apache Superset

Metabase
Snowflake

Redis

Neo4j

Apache NiFi

SQL

GCP

Control-M

Eclipse

Intellij

Pycharm

MySQL
.jpg)
Teradata

Git Lab

SVN

Git
.png)
Jenkins
Jira

IntelliJ

PyCharm

Redis

Bit Bucket

Bamboo
Got best performer of the month from client
Okay, so I have 10 years of experience. And I've worked on many big data stacks, including Hadoop, Hive, Big, and Spark. I've been working on Spark for the past 7 years. I've also used Kafka. I've worked on many cloud projects, including Azure, and the Azure services I've worked on include Azure Data Bridge, Data Factory, Synapse, Databricks, and Delta Tables. So I've done many projects on real-time streaming as well as batch processing. So mainly, the programming languages I've worked on are Scala, Python, and Java. So currently, I'm working on a project where we're getting data from IoT Edge. And after that, we have some modules built on Java. So the data will be stored in blocks. And after that, we push the data to Databricks. So we have written a notebook that consumes the data from Databricks. So we also use auto loader whenever any file is uploaded to the blob, the data gets consumed. And we process the data through Databricks, and we store the data into Delta Tables. And so we basically use Delta Live Tables, which is a kind of stream table that we've created. All our data is in Delta Live Tables. We created a pipeline on top of that, and this pipeline is running and stores the data into Delta Live Tables. And after that, we created a dashboard on top of that. So we have a dashboard where the data is consumed from Delta Live Tables. So this is kind of one project. So I've done many projects on real-time streaming, basically on Kafka itself. So that's kind of my exposure to Kafka. We also worked on cloud services. I've used Azure, and we've used Azure Functions as well, where we created APIs in Spring Boot. So what we did is we created APIs in Spring Boot, and we uploaded that API as an Azure Function in Azure services. So this is kind of my experience.
Okay, so we mainly use Spark. I did one of the projects where we are getting data in real-time streaming. So basically, it's a telecom-based project where we get data based on incident-related data. So whenever any cable cut happens, an incident gets generated, and the incident pushes to the Kafka topic. We have a Kafka topic based on the status of the incident, like if the incident is created, it's in a queued state. Then after that, the state would change, and it's in an in-progress state. After that, the incident status is active, deferred, closed, and canceled state. We have multiple Kafka topics based on the status of the incident. So we get the data on that particular Kafka topic, and after that, we consume the data from a Spark streaming. We basically use Spark streaming. After consuming the data, it returns a DStream. We process the DStream, which is a collection of RDDs. In the DStream, we have an incident ID, incident status, and the updated timestamp at which time the incident got created. On the basis of the incident ID, we make a REST call and get a huge JSON response. We have a huge JSON response, so we compute that JSON incident data. We have incident data, customer data, and ticket data. We store the incident data in MongoDB. Before that, this is a huge JSON response, so we compute that JSON. We store the intermediate data, and the final data we store into MongoDB. We have different collections in MongoDB. One is the incident collection where we store incident-related data, and we have a ticket collection. We store ticket-related data, which contains the ticket ID, ticket status, and who's the technician working on that ticket. So that's kind of a ticket collection. We update the real-time data in MongoDB so that the customer has a complete picture of the status of the ticket. The customer can see the status of their ticket on the UI. We have real-time updates on the MongoDB. And after that, this is how the job runs on the Spark cluster, and we use a three-node cluster. We have a conditionally running job, and we get 80 to 100 incidents per second.
so what matters is using the data quality you have to proper filtration of the data you just use you don't need to store unnecessary data you can normalize the data while storing so you can store proper data you can while storing the data properly partition the data create a bucket and you don't need to just take care of the duplicate data there should not be duplicate data stored in the target so this is how you can maintain data quality and data integrity so data should be normalized you don't need to store all the data in one large table you need to separate out the table as per your scenarios so you need to store the data in a distributed way in multiple tables based on your use case and create a proper partition suppose if you were dealing with retail data just store the data year-wise then month-wise then date-wise so you will have a complete picture of the data and when you are querying the data if you want to fetch data for a particular one you don't need to scan all the data you just put a filter on a particular partition so you can just fetch the data for a particular partition only so that improves your query performance and if you are filtering the data at the starting don't filter out junk characters like you have different modes in Spark like failover mode and I believe you have some modes in Spark where you can just maintain data quality in your ETL process so you don't need to store unnecessary data at the target set just handle duplicates while reading the data that's it
You can perform many data validation checks using Python. Suppose if you want to store the data in this topic, suppose if you have a date format, you want to store the data on a particular date format, so you can just perform the validation on the data that and also for if you want to store the data and in Snowflake and the target, you can perform that. The data should not be duplicate. Data format should be correct. The data type should be correct. This is how you can perform some data validation, and you can just check the relation of the data that the data is maintaining or not. So you can perform validation based on your use case and your scenarios. So you can perform. You can create a UDF in Spark. You perform data validation on top of that.
I'm not logged on for this.
So instead of using a passcode, you can just use a fire spark where you can create a workflow and you can just read the data from multiple sources. Spark provides different connectors and enabled connectors where you can just read the data from multiple sources and perform the transformation you want, and store the data into the target. You first need to decide which format you want to store the data. What is the format present in the source? You can just think about that and perform the reading part. You can just do the transformation and store the data into the target.
So in this code, the loading field because the transform data is not correct. So while you are loading the data into the target, it is unable to load the data due to some data type issue or a data type issue that is not matching with the data type of the target. Suppose you are putting data into a table, the data type is different. You need to perform data validation while loading the data. And you can just see the columns are correct for your table.
You need to pass the DB connection. I'm not sure.
Okay, so the process of tuning a data processing pipeline in BigQuery is you can use BigQuery.
I'm not certain.