
I am a strategic Big Data leader with 11+ years driving measurable business outcomes through innovative data solutions. I specialize in transforming operational challenges into competitive advantages across insurance, banking, retail, and telecommunications sectors.
My expertise lies in architecting enterprise-grade data ecosystems using Azure and GCP technologies that directly enhance decision-making velocity, reduce operational costs, and unlock revenue opportunities. I excel at translating complex business requirements into scalable data platforms that drive ROI and sustainable growth.
Core Capabilities:
Cloud-Native Solutions: Azure Databricks, Synapse Analytics, GCP BigQuery
Real-Time Analytics: Apache Spark, Kafka, streaming data architectures
Modern Data Stack: DBT, Snowflake, Delta Lake, NoSQL implementations
I don't just build data pipelines—I architect strategic capabilities that align technical innovation with business objectives, ensuring every solution delivers measurable value and positions organizations for data-driven excellence.
Senior Data Engineer
Vanderlande India Pvt LtdSenior Data Engineer
Life palSenior Developer
BarclaysSpark Developer
Tech MahindraData Engineer
Teradata India Pvt LtdDeveloper
IBM India Pvt LtdHadoop Developer
Sears IT and Management Services India Pvt Ltd
Python

Delta Lake
.png)
Apache Spark

Scala

Kafka
Pyspark

Apache Hive

Sqoop

Apache Superset

Metabase
Snowflake

Redis

Neo4j

Apache NiFi

SQL

GCP

Control-M

Eclipse

Intellij

Pycharm

MySQL
.jpg)
Teradata

Git Lab

SVN

Git
.png)
Jenkins
Jira

IntelliJ

PyCharm

Redis

Bit Bucket

Bamboo
Got best performer of the month from client
Okay. So this is, and I'm having total 10 years of experience. And I worked on a many big data stack like Hadoop, Hive, Big, Spark. I'm working on Spark since last 7 years. I used Kafka. I worked on many clouds projects like Azure and the Azure services I worked on, like Azure Databridge, Data Factory, Synapse, Dental Lake, Delta Tables. So I did many projects on real time streaming as well as on batch processing. Uh, so mainly, the programming language which I have worked on is Scala, Python, and Java. So currently, I am working on a project where we are getting data from IoT Edge. And after that, we we have some modules which, uh, built on, uh, Java. So the data will be stored on the block. And after that, we push the data to the uh, we consume the data from the Databricks. So we have written a notebook where it consumes the data from the Databricks. So we also use auto loader whenever any file uploaded to the blob, the data got consumed. And we process the data through our Databricks, and we store the data into the data tables. And so we basically use the data live table that is kind of a stream table which we have created and all are the data live tables. We created a pipeline on the top of that, and this pipeline is running and store the data into the data live tables. And after that, we created a dashboard on the top of that. So we have a dashboard on this plan, so data will be consumed from the delta live tables. So this is kind of a one project. So I did many projects on real time streaming basically on Kafka itself. So that is kind of my exposure on Kafka. We have a, uh, also, I worked on cloud services. I have, uh, uh, used in in in Azure, we have used Azure functions as well where we created the, uh, API in the spring. So what we did is we created APIs in a spring boot, and we you we uploaded that API as an Azure functions in the Azure services. So this is kind of my experience.
Okay, so we mainly use this spark. So I did one of the projects where we are getting a data in real-time streaming. So basically is a telecom based project where we are getting data based on we are getting an incident related data. So whenever any cable cut happens, the incident got generated and incident pushes to the Kafka topic. So we have a Kafka topic based on the status of the incident, like if incident is created, it is in queued state. Then after that the state would change it is an in progress state and after that the incident status is active state, deferred state, close and cancel state. So we have a multiple Kafka topic based on the status of the incident. So we got the data on that particular Kafka topic and after that we consume the data from a spark streaming. So we basically use a spark streaming. So after consuming the data, it returns a D stream. So we process the D stream, D stream is a collection of RDDs and after that the D stream we got, in D stream we have an incident ID, incident status and it's updated timestamp at which time the incident got created. So on the basis of incident ID, we will make a REST call and we will get a huge JSON response. So we have a huge JSON response, so we compute that JSON, that incident data. So we have incident data, we have a customer data, we have a ticket data that what will be. So we store the incident data in the MongoDB. So before that, this is a huge JSON response, so we compute that JSON. So we store the intermediate data and the final data we store into the MongoDB. So we have a different, different collection in MongoDB. So one is incident collection where we store the incident related data, we have a ticket collection. We store the ticket related data which contains like ticket ID, ticket status, who is a technician working on that ticket. So that is kind of a ticket collection. So we updated that real-time data in the MongoDB so that the customer has complete picture of the status of the ticket. Customer can see the status of their ticket on the UI. So we will have a real-time update on the MongoDB. And after that we, so this is how the job running on the Spark cluster and we use a three node cluster. So we have a conditionally running job and we got 100, 80 to 100 incidents per second.
so what matters you so to use the data quality you have to proper filtration of the data you just use you don't need to store the unnecessary data you can normalize the data while storing so and you can store proper you can while storing the data you can proper partition the data creates a bucket and you don't need to just taken care of the duplicate data there should not be duplicate data stored in the target so this is how you can maintain the data quality and data integrity so data should be normalized you don't need to store all the data in the sample in the one large table you need to separate out the table as per your scenarios so you need to store the data in a distributed way and in that multiple tables based on your use case and created a proper partition suppose if you were data so just of retail so just store the data in year wise then month wise then date wise so you will have a complete picture of the data and when you are wearing the data so you if you want to fetch a data for a particular one you don't need to scan all the data you just put putting on a particular partition so you can just fetch the data for a particular partition only so that is kind of your query performance will be improved and if you are filtering the data at the starting don't filter out the junk characters like we have a different different modes in a spark like fall fall back mode and I am not able to recall it but you have some modes and spark where you can just maintain that data quality your ETL process so you don't need to store the necessary data at the target set just handle in while reading the data handle the duplicates that's it
Uh, you can perform many data validation check using Python. Suppose if you want to store the data in this topic, suppose if you have a date format, you want to store the data on particular date format, so you can just perform the validation on the data that and also for if you want to store the data and the Snowflake and the target, you can perform that. The data should not be duplicate. Uh, data format should be correct. The data type should be correct. This is how you can perform some data validation, and you can just check the relation of the data that the data is the relations is maintaining the data or not. So you can perform so validation is based on your use case and your scenarios. So you can perform. Perform you can create a UDF and Spark. You perform the data validation on the top of that.
I'm not logged on for this.
So instead of using a passcode, you can just use a fire spark where you can create a work flow and you can just read the data from multiple sources. So Spark provided different connectors and different enabled connectors where you can just read the data from multiple sources and, uh, you can perform the transformation whatever you want, and you can store the data into the target. So you first, you need to decide then which format you want to store the data. What is the, uh, format of what is the format present in the source? So you can just think about that picture, and you can just perform the reading part. And you can just do the transformation and store the data into the target.
So in this code, remind me the loading field because the transform data is not correct. So the while while you are while you are loading the data into the target, it is unable to load the data due to some data type issue or might be We can say the data type issue, and it is not matching with the data type of the target. Suppose if you are putting data into the table, so the data type is different. So you need to perform that data validation while loading the data. And you can just see, uh, uh, the columns are correct of your table according to a table or not.
You need to pass the DB connection. I'm I'm not sure.
Okay. So process of tuning a data processing pipeline in your BigQuery, you can BigQuery. BigQuery.
Not sure.