An accomplished Data Scientist with over 13 years of extensive experience, adept at conceptualizing, architecting, and sustaining scalable Machine Learning models encompassing automated training, validation, monitoring, and reporting mechanisms. Proficient in a spectrum of advanced domains including GenAI, Natural Language Processing, Pattern Recognition, Sequence Analysis, Time Series, and Prediction. Demonstrated track record of leading and executing more than 20 end-to-end machine learning and AI initiatives, contributing expertise to each project's successful implementation. seeking a challenging position as a Data Scientist where I can leverage my expertise in developing and deploying complex data science and analytics projects to support the commercial organization.
Staff Data Analytics Engineer (Architect)
AvalaraSenior Data Scientist
VisaSenior Data Engineer (Lead)
MaerskSoftware Engineer (Data Analytics)
Tech MahindraApplication Development Analyst (Data Analytics)
AccentureHive
Spark
Teradata
Informatica
MSBI
Microsoft Azure
SQL DB
Power BI
Azure ML
Hadoop
Sqoop
Tableau
AWS
Oracle
Hive
Sqoop
Kafka
Airflow
Power BI
MSBI
Tableau
Tableau
Tableau
SAP BO
Postman
SAS
SAS
Abinitio
Abinitio
GCP
Docker
Kubernetes
Docker
Jenkins
Devops
Kubeflow
Kubeflow
MLFlow
Hi. I'm Jemin. I have over 12 years of experience in data and analytics space. I have experience working across data warehousing, data engineering, and data sciences, and also have worked quite a lot on extensive migration projects from on prem to on prem, on prem to cloud, cloud to cloud, and cloud to on prem. I have worked across technologies such as Microsoft Azure, AWS, Teradata, Informatica, uh, SAP, S4HANA, uh, BIVW, SAP. Uh, on the Big Data stack, I've worked on Hadoop, Hive, Big, Airflow, uh, Spark, and Kafka. On the machine learning front, I have worked across, uh, machine learning models such as classifications, regressions, time series, deep learning models such as neural networks, transformers, uh, and, uh, artificial intelligence such as, uh, deep learning, uh, natural language processing, uh, large language models, and various other generative AI applications as well. In terms of, uh, domains, I've worked in finance, insurance, banking, payments, transportation and logistics, and telecommunications.
The message we would take to secure, uh, sensitive financial data during migration from SAP Azure would be to write an implicit encryption logic while we are moving the data from on prem SAP to Azure or even SAP systems to Azure directly to inherently have an encryption and a very strong encryption algorithm so it is difficult for them to decrypt it back while the data is in movement. And once it has been loaded onto Azure, we can always have the decryption logic written there so the data has been transformed back to the original form. Now if there is a particular need for you to not have the the sensitive financial data to be displayed at all points, you can continue to have the encrypted data flowing from SAP to Azure and also within Azure in order to protect it from unwanted access and also to ensure that we are safe and there is no unauthorized access to the data. Even if there is, there is nothing sensitive or, uh, personalized information that is being
Uh, we would generally normalize data when you have a wide array of dimensions and these dimensions hold various metrics and for you to minimize the number of joints and also to make sure that the data is respectively unique in each of these dimensions, we would end up normalizing the model here data models. Now whenever there is a wide variety of data that is coming in but, uh, you have that necessity of constantly accessing them and also, irrespective of the redundancy or irrespective of the duplication that it causes and the overhead that it brings to the system we can continue to use it in a denormalized form when you want to have the data access completely in one place you don't want to have repeated joins over and over again at multiple instances during the entire data pipeline flow and thereby we would end up using denormalized form of data model
Uh, when we are migrating large datasets from s four HANA to Azure or any other cloud, uh, whenever, uh, so the general option or rather the general perspective that people follow is to move it over the bandwidth or move it over the network. So, the challenges that we face moving over the network is based on your subscriptions or based on the uh, services that you've opted from those cloud service providers, these bandwidths can be limited and also depending on the file type data volume, the sheer length of the data now, it can be a 10 field table or a 100 field table or a 1000 field table. So moving a 10 field table might be relatively similar than moving a 1000 field table. So these are the key aspects that you need to keep in mind when you're trying to move from S4HANA on prem to cloud via a network. So the, uh, preferred approach is for you to create a separate sandbox which is directly connected to Azure and where you have the necessary connectors, the necessary drivers installed, and have them read directly from the sandbox itself onto the Azure network or the the Azure Data Lake. Now the advantage which this bring would this would bring is the total isolation of the data migration, uh, full availability of both the servers in terms of, uh, pulling and pushing the data onto ADLS as well as SAP S4HANA. So this is how you would address performance bottlenecks in real time data application, and we've also achieved this in one of my previous organizations when we migrated from, uh, on prem Teradata to Microsoft Azure's cloud stack.
Uh, whenever you are moving data across multiple data warehouses in data lakes, for you to maintain data lineage, there is always a manual method where whenever any change has been implemented or any change has been made, you have appropriate documentation maintained that would give you complete information about the data lineage and with any changes done, these documentation will also be uh, modified. Uh, if that is not, uh, not available, you can always use various data lineage tools that are provided by various service providers, which would help you help you connect it on, uh, to Spark if at all you are using big data stack. Teradata has an inbuilt mechanism itself where you can just go to dbc. Tables and see the whole data lineage as to how the data is flowing and which the table is being written or used in which of the script or which of the macro stored procedure, etcetera. Uh, whenever it comes to leaks, we generally end up using, uh, so Hadoop or any of the big data stack such as Hadoop or Hive or, uh, Spark, Kafka, etcetera. For them, you have specified dedicated tools that would help you achieve it. If not, you would always build, uh, data pipeline to process them. So you can always keep track of those data pipelines. So in Azure, you have Azure Data Factory where you configure the steps as to what should run when and where. And basis on that, you can maintain a track as to how the data has been flowing. And when time comes to it, you can always trace back the data.
So whenever you want to load incremental data between s four HANA and target warehouse so since we are talking about warehouse, we can always implement SCDs or slowly changing dimensions, and we can maintain it in SCD two format where if the record does not exist in the target data warehouse table, it would insert the data. Else, it would update the data if it is already existing and if any of the key metrics or parameters have changed within that particular record. The other way for you to do it is to feed the data into the target data warehouse provided you have created a partitioning mechanism on the target data warehouse web on that particular the partition has been created on a date basis or daily basis or whatever the frequency with which you intend to load and whenever the new dataset comes it would load only in those particular partitions the challenge or the drawback with this partitioning approach is if there is a change in any of the record that has already been processed and you would end up duplicating it. So you can if so the other way to look at it is you can have a flag of sorts which would help you identify in case of duplication that this is the latest record. If not, you can always maintain it in SCD 2 to avoid any sorts of confusion. And as and how the data comes, it is either inserted or updated based on the changes.
Since the data has been passed of sorts of array or more like a data dictionary of sorts, uh, that is you have an array of an array, uh, it would throw an error because it would fail to understand what field represents to what. So it is advisable that we break them separately and, uh, also kind of give ID is equal to 123, name is equal to John Doe, Jane Doe, Mike Brown, and the salary is equal to 12,000, 15,000, and a blank. Uh, that way it would understand what each of these fields is, and it would create a data frame without an issue.
Since we have given a return condition there and a DataFrame dot with column so basically after so when the return function is trying to return the DataFrame it is also simultaneously trying to do all the transformations that has been written in the return condition so the best way to do is for you to create the DataFrame before the return condition and then return the final data frame with this what happens is all the necessary transformations etcetera are done in cache and then stored And whenever the it invokes the return function, it would return it back to the, uh, called function. And then whatever it has been asked to do that, it would do. So it is advisable that you don't give such a long list of, uh, transformations in the return function and keep it simple as either returning the data frame or returning the final
So whenever you you have to prioritize migration tasks, you need to first understand the entire application and how it has been designed in the SAP system. And 2, it you also have to ensure that what is the final objective of that particular application or of that particular flow. So if it is a critical financial application which we end up sending to say, exchanges or security, we have to first identify all the key dependencies, uh, which lead us to this application. Now when I tell key dependencies, it could be dependency on multiple data sources along the flow of the data pipeline or it can be key dependencies in terms of user inputs any manual user inputs that you may have. So whenever you have to identify and plan any migration tasks for any project, uh, involving SAP module, you first need to identify all the source systems that is feeding into that particular application or data pipeline. 2, you need to identify all the stakeholders who are responsible for any manual inputs or any manual activities that are being performed through the flow. It could be just passing in some sort of an exchange rate or currency value, anything of sorts that irrespective, we have to identify all the key manual steps that are involved and identify what it would take for us to automate them while we are migrating it onto the next platform 3, uh, break down the entire data pipeline into critical, medium critical, and less critical items, uh, meaning all the transformations all the critical business transformations that are happening that are responsible for your final reports and all the calculations and metrics that flow by, you need to break them separately, uh, work on them first and simultaneously see if you can also work on low critical and medium critical items so we can, uh, in panel, try to cover all the grounds that we are trying to
So while you are using Azure Data Factory, the the best way for you to, uh, handle errors is by 1 custom writing custom error solutions and, uh, during the configuration stage wherever you define the Azure data pipeline fact, uh, Azure Data Factory pipelines. And 2, so whenever there is job failure, it would, uh, fail the job and we can always get alerts, etcetera. But if it is a non critical step or if something can wait and the next steps could run we could always pass conditions telling okay this job has failed but since there is no dependency on this the next steps would perform and then we can uh
So whenever you are trying to balance between immediate data availability and, uh, resource optimization, Uh, the approach would always be focusing on the resource optimization because, uh, it is not just a question of 1 application or that particular uh, data pipeline in question. It is across the entire node or entire network where all the data processing, etcetera, are happening. So even if there is a scope of immediate data availability but if it is consuming high volume of resources thereby impacting other teams, other applications on the network, it is always advisable to focus more on resource optimization. Uh, it is possible for you to optimize resources and also get immediate data availability, but you'll have to optimize your entire data pipeline, your entire dataflow, set up certain configurations. For example, if you're trying to join with small data, you can always use broadcasting, uh, so that the smaller volume tables are not, like, uh, moved into and outside the cache over and over again and kept in the cache until the entire processing has completed.