
Experienced Azure Data Engineer with a proven track record in the IT industry. Over two years of hands-on experience in designing, implementing, and managing data solutions on Azure. Specialized in data ingestion, transformation, storage, and analytics using Azure services like Data Factory, SQL Database, Databricks, and Synapse Analytics. Skilled at collaborating with cross-functional teams to gather requirements, architect data solutions, and deliver high-quality outcomes.
Azure Data Engineer
Slb
Azure Data Factory

Microsoft Power BI

Data Warehouse

SQL

Azure Databricks

Logic Apps

Azure Virtual Machines

MicroStrategy
Experienced Azure Data Engineer with a proven track record in the
IT industry. Over two years of hands-on experience in designing, implementing,
and managing data solutions on Azure. Specialized in data ingestion,
transformation, storage, and analytics using Azure services like Data Factory,
SQL Database, Databricks, and Synapse Analytics. Skilled at collaborating with
cross-functional teams to gather requirements, architect data solutions, and
deliver high-quality outcomes.
Data integration and data transformation (ETL):
Data Modelling and data visualization:
Financial report covering all the actual, plan, forecast and DSO data of the company across the globe focusing on automation & standardization.
Benefits- Single version of truth, Simplified & timely data availability
Tools: Azure Data Factory, Azure Data Lake Service, Azure Databricks, Azure SQL Warehouse, Azure SQL DB, Azure Virtual Machines, Azure Logic Apps, Power BI, MicroStrategy, SQL.,
Created an OLAP cube by employing the below processes:
Provides visibility into the global daily and month-end GL cash and bank balances across multiple banks accounts worldwide for the Treasury team.
Tools - Azure Data Factory, Azure Data Flows, Azure Data Lake Service, Azure SQL Warehouse, Azure SQL DB, Logic Apps, SQL, Automation runbook(PowerShell), AAS, Power BI
Created an OLAP cube by employing the below processes:
Worked on an automated framework for 20 projects with failure and data delay alerts for end to end ETL till OLAP cube refresh, along with a consolidated Power BI dashboard that tracks real-time and historical CIM progress, reducing manual intervention by 90%.
Tools: Azure Data Factory, Azure Logic Apps, Microsoft Power BI.
Hi, I'm Malvika. So I've been working as a data engineer in Schlumberger for the last two years and seven months. During this time, I've explored a lot of things in data engineering under Azure, and that has given me a lot of experience and exposure to a lot of things. Apart from this, I also take part in other activities in the organization and make sure that the company culture is always prioritized. I'm very enthusiastic and very excited to learn new things. So starting with my journey, I started after my studies, I joined Schlumberger, which is also called SLB currently. I joined the data engineering team, where I was initially involved in making automation, implementing automation, and different things we were working on. One thing that we did is that instead of the monitoring team sending someone to each data factory and the corporate information models, which are the Azure analysis service models, they try to do a lot of manual labor to come up with a checklist to ensure everything is running fine. Instead of that, we created a mechanism that automatically updates the status of different flows in Power BI, in a Power BI report, by using a lot of tables in the backend and web activities. We also automated it so that as soon as the fact and dimension tables are loaded, it moves on to loading the Azure analysis service, which is an OLAP cube. We made sure that automation was made. After that, I was involved in different development projects. My first project was related to the bank account and related information of SLB. We built a model based on the business requirement. I used data flows there for ETL transformation. For the next project I was involved in was the financial reporting dashboard, which is also built on Azure analysis service cube. Here, I was exposed to Databricks, where we used PySpark and SQL for ETL. This helped me a lot to learn more about the different technologies under Azure. I'm also upskilling myself day by day to make sure I can contribute better in the roles in the company. Right now, I'm looking for other opportunities so that I can explore more and contribute much better with the knowledge I've acquired so far and I'm still acquiring. I'm pretty sure I'll be able to adapt in different teams with different technologies, which involves data engineering techniques like data modeling and ETL. Yeah, that's it from my end. Thank you so much for listening.
So what do we do in this case is that every time a pipeline is built, so the we should make sure that whenever a pipeline runs, it's recorded somewhere. So, say, a configuration table where we record that before a pipeline starts, we keep an entry that checks if that pipeline for a particular run ID has run or not. If not, we'll go and run it. If we go and run it and enter a new ID saying that for this particular pipeline, for this particular data factory, the current run has happened. If that's not the case, then we skip it. So in case of intermittent issues, we can handle it in such a way that there's no repetitive run, which saves resources. This way we can make sure that only one run happens for a particular run ID even if there is any failure. And also with respect to data availability, we can before each and every run, depending on the time stamp, we can check if what is the maximum time stamp that of the data that's available in the source. Based on that, we can check with the maximum time stamp available in our Azure data warehouse tables in the stage during the staging part, we can check if what is the maximum time stamp with which do we have the data available and compare that. And if new data is available, only then we let the load happen. Else, we don't have to load an older data that's already present. So this makes sure that every time there is a failure, when we restart the process, the already ran pipeline skips. Plus, it also makes sure that only the new data has been taken and there's no unnecessary runs happening.
So, in the time travel mechanism, there's a version that we have, where the different versions are being stored, and we can use the previous version to restore the data in case of a failure and data loss. So we can recover the previous version depending on that. We can also have multiple checks where we know the approximate data being pushed during each incremental load. If that amount of data is not met, then we can have the previous version getting restored. The same for the dimension tables, where we know that dimension tables usually compared to fact tables, the dimension tables do not have the huge amount of data. It varies. So depending on that, we can make sure that we know the approximate count that's been sent to an incremental load, and we know the approximate count that's already existing in case of dimension tables. If the minimum availability is not met, then we can restore the previous version of the dimension table in the fact table and not let the current data be available for the users. And we can also send an intimation mail indicating there is some data loss or we have seen some data mismatch. The quantity of data doesn't seem too much with the threshold, and those mails can help us to go and check what has happened. And such highlighting will help us understand if there's a data loss, and we know that the data has been recovered by the process itself. So this will help us to make sure that the current data is available for the users without any data loss coming into picture.
So we can do this by comparing different versions. We can have a threshold of the versions that can be compared. And depending on the amount of data change during each run, that can be tracked via a Power BI report, with a threshold to it. Say, all of a sudden, there is a huge data increase or all of a sudden, there is a data loss, such things can be highlighted in the report. So using that report, we can track the historic changes. This can have a data factory designed in such a way that the maximum amount of rows entered during different runs can be calculated in a number of updates that's been done in different runs. The number can be stored in a different table, and a Power BI report can be connected to that table. And depending on the change in the amount of data or the number of columns or the update that happened, we can detect these changes in history runs. And if it goes beyond the threshold, we can send an alert saying that this has happened, and we can also do that. Or, we can use that to see the change in data has occurred for auditing purposes. Yeah.
So, in case of high volume data, one thing we have to consider is the number of partitions we're using while the data is being processed. And if the data volume can be reduced by using incremental load instead of loading the whole data. We have to make sure that parallel processing is happening, and it doesn't load sequentially. For example, if we have data based on your months, we can process all the months in parallel instead of doing it in a sequential way. Another thing to consider is the Cairo serialization where the data's made into binary digits before being sent from the source into our staging tables, so that the data is processed much sooner because it's in small chunks. This way of processing will also help with Cairo serialization. So, that's something we can consider. And, yeah, I think for now, these are the options I'm getting in my mind. One is by making sure that it's divided and it's parallelly processed in different partitions, and not implementing sequential pipelines, rather doing parallel pipelines. We can do this by having 2 different tables. One, where we copy the current data and do the table switch, so that no one hits the table that's being copied. After all the transformation, after it's loaded in the online table, we then switch the online table and the temporary table, so that there's no table lock occurring and there's no slowing down the process. We can do that. We can have a temporary table and make sure the data is loaded into the temporary table, and the online table is available for other processes. And then the data's copied into the online table because the transformation obviously takes a lot of time. So, that's something we can do. And then we serialize the data, making sure it's converted into binary before we transfer the data. So, these are the few things I'm getting. They're at the top of my mind right now.
Okay. So migrate an existing data model to Snowflake, ensuring minimal downtime. Obviously, when we're doing the migration, we do it in the lower environments. And while switching from a different data model to Snowflake, we make sure it's been properly tested in the lower environments. And we also send proper email alerts to the users. One thing that we can do is initially start running both models simultaneously. We have to make sure the relationship between the different tables is fine and the data load is happening properly. All these checks can be done. But I can learn more about this after this because I don't have an in-depth understanding about the migration. I would like to explore more on this.
So here, I can see, usually, in the CICD, we made sure that it's first deployed to UAT, and then only deployed to production. If it moves to production, then that might cause a lot of issues. Yeah.
So, here we are just going to the order table and we are updating the status to Dispatched whenever the status is placed. So, for everything it just goes and here in this case, everything is taken into consideration, the whole dataset wherever the order is placed, it's changed into Dispatched. And we are not committing these changes. That means these changes are not actually being implemented. And while the data is processing, if we are running the statement, then without the table being locked, the incorrect statements might get copied, incorrect data might get copied. So, that is one impact that I can see.
So we can have different pipelines. One takes real-time data from Kafka or something, and batch processing can also happen. But if we have to integrate these two, then we can have different pipelines, one that takes data at all times and the other that triggers only depending on the time slot. That's something that can be done. We can merge the data together, with one column as an extra column, a flag column that indicates if it's from batch-processed data or real-time data. The historic data can be moved to the table that we're going to take real-time streaming data from. First, we can do that. Then, on that existing data, we can load the real-time streaming data.
Implement version control mechanisms in your data pipeline deployments with Azure Data Factory to prevent data loss. Yeah. So, first, initially, while the deployment is happening, we have changes in our local branch. And then we push it to the integration branch and then to the master branch, and then those changes are then pushed to UAT and then to production. So, this different branching strategy allows us to restore or revert the changes if we do it this way. If we maintain this CICD repository, it ensures that previous versions of the different versions are also saved, and we can restore these versions. Or, another thing we can do is take a backup of the existing master branch before making changes to our branch so that when something goes wrong, we can use the previous master backup to restore, or we can use the previous changes from the branch that was pushed to the upper environments before the particular change I'm making. Those changes can be reverted back. So, that mechanism can be used in this case.