profile-pic
Vetted Talent

Malavika M

Vetted Talent

Experienced Azure Data Engineer with a proven track record in the IT industry. Over two years of hands-on experience in designing, implementing, and managing data solutions on Azure. Specialized in data ingestion, transformation, storage, and analytics using Azure services like Data Factory, SQL Database, Databricks, and Synapse Analytics. Skilled at collaborating with cross-functional teams to gather requirements, architect data solutions, and deliver high-quality outcomes.

  • Role

    Data Engineer

  • Years of Experience

    3.8 years

Skillsets

  • Leadership
  • Data Modelling
  • ETL
  • Effective Communication
  • Adaptability
  • Data Visualization
  • Data Integration
  • data transformation
  • Olap cube
  • Creative thinking
  • Analytical Problem Solving
  • curiosity
  • SQL - 3.5 Years

Vetted For

9Skills
  • Roles & Skills
  • Results
  • Details
  • icon-skill_image
    Senior Data Engineer With Snowflake (Remote)AI Screening
  • 66%
    icon-arrow-down
  • Skills assessed :Azure Synapse, Communication Skills, DevOps, CI/CD, ELT, Snowflake, Snowflake SQL, Azure Data Factory, Data Modelling
  • Score: 59/90

Professional Summary

3.8Years
  • Jul, 2021 - Present4 yr 10 months

    Azure Data Engineer

    Slb

Applications & Tools Known

  • icon-tool

    Azure Data Factory

  • icon-tool

    Microsoft Power BI

  • icon-tool

    Data Warehouse

  • icon-tool

    SQL

  • icon-tool

    Azure Databricks

  • icon-tool

    Logic Apps

  • icon-tool

    Azure Virtual Machines

  • icon-tool

    MicroStrategy

Work History

3.8Years

Azure Data Engineer

Slb
Jul, 2021 - Present4 yr 10 months

    Experienced Azure Data Engineer with a proven track record in the

    IT industry. Over two years of hands-on experience in designing, implementing,

    and managing data solutions on Azure. Specialized in data ingestion,

    transformation, storage, and analytics using Azure services like Data Factory,

    SQL Database, Databricks, and Synapse Analytics. Skilled at collaborating with

    cross-functional teams to gather requirements, architect data solutions, and

    deliver high-quality outcomes.

    Data integration and data transformation (ETL):

    • Responsible for designing and implementing data integration solutions using Azure Data Factory.
    • Developed data transformationETL pipelines to extract, transform, and load data from various sources into a centralized data warehouse with the help of Pyspark and SQL in Azure Databricks and Azure Dataflows.
    • Implemented data freshness checks and automation of monitoring within the ETL pipelines to ensure data accuracy and reliability.
    • Collaborated with cross-functional teams for data integration
    • Collaborated with business stakeholders to understand requirements.
    • Ensured timely delivery of quality solutions.
    • Designed efficient data integration processes.

    Data Modelling and data visualization:

    • Created and maintained OLAP cubes using Azure Analysis Services (AAS) to provide efficient and accurate data analysis for business stakeholders.
    • Created multiple reports for data integrity and data quality checks using Microsoft Power BI.
    • Integrated Azure Analysis Services with Azure Data Factory and other data sources to automate data refreshes and ensure the availability of up-to-date information for reporting and analysis.
    • Implemented row-level security (RLS) and dynamic security filters in AAS to restrict access to sensitive data based on user roles and permissions.

Achievements

  • Provided continuous training and mentorship to the Associate Data Engineers of the year 2022 and 2023.
  • Completed essential courses and mini projects, leading to grade promotion from G08 to G09.
  • Received Reward of Excellence for 2022 as 'Emerging New Comer'

Major Projects

3Projects

Financial Reporting OLAP cube with ETL using Databricks

Schlumberger
Jan, 2023 - Present3 yr 4 months

    Financial report covering all the actual, plan, forecast and DSO data of the company across the globe focusing on automation & standardization.

    Benefits- Single version of truth, Simplified & timely data availability

    Tools: Azure Data Factory, Azure Data Lake Service, Azure Databricks, Azure SQL Warehouse, Azure SQL DB, Azure Virtual Machines, Azure Logic Apps, Power BI, MicroStrategy, SQL.,

    Created an OLAP cube by employing the below processes:

    1. Implemented a robust stage flow to efficiently extract data from diverse sources including SharePoint, MicroStrategy, SQL databases, and Data Lake, orchestrating the loading process into Azure Data Warehouse for centralized storage with the help of the data orchestration tool Azure Data Factory.
    2. Executed Extract, Transform, Load (ETL) processes using Azure Databricks with the help of PySpark and SQL to harmonize and integrate the extracted data. Leveraged the power of Databricks for scalable and parallelized data processing, ensuring consistency, and accuracy by applying relevant business logic during the transformation phase.
    3. Segregated processed data into dimension tables and fact tables, optimizing data organization to facilitate efficient query performance and analysis. Leveraged Azure Databricks for efficient data partitioning and distribution strategies, enhancing query performance and reducing processing times.
    4. Applied advanced data modelling techniques to enhance the structure and integrity of dimension tables and fact tables, ensuring alignment with business requirements and analytical objectives.
    5. Leveraged Azure Analysis Services to construct an OLAP cube, facilitating analysis and enabling exploration of data insights.
    6. Implemented data quality checks and validation procedures throughout the process to maintain data integrity and reliability using Microsoft Power BI.
    7. Collaborated closely with stakeholders to understand analytical requirements and iteratively refine the OLAP cube design to meet evolving business needs.
    8. Documented the entire process, including data sources, transformations, and cube design, to ensure transparency, reproducibility, and knowledge transfer within the team.

Global Cash Balance OLAP cube with ETL using Dataflow

Schlumberger
Mar, 2022 - Dec, 2022 9 months

    Provides visibility into the global daily and month-end GL cash and bank balances across multiple banks accounts worldwide for the Treasury team.

    Tools - Azure Data Factory, Azure Data Flows, Azure Data Lake Service, Azure SQL Warehouse, Azure SQL DB, Logic Apps, SQL, Automation runbook(PowerShell), AAS, Power BI

    Created an OLAP cube by employing the below processes:

    1. Implemented a robust stage flow to efficiently extract data from diverse sources including SharePoint, Oracle, SQL databases, and Data Lake, orchestrating the loading process into Azure Data Warehouse for centralized storage with the help of the data orchestration tool Azure Data Factory.
    2. Executed Extract, Transform, Load (ETL) processes to harmonize and integrate the extracted data, ensuring consistency and accuracy with the help of Azure Dataflows and applied the relevant business logic.
    3. Segregated processed data into dimension tables and fact tables, optimizing data organization and facilitating efficient query performance.
    4. Applied advanced data modeling techniques to enhance the structure and integrity of dimension tables and fact tables, ensuring alignment with business requirements and analytical objectives.
    5. Leveraged Azure Analysis Services to construct an OLAP cube, facilitating analysis and enabling exploration of data insights.
    6. Implemented data quality checks and validation procedures throughout the process to maintain data integrity and reliability using Microsoft Power BI.
    7. Collaborated closely with stakeholders to understand analytical requirements and iteratively refine the OLAP cube design to meet evolving business needs.
    8. Documented the entire process, including data sources, transformations, and cube design, to ensure transparency, reproducibility, and knowledge transfer within the team.

Common Framework for a unified monitoring dashboard

Slb
Jul, 2021 - Dec, 2021 5 months

    Worked on an automated framework for 20 projects with failure and data delay alerts for end to end ETL till OLAP cube refresh, along with a consolidated Power BI dashboard that tracks real-time and historical CIM progress, reducing manual intervention by 90%.

    Tools: Azure Data Factory, Azure Logic Apps, Microsoft Power BI.

Education

  • Computer Science Engineering

    Amrita School Of Engineering

Certifications

  • Az-900 (microsoft certification: azure fundamentals)

  • Dp-900 (microsoft certification: microsoft azure data fundamentals)

  • Pl-300 (microsoft certified: power bi data analyst associate)

Interests

  • Dance
  • Drawing
  • AI-interview Questions & Answers

    Hi, I'm Malvika. So I've been working as a data engineer in Schlumberger for the last two years and seven months. During this time, I've explored a lot of things in data engineering under Azure, and that has given me a lot of experience and exposure to a lot of things. Apart from this, I also take part in other activities in the organization and make sure that the company culture is always prioritized. I'm very enthusiastic and very excited to learn new things. So starting with my journey, I started after my studies, I joined Schlumberger, which is also called SLB currently. I joined the data engineering team, where I was initially involved in making automation, implementing automation, and different things we were working on. One thing that we did is that instead of the monitoring team sending someone to each data factory and the corporate information models, which are the Azure analysis service models, they try to do a lot of manual labor to come up with a checklist to ensure everything is running fine. Instead of that, we created a mechanism that automatically updates the status of different flows in Power BI, in a Power BI report, by using a lot of tables in the backend and web activities. We also automated it so that as soon as the fact and dimension tables are loaded, it moves on to loading the Azure analysis service, which is an OLAP cube. We made sure that automation was made. After that, I was involved in different development projects. My first project was related to the bank account and related information of SLB. We built a model based on the business requirement. I used data flows there for ETL transformation. For the next project I was involved in was the financial reporting dashboard, which is also built on Azure analysis service cube. Here, I was exposed to Databricks, where we used PySpark and SQL for ETL. This helped me a lot to learn more about the different technologies under Azure. I'm also upskilling myself day by day to make sure I can contribute better in the roles in the company. Right now, I'm looking for other opportunities so that I can explore more and contribute much better with the knowledge I've acquired so far and I'm still acquiring. I'm pretty sure I'll be able to adapt in different teams with different technologies, which involves data engineering techniques like data modeling and ETL. Yeah, that's it from my end. Thank you so much for listening.

    So what do we do in this case is that every time a pipeline is built, so the we should make sure that whenever a pipeline runs, it's recorded somewhere. So, say, a configuration table where we record that before a pipeline starts, we keep an entry that checks if that pipeline for a particular run ID has run or not. If not, we'll go and run it. If we go and run it and enter a new ID saying that for this particular pipeline, for this particular data factory, the current run has happened. If that's not the case, then we skip it. So in case of intermittent issues, we can handle it in such a way that there's no repetitive run, which saves resources. This way we can make sure that only one run happens for a particular run ID even if there is any failure. And also with respect to data availability, we can before each and every run, depending on the time stamp, we can check if what is the maximum time stamp that of the data that's available in the source. Based on that, we can check with the maximum time stamp available in our Azure data warehouse tables in the stage during the staging part, we can check if what is the maximum time stamp with which do we have the data available and compare that. And if new data is available, only then we let the load happen. Else, we don't have to load an older data that's already present. So this makes sure that every time there is a failure, when we restart the process, the already ran pipeline skips. Plus, it also makes sure that only the new data has been taken and there's no unnecessary runs happening.

    So, in the time travel mechanism, there's a version that we have, where the different versions are being stored, and we can use the previous version to restore the data in case of a failure and data loss. So we can recover the previous version depending on that. We can also have multiple checks where we know the approximate data being pushed during each incremental load. If that amount of data is not met, then we can have the previous version getting restored. The same for the dimension tables, where we know that dimension tables usually compared to fact tables, the dimension tables do not have the huge amount of data. It varies. So depending on that, we can make sure that we know the approximate count that's been sent to an incremental load, and we know the approximate count that's already existing in case of dimension tables. If the minimum availability is not met, then we can restore the previous version of the dimension table in the fact table and not let the current data be available for the users. And we can also send an intimation mail indicating there is some data loss or we have seen some data mismatch. The quantity of data doesn't seem too much with the threshold, and those mails can help us to go and check what has happened. And such highlighting will help us understand if there's a data loss, and we know that the data has been recovered by the process itself. So this will help us to make sure that the current data is available for the users without any data loss coming into picture.

    So we can do this by comparing different versions. We can have a threshold of the versions that can be compared. And depending on the amount of data change during each run, that can be tracked via a Power BI report, with a threshold to it. Say, all of a sudden, there is a huge data increase or all of a sudden, there is a data loss, such things can be highlighted in the report. So using that report, we can track the historic changes. This can have a data factory designed in such a way that the maximum amount of rows entered during different runs can be calculated in a number of updates that's been done in different runs. The number can be stored in a different table, and a Power BI report can be connected to that table. And depending on the change in the amount of data or the number of columns or the update that happened, we can detect these changes in history runs. And if it goes beyond the threshold, we can send an alert saying that this has happened, and we can also do that. Or, we can use that to see the change in data has occurred for auditing purposes. Yeah.

    So, in case of high volume data, one thing we have to consider is the number of partitions we're using while the data is being processed. And if the data volume can be reduced by using incremental load instead of loading the whole data. We have to make sure that parallel processing is happening, and it doesn't load sequentially. For example, if we have data based on your months, we can process all the months in parallel instead of doing it in a sequential way. Another thing to consider is the Cairo serialization where the data's made into binary digits before being sent from the source into our staging tables, so that the data is processed much sooner because it's in small chunks. This way of processing will also help with Cairo serialization. So, that's something we can consider. And, yeah, I think for now, these are the options I'm getting in my mind. One is by making sure that it's divided and it's parallelly processed in different partitions, and not implementing sequential pipelines, rather doing parallel pipelines. We can do this by having 2 different tables. One, where we copy the current data and do the table switch, so that no one hits the table that's being copied. After all the transformation, after it's loaded in the online table, we then switch the online table and the temporary table, so that there's no table lock occurring and there's no slowing down the process. We can do that. We can have a temporary table and make sure the data is loaded into the temporary table, and the online table is available for other processes. And then the data's copied into the online table because the transformation obviously takes a lot of time. So, that's something we can do. And then we serialize the data, making sure it's converted into binary before we transfer the data. So, these are the few things I'm getting. They're at the top of my mind right now.

    Okay. So migrate an existing data model to Snowflake, ensuring minimal downtime. Obviously, when we're doing the migration, we do it in the lower environments. And while switching from a different data model to Snowflake, we make sure it's been properly tested in the lower environments. And we also send proper email alerts to the users. One thing that we can do is initially start running both models simultaneously. We have to make sure the relationship between the different tables is fine and the data load is happening properly. All these checks can be done. But I can learn more about this after this because I don't have an in-depth understanding about the migration. I would like to explore more on this.

    So here, I can see, usually, in the CICD, we made sure that it's first deployed to UAT, and then only deployed to production. If it moves to production, then that might cause a lot of issues. Yeah.

    So, here we are just going to the order table and we are updating the status to Dispatched whenever the status is placed. So, for everything it just goes and here in this case, everything is taken into consideration, the whole dataset wherever the order is placed, it's changed into Dispatched. And we are not committing these changes. That means these changes are not actually being implemented. And while the data is processing, if we are running the statement, then without the table being locked, the incorrect statements might get copied, incorrect data might get copied. So, that is one impact that I can see.

    So we can have different pipelines. One takes real-time data from Kafka or something, and batch processing can also happen. But if we have to integrate these two, then we can have different pipelines, one that takes data at all times and the other that triggers only depending on the time slot. That's something that can be done. We can merge the data together, with one column as an extra column, a flag column that indicates if it's from batch-processed data or real-time data. The historic data can be moved to the table that we're going to take real-time streaming data from. First, we can do that. Then, on that existing data, we can load the real-time streaming data.

    Implement version control mechanisms in your data pipeline deployments with Azure Data Factory to prevent data loss. Yeah. So, first, initially, while the deployment is happening, we have changes in our local branch. And then we push it to the integration branch and then to the master branch, and then those changes are then pushed to UAT and then to production. So, this different branching strategy allows us to restore or revert the changes if we do it this way. If we maintain this CICD repository, it ensures that previous versions of the different versions are also saved, and we can restore these versions. Or, another thing we can do is take a backup of the existing master branch before making changes to our branch so that when something goes wrong, we can use the previous master backup to restore, or we can use the previous changes from the branch that was pushed to the upper environments before the particular change I'm making. Those changes can be reverted back. So, that mechanism can be used in this case.