profile-pic
Vetted Talent

Harshita Mathur

Vetted Talent
Seeking challenging role as Data Engineer where I can apply my expertise in developing and optimizing ETL pipelines, executing advanced SQL queries, and collaborating with cross-functional teams to design and implement scalable data architectures.
  • Role

    Data & Databricks Engineer

  • Years of Experience

    7.75 years

Skillsets

  • Python - 3.7 Years
  • SQL - 3.7 Years
  • Azure Data Factory
  • Azure Data Lake
  • Azure Synapse
  • Databricks
  • Delta live table
  • PySpark
  • SparkSQL
  • Unity catalog
  • Workflows

Vetted For

9Skills
  • Roles & Skills
  • Results
  • Details
  • icon-skill_image
    Data Engineer (Remote)AI Screening
  • 76%
    icon-arrow-down
  • Skills assessed :Data Visualization, Azure SQL Server, Web Scrapping, .NET Framework, Azure, Azure Data Factory, PHP, Python, SQL
  • Score: 68/90

Professional Summary

7.75Years
  • Apr, 2025 - Present1 yr

    Business Technology Solutions -Associate Consultant

    ZS
  • Jul, 2024 - Apr, 2025 9 months

    Senior Data Engineer

    Pratham Software (PSI)
  • Jun, 2021 - Jun, 20243 yr

    Associate Consultant

    Celebal Technologies
  • Jul, 2019 - Jun, 20211 yr 11 months

    Trainee

    IIHT Ltd
  • May, 2020 - Jun, 20211 yr 1 month

    Certification

    Coursera

Applications & Tools Known

  • icon-tool

    Databricks

  • icon-tool

    SparkSQL

  • icon-tool

    Azure

  • icon-tool

    Data Lake

  • icon-tool

    Power BI

Work History

7.75Years

Business Technology Solutions -Associate Consultant

ZS
Apr, 2025 - Present1 yr

Senior Data Engineer

Pratham Software (PSI)
Jul, 2024 - Apr, 2025 9 months

Associate Consultant

Celebal Technologies
Jun, 2021 - Jun, 20243 yr
    Worked on real-time projects to perform Dimension Modeling and automated data loads.

Certification

Coursera
May, 2020 - Jun, 20211 yr 1 month

Trainee

IIHT Ltd
Jul, 2019 - Jun, 20211 yr 11 months

Achievements

  • Achieved 40% reduction in data processing time through strategic optimization initiatives
  • Reduced data load times by 60% for complex dimension modeling across multiple projects
  • Achieved 30% improvement in data storage and retrieval capabilities with scalable Data Lake architecture on Azure platform
  • Improved data management and accessibility by 30% by migrating tables from Hive Metastore to Unity Catalog
  • Enhanced data processing efficiency and scalability by migrating Pentaho jobs to Databricks
  • Increased data reliability in the banking sector by 25%
  • Improved query performance and data processing speed by 40% with queries conversion to Spark SQL
  • Enhanced data analysis capabilities by 30% by converting Oracle queries into Synapse
  • Drove a 20% increase in data-driven decision-making by integrating views into Power BI

Major Projects

4Projects

UC Migration Project - US Client

Jan, 2024 - Present2 yr 3 months
    Successfully migrated tables from Hive Metastore to Unity Catalog, improving data management and accessibility by 30%. Created multiple Catalogs and managed both internal and external tables corresponding to each catalog, resulting in a 25% enhancement in data organization and usability.

BFSI Project - Banking financial services and Insurance

Mar, 2023 - Oct, 2023 7 months
    Successfully migrated Pentaho jobs to Databricks, enhancing data processing efficiency and scalability. Converted JavaScript codes to PySpark and Spark SQL, optimizing data transformation processes and improving overall workflow efficiency by 30%.

DATA ANALYTICS SOLUTIONS PROJECT

Nov, 2021 - Mar, 2022 4 months
    Successfully converted Oracle queries into Synapse, improving query performance and enhancing data analysis capabilities by 30%. Developed stored procedures and views in Synapse, streamlining data retrieval and manipulation processes.

WEB BASED HR RECRUITMENT SYSTEM

Jan, 2021 - Mar, 2021 2 months
    Developed a user-friendly web-based Recruitment Process System with comprehensive features: creating vacancies, storing applicant data, initiating interview process, scheduling interviews, storing interview results, and hiring applicants.

Education

  • Bachelor Of Technology - Information Technology

    Swami Keshvanand Institute of Technology (2021)

Certifications

  • Data engineer professional databricks

  • Data engineer associate databricks

  • Developer essential badge databricks

  • Microsoft azure dp-900

  • Microsoft azure dp-203

  • Developer foundation databricks

  • Microsoft azure az-900

  • Python programming - coursera

  • Responsive web design - coursera

AI-interview Questions & Answers

Okay. So my name is Harshita Mathur, and I am, uh, 3 years of experience, uh, 3 years of experience in my current company. I am joined as an intern, uh, in the Syllable Technologies, and I have worked upon multiple migration project in which I have converted Oracle queries, SQL query into the PySpark and Spark SQL. So this kind of a migration project, I have done. And, uh, I have, uh, basically, I have done, uh, some basic certification of Azure and data breaks. That is a z 900, DP 900, DP 203. And, uh, in Databricks side, I have done Databricks engineering associate, Databricks engineering professional. So these kind of a profession, uh, certification I have done of Azure and Databricks. And, um, uh, some basic background of, uh, you know, of my project is, uh, is that I have basic, uh, basically, a Pentaho tool in which there are multiple transformations are there in which, um, the, uh, JavaScript codes and some SQL codes are there. And the client the requirement of the client is that they are, uh, uh, when they, uh, rent their transformations on the Pentaho side, it will take 2 to 3 days, uh, to execute all the bureau data. So, basically, I have, um, done, uh, a project with the, uh, uh, BFSI project. Okay. And in which there are bureau data, uh, bureau data in which the name, uh, name of the customer. So basic details of the customer. So I have the, uh, I have done this. So we have migrated all the JavaScript codes and the SQL codes into the PySpark and Spark SQL so that in the Databricks so that it will take, uh, less time to execute. So yeah.

So in dot net application analytics, uh, we have to analyze our data whether it is consistent or not. Um, and to, uh, like, to ensuring the data consistency between, uh, the dot net and the application developer, we have to follow, uh, some steps. Like, first of all, the transactional consistency in which, uh, we use our transaction, ensure that all the database and operations that we are need to execute together are wrapped into, um, uh, wrapped in a transaction. Okay? And then the stored procedures and the then the batch processing in which the EDL and ELD process, uh, we are used to migrate the extract and transform. And this process to, uh, this process is used to handle the data in in the batches, like, as your data factory can orchestrate these tools. And then the consistency control in which the optimization consistency, uh, is there in which we can control all the mechanism. And then the data validation and, uh, integrity in which the data validation rules are there. And then, uh, and then event driven architecture are there, and then we can, uh, say that asynchronous processings are there in which the message queues and the background services are there. So these kind of, uh, uh, and the data will be synchronized. So these kind of the strategy we are using, uh, to ensure the data, uh, our data is consistent. Thank you.

Okay. So there are many, uh, ways to, uh, optimize our SQL query in which, uh, when we have, uh, you know, there there are bulk of data in which we can, uh, see that, um, the large data set is generated. So we can use a clustered index. We can use the partitioning. So, uh, these are the techniques we can use. Uh, so, uh, to, uh, identify, uh, the queries like SQL Server and the standard standard events or the, uh, and the qual, uh, and the query, uh, store to find the query, uh, that running slowly. And then the optimization, uh, optimized indexes are there in which the appropriate indexes we have to done to ensure that the appropriate the index index maintenance in which and the index index maintenance in which there are the regularly rebuilt and reorganize the indexes. And then, uh, the and then the refine of a query design in which to avoid the select, um, we have to specify only the columns, columns as early as possible, uh, in the query. So join optimization using we can use the cities and the temporary table. We can use the partitioning and do the query hints and the, um, query hints and the options are there in which the query hints, uh, says that it will guide the SQL query to optimizer in the, uh, in the choosing the best execution plans is there. And then the leverage you leverage is your, uh, specific feature in which we can use the elastic pools and, um, and read the scale out. Okay. And we have to we will we can monitor and adjust our data in which performance monetization is, uh, uh, is there, which when we can continuously monitor the performance using the Azure SQL analytics, um, query optimization insight, and the other monitoring tools. Okay. And then, uh, we can review the database design by using the normalization process, uh, in which the data is properly normalized to avoid the redundancy of the data, and we we can use the denormalization. In some cases, denormalization might necessary to, uh, read the heavy workload, uh, to improve our performance. Proper data types, we have to use. Caching and, uh, pre aggregations, we have to use. So these type of steps, we can, uh, take to optimize our SQL query. Thank you.

Application performance for a high volume of the data which retrieval in the manipulation within an as your database. Okay. Uh, okay. So, um, to optimize the dot, uh, dot net application performance for the, uh, high volume of the data, we have to follow some, uh, steps, like, uh, efficient queries we have to use. Right? Using, uh, in which we can use the parameterized query to prevent the, uh, SQL injection and, uh, and enhance our query plan to, uh, reuse. And then, uh, and then, uh, we can use, like, uh, like, connection management is also there, uh, using the pooling to minimize using pooling to minimize the overload of, uh, of opening and closing the database connections, ensuring that the connection are open and and then the data access point patterns are there. Uh, use, uh, in which we can use the asynchronize, uh, program, uh, to prevent the, uh, to prevent the blocking, uh, threads during the, uh, during the data database operations. And then the caching, we, uh, we can perform in which we can implement caching strategies. Like, for an example, uh, Azure Reddy is, uh, caching to reduce the load on the database for the frequent access data, batch processing are there in which we can, you know, uh, we can use, uh, like, single transformation to reduce the, uh, number of round tips, uh, um, round tips to the data and in which the, uh, Databricks configurations are there, uh, like, um, like, uh, configurations means that the Azure Azure SQL database performance levels that is the DTUs and the v codes to match the workload requirement using the SQL database built in features like, uh, like, auto turning and and performance recommendations are there. Monitoring and profiting are there. Scability is there in which you'll be used for design, uh, for this capability by by the partitioning the data using the features like, uh, elastic pools if we need. So, yeah, so, uh, in this way, we can, uh, optimize the dot net applications. Thank you.

How do you design the corporate Uh, okay. So, Okay. Uh, so, uh, there are some multiple points to design, uh, dot net application that interact with the, uh, Azure SQL. Um, first of all first of all, a project setup we have to do in which we can create a dot net project. Uh, start by creating a new dot net project, and then add a entry to the framework that is, uh, which is used to install the necessary entity framework core packages, uh, via the, uh, and you get, uh, and you get and then configure a database context in which we can create a DB context classes. Uh, so these classes will manage the manage the database to connect with the connect and track the changes. And and then the configure, uh, connection string in which we can u we use to store the, um, we use to store the Azure SQL connections, uh, using the secure, uh, secure JSON, uh, proper JSON. And then, uh, the then we can, uh, ensure the data integrity in which the data annotations and the fluent APIs are there, um, in which we can use the data annotation and the fluent API configurations to, uh, to ensure that the data integrity, uh, constraint, like, uh, indexes, unique keys, fields, and the relationships are there. And then the migration process is there, then we can use to create migration. Uh, we can use the entity database and schemas. And then the transaction process is there. We can use the transaction to ensure that the data consistent, uh, see, by using the transaction for um, operation that affects the multiple entities. And then we can use the, uh, consistency control, uh, in which consistency control in which to implement the consistent token consistent tokens and then the validation in which model validation and the secure, uh, and then the security, uh, we used to, uh, create the secure methods to store the access and our, uh, you know, connection string and use parameterized query to prevent the attacks and, uh, attacks, uh, by always parameterized query. So these are the basic step we have to use. Um, thank you.

Okay. Uh, so to refactor crucial section of the Python codes, um, okay, it includes some basic steps like, uh, uh, modularization modularization, the code, in which we can break the break the monolithic code into the, uh, smaller, uh, and the self contained modules. And then we can use the design pattern in which we can, uh, use to implement the, uh, design patterns such as, uh, the factory, uh, singleton and to and and strategy to enhance our code, uh, reusability. And then we can adapt the cleanup cleanup or, uh, code principles, um, in which we can follow the follow to clean code principle, uh, such as, like, uh, like, such as naming, avoiding the, uh, avoiding the magic numbers, and writing clear, uh, clear clear coincides, uh, comments, and writing, uh, writing, uh, uh, comments are part there. That is that will help to make our code to earlier to understand easier to understand and maintain. And then the language, uh, language as your SDKs, uh, we can use the SDKs for the Python to interact with the, uh, uh, to interact with Azure data, uh, services. Um, and these these SDKs are designed to simply implement the, uh, integration process and handling many low level, uh, details. And then the, uh, and then we can implement the dependency ingestions. Uh, so we can use the dependency injection to, uh, to manage the services dependency and which is used to allow for the better testing. And then we can use the error handling, uh, and the loading, uh, error handling and loading, then we can use the optimized, uh, data access pattern. Um, and then we can use utilize, uh, the, uh, SOR function and the logic apps, and then refactor and unit testings are there. And then the comments processes there in which you, uh, we can use the code base and the, uh, integration points with the with the Azure data services. The, uh, this document will, uh, we will use to understanding the code structure and the, uh, facilitate future and maintenance. So these, uh, these, uh, refactor, um, critical section of Python code, uh, we can use to reduce the complexity while, uh, while ensuring the seamless, uh, integration. So these are the

K. Um, so okay. So, uh, this issue in the code is that, like, uh, with the escape sequence in the echo statement, in PHP in PHP, the correct way to include the new line character in a string where you can use the slash n, double slash n. So, uh, like this way. Uh, and, um, these changing ensure these changes ensure that the new line character is properly recognized, and the error message will be displayed correctly. So we can use the try, uh, try, and then dollar data is equals to get data, uh, from API, uh, then parenthesis, then catch exception, uh, catch exception, uh, dollar e and echo, uh, caught expression and dollar e, uh, and then arrow get message slash n. So, uh, this is the, uh, correct way. And Okay. Uh, so, uh, in addition to fixing the new, uh, new line character that ensure that the, uh, expectation, uh, expectation classes you are catching, uh, the, uh, catching matches that type of expectation that, uh, get data from API might throw. And if the API calls throw a different type of expect, uh, exception, um, for an example, custom expression, then you need to to catch the, um, get the specific type as well. So thank you.

Okay. So the issue in the T SQL snippet lies, uh, within the, uh, within the insert into orders statement. Uh, like, uh, especially the problem is with the customer ID, uh, value being inserted. Like, the data type mismatch in order to the table, uh, customer ID is likely defined as the as an integer um, since it is used as a foreign key. Right? And the, uh, and the, uh, and the referencing referencing the customer ID in the customer table, which is an integer. So in the in the insert order statement, the value, uh, value cardinal is being inserted into the, uh, into the, um, customer ID column, and the cardinal is the cardinal is the string, not an integer. So, uh, this mismatch of, uh, the this mismatch that caused an error just because, uh, just because you are, um, you are trying to insert a string into a column that expect an integer. So, uh, to fix the to fix the customer ID value in the order table, that should match the, uh, customer ID from the customer table that which is an integer. Okay? And then we can use the insert into customers, uh, that is the table name, and then the customer ID, customer name, customer contact contact name, and the country values, uh, 1, and then the cardinal, Tom b, comma, uh, uh, uh, enriched, um, and then the Denmark. Go begin tran begin transaction, insert into order, order ID, customer ID, order date, values, uh, that is, uh, values, and then 10248, comma, 1, um, that all the value which is given in the code and then the commit, uh, commit transaction. So, um, so, uh, insert into the customer table, customer ID as an integer, num customer name as a cardinal, and the contact number, Tom b, uh, Ericsson, and then the, uh, and then the country Denmark, which is correct from the, uh, second 21, assuming the Denmark is in, uh, indented a country, uh, intended country. So, uh, this is the code, uh, we have to use. And the order ID 01024, and the customer code that is matching the or, uh, customer ID from the customer table. And then the order ID got did that in, uh, that inserts the, uh, correct date and time. So this is the, uh, data type issue. Thank you.

So if we talk about the dimension modeling update, uh, that is, um, to design basically SQL database schema of the of the efficient changes during the dynamical modeling. Uh, so we have to use some technologies like, uh, star schema, uh, star schema in which we use to utilize a star, uh, star schema with a central, uh, fact table surrounded by the dimension table. Um, and this this specifies a query and allow, uh, allow for the effect update and then the partitioning in which the partition, uh, partition, you can fact the table to manage the data growth and, uh, and efficiently and, like, data growth efficiently and enable faster data loading also. And then, uh, the indexes indexes in which we can implement the up appropriate indexes using the clustered and the non clustered indexes to enhance the, uh, to enhance the query performance and the facilitated updates. And then the, uh, column store indexes, uh, we can consider the column store indexes for the large fact large fact tables to improve, uh, improve our analytical query performance. Okay. And then the temporal, uh, tables. Temporal tables in when we can, uh, we can leverage it leverages the temporal tables to track the historical changes in the dimensional table, and, uh, it it enables easy rollback, uh, easy to rollback and to and and auditing purpose also. So, um, data con data compression is also there in which we can utilize the data compression technique, um, like, uh, to to optimize the storage and, uh, to improve the query performance. And we can also use Polybases also, uh, um, in which we can, um, if we want to integrate, uh, integrate our external data sources, then we can use the poly base to, uh, to efficient, uh, to make our query efficient and load the data into the from the various sources. So these kind of a step we have, uh, to use, and um, we can design the SQL schemas that will that is used to, uh, you know, uh, support the efficient, uh, changes during the dimensional modeling of, uh, dimensional modeling updates and Azure based services. Thank you.

Okay. So, uh, basically, to ensure the data analytics capabilities in dot net application, uh, using the Azure Community Services. So, um, so we can utilize services like, uh, like, Azure text analytics, uh, like, for the sentiment, uh, analysis or the languages, uh, languages detection. And the Azure computer Azure computer vision for the image analysis, uh, when we can use the Azure speeches, uh, Azure speech services, uh, for the speech to text converter and, uh, and the text to speech capabilities. And then the Azure tran, uh, Azure translator for the language translation. And then these services can integrate it into, uh, uh, like, into into your app application through Azure s t SDKs or the rest APIs by, uh, by allowing you to extract the insights from the text, images, and speeches, uh, data to ensure our data analytic analysis process. Additionally, we can, uh, uh, like, I, um, additionally, we can use the leverages of, uh, as your, uh, leverages is your, uh, machine learning process, uh, for custom the model, um, we can use to we can use a custom model training and, uh, deployment to further enhance your data, um, analysis and capabilities. So, uh, like, my approach is that to utilizing the Azure cognitive services. Thank you. So that's where we can enhance the data, uh, analysis capability in the dot net. Thank you.

Uh, Python based data programming workflow. K. Machine Okay. To integrate to integrate a machine learning services into the Python based data processing workflow, uh, we used to set up our Azure machine learning workspace, uh, create an, uh, we can create an Azure machine learning workspace to manage our machine learning resources, and we can install the Azure ML SDKs to install the machine learning SDK by using the Python, uh, using, uh, the PIP install. And then to authenticate, um, to authenticate your, uh, to authenticate your subscription credential to access the workspaces, and we can prepare the day prepare your data using the Python libraries like Pandas, NumPy, uh, SciPy, and these kind of libraries we can use. So define an experiment, uh, like, in the ML to encapsulate your, uh, data flow and the model training and the evaluation, and then create a create compute to target in which we can set up a, uh, set up a compute target in the Azure machine learning such as, um, such as to compute the data Azure Databricks to run your, uh, experiment. And then the training model, we can use the machine learning to train the machine learning model as the computer target and then the deploy model and then the monitoring and to manage the data so so that we can, uh, use the continuously monitoring the data to deploy the model using the, uh, using the Azure machine learning, uh, monitoring and management tool. Thank you.