
As a Senior Data Engineer at EY, I lead and ensure the delivery of high-quality data solutions to our clients, collaborating with cross-functional teams and leveraging Azure Databricks, Spark, Python, and Azure Cloud technologies. I have saved time and money by implementing automations for data integrity validations and infrastructure management. In my role as an Azure Data Engineer at EY, I engineered a new analytic platform with a central marketplace for analytics discovery and a robust engine for deploying repeatable data solutions.
I have a Bachelor of Technology in Computer Science from APJ Abdul Kalam Technological University, where I gained a solid foundation in data engineering, data integration, data transformation, and machine learning algorithms. I am also certified in Python, Cloud, and Databricks. I am passionate about learning and exploring new tech opportunities, and I am always eager to expand my skillset and knowledge.
Senior Analyst(Data Engineer)
Ernst & Young Analytic Hub - Actionable Insights at scaleAssociate Analyst(Data Engineer)
EY GDS
Python

SQL

ETL

Azure Databricks

Machine Learning

Azure DevOps

Azure Data Factory

Scikit-Learn

Azure Key Vault

Pandas

OpenCV

GitHub

BigData

Spark

Power BI

Matplotlib

PowerShell
My name is Mohammed. I'm a data scientist. Basically, I've been working with data since 2020. Since then, I have been exploring the field of data. Initially, I did a machine learning internship, which lasted 6 months. Then, in April 2021, I joined as a data engineer. Since then, I have been working as an Azure data engineer. I have been handling the data engineering operations end to end, from receiving the data, transforming it, and using the insights to inform the analysts. So I have been mostly working on batch requirements, and it's been around 3 years so far. I have been handling various clients, and the requirements were mostly on Azure cloud, involving technologies such as Spark SQL, Azure Data Factory, Databricks, and keywords and functions.
If we are designing an ETL system, it depends on what sort of ETL we are trying to design. If we are planning to design an all-app system, which is designed to be consumed by a Power BI or reporting system, then the requirement will be mostly at this level of design. The inputs will be receiving data from some sort of sources, which can be put into a landing container or landing area. Then, we can follow the standard approach of having a material architecture, which may follow the curated gold, silver, bronze layer sort of. We can have the data landed into the containers and have them as a source of truth later. If we want, we can enrich the data in the next layer. And at the final layer, we can have the data available. Based on this, what I intended to say as a design is that we can utilize both the data lake, blob storage, and data factory along with SQL server. SQL server can be utilized for the end-consuming process. We can have the data available in the landing containers and keep it clean on the silver layer. Then, we can keep the gold layer as well. From there, we can push the transformer data into the SQL server.
If we want to get a high-throughput web scraping service, then, basically, we should have a high-availability service. May be, we may have to think of queue-related services from the Azure side, or else we may have to think of a Kafka sort of component that will help us with high availability and real-time capabilities. So if we want a high-throughput and concurrent application, then this sort of design is what we can utilize. And based on this, we can have multiple servers, which will be running different scripts at the end, producing a multiple scraping system.
So, if we are having the web scraper data, which is coming from the scraping system, and we are expecting some sort of files that will be extracted as part of the web scraping. So, depending on the nature of the files, how we handle it. If we are receiving the scraped file in CSV or in JSON or in XML, we can either have it stored in an SFTP location, which can be triggered upon a batch sort of way. And then this can be taken into a landing layer, and we can do the cleaning on top of this. So, basically, we can utilize ADF along with its copy capability. And we can have Databricks if we have that component in order to clean up the data. And we can utilize the lingered services and datasets to connect to move the data from our SFTP location into landing, then from later landing into a silver layer by utilizing the Databricks with some lingered with enabled lingered service. Then, after cleaning and after doing the transformation, we can have the data moved into a SQL database by just having the Databricks script. That's one of the approaches. If we don't want Databricks to be in between, considering the cost or anything, then we can have the data moved by utilizing the web data flow activities, which will help us to do basic level of transformations and cleanups. So we can utilize the data flows. And at the sync level, we can have the SQL database, which will be most probably incrementally loading the data, the scraper data. So the components will be mostly SFTP location, then at the time-based trigger that will load into our landing. We can have a data lake or blob storage. Then from the blob storage, the ADF copy activity from the copy activity into SQL database by utilizing the webhook. So sorry. Yeah. By utilizing the webhook activity. And the webhook itself can have the source from our blob storage, and this can be moved into the sync into SQL database.
So in DevOps, if we want to deploy any repo configured item, then what we can do is, like, we can set up a bunch of PowerShell code. So, basically, how it works is, like, since we are having the CICD built, what we will do is, like, we will have the connection of states. So if we are having three environments, dev, QA, and test, this sort of for example, dev let's go with the dev, test, and prod. So, if we are having these three environments, we can set these stages. And based on these stages, what we can do is, like, this for these three stages, we can have three pipelines basically taking from the artifacts, that we will be for the SQL schema. We should be keeping a main branch and the, we should be keeping in the design, I will say, let's keep up the main branch for prod, and let's have an acceptance branch, and let's have a test branch. And let's do that. So, basically, we can have main for prod, then a test for the testing environment, and, develop for the dev environment. And from develop, we can utilize the feature branch. So from these three, the artifacts can be taken, and these artifacts can be deployed with the Azure container deployment activity. So by utilizing this activity, we can containerize all these SQLs and deploy it into our Azure SQL infra. So, the PowerShell activity in order to deploy this will have three batches, or three stages will be there. If we raise the PR from our feature branches into dev, then, while merging, we can have this trigger as well if the PR is going into dev. Then from that, it will automatically trigger and deploy into dev. For testing and prod environment, we can keep the security activity as well. We can put some approval criteria to be done by the leads or something. So, we can configure those rules, and we can have the pipeline configured in that way.
So, frequently, when we are executing the SQL query, it's going timed out. So, timed out can be due to different sorts of reasons. So, first of all, time-based analysis can be performed, and storage-based analysis can be performed. The volume of data that is getting processed by the query can be one reason, or else the complexity of the calculation can be another reason. So, depending on this, the nature of the query that we have written should be the root cause, the first place where we can start the root cause analysis. So, based on this, if we are having complex queries, complex nested queries or something, we may have to identify which part of the query the time taking is the problem. So, if we are having nested queries, we should be ideally following the basic principles, which include selecting only necessary columns, not utilizing subqueries, and if we utilize a subquery, the subquery should return a limited set of data. We should have proper join conditioning. We should have proper indexing. If not, we should implement proper indexing, and we should join on the proper column. If we have an index, we should utilize the index on the joining. This sort of basic evaluation should be there in order to see how much the query is taking on the time. And also, we can utilize the same query and try it in the SQL editor, and we can check the SQL plan to see which part of the query is taking too much time, what can be done in order to optimize it further. And if we are having large data getting scanned and that is causing some bottleneck in between and which is causing the times to be run out, then we may have to identify or figure out a way in order to avoid the largest scan. Probably, we may have to check on the granularity. If we are checking some data on the sales order level, then we may have to restrict it to a granular level where we will be scanning less data. What I intended to say is, like, if we are having a sales order item level, then we may have to consider going back to the sales order level so that we will be having, if we are having 10 items on a sales order, then let's not scan the entire data. That sort of way we may have to think of and go deeper into it.
So first of all, the basic things we can say, selecting the star is not the best way. We can select the necessary columns. And second of all, I see that the work clause has been placed at the end, and which is having the sales start date greater than these two filters, which got pushed at the end. So, the product number I'm expecting is coming from the product table, and I'm expecting the product okay, product number and product model. So that means, the product number filtering, cell start date, both these filters coming from the product table. So I'm expecting that we can easily implement this filter before joining so that we have to take lesser data into the joining. So these two are the main findings instead of star, the necessary item, then pushing the filter to the top. And I can see that it's already on left join. So this will be having left join, I mean to say left join will be having a fine performance. So it will be mostly fine. And upon that, we can think of other things to incorporate if we want. But these two definitely should be performed.
so in Azure Data Factory copy activity, we have a source on a delimited text location, which is Azure Blob Storage, path container.dataset.csv, format text, format, column delimiter is this thing, treat empty as null, okay, we're keeping empty as null, skip line count, okay, first line we're skipping. In the sink, Azure SQL, dbo.import, translator table name, string, age, int, stat. Okay, so in the column mapping, it looks good overall. There is an issue with the input fields due to incorrect data type mapping: name is string, age is int, and start date is date/time. These three fields look good to me, even though I can think of this: start date can be no issue, start date is fine, it's date/time only, right? So, age most probably can have a problem if people are keeping up with months or something along with that part, from that, I can't see based on this level of detail much insight, but it's better if we can see the data as well, that will give us the proper insight easily. So, the column mapping: name is coming as string, age is integer, and start date is timestamp. Overall, this looks good, but we may need to examine the data and identify what went wrong, so there can be possible causes, or causes like age is having floating point, start date is having just a date field only, we may not be needing to utilize the date/time when we don't have the timestamp details coming in, so this sort of issue can be there, but we have to see the data that can help us to figure out very easily.
Multi-regional data application inconsistency. So if we want to implement multi-regional replication on a SQL database. This question needs a bit of clarity for me. So are we expecting that we should be replicating data in the SQL DB into multiple regions, or are we expecting data coming from multiple regions into the SQL database? So if we are expecting the data coming from multiple regions into the SQL database and we should be handling it on the database, then the approach should be in this way: we should have a flag column in order to identify and distinguish this data is coming from which region. So if we are having sales order data coming from the Northeast and East Asia and the Middle East, then basically we should be keeping data in a way that has a flag column, which will help us to identify very easily whether it's coming from the Middle East, whether it's coming from East Asia. This sort of a single column can be easily identified based on some flag coming from the data, or it can be easily derived from file names or from timestamp details. That's the way I think we should be handling the data consistently in the database.
So a device. Okay. So if we are having a PHP system, then I think it should be, first of all, an on-premise system. If we have the data available in the on-premise, then we can utilize the batch activities in order to pull the data by setting up a lingered service and connection to the PHP system. And the dataset can pull the data directly into the SQL Server for a fresh time. So the copy options are lift and shift. This can help for that. Or else we make a new utilize Azure file service itself that will lift and shift very easily the whole data. We may have to figure out in between these two, which may work, or we may have to explore and see if there are any other options available that may probably help us. So, yeah, these three approaches, either ADF, either the file as your file store or else we may have to explore further and see on the Azure stack.
If we want to implement CICD for ADF, it's pretty easy in the DevOps. So basically we can set up a branching strategy in the ADF, which can take into Azure repos. So once we have Azure repos available and set, then we can have probably two branches. I mean to say by default, we will be having the ADF publish branch and a main branch. And then we can create a pipeline that will basically take the ARM template and deploy. So that's the approach. And if we go deeper into the deployment, then we can take, we can create two artifacts. One is from the main branch and one is from our ADF publish branch. So from the ADF publish branch, we can have the ARM template, which is selected on the activity of ARM template deployer. And from the ARM template deployer, we can choose the ARM template and ARM template parameter file. So these two files can be taken, selected by utilizing the artifact of the ADF publish branch. And then we can connect this into our data factory, whichever data factory we are intending to be deployed, then we should be configuring the branch and trigger. And one more, I said the main branch should be utilized, as Azure recommends stopping and restarting the trigger before and after the deployment. So let's have a file or PowerShell script that will help us to stop the trigger prior to the deployment. I mean to say before the ARM template deploy activity, let's have a PowerShell script that will stop and restart the triggers. So we should provide parameterization and the boilerplate code that Azure provides on the website, which can be utilized. And post the ARM template deploy activity, we can have the same PowerShell activity with parameters to reconfigure or restart the triggers. So based on this, and on top of that, we can set the branch for the dev state. If we are pushing it into dev, then we should have it deployed. And if we are for a test environment, we should have another batch set underneath that will be deploying it into based on the testing branch, and the main branch, obviously for the prod. So this is what I can think of.