-1684472041.jpg)
I am an experienced working professional with 5 years of overall experience in various domains like Ecommerce, Non Banking Institutions.
Senior Data Engineer
WiproData Analyst
EmbibeData Analyst
Skill-LyncRisk Analyst
AmazonBusiness Analyst
Shriram Finance LimitedSenior Business Analyst
Shriram Finance Limited
MySQL

Python

Tableau

Power BI

Advance Excel

Microsoft Azure SQL Database
Azure Data Lake Storage Gen2 (ADLS)

Azure Synapse
So I'm certified in professional data analytics with more than 6 years of experience in its domains of data science and analytics. I have worked with multiple domains. I started my career with the ecommerce segment, and it was Amazon where I started as a business analyst. My job role was to determine the fraud transactions and save the company's assets from large amounts of loss. So I used to usually write a SQL script to fetch the transactional data, manually check the high-value order spectrums, and take actions based on that. Next, I went to Shriram Transfer Finance Company as a risk analyst, where my job role was to determine the fraud patterns in the room. If necessary, I optimized the portfolio and generated insights from the portfolio regarding acquisitions of a customer through referrals, working on the B2B team and a GTM team to get free weekly reports and monthly reports. Next, I went to Skill Link as a data analyst, where my technical role came into picture. So I usually worked with large-scale data in Skill Link. I used to use AWS as a database and AWS Lambda for writing scripts. I mean, time zone history as a database and AWS Lambda for writing scripts. In the organization, we helped the management to generate more revenue through user analysis of user engagement activities in leads. I built the weekly, monthly, and quarterly reports on revenues and people's performance. I also tracked the behavior of the user in terms of what courses they were searching and helping on, how long they were spending their time on the website, where they were spending their time, and if they were dropping out, what the pain point of the dropout was. So I generally built a dropout recommendation and then continued to grow. I went to mBIE as a data analyst again, and I was working closely with the data engineering team. So here, I helped the engineering team with building maintenance of the cloud services, which I used was Azure over here. I had the organization to build many dashboards. The main dashboard was a business optimization dashboard with more than 50 KPIs indicating the trends from past data and helping stakeholders to generate insights on the forecasting side. For example, if we generated uncertain patterns, how were the patterns of the users coming onto the platform and how long they could come. We also implemented AB testing. For example, if we pre-set certain tags on a website, so how they were performing, how the different events were performing, which were the panels. I created a sign-up channel analysis dashboard, which helped the organization to get the data in real-time, which were my highest-performing banners. And what were the parameters for those highest-performing banners so that I could optimize our courses according to those click rates and also used a drop-off recommendation. And apart from them, I used to do the weekly, monthly, and daily reports for the GTM team also, helping to identify the pain points of the customer.
So the blood stream model is one of the most important models I have used clustering's technique in, optimizing the marketing campaigns, basically. So the main important aspect of how we build it is in clustering, which is a key component. It's amazing. The key component and what it does, we generally use scikit-learn package to import the k-means, machine learning algorithms. From scikit-learn, we get k-means, and we can use k-means to do what k-means does. K-means is a naive Bayes algorithm, actually, what they are doing. They are collecting points based on customer usage. For example, a customer is from a certain location. They are using a certain amount of data. They are visiting at a certain period of time. They're visiting certain channels. They visit our website and they create a pattern. Let's say some visit in the morning, some visit in the evening, some visit content a, some visit content b. So basically, we also aggregate different details such as personal fields, such as gender, usage duration, take rates, how often, how long we are consistently being from scroll, I mean, are they consistently scrolling the page? So accumulating all the data points, we create groups of small segments of those categories of the customer. So, basically, we use scikit-learn package from Python to get the clusters and use the k-means algorithm.
So to validate data accuracy and consistency in SQL, first, we will be looking into data ingestion. From where the data are ingested from multiple sources. So, they are basically how the extraction, transformation, and loading is done through Azure database. We are usually doing it with Azure data factory. So, is there any inconsistency or is there any discrepancy with the data types or not? The first thing we will be doing to check the validated data accuracy. Next, to assume the consistency, we will check for the null values. For example, if there are 50,000 columns. Right? There are 50,000 records in a table. And let's say it is a table of an employee table. So, an employee table consists of a date, a joining date, employee ID, employee name. If the joining date is missing for, let's say, 100 employees and the employer IDs are generally old. How we can recognize the old? It's I mean, it's a very old employee ID, like 101, 102, and now the series is going on for, like, you know, the 500 or 600. And so, first, we will be checking where the null values are. Then we can replace with the very minimum date of the entire table. So, the very first thing which we will be doing is we will be checking into the ingestion model. If there is any discrepancy in the ingestion. Once the data fields are accurate, the extraction and transformations are done accurately, then we will be altering or we will be updating the records accordingly. Sometimes there may be the front end passing some different events, which are basically duplicate events. So we can actually reconcile those duplicates and collapse it into 1 with case statements and then actually update the entire table. So, the first thing is the ETL. Second, wherever the null values, we should impute the null values accordingly. Third, instead of duplicate scores, we can collapse into 1 for example, there is a goal and another is Goal. This way we can use case statements and we can collapse into 1. Similarly, there may be other fields which have duplicate values and we will individually check each distinct record of the categorical field, and we can update the database accordingly after removing the duplicate record. This is how we can do.
So when we generally analyze data from high velocity volume sources, meaning it has a vast number of records, millions and trillions of records. The first thing we do to get the data is optimize the query. We cannot get the data all at once. So first, we look at the top 100 records to check the flow of the table. Once you understand what happens in a live database, suppose you're working on cloud services, and I'm working on Azure Synapse Studio. First, we check the first hundred records. Once you get a good picture of how the tables are, we can apply only the required columns, which we need for the business stakeholders, I mean, the columns we need to fetch. Then we can accumulate and optimize it accordingly. Suppose we need to get the data for 6 months, we apply the filters. Suppose we want to get the data for a particular field. Let's say we want the source to be SEO, so we can just filter it out. The very first thing is we cannot get the entire 50 lakh records in one go. So we have to optimize our query. The first thing, very first thing we need to do is optimize the query and then run it so we can get what we require. Next, we need to do normalization. Because without normalization, if there are multiple records, we cannot store the entire thing in one table. We will be creating different IDs. Let's say, we have a user ID. Right? And we want to store user patterns. Right? Let's say we want to store the timestamp. So if we do the timestamp accordingly, I mean, if you do the user IDs and one user visits once a day, so we can do it date-wise in one day, one user ID, visitor ID. In that visitor ID, we can create a visitor ID for one day. In that visitor ID, we can store the timestamp. How many times in that particular day the user has visited the website. So we can generally reduce and make the dataset more scalable and usable for the other teams. This is how we should think about optimizing the query using normalization, connecting different tables with IDs using inner joins. You'll see left joins, whatever is required for the business purpose. And that is the only way to handle and analyze data from high velocity and volume sources.
So, generally, in any BI tool, for example, Tableau or Power BI, you can create a dynamic and interactive dashboard anytime from different multiple sources. Now, what we can do instead of importing the data is connect my Tableau with my Azure SQL Server using Synapse Studio. Instead of importing my dataset, I can use the detect query method. In that detect query, it will automatically wait, and if I set the refresh timing within, let's say 5 minutes or every 10 minutes, we need to optimize the query, using the direct query method instead of importing the dataset at one go. It will generally work whenever we are using it, meaning it will directly query and get the data instantly. So, whatever we are making from that dataset, let's say whatever analysis we're making from that dataset, it will be interactive if we apply different kinds of filters, if we apply different drill-down steps, if we apply different kinds of action buttons, or if we apply different kinds of filters. We can also apply different kinds of graphs using DAX queries. All these things make it dynamic and interactive. In real-time data monitoring, we can use the direct query method instead of importing the data. This is how I usually create real-time monitoring dashboards.
So the data can be trans, I mean, data is usually called a migration. We can usually migrate data from one platform to another platform. For that, we need to connect Tableau to the Power BI source. So there will be multiple links of the channels. Power BI will be connected to the data warehousing system. Tableau will be getting connected to Power BI. Now if we want to get all of the data imported to Tableau, we can do so. And if we want on a real-time basis, we can just import the same query syntax to the Tableau dashboard, implying those things, and get the transition. So generally, I use Power BI a lot. Tableau, I have used to create Internet dashboards, but transitions, we have done very little in my client organization. So this is how we do the process, but this is how we need to explore and make it possible.
So the main problem over here is n clusters equal to auto. If you do not define the number of clusters, then the key means algorithm or the key means package will generally not be able to determine the optimal number of clusters, which they want to make with any number of data points. Remember, we saw earlier the multiple data points. Let's see if we want to take 5 data points, it will categorize the data in the same way. If you want to take 8 data points, it will categorize the data in the same way. So if you do not generally define the number of clusters, then the code snippet will not function correctly. The main thing.
And how can you profile an insight into what should be correct? Okay. Module ID and risk for the risk. Is that all? So over here, you are using module ID to aggregate the data of a score, but you are grouping by student ID. This is the thing which is an error and which should be rectified because whenever we are categorizing a table, I mean, aggregating a table. If we're aggregating the table with the module ID, we should group by the module ID to get accurate data. We can use the VAST student ID to filter, and we can use the student ID and bracket number of student ID. But if you want to get module-wise average scores, there are multiple modules in a particular student ID. So we should group the entire dataset by the module ID.
So in cloud services, when we get data in cloud services, what happened? Data ads are generally captured from multiple data sources. We do web scraping, and a lot of data are coming from different sources. So generally, what happens in Azure since I am used with Azure, definitely giving an example, we forward with Azure. In Azure Data Factory, we do the ETL process. Now using Azure Databricks. Now what happens if the data are multiple data bytes? It's a very large-scale data. Right? The very first thing which we will be using, the first thing within the transformation process, is to capture the correct data. Then we will aggregate the data and store it. Let's say if you want to include each record with a timestamp, I told you earlier, it will take a long time, I mean, it will take generally a huge dataset, 50 million records of data in a particular dataset in a particular container. Now if you want to retrieve that data, it will generally become very difficult to scale to retrieve data or to write this query script and fetch the data. So what we can do instead, we can aggregate the data and then store it. Number one thing. Or the second thing we can do is we can do normalization. Right? We can store data in different IDs, and we can create different tables and interlink them with different tables. The third thing we can do is we can create other dictionaries and link them with that. Let's say we have a path name for each of the records, so we have a path. So what we can do, we can actually create an ID for a specific path. Let's say it's if a user is viewing a website and is viewing a particular content, that particular content will have an ID. We can map the ID with the path name and then store it. So it will reduce the size of the dataset. Right? So the very first thing which we will be doing, is we'll try to do the aggregation. If aggregation is not possible, if there are multiple categorical variables, we'll try to normalize the dataset. And in this way, the data quality and efficiency is maintained in SQL.
We have two ways to do this. Let's say you have already built, imported the data, made a dashboard, and made the dashboard interactive and dynamic. According to your stakeholders' requirements, you have made all your requirements using action filters, graphs, and everything using text query measures. Once you're publishing it to the service table service, then we can create a schedule refresh time, which will be each minute. In that way, the dashboard will be refreshed every 4 to 5 minutes. This is the most efficient way of getting the dashboard refreshed. Next, we can do is again live queries. I mean, instead of importing the whole dataset, we can actually use a direct query. But in that case, what happens in a direct query? Let's see if there are two kinds of pools. One is a dedicated pool, and one is a serverless pool. If I'm running the query on a serverless pool, the dashboard will refresh after each 5 minutes. In your organization's demand, there may be multiple teams working on SQL. If 50 to 100 people are actually working on the same serverless pool and getting down tables and writing scripts, refreshing your dashboard can take a long time for you and the team also. Right? So instead, what we can do is use ADX as your data explorer with Kusto query language. That is how we can write. I mean, link the data with live dashboards. We can. The dashboard can be refreshed without interruption because ADX is designed to handle large-scale datasets. And without importing, we can do a direct query and get data.
This can be answered in different ways. 1st, I'd like to answer what is advanced SQL window function. I mean, what is advanced SQL window function? We can create an actual window function with a Common Table Expression (CTE), which is a temporary table and can be made callable whenever we require it in a script. I want to get an aggregation, but I don't want the entire table. I want to get the top performers among 10 different categorical data. Among those 10 different categories, we have 5 categories each with 5 subcategories. So, 10 into 5 is 50. Let's say category A, column A, category A, B, C, D. I mean, we have 10 categories. Column B, category we have Roman numbers I, II, III, IV, V. So, A has 5 Roman numbers, B has 5 Roman numbers, and so on. Each of those categories, I mean, each of those according to categories, are having marks. I don't want the entire thing to be aggregated with. I want only the top performing performers, so we can use the RANK or DENSE_RANK function to rank the entire dataset in one go. You can use a CTE using a subquery to do the entire thing in a single query. You can rank the top performers by rank 1, then call that CTE and use the aggregate function to aggregate the table. This way, we can analyze complex data aggregation. And we can use CTE tables, which are temporary tables, to call whenever we require it in the entire time. Instead of doing multiple aggregations at once, we can actually segregate the aggregations into 10 different steps, merge it down to 1, and present it in our report to the stakeholders.