
Data Engineer
Celebal TechnologiesAssociate Data Engineer
Celebal TechnologiesAssociate Data Engineer
Celebal TechnologiesSQL Developer

Azure Synapse Analytics

Azure SQL Database

Oracle

AWS Glue

AWS S3

Azure AD

Microsoft Power BI

Python Libraries
.png)
Databricks

Spark
Azure

AWS
Microsoft Excel

Jupyter

Visual Studio

Google Data Studio
Yeah. Hi. My name is, and I have joined CLL Technologies since May 2021. I have joined with my primary skills, Python and SQL. After getting into SQL, I have learned more technologies and enhanced my primary skills. I have learned the technologies like Azure Databricks, Azure Data Factories, and other Azure services. Apart from that, I have also learned the visualization tools, which is Power BI. I have been working with real-time streaming data. Apart from that, I also process data in batch processing. After getting all that data, I am well-versed in the analytical process and the visualization process. As I talk about, I have completed five plus projects and ten plus POCs in different domains for clients during my tenure at Syllable Technologies. Apart from that, I am leading one of the projects with ten plus developers. We have successfully completed the projects for the particular client. With this project, which is based on data warehousing migrations and applying security and governance to the warehousing on cloud platforms. Apart from that, I'm currently learning real-time streaming data and how to handle those data in databases.
Okay, so the use of Python is that I'm using Python in the Databricks platform. And I'm using one of the famous Python frameworks, which is PySpark, which is the combination of the Python and Spark technologies. I'm doing the transformation based on that. And from those transformations, I'm collecting the data from the Snowflake data warehousing. Suppose the client has different source systems like Snowflake and SAP, apart from that, we're collecting those data into a consolidated storage system, which we call the delta, where we use delta lake and do the entire middle architecture workflow, which is called the bronze layer. And apart from that, we can do some kind of transformations and then some kind of cleaning process as we store into that server layer. After that, we're implementing all those aggregations, functions, and all kinds of stuff. There's the final layer which we conclude, which we transfer to the particular layer. And then the particular Power Query concludes it and takes the data from the folder and does the process. So, in the Python usage, we use Python to manage all those three layers of architecture. And apart from that, we're pushing that data into the tagging system, which could be our Snowflake data warehousing and maintaining the data out there.
Okay. So for the error log mechanism and the retrying mechanism, what we can do is build out user-defined functional usage, which we call UDFs. By defining user-defining functions, we could maintain error logging based on audit log tables. We are, for example, defining 10 functions out there. And now we have to collect if some of the functions could fail and some of the functions could run successfully. Suppose I build up the audit log tables and I have collected all the metadata information about the functions out there. Now, as my program runs, I will collect if the sum of the functions may be the 5 functions. And the sum of 3 functions run successfully, and the 4th function makes an error. So what I could do is audit that error function to my audit log table, and I will just retry that function again. If I have built up if-else conditions out there. Suppose if the function if the program has done successfully, then it will go to the next step. If the program has failed, then it will retry. And I will do the retry mechanism based on that 4 times or 5 times. Suppose if retries have failed in the fifth time, then it will note down the particular error into the tables and the particular process into the next program. And it will get a notification to the particular user using emails and other instances so that the program has failed and kindly look into the particular program. That's why we can implement this mechanism into the particular program for making error logging and the retry mechanism successful.
So the issue of dot data skew or some kind of skew in further steps can be handled by making the partitioning and indexing particularly tailored to the scenarios. For doing the partitioning, it can be done in further ways. The most common way to do the partitioning is based on date-based columns. Suppose the data is coming from the source side, and it's been coming on a daily basis. From the data that's been loading, like for the particular processes, suppose 15 GBs of data is coming out there. In a single word, it couldn't be possible to load it directly into the system. It can create a problem in our system. To resolve that problem, we can try partitioning based on the date system. Suppose that data is coming, like 15 GBs, on a particular late day basis, and the data is coming in at 1 GB per day. We can load that data into the system on a day-by-day basis. If the day basis has also collected a large amount of data, we can further divide it into a timely basis or into 4-hour and 6-hour time frames using the timestamp column or date column. And using that, we can handle that kind of data skew into the data warehousing system like Snowflake. And for particularly fast execution processes, we can use indexing processes to make the execution faster.
So for those things, what we can do is build out programs generating dynamic SQL queries. Suppose I have 10 queries that need to be processed, but the parameters are the same and repeating. I can store the parameter values in an audit table or metadata table, and then build dynamic queries based on those values. I will make the dynamic queries where the data can be transferred from the tables where I have stored the parameter values in the metadata tables. The parameter values get from there and I will run the dynamic queries, which will make my performance faster. For further processes, I can make parallel processing using a thread pool. I will send the dynamic queries to the thread pool, and I will use the ETL process. In the ETL process, suppose the five parameters are getting into there, and the third parameter values are coming from a list. I will send that value directly from the list, and it will do the parallel processing using the thread pool executions. Once the process of the thread pool is complete, I will start my thread pool executions, then run my program, and then close my thread pool executions.
So the partitioning features could be applied in the basis of support. I have talked before using the date columns. So there's a categorized column that we can do partitioning. Suppose data is coming, there is a date column, created date or the updated date. So I can do the other partitioning features based on the date column. And suppose there's some tables with no date columns, then I can create one header table with the date column. I will put down the joining function between the header table and the data table. Then I will put the joining conditions, and I will load the particular table based on the dead joining condition and the header table and the data table. If those tables are also not present, I can use partitioning based on a categorized column, which could be region wise, city wise, and any other kind of column. But there is one drawback. There's a unique column, like IDs column and some other kind of column that cannot be included in the partitioning column because it contains unique values. The sum of containing unique values cannot be performed to the partition columns.
This network of the hypothetical data pipeline processes job or other solutes and solute principles. The principles are to invalidate it and this is how the effect functions. Yeah. It can be violated. Like, log error functions couldn't be violated because, like, so they here, we are, like, firstly, we are reading the data, then we are processing the data, then we are writing the data, and then the log error will be generated. If an error is not generated, then the log error cannot be recorded. But for better performance, what we can do is put the if-else condition, like try and except conditions. So what we can do is just maintain these blocks, read data, process data, and write data, under the try condition. If the try condition fails, then the error log will be generated. Otherwise, the error log will not be generated. This is the effect.
In the scale code, there is a performance issue. Let me explain. What might be causing the slowdown and how it could be addressed? Okay. So the further slowdown, particularly, is caused by the group by and order by statements, and everything will also be calculated. Like, suppose there are 50,000 records, and this will do the group by name. First, it will do the group by, then afterwards, it will go to the order by of the particular average. The average is calculated twice here. Firstly, the average is calculated based on the selected statement. Secondly, the average is calculated in the order by statement. So for better performance, we can do the city performance here. City implementation here. What the CT will do is create the particular objects of select areas and the name wise, and it will do the group by. Then once the objects have been created, afterwards, we will do the order by statement and then into the city. So it will be a better performing city per city implementation. It will get better performance out.
In what way is the latest Python meta-program within the field to adapt dynamic schema images? Okay. For dynamic schema changes, what we can do is compare the source table and the target table. So, suppose I have built up the source table and the target table, and we will introduce an intermediary table, which we could call a temporary table. Suppose I have loaded data from the source systems and put the data into a staging table. Then, after that, we will compare our staging table with the target table. If some schema changes have been done, then those changes will be performed onto the staging table, and the implementation could be done. And if a column has been added particularly to the target table, or a column has been deleted from the source systems, such as name, city, and gender, and name, city, and gender. And in the next time, if the source system has deleted the city column, then what I will do is compare my staging table with the target table. And if some column exchanges and those changes could be implemented into the target system. And every time the staging table will be recreated from the source systems, it will be in override mode, so it will implement the same system as the source table. And some changes have been done, then it will be implemented into the target table.
So I have never worked on machine learning algorithms and all kinds of stuff. So I'm not expected to get the best answer out there. What are the machine learning algorithm patterns for ETL for data enrichment? But we have incorporated some machine learning, not the machine learning algorithm, but the automation algorithm. What are the patterns we have implemented? We have implemented some automation into our data warehousing migration project. We have implemented automations, limiting the schedule trigger, and other stuff, which do the better processing and a smoother process for the ETL pipelines. We have implemented automation techniques.
What settings would you employ into the Python dynamic risk retail workers in response to the increasing data volumes? Okay, so for that, we can configure particularly, suppose we are using the Databricks platform for the ETL processes. We can just configure the ETL cluster configurations. In normal scenarios or systems, we can also implement it. We can dynamically configure particularly clusters, with the number of driver nodes, the number of worker nodes, and the run times, processors, and memory out there. We can maintain our metadata table for the particular scenario. Suppose there are five scenarios: small, large, extra large, and ultra large. We can categorize those cluster configurations into those categories and implement them based on the data volumes. And the other things we would do to implement auto scaling. Suppose we have provided a range of two to ten worker nodes. It will auto scale by itself according to the data volumes, making it suitable for those kinds of systems.