Data Engineer with nearly 4 years of experience in designing and implementing scalable data platforms for both on-prem and cloud solutions. Hands-on experience in designing and developing complex data pipelines using Databricks, Azure Data Factory/Azure Synapse to achieve business and data requirements. Good experience in working on cloud platforms such as AWS and Azure using technologies like Databricks, Azure Synapse, Azure Function App, and Snowflake. Detailed experience in handling issues in projects, proactive client communication and project management, determined leadership, and dynamic client relations.
Data Engineer
MastekData Engineer
New BalanceData Engineer
AvasoftDatabricks
Azure Synapse Analytics
Azure Data Factory
Snowflake
AWS S3
GitLab
MS SQL
Azure DevOps
I want to recognize the efforts that Thanveer has put in for
SSDB since he started with Goddard in June/July of this year.
He has been very diligent and thorough in documenting all data sources, integrations and touch points since he has joined.
He has been very instrumental in implementing the SSDB objects in time giving us the necessary traction when needed and testing the different flows from all scenarios making sure it meets our performance criteria.
He goes above and beyond to make sure our environments, git are in sync and brings up any discrepancies to the table for discussion.
He does the data analysis, creating reports for us to understand the anomalies as they come up
Lastly but not least, he is ready to stretch when needed, putting in long hours regardless of whether he is in office or home
Thank you Thanveer, your efforts are truly appreciated and very commendable!
Hi, I'm Mohammed Tanvir. I'm working in Avisoft for the past around three years as a data engineer. I'm primarily involved in designing data models and working on building complex data pipelines for various business requirements. And in part of my experience, some of the data products that I've worked on include Python, till PySpark, and Spark SQL in terms of Databricks. And I also work with different cloud providers like AWS and Azure. And I also have working experience with the SQL servers like RDBMS, like MS SQL Server, Postgres and Oracle. And also worked with different cloud data variables like Snowflake, and the dedicated SQL pool, and then Databricks Lakehouse platform. And so these are the tools that I've worked on in my previous projects. And I also have, you know, great client communication. And in the most recent project that I worked on is for a logistic client, where the requirement was to extract data from two different sources, which is Samsara and Trackvia. So the extraction is, the extraction framework was developed by me. So I developed the extraction framework for both the sources like the Samsara and Trackvia sources. And so I developed the framework to extract data from Samsara and Trackvia. And so that's the first part, like we implemented the medallion architecture, meaning the bronze, silver and gold, the incremental processing architecture. And we use the metadata driven approach to process the whole data. And we also use Power BI for the reporting aspect. And we use dedicated SQL pool to push data, I mean, to pull data from the data Lakehouse platform. So the cloud provider was AWS. And we use Databricks as an ETL tool. And we also utilize the Lakehouse platform within the Databricks as a cloud data warehouse, you know, so called cloud data warehouse. And from there, we use Power BI to fetch in data and populate the data into the reports. So this is the, this is my past experience. And in the bronze zone, we just extract data and we load it, we stage it in the bronze zone. And in the silver zone, we used to, we used to perform the deduplication and normal processing will be done there. And in the gold zone, we used to form the dimension and fact tables based on the business requirement. I worked on building the building various complex like fact tables. One is to one was to compare the vehicle information between the between the between between two application meaning the track via and some sort of application. So that's one of the complex scenario that I've worked on. So I also have worked with work closely with the reporting team to build the necessary metrics so that they can avoid performing complex transformation from the Power BI which will cost a lot of time and compute power. So this is about my experience. Thank you.
So we used, uh, in case of, uh, join between the orders and product table, so we need to, uh, implement the indexing, uh, between the orders and product tables. So the the on the on the column, uh, with which we are going to join these 2 tables. So, uh, the primary I mean, the, uh, the cluster index will be already there, uh, within the orders and product table if we have a primary key. But in case of, uh, improving the, uh, query performance, we need to implement the noncluster index on the columns, uh, which are used to join the order and product table. So implementing the, uh, noncluster index will, uh, improve the query performance, and it will also, uh, optimize the query, uh, which will run frequently. So so yes. So, uh, uh, the noncluster implementing noncluster index would increase the performance.
So we can, uh, we can use the, uh, predicates like, uh, uh, we can use the predicates like, uh, the no lock or uh, yeah, no lock would be used to, uh, used at the end of the table, uh, so that, uh, when there is a transaction which is currently ongoing, uh, we can, uh, we can we can just read the data, uh, from that particular table. So the the no lock scenario will be, uh, will be very helpful. And then we also have, uh, other locks like shared lock and, uh, no shared lock. And then, uh, and, uh, so these are the logs that we can, uh, apply on top of the tables so that, uh, when when there is a transaction with, uh, which is ongoing on that particular table, since it is already in the lock scenario, uh, the the other transaction could wait for that particular transaction to complete or, uh, it will be or based on the lock, uh, it it can it it will also be written to that particular table. So based on these locks, uh, we could, uh, you know, we could manage the isolation level within any databases similar to the my my SQL databases. And the concurrent transaction could be, uh, will can be managed, uh, by using the, uh, the transaction control language, which is the start transaction and commit transaction and, uh, the rollback scenario in case if there is any error or concurrent issue on updating the same data. So, uh, the transaction control language will play a major role on that.
I see when texting to improve their So, uh, we need to, uh, introduce we need to, uh, use the indexing on the on the columns for which we are going to join the different tables or, uh, or for the filtering process. So what are the columns that are going to be involved in the filters and the joins? For all those columns, we need to, uh, we need to, uh, we need to I mean, uh, we need to implement the indexing scenario so that, uh, it will it will rearrange the data within the, uh, within the actual underlying underlying structure so that when we try to retrieve the data, it will it will easily and quickly retrieve that particular data to us, uh, so which involves the several tables. So so the first thing is we need to, uh, check on the joint scenario, and then we need to implement the indexing to, uh, quickly retrieve data.
Small ecommerce application to ensure data consistency and harmonization. So, uh, at 1st level, we need to understand the business requirement, and we need to frame the, uh, the conceptual data model. And then we need to go for the logical data model, and then we need to go for the physical data model. In the logical data model, we need to I mean, in the conceptual data model, we need to frame the, uh, the moving parts, the entities, uh, within the ecommerce, uh, within the ecommerce business. So at first step, we need to check the entities, and we need to make sure the relationships are, uh, are in the normalized, uh, manner. Meaning, uh, to avoid the redundancy, we will implement the normalization concept. So we need to check at 1st level whether it is, uh, it is being normalized or not. And, uh, yes. The first step is for the conceptual model. We need to, uh, identify the entities and make the relationship. In the, uh, in the logical model, we will implement, uh, we will introduce the the columns, the possible columns, uh, the key columns within that particular entity. Let's take, uh, in the ecommerce, uh, let's consider the product and orders. So, uh, based on the previous question, so, uh, I'm considering the product and the order table, uh, as 2 different tables within my databases. So in that case, product will be there and, uh, I will be having the, uh, audit audit table. And I will make relationship between whether it is a 1 to many or 1 to 1. Uh, if it is many to many, I will introduce a new table, the bridge table, to avoid the, uh, you know, data redundancy and to make the data across the tables. So that's the that's the 2nd step, which is the logical data model and make the appropriate relationship, the one to 1 or one to many. And then, uh, at the end, we need to form the physical data model. So the database scheme of for the small scale. So, uh, depends upon the module we can, uh, introduce the schema. So let's take I'm going to have products alone in a different schema. Yes. That is possible. Uh, we can, uh, have a separate schema to, uh, to to handle all the, uh, all the, uh, product related tables. Yeah. That's possible. And to maintain, uh, products in a separate schema, yeah, that's possible too. We can create a product, uh, schema, and then we can keep all the product related entities within their particular
This could be used to automate the execution of recurrent complex equal operations considering both efficiency and error. So to automate the execution of recurrent complex SQL operation, uh, in case of any error, Uh, so instead of retrying it, we need to, uh, we need to, uh, we need to build a build a pipeline, which will, you know, self heal and, uh, process the data. So let's take, uh, at first level, we obviously need to implement the, uh, error handling error, uh, error handling mechanism, which is the try block within the, uh, SQL statement. So if there is any error within that particular, uh, code block, then, uh, it needs to be entered into the, uh, accept block. And, uh, in in the accept block, we can, uh, we can log that particular error into a table. And from there, we can build a pipeline, uh, uh, which will send an email to the respective stakeholders saying, okay. This particular, uh, comp this particular SQL, uh, command failed, uh, failed. So so that, you know, that particular stakeholder or that particular developer will look into the this particular, uh, SQL command, and they will immediately, uh, you know, uh, update the SQL query, and they they could fix that particular bug, uh, instead of rerunning that, uh, SQL query again and again in the error state, which will, uh, which is not a best practice and which is not efficient for any of the, uh, database system. So introducing the try catch mechanism and, uh, the emailing would, uh, would help us to automate the execution of, uh, the complex equal operation.
................... ................................. ........................................ ............................. since I haven't worked on the node.js I was not aware of this particular code but the module.exports equal router.post so from from the router.post it is it is just posting a particular record into send record record record request yeah I was not I'm not I haven't worked with the node.js part so if it is Python yes I could have explained this but since it is not my scope and thank you
While doing this equal query for returning records, can you spot any issues that could be So the number of columns is not there. So the transaction table could have, uh, you know, um, 100 and, uh, like, more than 100 columns. Uh, even though they are, uh, using the, uh, the limit is 10 here, uh, since the the number of columns is not specified, uh, that could be an issue. So, uh, so some someone who is trying to, uh, execute this query on a large dataset, uh, should definitely need to, uh, include the number of I mean, the name of the columns that they are trying to look into. And, uh, and I could see that, uh, the amount is greater than 1,000, and order by date is descending, meaning the latest, uh, data, uh, which is greater than 100. So if it is, uh, uh, if it is exactly if they know the requirement exactly means, they could just mention the, uh, filter class alone. Meaning, uh, for the past, uh, for the past month or for the past year or for the past 3 months. So, uh, instead of using the order book class, they could have used the filter class for the date. And, uh, and if they know the amount, uh, they can also mention that. So I could see the amount is greater than 100. So let's assume this is the business requirement, meaning the I need to I need all the transaction, meaning the top 10 transaction, uh, which has, uh, amount more than 1,000. And so and then ordered the that particular 10 record based on based on the latest record to the old record. So in that case, uh, yeah, this could be, uh, they can they can just mention in the where class, and, uh, the number of columns is very mandatory when it when it comes for the large
So partitioning the data uh, within the databases, uh, is is is key. Uh, so that, uh, let's take let's take an example of, uh, how the data is being stored, uh, underneath the, uh, underneath the, you know, uh, any particular server. So I have a table, uh, which have, uh, a file of 100 records. So all these 100 records consist of, uh, without partitioning. So all the 100 records are within within a particular file. So when it when I come to, uh, when when I query that particular file, uh, for, uh, for a specific record, in that case, uh, the database will scan all the 100 records, and it will fetch only 1 record. When it comes to partitioning, uh, uh, when I try to partition the records based on a certain column or a certain key, uh, the files will be partitioned. So let's take that particular 100 record. Those 100 records have a column called, uh, location, uh, which consists of Chennai, Bangalore, Mumbai. So all those, uh, city names. Uh, so now I'm going to partition the, uh, that particular table based on the, uh, city column. So in that case, the file the number of file will be, uh, generated, uh, in this order, like, Chennai Mumbai, Bangalore, and Hyderabad. So let's take, uh, that particular 100, uh, records got splitted into, uh, multiple chunks and loaded into different files. When I try to, uh, when I try to, you know, uh, filter only a specific record, uh, my database will check only a specific file which have that particular, uh, records. So let's take I'm going to check for the records with the city, uh, Chennai. Then it will it will it will retrieve all the data from that particular file instead of scanning each and every file. So this will, uh, this will, uh, very much impact the performance when it comes to larger dataset. So that's why, you know, uh, it is a best practice to implement the partitioning and indexing, uh, for a large volume table so that uh, whenever someone try to query the table, it will, uh, it will immediately retrieve the data and uh, give gives the data to the user. So partitioning, uh, is, uh, is is like storing the data in a separate chunk of files. And, uh, while querying it, the database will check only the particular file instead of checking all the files.
Compare context, the use of I've been back in the how do you compare? Uh, I haven't worked on the Node. Js part. Uh, I've only worked on the Python, and, uh, I I worked on Azure function. Uh, I I I wrote code based, uh, on Python for the Azure function app. And, uh, and most of the most of the project that I worked on include Python and. So I'm very much, uh, pretty much, you know, strong on Python. Uh, I don't know. I'm I'm not good at I mean, I haven't worked on Node. Js at all. So, uh, this question I could I couldn't answer this question. Maybe, uh, I could I could, you know, uh, check the Node JS part, and, uh, I could learn I could spend spend some time time on, uh, on learning the Node JS, and, uh, maybe I'll I'll, you know, try to look into that.
Steven, back in here. I was in task in previous project. Uh, utilized AWS services. Uh, I Uh, I haven't I haven't utilized any AWS services for a client level project. But for an internal project, I used the Azure Lambda services, uh, to extract data from 1 particular source from I mean, from 1 particular DB and, uh, load it into into a file system. So, yes, I I use the Azure services, meaning the Azure Lambda service to, uh, to do the back end process data processing, meaning the extract, load, and transform process processing, the ETL process, uh, in my internal project. So that's not a client, uh, project. So, yes, I worked on that.