
Lead Data Engineer
USTPlatform Engineer
USTSnowflake

SQL Server

Oracle

AWS
Azure

Python

SQL

Unix
.png)
Informatica

Airflow

Tableau

Streamlit

SVN
Hello. My name is. I'm currently working as a senior data engineer at UST Global. So I have 10 years of experience in building and developing ETL solutions for enterprises. I work with clients like Schroders, Equifax, and Logitech to help them migrate their on-prem databases to Snowflake.
so to tune the performance of a query in Snowflake, basically, we could use general best practices. Most people make a common mistake in Snowflake by running a small query in an extra-large warehouse. Imagine an extra small warehouse can run five to eight files in parallel. And if you have data that's not that big and you're using an extra-large warehouse, basically, you're saying there's no point in running it in an extra-large warehouse, right? So, the techniques I would use are to make sure your warehouse is properly sized based on your workloads and ensure it meets your deadlines and loads are complete on time. Also, for query optimization, we could use Snowflake's internal tables, such as Snowflake secure views, to understand how much time the query is taking. Using those tables, we could identify the blockers that are taking much time. We could either auto-scale if the query is complex or add more servers and if there are a lot of critical queries that are blocked due to bandwidth issues. For the query optimization part as well, we just have to look at Snowflake's internal tables, such as Snowflake query history, copy history, and all these Snowflake account usage tables, which would give an idea of how your query is performing. That will help you a lot to identify the issues in your query and it will help us improve the query performance.
Yeah, so in my project, which is for asset management, I have a huge trip tool called DBT, a table tool. We could write singular tests or functional tests using SQLs to identify the data quality of the source and make sure the data refresh times are within the regular timeline. So we could write a lot of stress on incoming data and make sure everything is working as expected using the Databill tool. The good thing is the Databill tool is an ETL tool. It doesn't use external warehouses or servers. It uses Snowflake warehouse to run your data quality tests, which are written in SQL, and ensure all your tests pass. And it's there for your data. Thank you.
So better efficiency, yeah, there is a specific use case wherein we are migrating SQL server data into Snowflake. So what happened there is that we had 15 billion records that needed to be loaded from on-premise to Snowflake. Fifteen billion records was taking a long time because the size was too huge, and if you generated a file, it's 15 billion records and if you move that to Snowflake, it was taking a lot of time. But earlier it was in JSON format, and we observed that the file was also randomly increasing rapidly. The first thing we did was try to chunk the files into smaller bits so that whenever the file is generated, it can be immediately moved to Snowflake, and there is parallel processing happening there. In the Azure Data Factory side, the kind of optimization we did is that we tried not to put any conversions in the ADF. Instead, we try to generate a proper SQL which is in line with Snowflake's table, and ADF is just dumping the data into a CSV file and does not do any kind of validation on the raw data. That was helping to reduce the time better. And also, we have tried a similar exercise with the Parquet format, which is user-friendly with Snowflake. That was also helping us generate files faster and load faster in Snowflake.
So there is no definitive guide of error handling in NoSQL databases. People are using them. People are using tools to execute their SQL such as DBT or SQL mesh or some other tools. But in general, what I feel is that if it's a simple query, you can directly write it in this NoSQL line. That should be okay. But if it's a complex query, try to put that inside a procedure rather than turning to JavaScript or Python and writing a proper try and catch method and catching those errors and logging that into a table. So in future, if any error happens in the SQL, you'll have a table where you can identify the error in the SQL, the procedure, which will improve the error handling mechanism in your projects. That is one best practice I would follow. The second one is if your queries, try to put a lot of data quality tests before loading into the integration layer. Apply a lot of tests from dbt. Make sure everything is properly loaded in the raw layer of your ETL pipeline, and do not run smaller queries in larger warehouse. And command error handling and stuff like that. What else I could think of? Yeah. I think mostly, you have to try and put everything in a try and catch method. And if you can configure a telemetry for your Snowflake table, that will gather all your procedures' logs and put into a singular or central table. It can be referred very easily instead of separate tables for each procedure. That could be another approach for handling errors in Snowflake.
What I want done is, we have set up a centralized log mechanism in ADF. So all the logs are passed to ADF. All the logs of ADF are passed to Elasticsearch. And it is so when something fails, there is an alert returned in Elasticsearch to raise a ticket to ServiceNow via Elasticsearch. So, basically, this is like a one-time setup for your entire project, wherein all your logs have to be gathered from all the data factory pipelines and pushed into a central log system, and you can write a simple select query on these logs from Elasticsearch and see if there is any failure for the current day and raise an alert if it finds any failure or error in the log files. So that is the mechanism we have followed in our project.
could be wrong with the course okay and in case I can't start from customer from orders business logic to perform again supposed to be here the moment you get a customer order customers business logic to perform aggregation it's supposed to be here by the way this question looks ambiguous I couldn't understand what it is but I will give a try I want to start from total customers orders business logic to perform aggregation is supposed to be here business logic to perform aggregation is supposed to be here customers talking orders okay no yeah so let's select counter star it's there is no counter star into counter star customers from customer that's there is no such thing as counter star we could directly assign a variable where customer total customer count equals to we can open SQL execute and put this select statement inside that and we can execute that and capture that result set into a variable and pass that variable to the return statement so there is no into over here into is for SQL server migrations but into doesn't work in snowflake so that is a looks like format I mean into keyword is not present in snowflake it's actually a SQL server's keyword
so middle list is view let's find select * from ref raw data where state is equal to 'active' and a difference created at a date less than 30 so he is trying to read the raw data which status equals to active and which is for the last month or the last 30 days and create a view with this raw data so it's basically pulling all the raw data for last 30 days which has status active and create a view on top in the snowflake and what and explain where this code may fail I'm not sure how raw data looks like in here but select * from raw data maybe instead of referring I think raw data should be referred from the source since he's using a star in general if in future when you column comes in this might give a trouble and it might fail at that point yeah so I would suggest either refer from the source where select * from open parenthesis source of raw data define it as a source and control your input data from there and if you do this and even that only thing you have to change is in the source file not in the in the source SQL not in the downstream pipelines like this one yeah but I think in future because of the star it might fail but in general it should work but yeah
So generally, Soulflake is a centralized resource which can be used by multiple teams. If you do not have proper boundaries between the databases or data across teams, there is a possible chance of security issues wherein unnecessary data can be accessed by other people. So the general solution would be to implement an RPAC, which is role-based access control, like a virtual layer around the team. And we can create a database and apply proper roles to it. Only the owner or people who are part of these roles could access this database. No one else can access it outside of it. So, basically, you could clearly define the RBAC policies. And we should also apply row access policies for the Snowflake secure share sheds, which is a common table for the entire Snowflake account. But if a person is querying that table, they might see unnecessary data. So we have to apply role access policies as well. And, yeah, the first one is RAPAC and the second one is role access policies. And if the data has sensitive information, like PII data, we could use tags, which is a feature in Snowflake. We can map a tag to the table or column, indicating it has sensitive information. And when exposed to tools like Collibra or other tools, it will have some alerts on top of it, which will not directly expose this sensitive data to other teams. And what else we can do about security? Yep. I think our backups, Nuvola's policies, and also centralized log tagging policies should help with proper security and boundaries and privacy for data within Azure Data Factory and Snowflake Webex.
In my project, most of the data comes from outside the system, which is outside our company's network. So, our team has developed an API. Not just is it a producer-consumer application, wherein they are responsible for calling the APIs, which are outside Shorla's network, and put that data into a subtype server, where from there, our job will begin. But in general, I have less experience in calling external APIs, but I have it. For example, in one of my hackathons, I called publicly available APIs and used keys, security keys, to authenticate via them, put those keys in Azure key vault, and authenticated the API calls via the security keys, which were generated earlier. And how do we handle it? So, in general, every API has a limitation. The ADF pipeline should be designed in such a way that you can put in a for loop, and set a limit on the for loop, saying after three failures, or three consecutive failures in the API, we could fail the ADF pipeline instead of continuously running and allow this error into a centralized logging mechanism, which can be tracked later. So, basically, also, in these days, Snowflake is, by default, accepting external API calls from Snowflake itself. We don't need an external tool for doing this. So, we could use Snowflake's capability to call APIs within Snowflake, and data can be collected.
In general, this is a general problem when a common repository is accessed by multiple team members. For example, there is one Git repository, and 10 people are working on the same Git repository. And people are merging features into the master. And before you merge your changes to the master, someone might have done the changes, that someone might have changed the same file which you are working on. In this kind of scenario, what I usually do is, I take a copy, clone the repository into my local, and I constantly check if there are any updates from the master before checking the code into the repository and seeing if any updates are happening on the same file. If there is one, I would immediately pull it into my local branch, and I would have the latest file. This is not a solution, but it will resolve the managed conflicts we get from the Git repositories. Or we can work on the code separately and test it properly. Once everything works, I usually take the latest copy and try to merge my changes into it. That is another way, but there is no single answer for this question. However, we could handle it in multiple ways. I would use a Git repository and constantly check if there are updates in the master and pull them if they don't replace my local code.