
Lead Data Scientist/ Gen AI Engineer
QuotientData Scientist
Wichita State UniversityData Scientist / AI Developer
GDITJr Data Scientist
XcelvationsJr Data Scientist
Data FactzData Scientist / AI Engineer
Merck
Python

Microsoft Azure SQL Database

Azure Machine Learning Studio

BigQuery
AWS (Amazon Web Services)

Microsoft Power BI

Tableau Prep

Tableau CRM

MongoDB

Amazon DocumentDB
Azure Data Lake Storage Gen2 (ADLS)

Azure Data Factory
.png)
Databricks

Amazon SageMaker
Snowflake
So myself, you know, I'm engineer Rajan Patel. And, uh, I have, uh, more than 8.5 years of experience in terms of data science, machine learning, artificial intelligence, deep learning, and, uh, various other technologies like, uh, natural language and generating. And I do have an experience with, uh, Python and, uh, or, uh, programming languages. And aside from that, uh, specific programming languages, I how to use various, uh, databases, uh, technologies, Just like, uh, databases, languages like SQL and NoSQL databases, which are like, uh, Oracle and, uh, MySQL And the MongoDB or something like that. And I do have an experience with various business intelligence tools, which are Tableau, PowerBI, and, uh, so all recs and various, you know, Python libraries, uh, which are a massive grip and a c bone and, uh, so on. And not only that one actually. So coming to the out cloud technologies, I have a strong, sorry, experience in Azure, um, and AWS and GCP. And coming to that, Uh, you know, Azure, I have, uh, no, various services which are, uh, Azure database and Azure data factory. And Azure, you know, uh, AppLogix and Azure data lake, uh, storage and the block storage And, you know, Azure Machine Learning Studio. And coming to the AWS, I do have an experience with, uh, AWS, uh, SageMaker and, EC 2 and, you know, s 3 and, you know, step functions and lambda functions and etcetera. And, uh, so coming to the the GCP, BigQuery, and, uh, Google AI platforms. And that's a small thing, and I do have an experience with the various domains of which are from, uh, financial domain and, um, marketing domain and advertisement domain and e commerce domain and health care domain. And, um, yep. These are the manufacturing domain as well. So this is about me, And, uh, so yep. That's that's
So how do we handle the schema changes in a smoke platform and always on ETL pipeline? Right. Okay. That's a great thing. So regarding about, you know, for handling the schema, uh, changes in a snowflake environment while maintaining, Uh, always on, you know, ETL, like extract and, uh, transform and load. Uh, pipeline, it was involving, um, several strategies, Like to ensure the data integrity and, uh, consistency and, uh, no, minimal downtime. Uh, for Snowflake, Uh, for the Snowflake supports, uh, schema evolution, uh, allowing you, like, uh, to add the new columns, uh, to the tables, Without, uh, you know, impacting the existing, uh, queries or ETL process, uh, this is accommodating, You know, if we do changes in our data sources. Uh, then, uh, use this Snowflake stream, uh, like, to capture the insert, update and delete operations on the table. And, uh, this allows, uh, to process the incremental changes making our ETL pipeline for more efficient. And the snowflake, uh, task can be scheduled to process these, uh, changes regularly. And before applying these changes to our production schema, uh, use the Snowflake zero copy of cloning feature, Uh, to clone our data and, uh, schema. And this allows us to test the changes without impacting the A production enrollment. And then these features allows, uh, to access the historical data within, like, you know, defined at a period, uh, if the schema it was, changes the lead to issue and, uh, it can revert the previous, uh, state of, uh, no, datum. And maintain the version control of our ATL script and data model. Uh, this allows us to roll back in the previous version in case of Any issues, uh, with the new schema? And I use automation tools to apply this schema changes and, like, monitor their impact of, uh, continuous integration and the continuous development pipeline, uh, can be beneficial, uh, actually. In stuff, uh, lots in frequent updates and process the data in a smaller and more frequent batches, uh, this could be reduced the risk and impact of the schema changes. Then our first scheme of changes, like, validate our data to ensure detail causes of our functioning correctly, And, uh, data integrities, it was maintained. Like, uh, keep all the stakeholder informed about A schema changes and maintain the comprehensive documentation. And this could, uh, understanding the impact and the troubleshooting issues. So, uh, and more about handling the, uh, you know, changes carefully and, uh, you know, design our retail Process, uh, too flexible and adaptable schema changes. And this might involve like, you know, dynamic sequel and the generations And, uh, ETL tools that can handle this schema
Given Python or Synap implemented the simplified version of recommendation system, Uh, what is the potential fit for in the logic when we're generating the personalized recommendation? Explain Why using the top score to provide the recommendation might Not be sufficient in real world application and discuss what would be enhanced recommendation logic. So the the code is import NumPy as empty. The function is the recommended underscore So what is the potential? Uh, so what is the potential fit for the logical engine? You can explain why you're using. Yeah. Sure. So, like, uh, using only with the top score, uh, provided recommendation might not be Sufficient and personalized with the user performance and complex, and not well captured, uh, by the vector space model. And, uh, this approach might not be scaled with the large number of items and users because, uh, it computes, uh, no, uh, scores of all items, which might not be computationally expensive. And the the logic does not account for the changes in the user of references over the time, which can lead to the, uh, state of final still recommendations. And this method does not consider the similarity between the items which could not be useful for the recommended Uh, items similar to those, uh, users, uh, has liked in the past. And if the user profile or item metrics is as fast and that product might not be Currently, representing the user of references. And to enhancing this recommendation, uh, scans your things, then what I can do actually, uh, in in, you know, integrated the user feedback to update the recommendation model in the real time, allowing the system to learn from the user interactions. Then I can use the method like, uh, singular value decomposition and alternating least square to handle the as fast data better and uncover, uh, latent factors. And, like, including the item metadata to, uh, make the content based recommendation along side of all, you know, collaborative filtering. And, like, in combining the collaborative and content based filtering to use, uh, no, uh, strengthen both the models and apply the more complex machine learning models, uh, that can be captured the nonlinear relationship and the interaction between the users, uh, and the item features. And, uh, use the items and the metadata Providing the recommendation, uh, unknown for the users, and item, uh, enough, uh, until enough, uh, you know, interactions data is collected. And include the algorithms, uh, to ensure the diversity recommendations and provide the users with the, You know, uh, discoveries. Uh, so we develop these things, I can, uh, add in this kind of scenario.
So, uh, like, Assume when you were returning the following SQL query to retrieve the user interact you know, interactions data for the recommendation module. Direct the query identifying the explain the potential issue with the how order by clause used here considering the sequence and the best practices. In select user ID, uh, count, uh, click as the number of click from the interactions where we went to date it was more than No group that you would really encounter. So, yes, that's a great thing actually. Okay. Uh, so for potential, we issue with the order by clause in the query. Like, uh, you know, uh, that uses the aggregate functions count click directly. According to the sequel, you know, It is better to order by the alias for the aggregate functions, so, like, to clarify and avoid the potential error and performance issue. And some sequel engines may not allow the ordering by aggregate functions directly, and it leave the ambiguity of if there is more than 1 aggregate in the select And a better approach would be using alias. Then click in order by across like this. Uh, I'm I'm I'm I'm talking about, you know, uh, select user ID count of clicks as numb clicks of, you know, Uh, interactions. Very even to date it was greater than the 20230101, and group by user ready on order by 9 clicks. Using this alias makes the for user to read and maintain and ensure the, like, compatible with the most equal database system. And it also helps, uh, with the performance optimization. And as the SQL engine does not have, uh, does not have You know, we compute the aggregate functions for loss sorting and but can directly use the result from the, uh, selection list.
So without writing any actual code, could you explain, uh, is there any potential issue with this approach? Detailing the So the code for tuning the model for a parameter. What aspects the initial learning model evolution will be or look here. The function, uh, tune model model parameter data. This score is minus infinity. Uh, okay. So I understood. So, like, uh, the potential we show with this particular approach, uh, like, all for, The first thing is that, you know, the official does not mention the separate validation set. Or the user, um, in use of time and validation and, test aspect. Which is a question, uh, like, to ensure the model does not perfect of the training data. Uh, it's unclear which cross validation methods is being used, like, okay fold and strike, uh, strike fold. And this twice method can be significantly impact the, like, reliability of the average score and especially with imbalance data set. And the, uh, it does not, uh, not specify which, uh, scoring metrics being used. Different problems require different metrics. Uh, it's like accuracy, response score, AUC, means, uh, AUC and mean square error. And these are, uh, you know, there is, uh, there is no Indication of the range of, uh, no, of of parameter being tested. A poor choice can lead the, uh, no, uh, some optimal tuning. And there is no mechanism to analyze the model complexity to ensure the generalization. And, uh, so And model tuning can be a computationally expensive and the the slogan does not suggest any, uh, parallel processing and speed of the process. And, uh, and to address these aspects, uh, like, without any early stopping mechanism, the model may speed, time, and evoluting the parameter test Are clearly not optimal. And, uh, uh, no no no mention is made of the time or competition resources, and which can be important of the parameter space, like, large or, uh, if a model is complex. And the those and this is not to specify the source staticy, like grid search and random search and basic optimization, which can affect the efficiency of for finding the best parameters.
Initial learning engineer is designing the future, feature transformer in a PyTorch. Model normalize input data. We play the piece of the code, explain to what could be input implementation. Okay. The class is normally in the model. We put the parameter itself mean. Next place is standard. Uh, the normalize class has 2 methods. 1 is the unit and, uh, and the second 1 is the forward. Uh, so here, You know, vectorization of the the improvement of this, uh, you know, or could be implementation of these things like, Uh, vectorization. Uh, like, the mean and the standard deviation it was in this list, you know, as a scalar. Which which implies the I know that the input or, um, is specific to this color. And in, you know, uh, input data are usually a multi dimensional, like images in in a feature set. And normalization is typically done, uh, element wise for each, uh, feature. This, uh, you know, this accommodated, uh, this. And the mean and standard deviation, uh, should be vector of, uh, of the same length of the number of features in the input. And when, uh, subtracting and dividing by self mean and self standard division, there should be a check on reshaping of ensure the broadcasting happens correctly and does not produce, like, you know, uh, uh, unit united results. And the the normalization operation, uh, done in place, which can be problematic in PyTorch. Uh, when dealing with, uh, computation's graph to ensure that original data is not, uh, modified and the A gradient can be properly computed. It's better to avoid the you know, avoiding, uh, place operations. And, uh, and and in PyTorch, it does, uh, good practice to register the mean and the standard deviation as a buffer. And if they're not meant to the update during the training, this is done using the self register buffer. Uh, this way, uh, they properly move the They were using the model, uh, to GPU, and if the data type input is not float 32, the normalization might not work as And as expected, uh, could reduce the type of mismatch, and the clause should be monitored to ensure, uh, supports the inputs of the various data points. And, uh, so, yeah, this is my approach.
What, uh, techniques could use to optimize the complex SQL queries for faster processing in a snowflake? Right. Um, so for the the techniques could use the optimize the complex SQL queries for for moments. So optimizing the complex, uh, SQL queries in a small place for fast processing. It was, uh, involving, you know, on the 4 you know, use the clustering keys and choosing appropriate, like, you know, Uh, question is for our tables to call it related to the data. And to reduce, uh, the amount of scanning data during, you know, uh, queries, and using the appropriate virtual, uh, warehouse size of the workload. Larger warehouses can process the queries faster, but, uh, at the higher cost. Uh, like, a structure of, uh, queries or tables, it in a way, Snowflake can automatically uh, exclude the irrelevant partitions, uh, from the query processing, like pruning, uh, and using the partition fee like where clause. And using the Snowflake's automatic result catching to avoid the re executing the same query and create, uh, metal, you know, materialized view to free aggregate it and the store the complex calculation that can be reused across the multiple queries and writing the queries to that amazing inter node communication as data moment can be bottleneck. And use the query history and the query profile to understand the performance and identify, uh, you know, battle mix and the right efficiency of queries so that select only the required columns and evade the select star and use the WHERE clause to limit the data scanned and structures our joins to reduce the amount of data being joined. And, uh, uh, look, try to use the if we join so whenever is possible. Well, Snowflake does not use the traditional indexing, consider using uh, such optimization service, uh, for, uh, you know, frequently searched large tables, like, to speed up, uh, selective filters on the queries And use the correct and, uh, smallest and and, uh, data types to reduce the amount of data is preprocessed.
What are the strategies of for handling the classes in the large NLP datasets? That's a great question. For handling this kind of imbalance in the classes in the large NLP datasets, and the the first, Uh, read and or sampling and minority class, it and can increase the number of instances in the minority class, Um, by duplicating the existing instances and generating the new synthetic instances using techniques like, uh, small top And, uh, and and and under sampling and majority classes, like, reducing the number of instances in the majority classes, Uh, this can lead the loss of information, and so it should not be done care you know, should done be carefully. And combining the order and, uh, and under sampling to create the more balanced dataset and use the techniques, uh, like back Translation and the synonym replacements and random, uh, insertion and deletion of the words to generate the new samples from the minority class. And, uh, assuming the higher misclassification cost to to the minority class, and use the algorithm that inherently Count, uh, for a different class weights, and, like, not treat the minority class as the anomalies and use the anomaly detection algorithm. And then you, uh, I would like to go for use the bagging technique where the each model in assemble may focus on different aspects of the data. And I can apply the boosting algorithm that focus on examples of order, uh, to classify of those, uh, belonging to minority class And, uh, combining the different models and the daily abates of those performing the a better minority class using the note stacking technique, And, uh, I I can use the free time model and fine tune them on our dataset. This model how the, uh, you know, have been trained around the large A corpora and may generalize the better even, uh, smaller minority class datasets. And I can use the evolution matrices like, uh, better picture and model performance. On imbalance, the datasets like a fun score and precision recall, and they use a RTC curve. And I can, um, move the, uh, threshold and adjusting, like, you know, decision In threshold for minority class and increase the sensitivity, and so, uh, active learning and, uh, solo labeling and, uh, curriculum learning. So we develop this, uh, approaches, I can handle the, uh, imbalance from, uh, classes.
How do you architect a system to automatically adapts in in machine learning model to change the data distributions? How do you want to the system Automatically, the machine learning model to change like, architect the system, like, Also, the machine learning model automatically, like, adapt the changing data distribution involved, like, creating the A pipeline, uh, capable of, uh, continuous monitoring and evaluating and updating. And, uh, so, uh, here the first approach, Uh, I could like, like, more data monitoring, and I can implement the data monitoring to detect the changes in the data distributions, uh, and then continuously evaluate the model, uh, performance with the latest data. If the performance are dropped below certain threshold, This, uh, trigger the retaining the process then creating the pipeline, uh, that can retain the model automatically with the new data. This pipeline must handle the data processing, feature extraction, and model training and validation. And use the model versioning to keep Track off for different portions of the model and their performance testing. Uh, before fully replacing the existing model, I used the a b testing to compare the performance of new model against the old one on real time data. And, uh, implementing the features, uh, stores to manage and reuse the features across the different model versioning, ensuring the consistency. And, uh, use the workflow orchestration tools like Apache Airflow and, uh, an approved flow pipelines and AWS Step function to manage the retraining and deployment pipeline. And I can, uh, roll back mechanism in place, you know, Oh, in case, uh, the new model performed unexpectedly after deployment. So I can, uh, you know, human review process, uh, to validate the model updates, uh, when necessary, especially for, uh, critical application. And, uh, like, deploy the model in the way of that Post the dynamic updating without the downtime using techniques like, um, uh, canary releases, and blue green deployments, and shadow mode. And I could be you uh, I I I could use the services like UberNexa for scalable and flexible infrastructure, uh, that can dynamically allocate in the resources for the retraining and deployment models. And implementing the comprehensive logging and, uh, I know, auditing the tracks, the system decision, which useful for debugging and the complainants. So with the with these approaches, I can, Uh, uh, automatically have the machine learning model to change the non
You need to implement the features of for an existing system where LLM generates the personalized travel itineraries. Describe your implementation plan and, uh, the metrics you will monitor and they will note the system success. Mhmm. So, um, need to implement the features for existing So, uh, to implement the features for existing system using large language model to, uh, gen generate the personalized travel itineraries. Uh, the first, uh, implementation plan, uh, that it required requirements in the gathering, the connecting the user interviews and service to understanding the features, users who wants in the the the travel activities and lays the competitors some offering for feature insights. Then, uh, feature designing the features, uh, like, criticizing the features based on the user methods, uh, business values and technical feasibilities, and, like, designing the interactive features such as user inputs, uh, for references and constraints like budget and duration and interest and real time custom, you know, customization options and then, uh, ensure the access of 2 day to travel databases and it is for destinations, accommodations, and activities, and, uh, transportations, um, like and and user reviews and establish the partnership with the travel services and providing the real time data access. And then fine tuning the LLM with the travel related datasets to understanding the domain specific language and the user queries and incorporate the user preferences data to personalized the data models and then, uh, integrate the into the existing system with the focus on seeing this user experience and implement the EP for real time data exchange with the travel services of providers, and then, uh, develop the user friendly, uh, interface that allows easy input for travel preferences and displays the itineraries. Like, uh, creating the visualization tool for itinerary reviews, and then, uh, conducting the unit test, integration test, and user acceptance test to ensure the system works as expected. And, like, performing the a b testing to compare the new features and, uh, with the baseline and, um, a roll a roll of the features incrementally using features flags and cannery releases. Monitor the system performance, and the user feedbacks, and closely during the initial deployment. And, like, implementing the mechanism to collect the feedback to the generated itineraries and use the feedback, uh, to continuously improve the LLM performance and establish the process, sir, to regularly update the travel databases in LLM. And coming to the, uh, metrices to monitor this LLM, uh, to track this, uh, metrics like daily activities, dose, and session length, and number of, uh, you know, um, things. And use the how often, uh, flow, uh, suggest rightness, conducted services, and assess the user satisfaction with the level of personalization, and monitor the relevance of recommendation at least to user preferences. And they use the precision recall or matrices to evaluate the accuracy and, uh, item item suggested and the conversion rate and measure the conversion date of the attendees, uh, in you know, leading the booking and, uh, and lays the retention reattention rates to see the use if the user returns to claim the new item is and monitor the response time and ensure that itinerary, you know, generations within acceptable limit, and qualitatively analyze the user feedback for insights into the feature improvements and use the sentiment analysis to cause So user satisfaction and to track the revenue matrices if the services is monetized. Average revenue for user ARPU and the lifetime value of TV. And, uh, keep on eye on error rate itinerary and generates the p failure and adoption rate.