Data Scientist
Data SocietyData Analyst/Scientist Mentor
DigikullData Science Intern
Ineuron.aiSAP Analyst
Tata Consultancy Services LtdPython
GCP
AI
Git
AWS (Amazon Web Services)
Docker
NLP
MongoDB
CodeClouds
Tableau CRM
Azure
Flask
Yeah. Sure. I can give my background. So my name is Sachin Mishra, and, uh, I am working as a data scientist in a data society company, um, from last 2 years. Before that, I was working in a TCS as a data analyst or SAP analyst. Um, so I have overall 3 years of experience in the data field, and I love enjoying, um, and, uh, solving I I I'm passionate about solving the data problems. So that's all about me.
How would you set up an experiment to compare the effectiveness of different embedding techniques for NLP. Well so we we have, like, different kind of, um, embedding techniques, right, in NLP. We have TFIDF. We have bag of words. We have, uh, modern techniques, um, like, we can use open AI embeddings. So there are so many different embedding techniques we have in the NLP. Now in order to set the experiment, um, what I will, uh, what I would do or, uh, basically, I will start with a very basic comparison, and it will also depend upon the problem statement. For example, if I have to solve the basic, uh, let's say, classification problem of maybe, um, let's say, any kind of sentiment analysis I have to do. Now with respect to that sentiment analysis, if if my task is just to do that that classification, then maybe I can, uh, go and create the embeddings with respect to TFIDF or maybe using mega words. And then I will see my how much, um, accuracy I'm getting, how much, uh, a one score I'm getting. So, uh, based upon that, I can see the results. But let's say if, uh, I I have the problem statement in which, um, I have, like, documents, bunch of documents, and I wanna classify the documents. Now here, in case of documents, the number of words and understanding which which basically a modeling is instead of not just solved by using, you know, TFID for bag of words, um, techniques. Then I can go ahead and, you know, try to utilize, um, state of the art algorithms like open AI embeddings or maybe some open source model embeddings. So in that way, I will basically set them an set up an experiment and compare the effectiveness of different embedding techniques.
Um, explain your approach to implementing a hybrid recommendation system combining collaborative filtering and content based methods. Okay. Yeah. So, uh, I have worked on the recommendation system. So one of the project which I have worked on a recommendation system was related to, you know, suggesting the movies, which, uh, Netflix also does. So I was just trying to build the same kind of replica. Now I I use the collaborative filtering method and not the content based methods. So in the in the collaborative, uh, filtering, what we do usually is, um, you know, how users like, uh, we we try to group basically similar kind of user, and then we we provide that kind of content, uh, to that user if that user belongs to that particular cluster. In the content based matters, we we have to basically come up with the user's data. So when we initially launch this kind of recommendation system, we don't have any kind of data. So we we have to wait, uh, till we get some data. And once we get the data, then, um, based upon, like, what user is liking, then, um, you know, what kind of taste user has. Um, maybe user is loving the, let's say, emotional movies or maybe, uh, drama movies or whatever genre we can pick up. And once we have the data, then we can provide, uh, you know, the, um, recommendation based on the content and not only the, um, like, collaborative filtering. So, yeah, obviously, by combining collaborative as well as, um, content based filter, we can achieve the very good hybrid recommendation system, which can work in a production system.
How do convolutional neural networks handle image data differently from fully connected neural networks? Yeah. So CNNs are, um, basically specialized, we can say, neural networks, which is which is basically designed for doing the, um, you know, basically designed for, um, images and videos, uh, for some time, but, obviously, they they mostly work with image data. Now the way they are different than, uh, normal neural network or fully connected neural network is basically we can directly feed the images, uh, into the CNNs, and then they have this, um, you know, different kind of filters using which they, uh, try to identify the patterns within the images. Uh, but in other hand, uh, fully connected neural networks, we have to basically convert the images into the pixels, then we have to do the flatten operation, and then we pass those, uh, filters where they self identify, okay, what this image have. But in, uh, fully connected neural networks, we don't have this concept of filters. So that's how they are different.
What techniques we use to handle by leaking when it's dataset in a supervised learning context? Yeah. There are several techniques to, um, handle imbalance dataset. Like, we can go for SMOTE is one of the technique using which we can handle the, uh, imbalance dataset. Then we have techniques like downsampling or upsampling. So if we have, um, some, uh, let's say, 2 classes, 1 and 0, which we want to do the classification, then and we have the, you know, majority classes in 1, then we can go for either, you know, down sample them or maybe we can up sample these 0 classes. So those are some techniques using which we can um, handle the imbalanced dataset in in terms of supervised learning.
What strategy would you use to streamline the deep learning model to run efficiently on a mobile device? Okay. Um, what strategy would would I use to streamline, basically, a deep learning model to run efficiently on a mobile device. Um, honestly, I have never worked on, um, machine learning on edge devices, um, because that is not in my 3 years of experience I've seen. But, uh, I have worked on a mini project in which, um, I I I have to build, uh, basically, a deep learning model, basically, which can take the picture of plant, um, leaves plant leaves, basically, and then it can classify, okay, the leaves, uh, are having some kind of problem or they are okay. So that was the project. And I have basically uses TensorFlow light model, uh, TF Lite model, and created the Android app, and the model was working fine. Uh, so TF light is one of the framework we, obviously, we can go and use to, you know, deploy our models on our mobile devices. That's what I, um, I have used in in my mini project.
Can you explain what the following Python function is intended to do and, uh, identify any potential error that would prevent it from functional functioning correctly. Yeah. Let me just guess. Define, calculate precision, true positive, false positive. Precision is equal to positive divided by true positive for f plus false positive. Return precision. Calculate precision for you on 1 0. So false positive is 0. All are true positive. If precision reserved, except 0, there is an error. Yeah. So to positive so if I look at this um, function, it it basically help us calculate the, uh, precision, and the formula is given by true positive divided by true positive plus false positive. And what we are trying to do in a try, um, block is basically we are calling that function, calculate precision. Uh, and we are passing true positive as 42, false positive as 0, and then we are printing the precision result. Okay? So in this case, we have 2 positive as 42 and divided by 2 positives. So in this case, we will get the precision as 1 because false positive number is 0. And so this accept 0 revision error. This will only can happen if we have, like, all the 2 positive and 0 as well as false positive 0. So in that case, we will obviously get the 0 division error. And, uh, identifying potential error that would prevent it from functioning correctly. Yeah. So that that's the, uh, only thing. If we have the true positive as well as false positive value is 0, then we will get this, uh, accept 0 division error. It will go to the accept block and raise this exception that are calculating precision, um, division by 0 error.
In this Python code snippet, the goal is to sort a dictionary by its values, summing the code, and explain any issue that might arise with the current implementation. Okay. We have import operator. Day is equal to apple 50, banana 30, cherry 20. Sorted d is equal to sorted d dot items. Comma key is equal to operator dot item getter 1, reverse equal to 2, prints on d. Yeah. So, okay, in this code by hand, the goal is to follow dictionary wise values. Yeah. Go and explain any issues that might arise with the current implementation. Okay? Let me just check. So we have d is equal to apple 50, banana 13, cherry 20. Okay? And we wanna sort it using the values. So we want the answer, like, first cherry, then banana, then apple. Right? Because we have 20, 30, and 50. Okay. Sorted d dot items. We'll obviously sort it. Key is we are saying operator dot item getter 1. I'm not sure about this function item getter 1 and reverse equal to 2. So, basically, it will gonna reverse the entire dictionary with respect to the values. Um, I I I have to basically write it in a code environment and then see because, uh, I'm not able to recall item that are 1. But, obviously, I mean, this uh, code looks good good. I mean, this this can solve the purpose to basically shorten the dictionary by its values, but I'm not sure about this item getter 1. So I have to call it, and then only I can comment if this can give any kind of problem or not, this code.
What method would you use to scale feature extraction for millions of images efficiently in a distributed computing environment. Okay. What method would I use to scale feature extraction? K. Scale feature for 1,000,000 of images. That is your name on the studio, computing environment. Yeah. I mean, I can go and use PySpark. So I can write function or maybe I can create, um, you know, Lambda function or maybe cron job. So I I will write a Python function in which, uh, I will be extracting features from the from the images. And then on a different computer computer moment, all the images can be continue continuously streamed or feed it. And then, uh, the Python code will basically extract the feature, and maybe I will convert that feature into pandas data frame or maybe spark data frame, uh, whatever output format we required in in that format. That's what I can think of right now.
What are your preferred tools for automating the deployment of machine learning model and ensuring margin control? Yeah. So I have, like, several tools. So, obviously, for automating the deployment of a machine learning model, I can, um, go and use the, um, basically so for version control, obviously, I will be using, um, GitHub or maybe, um, whatever, I mean, code versioning system the company is using. So right in my current organization, we are using Bitbucket. So that I will obviously use for version control of the code. And for automating deployment of the machine learning model, um, we can automate it via maybe Jenkins pipeline or, uh, maybe, you know, through GCP, whatever tool we are using, like, uh, you can we can use Google Cloud Platform or or maybe, uh, so let let me just remember. I I had used one more tool. Well, I have used. So I have used Yeah. So one of the tool is MLflow. MLflow, we can use for automating the deployment of a machine learning model. MLflow is, uh, one of the tool which, uh, right now we are using in in our current company. So that is one of the tool we can use for sure. And let me just remember one more, uh, I had in my mind. Just, uh, I'm thinking it. So we have workflow and Yeah. So or we can go for Terraform. Uh, so using Terraform also, like, once we have the Dockerfile and everything ready, then we can use a Terraform also to automate this entire process. So those are the tools, um, I will be definitely using.
Yeah. How would I approach data versioning when working with large dataset in machine learning experiments? So, obviously, like, um, code versioning, model versioning, uh, we should go for data set versioning as well. So the data versioning, uh, I would go I will I will be using a tool, uh, DAG sub and, uh, MLflow for, um, data versioning. That that is my go to tool. And, uh, using that, I will ensure my whatever model I'm building. So whatever experiment I've done, so I will be ensuring that this dataset belongs to this belongs to this experiment. And if the dataset got changed and if, um, my model got changed, then using that deck sub and MLflow experiment tracking, I will use, uh, that particular, um, let's say, dataset version 2 and then ML model version 2. In that way, uh, I will be basically ensuring, okay, I have the, uh, proper data version as well as the proper um, model version.