Data Scientist
Data SocietyData Analyst/Scientist Mentor
DigikullData Science Intern
Ineuron.aiSAP Analyst
Tata Consultancy Services Ltd
Python

GCP

AI

Git
AWS (Amazon Web Services)
.png)
Docker

NLP

MongoDB

CodeClouds

Tableau CRM
Azure
.png)
Flask
My name is Sachin Mishra, and I am working as a data scientist in a data society company from the last two years. Before that, I was working in TCS as a data analyst or SAP analyst. So I have overall three years of experience in the data field, and I love solving and enjoying data problems. I'm passionate about solving data problems. That's all about me.
We have different kinds of embedding techniques in NLP. We have TF-IDF, we have bag of words, and we have modern techniques, like we can use Open AI embeddings. To set up an experiment to compare the effectiveness of different embedding techniques, I would start with a basic comparison that also depends on the problem statement. For example, if I have to solve a classification problem, such as sentiment analysis, I would create embeddings using TF-IDF or bag of words and see how much accuracy I get, how much one score I get. Based on those results, I can see the effectiveness of those techniques. However, if I have a problem statement with documents and I want to classify them, then I need to use techniques that can handle the number of words and understanding in the documents. In that case, I can try to utilize state-of-the-art algorithms like Open AI embeddings or some open-source model embeddings. In this way, I will set up an experiment and compare the effectiveness of different embedding techniques.
I explain my approach to implementing a hybrid recommendation system combining collaborative filtering and content-based methods. One of the projects I worked on was a recommendation system that suggested movies, similar to what Netflix does. I attempted to build a replica of this system using collaborative filtering and not content-based methods. In collaborative filtering, we usually group similar users together and provide content to users that belong to a particular cluster. In contrast, content-based methods require user data. When launching a recommendation system, we initially don't have any data, so we have to wait until we get some data. Once we have the data, we can determine a user's preferences, such as their taste in movies. We might find that a user enjoys emotional movies, drama movies, or other genres. With this data, we can provide recommendations based on content, in addition to collaborative filtering. By combining both methods, we can achieve a hybrid recommendation system that can work in a production environment.
How do convolutional neural networks handle image data differently from fully connected neural networks? Yeah. So CNNs are basically specialized neural networks which are designed for images and videos. They mostly work with image data. The way they are different from normal neural networks or fully connected neural networks is that we can directly feed the images into the CNNs, and then they have a different kind of layer using which they try to identify the patterns within the images. In contrast, fully connected neural networks require us to convert the images into pixels, then do a flatten operation, and then pass those through layers where they self-identify what the image has. But in fully connected neural networks, we don't have this concept of layers that are specifically designed for image data, such as filters. So that's how they are different.
What techniques we use to handle imbalanced datasets in a supervised learning context? Yeah. There are several techniques to handle an imbalanced dataset. Like, we can go for SMOTE, which is one of the techniques used to handle imbalanced datasets. Then we have techniques like downsampling or upsampling. So if we have some classes, 1 and 0, which we want to do the classification, then if we have the majority class in 1, then we can go for either downsample them or maybe upsample these 0 classes. So those are some techniques using which we can handle the imbalanced dataset in supervised learning.
What strategy would you use to streamline a deep learning model to run efficiently on a mobile device? I would use a strategy that involves optimizing the model's architecture, reducing its size, and using a framework that supports mobile deployment. Okay, to achieve this, I would first consider using a framework like TensorFlow Lite, which I have experience with, and has been successful in my mini project. This framework allows us to deploy our models on mobile devices efficiently. Honestly, I have never worked on machine learning on edge devices, but my experience with TensorFlow Lite has given me a good understanding of how to optimize models for mobile deployment. In my previous project, I had to build a deep learning model that could classify plant leaves as either healthy or unhealthy. So that was the project. And I used TensorFlow Lite model, created an Android app, and the model worked fine. TensorFlow Lite is one of the frameworks we can use to deploy our models on mobile devices. That's what I used in my mini project.
True positives divided by true positives plus false positives is the formula for precision. Precision = (True Positives) / (True Positives + False Positives) The function is intended to calculate precision based on the given true positives and false positives. ```python def calculate_precision(true_positives, false_positives): if true_positives + false_positives == 0: raise ValueError("Cannot calculate precision when both true positives and false positives are 0") return true_positives / (true_positives + false_positives) try: precision = calculate_precision(42, 0) print(precision) except ValueError as e: print(e) ``` The potential error that would prevent the function from functioning correctly is when both true positives and false positives are 0. In this case, the function will raise a ValueError because division by 0 is undefined.
d = {'apple': 50, 'banana': 13, 'cherry': 20} sorted_d = sorted(d.items()) comma_key = operator.itemgetter(1) reverse = True print(sorted_d) # The current implementation is close to achieving the goal of sorting a dictionary by its values. # However, there are a couple of issues with the current implementation. # 1. The itemgetter function is used incorrectly. The itemgetter function is used to # specify which item from the tuple returned by the dictionary's items() method to # return. In this case, since we want to sort by the values, we should use # operator.itemgetter(1) to return the second item in the tuple (the value). # 2. The sorted function returns a new sorted list from the elements of any sequence. # It does not modify the original dictionary. So, if we want to sort the dictionary # in-place, we need to assign the result back to the original dictionary. # 3. The key argument in the sorted function is used to determine the sort order. # In this case, we want to sort by the values, so we should use a lambda function # that returns the second item in the tuple (the value). # 4. The reverse argument in the sorted function is used to sort in descending order. # In this case, we want to sort in descending order, so we should set reverse to True. # Here's the corrected code: d = {'apple': 50, 'banana': 13, 'cherry': 20} sorted_d = sorted(d.items(), key=lambda x: x[1], reverse=True) print(sorted_d)
What method would you use to scale feature extraction for millions of images efficiently in a distributed computing environment. Okay, what method would I use to scale feature extraction? I would scale feature extraction. That is your name on the studio, computing environment. Yeah. I mean, I can go and use PySpark. So I can write a function or maybe I can create, you know, a Lambda function or a cron job. So I would write a Python function in which I would extract features from the images. And then on a different computer, all the images can be continuously streamed or fed in. And then the Python code would basically extract the features, and maybe I would convert that feature into a pandas data frame or a Spark data frame, whatever output format is required in that format. That's what I can think of right now.
So, my preferred tools for automating the deployment of machine learning models and ensuring margin control are several. Obviously, for version control, I will be using GitHub or the company's code versioning system, which in my current organization is Bitbucket. That way, I will use Bitbucket for the version control of the code. And for automating the deployment of the machine learning model, we can automate it via Jenkins pipeline or through Google Cloud Platform, or whatever tool we are using, such as Google Cloud Platform. One of the tools I have used is MLflow. MLflow is a tool that we currently use in our company, and we can use it for automating the deployment of a machine learning model. Additionally, we can also use workflow and Terraform. Once we have the Dockerfile and everything ready, we can use Terraform to automate this entire process.
I would approach data versioning when working with large datasets in machine learning experiments by using code versioning, model versioning, and also data set versioning. I would use tools like Dagit and MLflow for data versioning. Using these tools, I will ensure that the dataset belongs to a specific experiment. If the dataset changes and my model changes, I will use the particular dataset version and the corresponding ML model version. This way, I will be ensuring that I have the proper data version as well as the proper model version.