profile-pic
Vetted Talent

Sachin Mishra

Vetted Talent
Experienced Data Scientist and Mentor with strong background in Machine Learning, NLP, and Computer Vision. Possessing over 2.5 years of hands-on expertise in developing and implementing cutting-edge solutions, I have successfully led team of Junior Data Scientists and Analysts, providing guidance and mentorship to drive exceptional results. With proven track record of leveraging data-driven insights to solve complex problems, I bring unique combination of technical expertise and leadership skills to create impactful solutions. Seeking opportunities to contribute my skills and knowledge in dynamic and challenging environment.
  • Role

    Data Scientist

  • Years of Experience

    3 years

  • Professional Portfolio

    View here

Skillsets

  • GCP
  • automation
  • AI
  • Streamlit
  • Flask
  • On
  • Github
  • SAP
  • Azure
  • APIS
  • MLOps
  • LinkedIn
  • Cloud
  • Troubleshooting
  • Tableau
  • Leadership
  • CI/CD
  • UI
  • Windows
  • Random Forest
  • Database management
  • Matplotlib
  • Git
  • Training
  • Docker
  • Python - 3 Years
  • Database
  • Statistics
  • NLP
  • Mongo DB
  • Deep Learning
  • Scrum
  • ML
  • R
  • AWS
  • Python Programming
  • Python - 3 Years
  • C
  • Communication
  • API
  • PowerBI
  • Code Review
  • Computer Vision
  • SQL
  • MySQL
  • Agile
  • FastAPI

Vetted For

12Skills
  • Roles & Skills
  • Results
  • Details
  • icon-skill_image
    Data Scientist (Remote)AI Screening
  • 71%
    icon-arrow-down
  • Skills assessed :Communication Skills, Jira, Retrieval-Augmented Generation, Computer Vision, Deep Learning, PyTorch, TensorFlow, GitLab, machine_learning, NLP, NO SQL, Python
  • Score: 64/90

Professional Summary

3Years
  • Jul, 2022 - Present3 yr 11 months

    Data Scientist

    Data Society
  • Jun, 2022 - Present4 yr

    Data Analyst/Scientist Mentor

    Digikull
  • Feb, 2022 - Oct, 2022 8 months

    Data Science Intern

    Ineuron.ai
  • Jul, 2021 - Jul, 20221 yr

    SAP Analyst

    Tata Consultancy Services Ltd

Applications & Tools Known

  • icon-tool

    Python

  • icon-tool

    GCP

  • icon-tool

    AI

  • icon-tool

    Git

  • icon-tool

    AWS (Amazon Web Services)

  • icon-tool

    Docker

  • icon-tool

    NLP

  • icon-tool

    MongoDB

  • icon-tool

    CodeClouds

  • icon-tool

    Tableau CRM

  • icon-tool

    Azure

  • icon-tool

    Flask

Work History

3Years

Data Scientist

Data Society
Jul, 2022 - Present3 yr 11 months
    Accomplished diverse training projects encompassing Python, NLP-based Clustering, Computer Vision, and Web Scraping, delivering all projects on time. Recognized with an Efficient Employee Award for consistently meeting project goals. Worked on CNN project to classify fruits images using transfer learning. Written a production grade code for building a Convolutional neural network(CNN) using transfer learning approach on pretrained VGG-16 model, and Mobile-Net model to classify fruits images. Worked on NLP project for doing sentiment analysis on companies policy documents. Created a python package called assesment-creator to automate the task within the organization thus reduced the manual working hours. Worked on Project of Tableau Dashboards for North-carolina government which I used tableau prep flow builder for creating dataflows and finally build the dashboard for the same.

Data Analyst/Scientist Mentor

Digikull
Jun, 2022 - Present4 yr
    Facilitated learning and growth as a mentor and instructor, delivering engaging lessons and practical examples to students on Python programming, machine learning, statistics, and Tableau. Developed and implemented a comprehensive curriculum for Python programming, machine learning, statistics, and Tableau, catering to students with diverse backgrounds and skill levels. Mentored and coached junior data analysts, providing guidance on best practices, troubleshooting techniques, and code review to facilitate their professional growth and development. Mentored and guided students in their learning journey, providing individualized support and feedback to help them grasp complex concepts and apply them effectively. Designed and conducted hands-on coding exercises, projects, and assessments to assess students' understanding and proficiency in Python programming, machine learning, statistics, and Tableau. Introduced and implemented MLOps methodology for the first time in the organization using MLFlow.

Data Science Intern

Ineuron.ai
Feb, 2022 - Oct, 2022 8 months
    Completed generic training in MySQL, Python, Statistics, Tableau, and Machine Learning. Cleaned and formatted Big Mart Sales data with over 8524 rows and 12 columns and made it ready for analysis. Developed interactive dashboards to visualize Key Performance Indicators (KPIs) and provided business recommendations. Build and Compare different Machine Learning Models such as Linear Regression, Lasso Regression,SVM, and Random Forest to Predict Sales of the different stores of Big Mart.

SAP Analyst

Tata Consultancy Services Ltd
Jul, 2021 - Jul, 20221 yr
    Day to day creating and maintaining Clients data in SAP hana database in Production, Development and Quality System for smooth business function. Prepared BI dashboard in Tableau for tracking of monthly incidents, tasks and change request reported. This work improved the tracking of different task assigned to the different teams and reduced the SLA time by 30%. Worked closely with the engineering and business team using scrum/agile methodology Creating SAP BI objects, Providing necessary roles and authorizations to clients and monitoring process chains.

Achievements

  • Accomplished diverse training projects encompassing Python, NLP-based Clustering, Computer Vision, and Web Scraping,
  • Recognized with an Efficient Employee Award for consistently meeting project goals.

Major Projects

13Projects

Sports Celebrity and Data Scientist

Image Classi cation

Credit Score Classification

Pet-Image Classification using CNN

Python Pypi Package

NLP Emotion Detection

Deployed

Atliq Hardware Sales/Pro Dashboard

Aadhar Card Masking & Information Retrivel

NLP Food Order App

NLP-Text Summarization Using Pegasus Model

Content Based Movie Recommender Engine

Community Sessions:

Education

  • BACHELOR OF ENGINEERING, Electronics and Telecommunication

    Thakur College of Engineering & Technology, Mumbai, Maharashtra (2021)

AI-interview Questions & Answers

My name is Sachin Mishra, and I am working as a data scientist in a data society company from the last two years. Before that, I was working in TCS as a data analyst or SAP analyst. So I have overall three years of experience in the data field, and I love solving and enjoying data problems. I'm passionate about solving data problems. That's all about me.

We have different kinds of embedding techniques in NLP. We have TF-IDF, we have bag of words, and we have modern techniques, like we can use Open AI embeddings. To set up an experiment to compare the effectiveness of different embedding techniques, I would start with a basic comparison that also depends on the problem statement. For example, if I have to solve a classification problem, such as sentiment analysis, I would create embeddings using TF-IDF or bag of words and see how much accuracy I get, how much one score I get. Based on those results, I can see the effectiveness of those techniques. However, if I have a problem statement with documents and I want to classify them, then I need to use techniques that can handle the number of words and understanding in the documents. In that case, I can try to utilize state-of-the-art algorithms like Open AI embeddings or some open-source model embeddings. In this way, I will set up an experiment and compare the effectiveness of different embedding techniques.

I explain my approach to implementing a hybrid recommendation system combining collaborative filtering and content-based methods. One of the projects I worked on was a recommendation system that suggested movies, similar to what Netflix does. I attempted to build a replica of this system using collaborative filtering and not content-based methods. In collaborative filtering, we usually group similar users together and provide content to users that belong to a particular cluster. In contrast, content-based methods require user data. When launching a recommendation system, we initially don't have any data, so we have to wait until we get some data. Once we have the data, we can determine a user's preferences, such as their taste in movies. We might find that a user enjoys emotional movies, drama movies, or other genres. With this data, we can provide recommendations based on content, in addition to collaborative filtering. By combining both methods, we can achieve a hybrid recommendation system that can work in a production environment.

How do convolutional neural networks handle image data differently from fully connected neural networks? Yeah. So CNNs are basically specialized neural networks which are designed for images and videos. They mostly work with image data. The way they are different from normal neural networks or fully connected neural networks is that we can directly feed the images into the CNNs, and then they have a different kind of layer using which they try to identify the patterns within the images. In contrast, fully connected neural networks require us to convert the images into pixels, then do a flatten operation, and then pass those through layers where they self-identify what the image has. But in fully connected neural networks, we don't have this concept of layers that are specifically designed for image data, such as filters. So that's how they are different.

What techniques we use to handle imbalanced datasets in a supervised learning context? Yeah. There are several techniques to handle an imbalanced dataset. Like, we can go for SMOTE, which is one of the techniques used to handle imbalanced datasets. Then we have techniques like downsampling or upsampling. So if we have some classes, 1 and 0, which we want to do the classification, then if we have the majority class in 1, then we can go for either downsample them or maybe upsample these 0 classes. So those are some techniques using which we can handle the imbalanced dataset in supervised learning.

What strategy would you use to streamline a deep learning model to run efficiently on a mobile device? I would use a strategy that involves optimizing the model's architecture, reducing its size, and using a framework that supports mobile deployment. Okay, to achieve this, I would first consider using a framework like TensorFlow Lite, which I have experience with, and has been successful in my mini project. This framework allows us to deploy our models on mobile devices efficiently. Honestly, I have never worked on machine learning on edge devices, but my experience with TensorFlow Lite has given me a good understanding of how to optimize models for mobile deployment. In my previous project, I had to build a deep learning model that could classify plant leaves as either healthy or unhealthy. So that was the project. And I used TensorFlow Lite model, created an Android app, and the model worked fine. TensorFlow Lite is one of the frameworks we can use to deploy our models on mobile devices. That's what I used in my mini project.

True positives divided by true positives plus false positives is the formula for precision. Precision = (True Positives) / (True Positives + False Positives) The function is intended to calculate precision based on the given true positives and false positives. ```python def calculate_precision(true_positives, false_positives): if true_positives + false_positives == 0: raise ValueError("Cannot calculate precision when both true positives and false positives are 0") return true_positives / (true_positives + false_positives) try: precision = calculate_precision(42, 0) print(precision) except ValueError as e: print(e) ``` The potential error that would prevent the function from functioning correctly is when both true positives and false positives are 0. In this case, the function will raise a ValueError because division by 0 is undefined.

d = {'apple': 50, 'banana': 13, 'cherry': 20} sorted_d = sorted(d.items()) comma_key = operator.itemgetter(1) reverse = True print(sorted_d) # The current implementation is close to achieving the goal of sorting a dictionary by its values. # However, there are a couple of issues with the current implementation. # 1. The itemgetter function is used incorrectly. The itemgetter function is used to # specify which item from the tuple returned by the dictionary's items() method to # return. In this case, since we want to sort by the values, we should use # operator.itemgetter(1) to return the second item in the tuple (the value). # 2. The sorted function returns a new sorted list from the elements of any sequence. # It does not modify the original dictionary. So, if we want to sort the dictionary # in-place, we need to assign the result back to the original dictionary. # 3. The key argument in the sorted function is used to determine the sort order. # In this case, we want to sort by the values, so we should use a lambda function # that returns the second item in the tuple (the value). # 4. The reverse argument in the sorted function is used to sort in descending order. # In this case, we want to sort in descending order, so we should set reverse to True. # Here's the corrected code: d = {'apple': 50, 'banana': 13, 'cherry': 20} sorted_d = sorted(d.items(), key=lambda x: x[1], reverse=True) print(sorted_d)

What method would you use to scale feature extraction for millions of images efficiently in a distributed computing environment. Okay, what method would I use to scale feature extraction? I would scale feature extraction. That is your name on the studio, computing environment. Yeah. I mean, I can go and use PySpark. So I can write a function or maybe I can create, you know, a Lambda function or a cron job. So I would write a Python function in which I would extract features from the images. And then on a different computer, all the images can be continuously streamed or fed in. And then the Python code would basically extract the features, and maybe I would convert that feature into a pandas data frame or a Spark data frame, whatever output format is required in that format. That's what I can think of right now.

So, my preferred tools for automating the deployment of machine learning models and ensuring margin control are several. Obviously, for version control, I will be using GitHub or the company's code versioning system, which in my current organization is Bitbucket. That way, I will use Bitbucket for the version control of the code. And for automating the deployment of the machine learning model, we can automate it via Jenkins pipeline or through Google Cloud Platform, or whatever tool we are using, such as Google Cloud Platform. One of the tools I have used is MLflow. MLflow is a tool that we currently use in our company, and we can use it for automating the deployment of a machine learning model. Additionally, we can also use workflow and Terraform. Once we have the Dockerfile and everything ready, we can use Terraform to automate this entire process.

I would approach data versioning when working with large datasets in machine learning experiments by using code versioning, model versioning, and also data set versioning. I would use tools like Dagit and MLflow for data versioning. Using these tools, I will ensure that the dataset belongs to a specific experiment. If the dataset changes and my model changes, I will use the particular dataset version and the corresponding ML model version. This way, I will be ensuring that I have the proper data version as well as the proper model version.