profile-pic
Vetted Talent

Bharath Shroff

Vetted Talent
Results-driven professional with 5+ years of experience in AI, data science, and software engineering, consistently leveraging cutting-edge technologies to drive innovation. Proven expertise in automating financial and data processes, building scalable solutions, and delivering actionable insights for global stakeholders. Skilled in AI/ML, Python, RAG, Next.js, and cloud platforms like Databricks and Azure. Adept at enhancing decision-making through advanced analytics, end-to-end application development, and agile methodologies, with a strong foundation in project management and client-focused solutions.
  • Role

    Data Scientist

  • Years of Experience

    6 years

  • Professional Portfolio

    View here

Skillsets

  • REST API - 2 Years
  • React Js - 2 Years
  • react - 2 Years
  • Scala - 1 Years
  • React Js - 2 Years
  • Next Js - 1 Years
  • Next Js
  • Selenium - 2 Years
  • MLOps - 1 Years
  • LLMs - 1 Years
  • K-Means - 1 Years
  • Backend - 2 Years
  • Financial reports - 1 Years
  • Node Js - 1 Years
  • PowerBI - 2 Years
  • MySQL - 5 Years
  • Git - 4 Years
  • PowerBI - 2 Years
  • rag
  • Data engineering and manipulation
  • Tableau - 1 Years
  • Reporting - 3 Years
  • Relational Database - 5 Years
  • PyTorch - 1 Years
  • Python - 6 Years
  • SQL - 5 Years
  • PySpark - 5 Years
  • Cloud - 1 Years
  • Next Js - 1 Years
  • Databricks - 5 Years
  • Odoo
  • Big Data - 5 Years
  • Data Engineering - 5 Years
  • MLFlow - 1 Years
  • JavaScript - 4 Years
  • React Native - 1 Years
  • Databricks cloud
  • Finance - 1 Years
  • Restful APIs - 5 Years
  • LLM - 1 Years
  • AI - 3 Years
  • Data Engineer - 5 Years
  • Data warehouse - 5 Years
  • Azure - 2 Years
  • API - 3 Years

Vetted For

10Skills
  • Roles & Skills
  • Results
  • Details
  • icon-skill_image
    Python Developer (AI/ML & Cloud Services) - RemoteAI Screening
  • 66%
    icon-arrow-down
  • Skills assessed :GCP/Azure, Micro services, Django /Flask, Neo4j, Restful APIs, AWS, Docker, Kubernetes, machine_learning, Python
  • Score: 59/90

Professional Summary

6Years
  • Aug, 2024 - Present1 yr 10 months

    Contract Data Scientist

    MCSquared AI
  • Aug, 2024 - Oct, 2024 2 months

    AI Innovation Specialist - Finance

    Trilogy
  • May, 2022 - Jul, 20242 yr 2 months

    Full Time Data Scientist

    MCSquared AI
  • May, 2018 - Jul, 2018 2 months

    RnD Intern

    DELL EMC
  • Jun, 2019 - Jul, 20212 yr 1 month

    Associate IT Consultant

    ITC Infotech
  • Aug, 2021 - Apr, 2022 8 months

    Full Stack Developer Volunteer

    Isha Foundation
  • May, 2016 - Jul, 2016 2 months

    RnD Intern

    Computer Institute of Japan

Applications & Tools Known

  • icon-tool

    Odoo

  • icon-tool

    Apache

  • icon-tool

    NumPy

  • icon-tool

    WordPress

  • icon-tool

    Palantir Foundry

  • icon-tool

    Databricks

  • icon-tool

    Azure Data Factory

  • icon-tool

    Power BI

  • icon-tool

    Next JS

  • icon-tool

    LangChain

  • icon-tool

    React Native

  • icon-tool

    Git

  • icon-tool

    DevOps

  • icon-tool

    Selenium

  • icon-tool

    PowerShell

  • icon-tool

    Scala

  • icon-tool

    Kaggle

  • icon-tool

    Scrapy

  • icon-tool

    SVM

  • icon-tool

    Naive Bayes

  • icon-tool

    Tkinter

Work History

6Years

Contract Data Scientist

MCSquared AI
Aug, 2024 - Present1 yr 10 months
    Led the team to build a pipeline in Databricks feeding into a map view dashboard containing proximity hotspots of leads around business provided site locations leveraging Bing Maps API and 3rd party Real world data sources like Citeline, Health Verity, IQVIA.

AI Innovation Specialist - Finance

Trilogy
Aug, 2024 - Oct, 2024 2 months
    Deriving Financial Insights using LLM chatbot built on React for the frontend and Express JS for the backend, which updated the RAG Vector DB upon new file uploads, reducing manual analysis time by an hour.

Full Time Data Scientist

MCSquared AI
May, 2022 - Jul, 20242 yr 2 months
    Deployed Machine Learning Survival model to production replacing the previous XGBoost model on Databricks using the medallion architecture capable of self re-training every month with new data and auto archive or promote to production based on the champion model using MLFlow for model versioning and evaluating the model performance based on C-score.

Full Stack Developer Volunteer

Isha Foundation
Aug, 2021 - Apr, 2022 8 months
    Developed a web application using the open-source Odoo Framework built on Python, streamlining processes and digitizing multiple forms required to be filled by hand by 100s of visitors saving hours of work both for the visitors and the staff.

Associate IT Consultant

ITC Infotech
Jun, 2019 - Jul, 20212 yr 1 month
    Deployed end-to-end modules using Git DevOps for Continuous Deployment across the 4 stages (DEV->QA->UAT->PROD), ensuring seamless transitions and operational efficiency for MLOps.

RnD Intern

DELL EMC
May, 2018 - Jul, 2018 2 months
    Developed Python scripts for automated reporting, flagging approximately 100 high-priority reports daily, enhancing efficiency in report management.

RnD Intern

Computer Institute of Japan
May, 2016 - Jul, 2016 2 months
    Helped in improving the accuracy of multi-class Classification of emails and Achieved 70%+ accuracy.

Achievements

  • Football Secretary (IIT Hyderabad)
  • Inter IIT Football Captain
  • Participated in Table Tennis Inter-Departmental / Inter-Year Tournaments

Major Projects

7Projects

Melanoma Classification

    Achieved 85% AUC score in Identifying Melanoma using Convolutional Neural Network (CNN) models.

Network traffic analysis ITC Infotech

Oct, 2020 - Oct, 2020
    Extracting insights by transforming Apache access logs and visualizing through plots showing traffic originates from 10 different countries. Processed 6 million+ rows of server logs fetched from Open Source Apache Server Logs. Done as part of a training for PySpark.

Network traffic analysis

Oct, 2020 - Oct, 2020
    Extracting insights by transforming 6 million + Apache server logs and visualizing through plots showing traffic originates from 10 different countries.

Machine Learning Library from scratch

Aug, 2020 - Aug, 2020
    Implemented a few ML algorithms only using NumPy with the intention of developing a deep understanding of the Machine Learning algorithms. Regression 3 models, Classification 3 models, No use of any existing modules libraries apart from NumPy (math library). Also 9 Normalization algorithms for Data Standardization in an effort to understand them.

Image classification of fruits

May, 2020 - Jul, 2020 2 months
    Multi Class Classification of Fruits using images, dataset used from Kaggle with 90380 annotated images. Leveraging Pretrained models like VGG, ResNet, AlexNet, Mobile Net for mobile deployable model.

Tic-tac-toe Extended 2player

Apr, 2019 - Apr, 2019
    Implementation of an advanced version of the Tic-Tac-Toe game in python. 2 player as of now. Learnt about this game of 2 layered Tic-Tac-Toe from a friend where we used to play on the behind of our notebooks. Implemented as a side project during college, to be played manually by 2 people as of now, ambitious objective of using ML as a future scope.

IITH Main Website

Jan, 2019 - Mar, 2019 2 months
    Built our college website from scratch using WordPress Templating which included integrating from over 10 departments.

Education

  • Bachelor of Technology in Mechanical Engineering

    Indian Institute of Technology (2019)
  • Bachelor of Technology in Mechanical Engineering

    Indian Institute of Technology (IIT) (2019)
  • Bachelor of Technology, Mechanical Engineering

    Indian Institute of Technology (IIT) Hyderabad (2019)

Certifications

  • Certified azure data engineer associate (dp-200, 201) microsoft 2021

  • Certified azure data engineer associate (dp-200, 201) | microsoft | 2021

  • Microsoft certified azure data engineer associate (dp-200, 201)

AI-interview Questions & Answers

Hi, my name is Bharat Shroff and I'm from Bangalore, Karnataka. Starting my career as an associate IT consultant, where my responsibilities included those of a data engineering role, I worked with two clients. In the first client, I helped them build an Azure data factory, in which we orchestrated a pipeline, an event-driven pipeline, which every day would upload a file, triggering a pipeline of notebooks that would take the data from the raw, apply transformations, generate analytics, and push that to Power BI and Synapse Analytics, which would then be consumed by further stakeholders. In the second one, it was majorly on Azure Databricks, creating a similar data pipeline. Then, after that, I worked at Isha Foundation for a considerable amount of time, where I basically helped them build or built the website that helped digitize their process. It was a very manual process where every time a person came to the Isha Yoga Center, they had to fill a handwritten form, which used to take hours of work from the team and participants as well. So we created a digital profile, storing all that information, and integrating different aspects of different activities like accommodation or other programs by integrating those APIs and building a common website where the user or visitor could come and just book through that. For this, I used Python and Udo. Udo is an open-source framework, so I got exposed to a lot of full-stack development, where I developed both the backend and the frontend. Then, coming back to MC squared, I switched to MC squared, where I worked as a data scientist. There, I also worked with two clients. The first client had their own data platform, called Palantir, where I basically worked on preparing visualizations, which is essentially a POC on visualizations that stakeholders would be interested in. That did involve some health checks on the data monitoring, data drift monitoring, and all this kind of KPIs. In the second client I worked with, it was basically again on Azure Databricks, but this had a process of identifying data vendors from which we could buy data, and using the client's proprietary data to do analysis, competent analysis, and other analysis that would help grow their business essentially. In my latest project, the current project I'm working on, it's on an LLM, where we've built an agent that you can ask questions, and which will create SQL queries and fetch data from the required database. So, yeah, it's been a good journey with very varied experiences and tech stacks. Thank you.

I instrument and improve the reliability of a distributed task by using AWS Step Functions, which is the equivalent of Azure Data Factory in AWS. This service helps orchestrate pipelines, and I can use AWS SageMaker to automate the machine learning and data processing logic within notebooks written in AWS Glue. These notebooks contain the actual Python code.

So Redis cache is one of the industry-leading standards here, and that would help us drastically optimize the performance of any cloud platform by storing or even edge caching, which would store certain relevant data on the edge devices with near real-time retrieval speed. And if the AI model itself is small enough to be hosted on the edge device, then the latency between the server load and the latency between each query that comes back to the server, which the server uses the AI model to generate the response and serves it back, would be greatly reduced by hosting and minimizing the AI model size so that it can be hosted on an edge device.

When designing a low-latency API, which serves machine learning predictions, or at least from a user interface user experience perspective, it is important. And it's very important that the perceived time, or the time delay, is definitely shown to improve user experience. So, as we start getting responses, just start showing each of the words. And then ultimately, once the whole response is generated, then format, I think that's what major UIs and other low-latency systems do. Using vector databases definitely helps speed up the process.

Now, we'll destructure a Python code base, keeping solid principles in mind. So, an ML project, it's important to accommodate flexibility in data and the flexibility in training a model and retraining with updates to the data. So, it's very important to accommodate that. Based on what I've used is the database architecture of bronze, silver, and gold layers, where the bronze layer contains the raw data, the silver layer contains feature engineering or feature extraction, and basically all the features we want to feed into a machine learning model. Then, the gold layer has the data that's filtered and just before it goes into the machine learning model. And in the gold layer, that's where the predictions are created. And then, beyond that, we obviously would want a retraining process, which would utilize a sense of what with MLflow – I'd be maybe a bit biased about that – but any other Apache Airflow or similar strategies would work, where we retrain the model on new data using a champion model comparison, whether based on certain metrics relevant to the particular use case. We would either archive the previous model or continue with the champion model based on which one is performing better. So, all these would help build a self-sustaining pipeline, which would maintain the data as well as the quality of predictions, and the accuracy would improve because the more data an ML model has, the better the accuracy.

What strategy would you employ to optimise a Python application's interaction with S3 is one of the major computationally intensive operations or which can handle the computation and not block or cause any blockages which is essential for user experience so that all these S3 buckets by default they have parallel access so use multiprocessing or multithreading also would work so that in the Python application itself so that the Python application is leveraging multithreading and accessing for each user or even not even for each user for each prediction it uses a different thread so that and that thread can independently and in parallel access the S3 buckets so that because by default Python applications are sequential and by helping to parallelise that would significantly improve or optimise how S3 natively supports parallel accesses reads and writes so yeah.

I have worked with SQL majorly. So I don't know about graph. But this query with a question mark property question mark value and it's not a valid SQL query at least. This backslash quote doesn't make sense. It's not correct Python syntax. So we don't need that backslash, just three quotes would do. And the query itself, I don't know if we should be using commas and the where condition it doesn't have and what should be the condition exactly. So this query doesn't look right to me.

Neo4j is basically a graph-based database framework, so based on any use case which involves maintaining relationships, these kind of node or graph kind of representation like a social media network where you have friends who are friends of friends and so on. This is how a graph, a node is connected to another node, so your friend is connected to another friend. This setup is ideal for these kind of scenarios, and the machine learning in this case inherently knows about these relationships. It would try to leverage similar nodes not only by the individual node attributes but using the relationships as well, which would help the machine learning model learn about these things. This is instead of the usual table structure, which would require additional training to integrate the relationship aspect. Explaining how one row is related to another row wouldn't be something straightforward to teach an ML model using a tabular or a columnar structure.

That can enhance ML prediction capabilities for a system designed in this strategy. Neo4j, like I said before, is a graph-based database. So building a knowledge graph or implementing a knowledge graph would be very straightforward, and leveraging this for machine learning predictions, I mean, assuming it is a use case which is very suitable for a graph, Neo4j natively supports nodes, relationships, and this would be easily captured by the machine learning model which would help train or implement a knowledge graph, and the machine learning model can immediately learn about how the knowledge graph is structured.

So Skykit, the project I worked on initially involved using XGBoost on Skykit Lore, but based on the use case, a survival model was a much better fit. There is another library by Skykit called Skykit Survival, which we implemented to tailor fit our use case, which just made sense instead of using traditional machine learning algorithms, which are majorly good for classification kind of problems or, of course, regression.

FastAPI, since I've worked with FastAPI, it natively supports asynchronous programming, although there is a little tricky part where if you specify an async function, it actually becomes a sequential function, which I think was a major topic of confusion, not debate, which was clarified in a PyCon – I believe in Ireland – where the speaker clarified how to exactly use this for asynchronous purposes. So basically, you just define the function as is, without manually specifying async, and because FastAPI natively supports async, it will automatically run the functions in an asynchronous manner. It's essential to keep any API asynchronous to prevent one user's query from blocking another user's query and to optimize server load and compute, reducing idle time for the CPU.