
Contract Data Scientist
MCSquared AIAI Innovation Specialist - Finance
TrilogyFull Time Data Scientist
MCSquared AIRnD Intern
DELL EMCAssociate IT Consultant
ITC InfotechFull Stack Developer Volunteer
Isha FoundationRnD Intern
Computer Institute of Japan
Odoo

Apache

NumPy

WordPress

Palantir Foundry

Databricks

Azure Data Factory

Power BI

Next JS

LangChain

React Native

Git

DevOps

Selenium

PowerShell

Scala

Kaggle

Scrapy

SVM

Naive Bayes

Tkinter
Hi, my name is Bharat Shroff and I'm from Bangalore, Karnataka. Starting my career as an associate IT consultant, where my responsibilities included those of a data engineering role, I worked with two clients. In the first client, I helped them build an Azure data factory, in which we orchestrated a pipeline, an event-driven pipeline, which every day would upload a file, triggering a pipeline of notebooks that would take the data from the raw, apply transformations, generate analytics, and push that to Power BI and Synapse Analytics, which would then be consumed by further stakeholders. In the second one, it was majorly on Azure Databricks, creating a similar data pipeline. Then, after that, I worked at Isha Foundation for a considerable amount of time, where I basically helped them build or built the website that helped digitize their process. It was a very manual process where every time a person came to the Isha Yoga Center, they had to fill a handwritten form, which used to take hours of work from the team and participants as well. So we created a digital profile, storing all that information, and integrating different aspects of different activities like accommodation or other programs by integrating those APIs and building a common website where the user or visitor could come and just book through that. For this, I used Python and Udo. Udo is an open-source framework, so I got exposed to a lot of full-stack development, where I developed both the backend and the frontend. Then, coming back to MC squared, I switched to MC squared, where I worked as a data scientist. There, I also worked with two clients. The first client had their own data platform, called Palantir, where I basically worked on preparing visualizations, which is essentially a POC on visualizations that stakeholders would be interested in. That did involve some health checks on the data monitoring, data drift monitoring, and all this kind of KPIs. In the second client I worked with, it was basically again on Azure Databricks, but this had a process of identifying data vendors from which we could buy data, and using the client's proprietary data to do analysis, competent analysis, and other analysis that would help grow their business essentially. In my latest project, the current project I'm working on, it's on an LLM, where we've built an agent that you can ask questions, and which will create SQL queries and fetch data from the required database. So, yeah, it's been a good journey with very varied experiences and tech stacks. Thank you.
I instrument and improve the reliability of a distributed task by using AWS Step Functions, which is the equivalent of Azure Data Factory in AWS. This service helps orchestrate pipelines, and I can use AWS SageMaker to automate the machine learning and data processing logic within notebooks written in AWS Glue. These notebooks contain the actual Python code.
So Redis cache is one of the industry-leading standards here, and that would help us drastically optimize the performance of any cloud platform by storing or even edge caching, which would store certain relevant data on the edge devices with near real-time retrieval speed. And if the AI model itself is small enough to be hosted on the edge device, then the latency between the server load and the latency between each query that comes back to the server, which the server uses the AI model to generate the response and serves it back, would be greatly reduced by hosting and minimizing the AI model size so that it can be hosted on an edge device.
When designing a low-latency API, which serves machine learning predictions, or at least from a user interface user experience perspective, it is important. And it's very important that the perceived time, or the time delay, is definitely shown to improve user experience. So, as we start getting responses, just start showing each of the words. And then ultimately, once the whole response is generated, then format, I think that's what major UIs and other low-latency systems do. Using vector databases definitely helps speed up the process.
Now, we'll destructure a Python code base, keeping solid principles in mind. So, an ML project, it's important to accommodate flexibility in data and the flexibility in training a model and retraining with updates to the data. So, it's very important to accommodate that. Based on what I've used is the database architecture of bronze, silver, and gold layers, where the bronze layer contains the raw data, the silver layer contains feature engineering or feature extraction, and basically all the features we want to feed into a machine learning model. Then, the gold layer has the data that's filtered and just before it goes into the machine learning model. And in the gold layer, that's where the predictions are created. And then, beyond that, we obviously would want a retraining process, which would utilize a sense of what with MLflow – I'd be maybe a bit biased about that – but any other Apache Airflow or similar strategies would work, where we retrain the model on new data using a champion model comparison, whether based on certain metrics relevant to the particular use case. We would either archive the previous model or continue with the champion model based on which one is performing better. So, all these would help build a self-sustaining pipeline, which would maintain the data as well as the quality of predictions, and the accuracy would improve because the more data an ML model has, the better the accuracy.
What strategy would you employ to optimise a Python application's interaction with S3 is one of the major computationally intensive operations or which can handle the computation and not block or cause any blockages which is essential for user experience so that all these S3 buckets by default they have parallel access so use multiprocessing or multithreading also would work so that in the Python application itself so that the Python application is leveraging multithreading and accessing for each user or even not even for each user for each prediction it uses a different thread so that and that thread can independently and in parallel access the S3 buckets so that because by default Python applications are sequential and by helping to parallelise that would significantly improve or optimise how S3 natively supports parallel accesses reads and writes so yeah.
I have worked with SQL majorly. So I don't know about graph. But this query with a question mark property question mark value and it's not a valid SQL query at least. This backslash quote doesn't make sense. It's not correct Python syntax. So we don't need that backslash, just three quotes would do. And the query itself, I don't know if we should be using commas and the where condition it doesn't have and what should be the condition exactly. So this query doesn't look right to me.
Neo4j is basically a graph-based database framework, so based on any use case which involves maintaining relationships, these kind of node or graph kind of representation like a social media network where you have friends who are friends of friends and so on. This is how a graph, a node is connected to another node, so your friend is connected to another friend. This setup is ideal for these kind of scenarios, and the machine learning in this case inherently knows about these relationships. It would try to leverage similar nodes not only by the individual node attributes but using the relationships as well, which would help the machine learning model learn about these things. This is instead of the usual table structure, which would require additional training to integrate the relationship aspect. Explaining how one row is related to another row wouldn't be something straightforward to teach an ML model using a tabular or a columnar structure.
That can enhance ML prediction capabilities for a system designed in this strategy. Neo4j, like I said before, is a graph-based database. So building a knowledge graph or implementing a knowledge graph would be very straightforward, and leveraging this for machine learning predictions, I mean, assuming it is a use case which is very suitable for a graph, Neo4j natively supports nodes, relationships, and this would be easily captured by the machine learning model which would help train or implement a knowledge graph, and the machine learning model can immediately learn about how the knowledge graph is structured.
So Skykit, the project I worked on initially involved using XGBoost on Skykit Lore, but based on the use case, a survival model was a much better fit. There is another library by Skykit called Skykit Survival, which we implemented to tailor fit our use case, which just made sense instead of using traditional machine learning algorithms, which are majorly good for classification kind of problems or, of course, regression.
FastAPI, since I've worked with FastAPI, it natively supports asynchronous programming, although there is a little tricky part where if you specify an async function, it actually becomes a sequential function, which I think was a major topic of confusion, not debate, which was clarified in a PyCon – I believe in Ireland – where the speaker clarified how to exactly use this for asynchronous purposes. So basically, you just define the function as is, without manually specifying async, and because FastAPI natively supports async, it will automatically run the functions in an asynchronous manner. It's essential to keep any API asynchronous to prevent one user's query from blocking another user's query and to optimize server load and compute, reducing idle time for the CPU.