
Experience in fine-tuning and Prompt Engineering of LLMs such as GPT-3.5, Llama-2, and Mistral including RAG models. Proven expertise in Generative AI, Langchain, OpenAI models, Llamaindex, RAG, Hugging face, and LLM Finetuning. My journey in AI started with a strong foundation in Electrical and Electronics Engineering from MIT College of Engineering, Pune, which has been instrumental in developing my analytical and problem-solving skills. With a focus on RAG and chatbot technologies, we've crafted intelligent systems that have significantly improved client interaction and service delivery. My commitment to innovation and collaborative approach has been key in delivering projects that not only meet but exceed our client expectations, fostering a culture of excellence and continuous improvement within our organization.
AI/ML/LLM Engineer
Applied AI ConsultingProgramming Analyst Trainee
CognizantML Engineer
UshaiTechLabs Pvt Ltd
Python

MATLAB

Django
.png)
Docker

Kubernetes

Terraform

Elasticsearch

Kibana

Kafka

RabbitMQ
.png)
Datadog

Argo CD
Cucumber

Smartsheets

LangChain

PostgreSQL
AWS (Amazon Web Services)
.png)
FastAPI
Implemented changes according to Social Media platforms API changelog, maintaining compatibility with evolving APIs and ensuring seamless integration with social media platforms.
Monitored Datadog dashboard for system performance and discrepancies, ensuring 99.9% uptime and proactively identifying potential issues.
Automated Smartsheets data updates, reducing manual effort by 80% and ensuring data accuracy and consistency.
K. Let me understand more about the background. Yeah. So myself and Jim, uh, I've I started working at Applied AI from 2019 in August. So, initially, I was working on Python and AWS. So where, uh, initially, my, uh, work was to make a desktop app as a serverless app. So there, I used AWS and serverless framework to con convert this desktop app into a serverless app where it would have multiple APIs and using serverless AWS services such as, uh, s 3, DynamoDB, Lambda, uh, SQS, and other things. So, yeah, that was a product. Uh, later on, um, I had I had, uh, like, morely worked into a serverless and, um, serverless APIs and AWS. So I created a app or a software where the user can generate APIs for a microservice. So that I call as serverless API generated. So that was completely developed by me. Later on, I was placed on a custom project where a data analytics platform, uh, like, they used the project was all about data analytics where uh, customer can, uh, track the progress of their social media handles as well as their competitors. So we used to, like, collect the analytics from Facebook, Instagram, Twitter, LinkedIn. All these, uh, analytics were shown to the customer at at a a UI. So, yeah, keeping that app up and, uh, if any issue occurs, so filling the bad fill data and all those stuff, that was handled by me. Later on, then I was a part of a project where, uh, like, I had used ChargeGPD OpenAIPS for a health analytics platform where, uh, like, the that company was, uh, trying to get, uh, or build, uh, IVR agent IVR board where the board will call an IVR number. It will press in the numbers. Like, it will listen to the call, press in their numbers automatically and, uh, get all the data required for particular patient. Their, uh, like, AWS transcript was used to for speech to text, then a comprehend was used to, uh, like, identify the entities. After the advent of shared GPT, so mostly I had worked on some, uh, internal product products where I build a ML testing platform where user can come in with their ML models. They can, uh, upload a sample data. From that data is to generate a synthetic data for testing the robustness of model, the exploratory test of that model, stress test of that model. So in different categories, we used to generate the synthetic data. Then this synthetic data can be used to test their ML model, uh, before going back into production, and the test thing in was, like, uh, use testing all these generated data against that model, getting a report of that along with all these metrics like and other, um, like, test related metrics. Also, I generated a chatbot for internal company FAQs so that the user previously, they had the turnaround time was too much for getting some issues resolved, which can be directly found from the website data or some other, uh, relevant materials. So that material had integrated as a RAG app where the where the, uh, sources have been defined. Like, it can be website. It can be PDF, uh, YouTube videos. From that, a chatbot was developed, and the user can come in with their questions, ask the questions to this chatbot, and they can get the answer. Also, I had, um, on the similar lines, I had then, uh, you designed the automated automation test case generation platform where the user can upload a Selenium, uh, recording. And from that recording, we have to generate the automation test for that. So, yeah, that's pretty much.
So just another one implementing test for a fast API service. That interfaces we can learn. So, yeah, FastAPI, uh, I wanna in one of the projects, I had used FastAPI for, like it was recently for one of the clients where we had to design Excel to DB service where there were, like, too many Excels, and they wanted that data to be as part of the DB so that that, yeah, that data can be queried. So there, uh, I'd used FastAPI for, uh, the designing the APIs for that where, uh, they can come in with the different queries that can be asked to the API, and the answer to those queries for was from those Excel sheets which have been converted to a database format, uh, like designing the scheme and all that. So, yeah, so when we want that FastAPI service to be interfaced with elements, so, basically, the FastAPI would be exposing some APIs, uh, supposedly, for example, for a chatbot where user can come in and ask the question to that API API. So that API would be, uh, like, interesting with the LLM. So the integration of their FastAPI to this LLM can, uh, use langchain as a, like, back end service where the FastAPI interacting with the lang chain methods and functions. Uh, lang chain is, like, widely used nowadays for the chatbots. So building the chatbots, we're using a rack based approach. Uh, like, with any lens, it supports, like, mostly all the top top notch lens. Maybe it can be open air or any open source lens as well. So I think Lantern can do, uh, the integration for that.
Provost, this thing started before you're adding the coherence of responses from an entire system. Okay. So what I understand is, uh, dialogue system, maybe I can, um, like, assume it as a chatbot, uh, powered by chat GPT. So where user is coming in, they are there's a, like, a conversation being built where user is asking question, getting answers, then asking follow-up question. Right? So for evaluating the governance of a response, so basically, for any LLM giving out, uh, responses, uh, it it can hallucinate at sometimes, or we need to have a testing strategy for the output so that, uh, the output is is is, like, tested against all the things. It can be, like it it should not be biased. It should be fair enough. So all these things need to be tested. Also, the current, uh, the response is current or not. So for this, like, there are multiple libraries. 1, like, there's, uh, events library for LLMs, which gives this whether the, uh, like, the percentage of how much the response if it is biased or not, if it is, uh, what is the fairness, uh, fairness coefficient for that? And, uh, yeah, how much current is so it it it measures the similarity for between the question asked against the response. So, yeah, it will give you the how much the current it how much the response is current against the asked question. So yeah.
In what ways can multithreading be leveraged in Python DSLM application to improve performance? In what ways can multithreading be leveraged? In a Python based LN application to improve performance. So, yeah, if a Python based LN application is there, Maybe multithreading can be mostly used, like, if for that LN application, if there are at a time, multiple users are coming in. So the, uh, the questions being asked will be at a similar at, uh, like, concurrent concurrent time. So there can be multiple question being asked to that LLM. So we can, uh, we can apply threading there where, uh, like, the multi threads are invoking the LLM, uh, model, in, like, the back end model, at the same time getting the questions to all like, if there are too many users to it. So the question the answers are being faced for all those question asked at a time. Uh, so, yeah, that is one thing for multithreading. And so mostly, uh, that's that's the most important part for the user experience where, uh, they would be not, uh, having to wait too much for that. And, uh, that would be one of the primary focus where we can use the multithreading. Apart from that, I think then for also, like, getting the data from, uh, when the when new data comes in, if you want to reindex the LM or if it is a rack based pipeline, if there is some new and all resources is coming in or we have to train the l l m again once. So for from different sources or for, uh, like, multiple clients if that is, uh, deployed. So that can be done via multigrating where where the training or the, uh, indexing part is using multithreading to get get the, uh, data from multiple sources at once. And, uh, yeah, in that way, we can save time instead of, uh, having, like, sequential way of getting all the data or indexing the data and saving that into vector database. So yeah.
This is a protocol you would implement to internal live updating the interval of prompt performance metrics. Live updating leaderboard off prompt. And so for this, uh, we can have, uh, kind of, I I guess, a rabbit m q would work, uh, where like, any any messaging queue kind of, uh, service or protocol we can use. So if if if it's in AWS, we can have SQS. Right? SQS or, uh, open source like, uh, Kafka messaging queue or RabbitMQ messaging queue where the application is continuously sending the prompt metrics to this queue, and this queue is being integrated with some leaderboard. So, like, one of the projects I I did this where, uh, the Kafka messaging queue was integrated with a, uh, Datadog. So Datadog was the dashboard where all these metrics are current, like, for the data analytics platform from Facebook with all the different kind of social medias. The metrics are con continuously being sending, uh, sent to sent to that data doc data doc dashboard. So that dashboard was a, uh, central point where we we keep a eye on it where when if it is if it is dropping under a threshold, like, it was around 97, 98%, If drops under that, the color of that box becomes red so that we can, uh, identify and also we get an email around that that this is dropping so that we can then debug what's the issue around or just that retrieval or that, uh, that service. So, yeah, we can have this RabbitMQ or Kafka messaging queue, which will, uh, integrate it with the application or the model which will continuously sending the prompt metrics to data log dashboard where this it is live live updating.
Detect to influence those are caching mechanism Python or frequent elements. Yeah. So the chatbot that I had built in that, uh, they used. So had different mechanisms, uh, apart, uh, for, uh, this getting the answers from the vector d b's. So it it also had an option whether to save it in the history or not. So, uh, yeah. So when that history is being shared and the question and answer that are being used asked, So we had to, uh, we have to keep the user history so that next time he comes in, he has that history where like, what are the questions they ask, similar to chat Liberty. So that is still being shared into the DB along with the user details, the question, and the response to that he had got. And, uh, yes. So server side make if so mostly I work in AWS. So for AWS also, when server side may are caching, we, um, see. So the application is mostly served using a CloudFront service. So CloudFront is a service by AWS where, uh, it has, like, multiple age locations so that, uh, guy sitting in Mumbai, even if the server is in, uh, like, maybe North Virginia. So the guy asking for Mumbai, he he would be having too much latency. So for that, the caching is being implemented in the cloud front, um, in, like, itself so that next time the the next time the user asking question, it would be served from its nearest stage location and not from the original, uh, like, not voice in your location. So that is one thing. And, uh, yeah, I think then if apart from that, I would say yeah. So one is, like, saving as a persistent, like, in a database or using a CloudFront for cache.
We think there is another important in the email description. What we are using for receiving and completing in this groups. Okay. Pipeline, it, uh, define this very main image part. Yeah. So, uh, so the pipeline has been defined, uh, and the image part is being, like, passed on to the function, and that, uh, that function is invoking the pipeline along with the image part. But the max length is 50, so that that is one parameter where we are defining so that the output description can be, uh, like, not less than 50 characters. So that is the reason where, uh, some of the images, we can we will be receiving complete description. Like, if the description is too long, though, that will be clipped or, like, it will be cut around the 50 character length. So, yeah, so that that is what I find. And so what how will we debug this is, uh, set trial approach kindly, like, max length. So we'll see, uh, we would see, like, what what kind of images do we want, uh, do are we are we trying to test upon? So based on the images or the complete dataset, uh, we can define a max length. The that would suffice, like, mostly all of the images, which which we which surely is not 50 gigawatt description. In 50 characters, it's, like, too much less. So some standard length for a description or if if you want each description to be a lengthy one. So in that way, we can define that. So yeah.
It's Python code for creating prompt for. There's an error that causes a random exception. Can't look at what might be causing the issue. Python code for creating a prompt for line language model. Prompt is translated English to French, then summarize the following text. Generate prompts. So function is in the prompt. A task is sent in along with the text. So for this example, we are sending task s 2, and where is the Eiffel Tower? Prompt is a little prompts of text, but the length of prompt is 2 we are sending into. So prompts of 2 will be out of, uh, out of range because, uh, Python is, like, 0 based indexing. So if you want the second, uh, like, summarizing task, uh, then it should be 1. If if you want English to French, it should be 0. But 2 is, like, it is causing the error, uh, at, like, prompt physical prompts of task. So, yeah, that's a range error exception, um, like, out of range exception. So that is what is causing the issue. So yeah. Uh, so I can see, like, where the question is, where is the filter? I don't think there is too much of, uh, to summarize here. So if you you want to ask, like, translate English to French. So instead of 2, it should be 0. Even if you want to summarize, it should be 1.
Okay. Now procedure to transition a synchronous LMPI to a synchronous LMPI contracting integrity. So synchronize LLM API. So assuming this will be mostly in Python, uh, uh, and I suppose if if it is using FastAPI so FastAPI has, um, asynchronous calls as well. So the, um, like, usual synchronous calls can be converted to asynchronous by using the async and await, uh, keywords used in that. So pi and Python inbuilt has those asynchron asynchron's invocations as well. So that should not be an issue even if we are using, like, Flask or, uh, Django. That that can be done as well. So, yeah, mostly that to do process, like, from the core perspective, you can do do that if if it's because I can see it's, like, from the API contract integrity. So API's API structure or the text tag should not be, um, should not be changed or it should be retained. So that's why, um, I would not be thinking from a different perspective, like, uh, completely changing in the text tech, uh, for a better asynchronous call. But, yeah, sync can await, uh, from the Python definitions, so we can use that for the synchronous architecture. And, uh, yeah, I think I gotcha.
Sketch a high level diagram to demonstrate how you do incorporate age computing principles to distribute LLM inference. So how do you incorporate age computing principles to distribute LMM inference? So yeah. So suppose, like, uh, LMM based app is there, so the so the there is this this front end and back end. Uh, so inference if if inference is, uh, there. So I mostly go for serverless apps. So assuming I would deploy this on AWS, right, once the app is, uh, the app structure is ready, like, code wise, all the things are ready. The front end is ready. Back end is ready. So the front end core will be deployed. Uh, so we say suppose the core is residing somewhere, uh, maybe it can be as a package in, uh, in any city instance or some any other way. So the URL will be served using, uh, like, route 53 for routing the, uh, like, URL to the actual app. Then, uh, from routed or 53, it will go go to the DNS for DNS, uh, DNS or, uh, like, distribution. So distribution, we would use CloudFront. Uh, then from CloudFront, it would go to, uh, like, application load balancer so that if if, uh, high number of users come in so that the application not disrupted too much. So I play from the ground front, it will the request would go to application load balancer. From there, we have the target group specified. So the code, like, the back end code for this app will be residing somewhere as a docker image, sir, as a docker image, uh, and it will be deployed on ECS because ECS is also serverless and so that we are not worried about, uh, scaling things. So on ECS so the the ECS is pointed by the ALB. From ALB, it will be the request is going to ECS. So large number of request is handled by ALB and ECS. So ECS is working fine. And, uh, from the UI so as I said, it will go to route 53 then CloudFront. So CloudFront is basically, it will be responsible for this distributing, uh, the UI inference from the UI side, uh, so that the request coming in are cached at the at the user's uh, nearest edge locations. Right? And, uh, for the LLM inference, so LLM model can be hosted as a inference endpoint on, uh, say, maybe, like, if the model is trained using SageMaker. So SageMaker SageMaker has several endpoints, which can, uh, uh, which can be auto scale use, uh, based on the, uh, number of requests coming in. So the latency is not a problem there. So we can use, yeah, serverless inference for this. And I would I would see, like, if edge computing if if it is suggesting the small devices for edge computing. So for that, we would, uh, mostly use small LLM models, which are not big in size but at, like, not too much at the cost of quality, uh, which can be deployed on, uh, edge computing. And and if we want to train that model, we we would use, like, the LoRa techniques and PFT, like, parameter efficient fine tuning for that so that the number of parameters are very less, uh, in size and not too much so that that they can be deployed on an edge computing device. So yeah.
What about you detect optimizing existing code base to streamline interaction with Dolly volume? Optimize an existing Python code base to streamline interactions with DALL E for a minute. So existing code base is there. Uh, okay. So suppose, uh, existing code base has some, uh, functionalities defined which it is currently working and if you want to, uh, add the interaction with DALL E for image generation. So for DALL E, uh, we have this OpenAI APIs. So those APIs have been integrated in the Python core base wherever we want the, uh, interactions with the DALL E. So what like, if the user is asking questions, uh, you're usually asking to generate images through this Python codebase that would call the, um, OpenAI API along with whatever the description for the image is there with some other parameters. And, uh, we'll get this image, uh, image from Delhi. Until the time we get image from Delhi, we can have some kind of, like, loading bar or some some, like, UI element to show the user that the image is being generated. And yeah. So I think that should be.