
Experience in fine-tuning and Prompt Engineering of LLMs such as GPT-3.5, Llama-2, and Mistral including RAG models. Proven expertise in Generative AI, Langchain, OpenAI models, Llamaindex, RAG, Hugging face, and LLM Finetuning. My journey in AI started with a strong foundation in Electrical and Electronics Engineering from MIT College of Engineering, Pune, which has been instrumental in developing my analytical and problem-solving skills. With a focus on RAG and chatbot technologies, we've crafted intelligent systems that have significantly improved client interaction and service delivery. My commitment to innovation and collaborative approach has been key in delivering projects that not only meet but exceed our client expectations, fostering a culture of excellence and continuous improvement within our organization.
AI/ML/LLM Engineer
Applied AI ConsultingProgramming Analyst Trainee
CognizantML Engineer
UshaiTechLabs Pvt Ltd
Python

MATLAB

Django
.png)
Docker

Kubernetes

Terraform

Elasticsearch

Kibana

Kafka

RabbitMQ
.png)
Datadog

Argo CD
Cucumber

Smartsheets

LangChain

PostgreSQL
AWS (Amazon Web Services)
.png)
FastAPI
Implemented changes according to Social Media platforms API changelog, maintaining compatibility with evolving APIs and ensuring seamless integration with social media platforms.
Monitored Datadog dashboard for system performance and discrepancies, ensuring 99.9% uptime and proactively identifying potential issues.
Automated Smartsheets data updates, reducing manual effort by 80% and ensuring data accuracy and consistency.
K. Let me understand more about the background. Yeah. So myself and Jim, I started working at Applied AI from 2019 in August. So, initially, I was working on Python and AWS. So where, initially, my work was to make a desktop app as a serverless app. So there, I used AWS and the serverless framework to convert this desktop app into a serverless app where it would have multiple APIs and use serverless AWS services such as S3, DynamoDB, Lambda, SQS, and other things. So, yeah, that was a product. Later on, I had worked on serverless and serverless APIs and AWS. So, I created an app or a software where the user can generate APIs for a microservice. So, that I call a serverless API generator. So, that was completely developed by me. Later on, I was placed on a custom project where a data analytics platform was used. The project was all about data analytics where the customer can track the progress of their social media handles as well as their competitors. So, we used to collect the analytics from Facebook, Instagram, Twitter, LinkedIn. All these analytics were shown to the customer at a UI. So, yeah, keeping that app up and running, if any issue occurs, so filling the bad data and all those things, that was handled by me. Later on, then I was a part of a project where I had used ChargeGPD OpenAPIs for a health analytics platform where the company was trying to build an IVR agent IVR board where the board will call an IVR number. It will press in the numbers. Like, it will listen to the call, press in their numbers automatically, and get all the data required for a particular patient. Their AWS transcript was used for speech to text, then Comprehend was used to identify the entities. After the advent of shared GPT, so mostly I had worked on some internal product products where I built a machine learning testing platform where users can come in with their ML models. They can upload a sample data. From that data, synthetic data is generated for testing the robustness of the model, the exploratory test of the model, stress test of the model. So, in different categories, we used to generate the synthetic data. Then, this synthetic data can be used to test their ML model, before going back into production, and the test thing is to use testing all these generated data against that model, getting a report of that along with all these metrics like and other test-related metrics. Also, I generated a chatbot for internal company FAQs so that the user previously, they had a turnaround time that was too much for getting some issues resolved, which can be directly found from the website data or some other relevant materials. So, that material was integrated into a RAG app where the sources have been defined. Like, it can be a website. It can be a PDF, YouTube videos. From that, a chatbot was developed, and the user can come in with their questions, ask the questions to this chatbot, and they can get the answer. Also, I had designed an automated test case generation platform where the user can upload a Selenium recording. And from that recording, we have to generate the automation test for that. So, yeah, that's pretty much.
So just another one implementing test for a fast API service. That interface we can learn. So, yeah, FastAPI, I want to use it in one of the projects, I had used FastAPI for, like it was recently for one of the clients where we had to design an Excel to DB service where there were, like, too many Excels, and they wanted that data to be as part of the DB so that data can be queried. So, I'd used FastAPI for designing the APIs for that where they can come in with different queries that can be asked to the API, and the answer to those queries was from those Excel sheets which have been converted to a database format, like designing the scheme and all that. So, yeah, so when we want that FastAPI service to be interfaced with elements, so basically, the FastAPI would be exposing some APIs, for example, for a chatbot where a user can come in and ask the question to that API. So that API would be, like, interesting with the LLM. The integration of FastAPI to this LLM can use langchain as a backend service where FastAPI interacts with the langchain methods and functions. Langchain is, like, widely used nowadays for the chatbots. So building the chatbots, we're using a rack-based approach. Like, with any lens, it supports mostly all the top-notch lenses. Maybe it can be an open-air or any open-source lens as well. So I think Lantern can do the integration for that.
Provost, this thing started before you're adding the coherence of responses from an entire system. Okay. So what I understand is, dialogue system, maybe I can assume it as a chatbot, powered by Chat GPT. So where the user is coming in, there's a conversation being built where the user is asking a question, getting an answer, then asking a follow-up question. Right? So for evaluating the governance of a response, basically, for any LLM giving out responses, it can hallucinate at times, or we need to have a testing strategy for the output so that the output is tested against all the things. It should not be biased; it should be fair enough. All these things need to be tested. Also, the current response is not current or not. For this, like, there are multiple libraries. One is the events library for LLMs, which gives you whether the response is biased or not, its fairness coefficient, and how much it is current, measuring the similarity between the question asked and the response.
In what ways can multithreading be leveraged in Python DSLM application to improve performance? In what ways can multithreading be leveraged in a Python based LN application to improve performance? So, if a Python based LN application is there, maybe multithreading can be mostly used, like, if for that LN application, if there are multiple users coming in at a time, the questions being asked will be at a similar time. There can be multiple questions being asked to that LLM. We can apply threading there where multiple threads are invoking the LLM model in the backend model at the same time, getting the questions from all users at once. So, the answers to all those questions are being faced at a time. That's one thing for multithreading. And that's the most important part for the user experience, where they would not have to wait too much for that. That would be one of the primary focuses where we can use multithreading. Apart from that, I think, for also getting the data from when new data comes in, if you want to reindex the LM or if it's a rack-based pipeline, if there are new resources coming in or we have to train the LLM again once. So, for data from different sources or for multiple clients if it's deployed, that can be done via multithreading, where the training or the indexing part is using multithreading to get the data from multiple sources at once. And, in that way, we can save time instead of having a sequential way of getting all the data or indexing the data and saving that into a vector database.
This is a protocol you would implement to internally live update the interval of prompt performance metrics. Live updating the leaderboard off prompt. And so, we can have a rabbitmq queue, which would work, where any messaging queue kind of service or protocol we can use. So if it's in AWS, we can have SQS. Right? SQS or open source messaging queues like Kafka or RabbitMQ, where the application continuously sends the prompt metrics to this queue, and this queue is integrated with some leaderboard. So, one of the projects I did involved integrating a Kafka messaging queue with Datadog. Datadog was the dashboard where all these metrics were current, for the data analytics platform from Facebook with all the different social media kinds. The metrics are continuously being sent to that Datadog dashboard. So, that dashboard was a central point where we keep an eye on it. If it drops under a threshold, like 97, 98%, the color of that box becomes red, so we can identify and also get an email when it drops so we can then debug what's the issue around or just that retrieval or service. So, we can have this RabbitMQ or Kafka messaging queue, which will integrate it with the application or model, which will continuously send the prompt metrics to the Datadog dashboard, where this is live updating.
Detect to influence those caching mechanisms Python or frequent elements. Yeah. So the chatbot that I had built in that, they used. So it had different mechanisms apart from this getting the answers from the vector db's. So it also had an option whether to save it in the history or not. So yeah. So when that history is being shared and the question and answer that are being used asked, we have to keep the user history so that next time he comes in, he has that history where like, what are the questions they ask, similar to chat Liberty. So that is still being shared into the DB along with the user details, the question, and the response to that he had got. And yes. So server-side, mostly I work in AWS. So for AWS also, when server-side caching is being implemented, we see. So the application is mostly served using a CloudFront service. So CloudFront is a service by AWS where it has multiple edge locations so that a guy sitting in Mumbai, even if the server is in North Virginia, the guy asking for Mumbai would have less latency. So for that, the caching is being implemented in the CloudFront itself so that next time the user asks a question, it would be served from its nearest edge location and not from the original server location. So that is one thing. And yeah, I think then apart from that, one is saving as a persistent state in a database or using a CloudFront for caching.
We think there is another important item in the email description. What we are using for receiving and completing in this group. Okay. Pipeline, it defines this very main image part. Yeah. So, the pipeline has been defined, and the image part is being passed on to the function, and that function is invoking the pipeline along with the image part. But the max length is 50, so that is one parameter where we are defining so that the output description can be not less than 50 characters. So that is the reason where some of the images we can receive complete descriptions. Like, if the description is too long, though, that will be clipped or it will be cut around the 50 character length. So, that is what I found. And so, how will we debug this is to set a trial approach, kindly, like, max length. So we'll see what kind of images we want to test upon. So based on the images or the complete dataset, we can define a max length. That would suffice mostly all of the images, which surely is not 50 gigawatt descriptions. In 50 characters, it's too much less. So some standard length for a description or if you want each description to be lengthy. So in that way, we can define that. So yeah.
Generate prompts. So the function is in the prompt. A task is sent in along with the text. So for this example, we are sending task s 2, and where is the Eiffel Tower? Prompt is a little prompt of text, but the length of prompt is 2 we are sending into. So prompts of 2 will be out of range because Python is 0-based indexing. So if you want the second summarizing task, then it should be 1. If you want English to French, it should be 0. But 2 is causing the error at prompt physical task. So, that's a range error exception, out of range exception. So that is what is causing the issue. So I can see where the question is, where is the filter? I don't think there is too much to summarize here. So if you want to ask, like, translate English to French, it should be 0. Even if you want to summarize, it should be 1.
Okay. Now the procedure to transition a synchronous LMPI to a synchronous LMPI contracting integrity is as follows. So synchronize LLM API. Assuming this will be mostly in Python, it is using Python's built-in asynchronous invocations. So, the usual synchronous calls can be converted to asynchronous by using the async and await keywords. FastAPI also has asynchronous calls, but that should not be an issue. That can be done as well, even if we are using Flask or Django. So, from the core perspective, you can do that if it's because I can see it's from the API contract integrity. The API's structure or the text tag should not be changed or retained. That's why I would not be thinking from a different perspective, completely changing in the tech for a better asynchronous call. But, sync can await from the Python definitions, so we can use that for the synchronous architecture.
So I mostly go for serverless apps. So assuming I would deploy this on AWS, right, once the app is ready, like, code-wise, all the things are ready. The front end is ready. Back end is ready. So the front end core will be deployed. So we say suppose the core is residing somewhere, maybe it can be as a package in an AWS Lambda function or some other way. So the URL will be served using Route 53 for routing the URL to the actual app. Then, from Route 53, it will go to the DNS for DNS distribution. So distribution, we would use CloudFront. Then from CloudFront, it would go to the application load balancer so that if a high number of users come in, the application won't be disrupted too much. So from the ground up, the request would go to the application load balancer. From there, we have the target group specified. So the code, like, the back end code for this app will be residing somewhere as a Docker image, and it will be deployed on ECS because ECS is also serverless and so that we are not worried about scaling things. On ECS, the ECS is pointed by the ALB. From ALB, the request is going to ECS. So a large number of requests are handled by ALB and ECS. So ECS is working fine. And from the UI, as I said, it will go to Route 53, then CloudFront. CloudFront is basically responsible for distributing the UI inference from the UI side, so that the requests coming in are cached at the user's nearest edge locations, right? And for the LLM inference, the LLM model can be hosted as an inference endpoint on, say, maybe, like, if the model is trained using SageMaker. SageMaker has several endpoints, which can auto-scale based on the number of requests coming in. So the latency is not a problem there. So we can use serverless inference for this. And I would see, like, if edge computing is suggesting small devices for edge computing. So for that, we would mostly use small LLM models, which are not big in size but not too much at the cost of quality, which can be deployed on edge computing. And if we want to train that model, we would use the LoRa techniques and PFT, like, parameter efficient fine tuning for that so that the number of parameters are very less, in size and not too much, so that they can be deployed on an edge computing device.
Optimizing an existing code base to streamline interactions with DALL-E involves several steps. For an existing Python code base, one approach is to integrate the OpenAI APIs into the existing functionalities while adding the necessary interactions for image generation. Existing code base has functionalities defined, which it is currently working, and if you want to add interactions with DALL-E for image generation. For DALL-E, we have the OpenAI APIs, which have been integrated into the Python core base for interactions. If the user asks questions, the Python code base calls the OpenAI API along with the description for the image and other parameters to generate an image. Until the time we get the image from DALL-E, we can have a loading bar or some UI element to show the user that the image is being generated.