Vetted Talent

Nevil Shah

Vetted Talent

I didn’t start my career trying to work in artificial intelligence. I started by trying to make systems work reliably under pressure, at scale, and in the real world.

Over nearly six years, that mindset has shaped my journey as a senior engineer building production-grade intelligent systems. I have worked across startups and enterprise environments, repeatedly taking ideas from research and experimentation into reliable platforms that real users depend on every day.

What I enjoy most sits at the boundary between research and engineering: turning uncertain, complex ideas into systems that are observable, scalable, and economically viable. That has meant building and operating large language–based platforms, retrieval-driven question answering systems, and high-performance inference infrastructure where latency, cost, and correctness are not theoretical concerns but daily production constraints.

In practice, my work has involved designing complete systems end to end: from data ingestion and retrieval, to ranking and response generation, to serving, monitoring, and continuous evaluation. I have spent significant time optimizing performance under real traffic, operating containerized production systems, and building automated evaluation pipelines to measure correctness, reasoning quality, and regression risk over time.

Role
Senior Software Engineer MLOps / Backend
Years of Experience
6.42 years
Professional Portfolio
View here

Skillsets

Milvus
Keda-autoscaling
TensorRT
Surrealdb
Rest APIs
PyTorch
OpenAI API
Nvidia Triton
Gpu inference optimization
Async systems
semantic search
Redis
OCR
MySQL
multimodal AI
LLMs - 1 Years
LangChain
Kubernetes
Google adk
GCP
FastAPI
Django
Celery
AWS
Ai evaluation
rag
Computer Vision
Python
Python - 4 Years

Vetted For

16Skills

Roles & Skills
Results
Details

Machine Learning Engineer, AI/ML, Search & Discovery (Remote)AI Screening
77%

Skills assessed :Collaboration, Communication, CI/CD, Data preprocessing, Deep Learning, Feature Engineering, Model evaluation, Natural Language Processing, PyTorch, Reinforcement Learning, TensorFlow, Good Team Player, machine_learning, NLP, Problem Solving Attitude, Python
Score: 69/90

Professional Summary

6.42Years

Jul, 2024 - Present1 yr 10 months
Software Engineer MLOPs
Avaamo
Oct, 2023 - Jun, 2024 8 months
Founding AI Engineer
Dice Comic
Feb, 2022 - Oct, 20231 yr 8 months
Deep Learning Engineer
RadiusAI
Aug, 2019 - Apr, 2020 8 months
Master Thesis Project
Indian Institute of Science and Educational Research Mohali
Apr, 2020 - Sep, 2020 5 months
Software Intern
LightSpeedAI Labs
Oct, 2020 - Jan, 20221 yr 3 months
Software Architect
LightSpeedAI Labs

Work History

6.42Years

Software Engineer MLOPs

Avaamo

Jul, 2024 - Present1 yr 10 months

Led AI quality, drift monitoring, and regression evaluation across production chatbot systems. Built scalable LLM experimentation platforms supporting multi-language and multi-model benchmarking. Optimized GPU inference pipelines (TensorRT, async execution), achieving major latency and cost reductions. Migrated retrieval and search systems to Kubernetes-native vector search and Elasticsearch.

Founding AI Engineer

Dice Comic

Oct, 2023 - Jun, 2024 8 months

Co-founded and owned end-to-end AI product development from customer discovery to production delivery. Designed multimodal RAG systems combining text and visual data for enterprise applications. Improved domain accuracy via model fine-tuning and LoRA-based adaptation of vision-language models.

Deep Learning Engineer

RadiusAI

Feb, 2022 - Oct, 20231 yr 8 months

Built real-time vision models achieving 12 lower latency and reduced GPU memory usage. Scaled video analytics systems to increase throughput and reduce compute cost.

Software Architect

LightSpeedAI Labs

Oct, 2020 - Jan, 20221 yr 3 months

Software Intern

LightSpeedAI Labs

Apr, 2020 - Sep, 2020 5 months

Developed FPGA-accelerated edge AI systems optimized for strict latency and power constraints. Recognized with a Top 3 finish in the BOSCH-DNA NXT Startup Program.

Master Thesis Project

Indian Institute of Science and Educational Research Mohali

Aug, 2019 - Apr, 2020 8 months

Major Projects

3Projects

SystemQuest Gamified System Design Platform

Built an AI-assisted platform for gamified system design practice, enabling engineers to solve LLD and HLD problems through interactive challenges with automated evaluation and feedback using reasoning LLMs.

InsightX AI Research & Insight Engine

Built an AI-powered insights platform that analyzes financial trading activity and generates personalized educational explanations using LLM-based reasoning and data-driven analysis.

UClaim AI-Agent Claims Processing Platform

Developed AI pipelines using multimodal models and multi-agent reasoning to extract and validate insurance documents, integrating Web3 smart contracts for automated claim settlement with minimal human-in-the-loop.

Education

BS-MS Physics
Indian Institute of Science Education and Research (IISER), Mohali (2020)

AI-interview Questions & Answers

Hi. I'm Neville. I did my master's in physics, from Indian Institute of Science Education and Research, Muhari. During my 5 years of bachelor's and master's, I had several internships all over India. I had esteemed astronomy institutes where I did hands on projects. I processed a lot of image data and used a lot of data preprocessing techniques, post processing, and caught my hands on Python knowledge. And it really got me excited into the field of computer science and deep learning. As I move forward, I once I graduated, I was pretty much excited about building things. I wanted to join a starter because I started I joined Lightspeed Labs. It's my first company. There, I learned most of my AI skills. I started working on hands on projects. So, and started off with building 2 algorithms, really. There's a face mask detector algorithm for, BOSCH. And secondly, we also worked on a fraud detection project, which allowed me to keep, which bagged up the second price, and we won a pool for $2,000 for free. it's a 2 big achievements that help company get started without finding wrong. And then I worked on different software engineering work, applying these models to and fro. That's when I decided once I felt like it was in mono, so then soon I decided to switch to Radius AI. I started off my journey at Radius AI in 2022. early 2083. That's when, there, I did a lot of, deep learning algorithms starting from post 3 d. I started training models. And I've been optimizing models for edge AI deployment. So that's where I learned frameworks like Nvidia, Tensor, or like deployment on TensorID servers and also building custom AI models to cater to the needs. So there were 2 major projects there. 1 was in post based where I was able to, create a custom model architecture and reduce this latency of the model by at least 15 x. After that, I worked on a project as you're working with tracking between multiple cameras, like 3 cameras, and it's track you would tracking across them. So after that stage, as the AI field was evolving and there was chain AI coming up, there's a boom of chain AI. I'm trying to start up with my ex colleague, to start Hedwig AI. And Hedwig AI is responsible for building the platform from scratch. I was building the search and rack pipelines for image, video, and audio, and document retrieval and augmentation strategies. I work with small elements to generate the text outputs as well as I worked on different pipelines for creating from single image to multi view images using diffusion models. So all I have worked on, I have a full years of experience in this field.

Machine learning model with CICD practice. To the machine learning model with CICD practice is we can have 2 Docker containers. 1 Docker container which runs the inference script. The inference script will contain the Docker Guard team. We will have access to the GPU. Now this Docker Guard team, we can and we can deploy, the custom Docker script will contain steps for the deployment as well as running the script. Now the script will basically load the model in the GPU and run the inference. We can also have 2 different servers. In this inference script, we can coordinate with the second Docker container, which is a Trident inference server. If the machine learning model is all converted to, TorchScript, then Webex and then TensorID, we can deploy it on the GPU in a more optimized fashion. since the TensorRT model is a very hardware optimized framework, it basically works at the layer of the bare bones, like the matrix multiplication, and basically simplifies the whole model weight into a graph layer, which can directly be mapped to the GPU of the hardware. Since, the hardware contains multiple, NVIDIA cores like floating point 16 intent, based on the quantization scheme also, we can have loading model, machine learning model.

Meaning transfer model for a real time translation service. Okay. For real time translation service, we can use a predictive transformer architecture, which is a decoder only architecture. The this model basically will take as input the, route off from the language one and convert using a pre trained tokenizer converted into tokens. These tokens will be mapped to vector embedding space using the tokenizer. And once we have the vector embeddings, we can pass these vector embeddings along with the position embeddings, which gives the position of the token or the word in the input sequence. Now this the position in input encoding is very important because, and I'll explain this later. We have a self attention here which maps which calculates the attention or how correlated are each word or how dependent are each word in respect to each other. Now if we pass this to, the transformer architecture, which consists of a feed forward network, which can serve multiple decoder blocks and a softmax layer and, token predictor and a post processing tokenizer. Now the 2 major part of the transformer here consists of multiple decoder blocks, but each decoder block contains 2 kinds of, a field forward network and a self attention layer. So once we pass the input, it will pass through the feed for the input vectors. So let's say we start with, keywords and we tokenize into k tokens, And the k tokens are further added with the position of this space. We have 2 k tokens and, like, 2 k vectors, comma 768 if each vector embeddings in 768. This will pass through the field forward network and then to the attention head, and we remove the position embedding. And then once we have the output from the first decoder here, again, add the position embedding. It passes further into the multiple decoder layers. And then finally, once we have the output layer, we pass it through a softmax to get the output tokens. Once we have the output tokens, from the vector embeddings, we, convert the output tokens to, the post processing tokenizer, which is the output tokens in the other language. So that's how we can implement a transfer model for real time translation service.

Take to upload a contained based from this command system using machine learning. Is there a content based, recommended system wherein form? Given a set of contents for, let's say, we start with newspaper article reviews. We'll have multiple newspaper article reviews, and each of them will have a feature set which will contain which are the common features in each of these newspaper articles. Now based on the feature set, we can do our TFIDF based modeling and estimate for each, content. What is the frequency of the force present in this? Now we can given us the content of either the document, we can again estimate the what's and this TFID based course can be converted to 1 shot encoded vector sequence. Now given another content source, we can model the same 4 parameters for this content doc video view and then compare and get the vector embedding. Now we can come, do a vector cosine similarity between the input vector and the preprocess vectors of the content and calculate, retrieve the top k vectors and recommend that. That will be a contained based recommended system.

I'll do training on the show. Okay. To handle the imbalance data sets when training a machine learning model, we can use custom learning rates. So, a cosine learning rate or a step function based learning rate will allow us to reduce the, impact on the loss function due to the imbalance in the data. Secondly, if we have imbalance data, we can balance out the data or we can generate more dataset, based on the distribution of the like, for where we don't have data. We can do synthetic data to add to the, to remove the imbalance. Or if the dataset size is large, we can take subsample the input dataset and the dataset, the pretrained model, and use that to train the machine learning model.

In one, reduce the inference time of a deep learning model is significantly affecting its accuracy. Yes. So to reduce the inference time of a deep learning model, something like if we start with PyTorch weights, the inference type is very high. But if we move, we can convert these weights to Onyx and then to TensorRT. TensorRT optimized weights will be much more faster because it's more hardware optimized and can be directly loaded into the GPU. So that's one approach. And while doing the conversion, we can always utilize the pre training dataset and use, do a do a quantized aware training, which will allow us to con so that's one approach where the model accuracy also doesn't take a hit and the inference time can be different. 2nd approach could be a quantized array training where if the weights are in floating point 30 to a floating point 16, we can convert them into in date weights using a quantized array training approach. A quantized array training, usually takes a pre a subset of the training data And while doing the quantization, this is the way the floating, the 30 lowest integrated weights. It basically recalibrates the weights based on the subset of the training data. This allows us in, converting the model into index, and index operations are at least 8 times faster than forwarding 1 30 2 operations because the number of metrics manipulation operations reduce significantly. Therefore, using these either one of these techniques, that is, converting the Python or TensorFlow model to TensorRT or converting the or using a quantized training approach can significantly reduce the infinite time without affecting the accuracy.

Just pardon function. So first thing is here, the category variables are converted to numerical embeddings using a unique identifier. It's a good strategy, but we can the category of variables present in it or the categories could have a lot of overlap as well. So rather than using, simple, we can use a sentence embedding model like instruct embedding, which allows you to cluster the more like, define the embedding based on the task you had. Let's say if you are doing a clustering task or category discovery or acquiring task, Based on that, we can specify the instruction in the instructor embedding model. the instructor embedding model is a simple, text to vec model, which allows us to embed this in a much more efficient and maintainable way. Here, this second drawback is if the order of the category changes or the category changes, the encoding will change immediately, and we'll have to retrain the model to, we'll have to retrain the model to account for the change in the categories as the encoding the order of the categories would have changed. So it's always better to use a standard, kind of a text embedding, text to vector embedding model for categorial variables like this.

Can this write it for So the important issue is if the max and the, min values are the same or if there are already, like, NAND values in this feature scaling function, then, once we do this feature scaling, the data frame column values will explode. And rather than and it won't be a feature scaling. So we need to make sure to filter out the NAND values as well as we need to make sure that max value and there's a testing max value and mid value before applying this maximum scale.

So using TensorFlow, we can use we can write the full architecture by hand like we can define the CNN layers and then, let's say we have x number of CNN layers for the which helps in decoding like, encoding the image. And then at the end, we have a softmax layer which predicts the classification accuracy. So basic predicts the category in which each, the image belongs to. So based on the number of classes, we can define the softmax layer. Or the second approach could be we start from a pre trained model, something like a ResNet 50, which is a, and which is a good embedded imaging coding model and train on image. And then we can add a classifier layer on top of it to predict the classes. We also fine tune that data for fine tune that model on a custom dataset. So we can use a transfer learning approach to produce the out of like, since the model is trained on image, we can use transfer learning to train the on a custom, like, and train, like, train the model to on the custom dataset to improve the image classification accuracy.

How might it have left? Could you show the uncomment study? This audio time series. So the audio data can, audio data is basically the time period versus amplitude kind of a data. Now we can convert any audio data to a spectrogram. Now what is a spectrogram? A spectrogram is basically a frequency versus a frequency versus a time period, a 2 d kind of an image plot where each pixel represents the amplitude at that particular time and for that particular frequency. So if we take off a fast Fourier transform of an audio signal, we can get the amplitude for that signal at that particular frequency. Now using that particular frequency amplitude and the time period, we can define a spectrograph, which is a 2 d image map. This image map can be passed to a standard convolution neural network, and it can be used for embedding the audio signal as well.

Machine learning models for edge devices. By developing machine learning models for edge devices, we need to consider several factors. One is the infrared state. 2nd is the model size. 3rd is, what is the require input image size and the RAM requirement for processing the input and loading that input in the modem in the GPU. So I think these three factors are extremely critical because when we are deploying machine learning models on the edge, we need to be very careful about the inference time. At on the edge devices, you want an inference time or latency of, like, few milliseconds or milliseconds. But if the model is extremely heavy, you might not be able to oblige. You can't get that graph inference times. 2nd, if the model size is too large and it's not able to fit in the GPU course, then that's also something we need to consider. 3rd, as I mentioned, if the image size is too large, like creating HD quality image, it is extremely difficult to fit that kind of data on edge device. So we need to make sure that the model takes us input a smaller image size, which can be fit in the GPU or the RAM of the edge devices.