Vetted Talent

Akash rajpuria

Vetted Talent

Senior NLP Scientist | Proven Leader in Driving Business Impact with AI

I am a highly motivated Senior NLP Scientist with 7+ years of experience in leveraging cutting-edge Natural Language Processing (NLP) techniques to solve complex business problems and generate significant ROI. My expertise lies in building and deploying scalable NLP solutions across various industries, including healthcare, e-commerce, and pharmaceuticals.

Throughout my career, I have consistently demonstrated a strong ability to:

Lead and mentor high-performing data science teams. I have experience spearheading teams of data scientists and engineers in designing, developing, and deploying NLP solutions.

Deliver impactful results through NLP innovation. I have a proven track record of building NLP models that have increased user engagement by 20%, reduced customer pain points by 300,000, and boosted revenue by 3.7%.

Bridge the gap between technical expertise and business needs. I excel at translating complex NLP concepts into actionable insights for non-technical stakeholders, ensuring data-driven decision making.

Embrace new technologies and stay ahead of the curve. I am passionate about staying current with the latest advancements in NLP, actively utilizing tools like Hugging Face, OpenAI, and PySpark to tackle Big Data challenges.

I am confident that my leadership skills, technical depth, and business acumen make me a valuable asset to any team looking to leverage the power of NLP for real-world impact.

Role
Sr. NLP Scientist
Years of Experience
17.8 years

Skillsets

Dimensionality Reduction
Reinforcement Learning
NLU
Speech Recognition
Neural Network
Rnn
LSTM
Transformer
Recommendation System
LangChain
Cnn
TF-IDF
Bow
CRF
Fastext
Unsupervised Learning
Time Series
conversational AI
API
Apache Spark
PySpark
Docker
OCR
Computer Vision
LLM
Artificial Intelligence
Deep Learning
Statistics
Big Data
GPT
BERT
Natural Language Processing
Image Processing
OpenAI
Generative AI

Vetted For

10Skills

Roles & Skills
Results
Details

Machine Learning Engineer IIAI Screening
38%

Skills assessed :A/B Test Design, ChatGPT, Complex SQL Queries, ETL pipeline, llm prompt engineering, Natural Language Processing, Snowflake, Spark, machine_learning, Python
Score: 34/90

Professional Summary

17.8Years

Jun, 2024 - Present2 yr 1 month
Senior Data Scientist
EPAM Systems
Jan, 2023 - Jun, 20241 yr 5 months
Lead Data Scientist - NLP- PFIZER
MResult
Mar, 2022 - Jan, 2023 10 months
Lead NLP SCIENTIST
Navana.ai
Aug, 2019 - Jan, 20211 yr 5 months
Machine Learning Specialist
Jio Platforms Limited (JPL)
Teacher
Jan, 2021 - Mar, 20221 yr 2 months
Founding Member NLP
Navana.ai
Jun, 2019 - Aug, 20201 yr 2 months
Data Science Instructor
upGrad
Aug, 2018 - Jan, 20212 yr 5 months
Subject Matter Expert (Deep Learning)
upGrad
Jun, 2018 - Aug, 20191 yr 2 months
Data Scientist
Jio Platforms Limited (JPL)

Applications & Tools Known

ChatGPT
Hugging Face
Dialogflow
GCP
AWS
Git
Azure
Tableau
Keras
Tensorflow
NLTK
Sklearn
Pandas
Flask
Kaldi

Work History

17.8Years

Senior Data Scientist

EPAM Systems

Jun, 2024 - Present2 yr 1 month

Lead Data Scientist - NLP- PFIZER

MResult

Jan, 2023 - Jun, 20241 yr 5 months

Lead Data Scientist working on GENAI projects, Medical Compendium Generation, Clinical Document Classification, Health Recommendation Engine, Adverse Drug Event Prediction System, Data Creation and Redaction, Laptop Recommendation App, and automated various Outlook tasks.

Lead NLP SCIENTIST

Navana.ai

Mar, 2022 - Jan, 2023 10 months

Founding Member NLP

Navana.ai

Jan, 2021 - Mar, 20221 yr 2 months

Founding Team member who dealt with multilingual Ordering Solution and Text Classification system.

Teacher

I have taught students at no cost , student which cannot pay

Machine Learning Specialist

Jio Platforms Limited (JPL)

Aug, 2019 - Jan, 20211 yr 5 months

Data Science Instructor

upGrad

Jun, 2019 - Aug, 20201 yr 2 months

Subject Matter Expert (Deep Learning)

upGrad

Aug, 2018 - Jan, 20212 yr 5 months

Data Scientist

Jio Platforms Limited (JPL)

Jun, 2018 - Aug, 20191 yr 2 months

Designed noise-robust Speech Recognition tool, implemented Ranking Algorithm and Growth Analysis Algorithm.

Achievements

Top Scorer in DBA 800/DBA 801
11% top scorer for Data Science in MTECH by BITS across pan INDIA
Youngest guest faculty for Data Science in MUMBAI

Education

Doctorate of Data Science
Golden Gate University (2025)
MTech. Data Science
Birla Institute of Technology and Science, Pilani (2021)
BTech in Electronics
RCOEM (2018)

Certifications

Top instructor from more than top institutes
Nlp expert by upgrad
Speech recognition by microsoft

AI-interview Questions & Answers

Could you help me understand more about the background by giving a brief introduction about it?

Describe any project you have extensively used in production. What is the achievement you need using LLM, and what are the challenges you are facing? So the project which I have worked with LLM is right now the redaction project. So the redaction project is from Pfizer. How it works is you have the redaction project using LLM. The task was we have very sensitive medical data, and this was used in production. For example, let's say a name of an employee is or let's say name of a patient is Raju. Raju is suffering from acidity, and he is taking this medicine this many times. This is his phone number. This is his email address. So the complete information is stored in your database. Now this information is very sensitive, and if you want to use it for any purpose, let's say, machine learning purpose or developing any model or you want to use it internally, anywhere, you can't use this data directly. It does not follow all the guidelines. So what you need to do is you need to immediately make it redacted. Redacted means that all the crucial information is hidden. Now this hidden information, for example, Raju's name is now changed with some other name so that the real information with the dummy name, his age, his name, his phone number, his email address is redacted with x is written or some dummy name is written so that it can be used for modeling purpose also, and it is not linked. So to do this purpose, we have used large language models. Now inside that, first, identifying the name because the major challenge here is why we can't do this with machine learning? Because the people's name, let's say the people's name is Helen, and there is a medicine name, which is also, let's say, Helen 10 20 mg. Now identifying this, whether it is a name or a medicine name, this is very important because we don't want to redact the medicine name. We want the dosages so that we can train a model later on. To identify them, we use 11 models here. Now 11 models can understand the context very well, and even if you give some name which LLM has not seen, it can recognize that. So what we did here was we used the LLM models, and we tried to redact the names. We also fine-tuned the model here. Now to fine-tune the model, to improve the performance and accuracy, this is a medical data. It's very important. What we did was we created a dummy data with the help of LLM only, like, using prompt engineering. We created the data. After creating the data, we redacted that also, and the same thing we used to fine-tune it also. So we used the OpenAI model to fine-tune it, and we used Azure's OpenAI here, and we've trained that model. After that, we got a very good accuracy here around 97.3 percent, on this redaction where phone numbers, email addresses, and the name of a person was redacted very perfectly with a 97.3% F1 score.

Okay. What steps would you take to mitigate Snowflake compute cost while running complex JSON aggregation. So I have not worked with Snowflake, so I would not be able to answer this one.

Imagine you have a complex SQL query consisting of multiple joints that is running slower than expected. What areas of component, query, or data involved would you look into 1st to optimize the performance? So, that's a school of multiple joints and the operations that is running slower than expected. What are the areas of component or data involved would you be? Okay. So we are using here c dot star a dot star and b dot value on table a, table b, and we are status. Okay. We want to optimize this. So I think we are using here a dot star. Instead of using star, we can take a particular value. That is one optimization we can do. Then we are using here 2 joins. B is joined on a dot ID and b dot ID. Okay. And join table c on c dotid and a dotid. Okay. Then we'll see status as active. We can use having clause also here, to make it faster.

Would you how would you leverage different elements to create a WhatsApp chatbot for travel inquiries?

I think this platform is not very smooth. I have missed that question. A lot of questions I have missed. How do you design a scalable slot to handle degradation? I have not scaled it and stuff. I have not designed Spark, but I have used Spark jobs by Spark. I have used it where we can take a larger volume of data, and we can apply map reduce on top of it, and we can make it super fast. So, I have worked on that but not designing.

Suppose there is a tie PyTorch model for image classification that is yielding low accuracy than expected. What steps would you examine within the machine learning model evaluating process to debug and improve the model performance? Okay, so we have a model classifier with entropies. We've used item optimizer for a box and image labels class. We are using an optimizer with a gradient, and we're then evaluating the model. Okay? While evaluating the model, we found that it's yielding low accuracy. If it's yielding low accuracy, we can try to change the loss function. We can try to change different optimizers, such as the Adam optimizer used there. We can try to use different optimizers and see if we can change the learning rate. The first thing that will come to mind is changing the hyperparameters, specifically the learning rate. That's okay? Then we can add some more processes to improve the accuracy, like early stopping and checkpoints. We can do that. We can check whether it's overfit or underfit. If it's overfit, we can apply techniques like changing the learning rates. We can also try to use regularization to prevent overfitting. And if it's underfit, we'll try to augment the data more, increase more data because it's an image classifier, so we can do that.

You are given with the following as well query that is intended to calculate the average rate of what it is missing a clause, and it's currently returning an incorrect result. Identify the missing clause that is

import pandas as pd # Create the first DataFrame df1 = pd.DataFrame({ 'id': [1, 2, 3], 'score': [10, 20, 30] }) # Create the second DataFrame df2 = pd.DataFrame({ 'id': [1, 2, 3], 'value': [40, 50, 60] }) # Merge the two DataFrames on the 'id' column merged_df = pd.merge(df1, df2, on='id') # Rename the 'score' column in the merged DataFrame to 'score' merged_df = merged_df.rename(columns={'score_x': 'score', 'score_y': 'value'}) # Compute the mean of the 'score' column mean_score = merged_df['score'].mean() print(mean_score)