Vetted Talent

Yk tripathi

Vetted Talent

Results-driven and adaptable Data Scientist/Machine Learning with a successful track record in managing multiple priorities and delivering high-quality solutions. Proficient in NLP and machine learning techniques, with a focus on automating and processing vast amounts of text data. Recognized for expertise in using ML to analyze and extract insights from complex documents, such as bonds uploaded to the London Stock Exchange. Adept at developing and deploying unified models for classification and extraction, utilizing Sagemaker pipelines and hyperparameter tuning for optimal performance. Skilled in intelligence gathering, statistical analysis, and data mining, with a strong emphasis on attention to detail and written communication. A proactive problem solver with a passion for leveraging generative AI for Information Warfare. Experienced in technical entry into the Armed Forces, integrating the latest technology into existing frameworks and using ML to automate troop movement and supply management. Holds an M.Tech in Data Science and a B.Tech in Computer Science, along with certifications in machine learning, deep learning, cloud services, and more. Actively engaged in data science projects, including Eurobonds analysis at the London Stock Exchange and machine learning projects on Kaggle. A self-motivated professional with excellent organizational and time management skills.

Role
Manager - AI and Automation
Years of Experience
11 years

Skillsets

Glue
Quantum Computing
Agentic AI
automation
AWS
Azure
Data Science
Docker
GCP
Generative AI
Git
GitLab
Database management
Google Cloud
Java
Lambda
Lang Graph
LangChain
LangSmith
Machine Learning
MLOps
Monitoring
Reinforcement Learning
Transformers
Data Engineering
NLP - 7 Years
Deep Learning - 5 Years
Python - 7 Years
Statistical analysis
Sagemaker - 2 Years
SQL - 6 Years
A/B testing - 6 Years
Artificial Intelligence
BigQuery
Cloud Services
clustering
Python - 5 Years
Data Visualization
Database management
Decision Trees
Feature Engineering
Mathematics
Product Management
Statistical Modeling
Statistics
Web Analytics
Quantum Computing

Vetted For

0Skills

Roles & Skills
Results
Details

Digital Data ScientistAI Screening
76%

Score: 76/100

Professional Summary

11Years

Jan, 2025 - Present 10 months
Manager - AI and Automation
EY
Jun, 2024 - Jan, 2025 7 months
Data Scientist Manager
Affine
Mar, 2024 - Jun, 2024 3 months
Data Scientist lead
Affine
Mar, 2016 - Dec, 2016 9 months
Application Developer
Oracle India
Apr, 2017 - Mar, 20235 yr 11 months
Data Scientist-Technical Officer
Indian Army
Mar, 2023 - Mar, 20241 yr
Data Scientist
London Stock Exchange Group
Jan, 2014 - Feb, 20162 yr 1 month
System Engineer
TCS

Applications & Tools Known

AWS Cloud
Amazon SageMaker
Azure
Keras
PyTorch
Hugging Face
Tensorflow
neural network architectures
Python
LLM
LLAMA
Generative AI
Machine Learning
Huggingface
ETL pipelines
ChatGPT
Snowflake
Spark
Generative AI

Work History

11Years

Manager - AI and Automation

Jan, 2025 - Present 10 months

Agentic AI frameworks, MCP and other state-of-the-art solutions. Refined system efficiency by 15% by monitoring the solution with Langsmith.

Data Scientist Manager

Affine

Jun, 2024 - Jan, 2025 7 months

Increased project delivery efficiency by 45% for Shutterstock by managing offshore teams. Managed 4 different projects for the same client. Delivered high-priority insights and was pivotal in several solutions design.

Data Scientist lead

Affine

Mar, 2024 - Jun, 2024 3 months

Led KHC project, resulting in a budget increase of $110,000. Upgraded operational efficiency by 70% through automation of freight. Improved prediction accuracy by note reading using GPT models. Critical in designing generative AI solution.

Data Scientist

London Stock Exchange Group

Mar, 2023 - Mar, 20241 yr

Automated and processed vast amounts of text data in various forms as an NLP data scientist. Improved data extraction accuracy by 60% through designing ML-based PDF retrieval systems. Developed Sagemaker Pipelines for classification model deployment and testing.

Data Scientist-Technical Officer

Indian Army

Apr, 2017 - Mar, 20235 yr 11 months

Amplified operational efficiency 25% by integrating advanced technology into military systems. Conducted satellite and radar data analysis using ML techniques. Researched about automation of troop movement and generative AI applications.

Application Developer

Oracle India

Mar, 2016 - Dec, 2016 9 months

Improved deployment efficiency by 25% by optimizing the design and planning workflow. Enhanced operational efficiency, reducing downtime by 30% through expert collaboration in troubleshooting and debugging. Developed application scripts and updated technical documentation.

System Engineer

TCS

Jan, 2014 - Feb, 20162 yr 1 month

Amplified system efficiency by 40% by resolving 150 technical issues and proposing 5 design solutions. Improved IT compliance efficiency by 30% by developing comprehensive IT policies.

Achievements

I was awarded GOLD Award in LSEG for contribution in the Eurobonds project
Spearheading a GenAI-centric project for a global MNC client
Automating and processing the vast amount of text data using NLP techniques
NLP Automation Success
Model Optimization
Tech Integration Leader
ML Competition Ranking

Major Projects

3Projects

Generative AI - Advanced GAN with CelebA Dataset

Nov, 2023 - Dec, 2023 1 month

Generative AI project utilizing GANs for image generation. Augmented datasets by creating 10,000 high-quality images with GAN model from CelebA dataset.

Titanic - Machine Learning from Disaster

Nov, 2022 - Dec, 2022 1 month

Machine learning project predicting survival based on dataset analysis. Improved prediction accuracy by building predictive models to analyze survival rates using passenger data.

NLP with disaster tweets

Nov, 2021 - Dec, 2021 1 month

NLP project for monitoring Twitter communications during emergencies. Boosted disaster response efficiency by developing a machine learning model for classifying disaster-related Tweets.

Education

M.Tech in Data Science
BITS Pilani (2024)
B.Tech (Computer Science)
IERT (2013)
PG Diploma
Savitri Bai Phule University (2019)

Certifications

Deep Learning
Kaggle (Oct, 2010)
Python for data science
Udemy (Sep, 2009)
Java
Feature engineering
Deep learning
Intermediate machine learning
Az-900
Python for data science
Sql
Intro to gis

AI-interview Questions & Answers

So I am my name is YK Tripathi, and, I am a data scientist. I have purse I have completed my MTech in BITS Pilani in data science domain. Uh, before that, I was working with Indian Army for 6 years. And prior to that, I was working with Oracle. And prior to that, I was working with TCS. I am a BTech in computer science and engineering from IERT, Allahabad. And uh, after that, presently, I'm working with London Stock Exchange Group as NLP data scientist. So, uh, basically, my role in LSEG is to extract meaningful information from a vast amount of PDFs and different type of data which is being uploaded to the stock exchange, uh, which is uh, in variety of formats, sometimes structured, sometimes semi structured, sometimes unstructured. Uh, but it is we have to process it and make sense of it so that, uh, some meaningful data can be extracted from it and used by the company. Apart from that, I am, uh, also certified uh, as Azure fundamental AZ 900 certification, I am certified for Python for data science. Uh, apart from that, I have a extensive work experience with AWS SageMaker, which is a data science platform uh

Okay, so for significant changes and trends within the dataset, if the dataset contains Data which is numerical in nature, we can perform many statistical analysis To identify that so, basically, uh, significant changes are properly outliers. So, if we do a box plot of any numerical data, so the Outlier things the outlier data points can be indicative of significant changes, And we can identify trend by various plots like l m plot and things like that where you would see a trend line, which would be corresponding to the latest trends, things like that. If the data is a time series data, so in there, we can identify trend. There are automated tools also which can be used directly. Let's say, for example, Tableau, in which you just have to drag and drop target column into the Tableau, uh, what we say is a desktop view or dashboard, And there you would see many trend lines, whichever you want to see, and significant changes can be detected Programmatically also where you would create a pandas data frame and then perform some Coding, um, like, uh, you can have common sense checks. Let's say if you want to take about talk about the age. So age cannot be extending to 200 years for humans. Right? So these type of significant changes and trends you can analyze in 3 ways.

Okay. To analyze large volumes of data, 1st and foremost, we need to use some cloud services in some of the formats like I can use AWS I can use Azure, even Google Cloud, Adobe Cloud. Anything can work. So data can be stored in a data lake or some sometimes, uh, the previous implementation have relational database management systems, which are handling of, uh, capable of handling a large amount of data. Apart from that, data warehouse, data lake, and things like that can be handled. Apart from that, there is a new thing called the SPA, which is stream processing, and analytics in which the large amount of data which is coming to you, which is being directly processed on the go and then present it to the user. So we can use, you know, Kafka, Spark, things like that. Apart from that, there is also big data which can be used by, uh, like, Hadoop and things like that, but I would rather prefer using Kafka and, uh, Spark. So this is what I think I would handle, the large volumes of data. Apart from that, there is also 1 more technique which you can do by to handle the large volume of data is, you know, you break it down to chunks. You take small amount of data first and then keep on doing it, like, you know, compartmentalizing the data. So that is also one way if, uh, the restriction is towards the spending, like how much we are going to spend because all the cloud resources, they are, uh, costly. They would cost some money. So that is how we

Yeah. Sure. So in the process of tailoring a data presentation to different audience type, uh, we can have, like, business people. Right? 1st and foremost, I I I would go from the top. So let's say you have the business people, the people who need to understand the business value of your solution. So first we need to present the business value of that solution. We need to present the graphs Let's say I want to shift My work to the Cloud. So I need to explain how this Cloud thing is going to help me in the longer run? How my Opex and CapEx are going to, you know, reduce over a period of time or whatever the changes are. So they might be interested Didn't see that, the money aspect. Let's say I want to present, uh, something which is more towards the technical people. Like, If I want to implement something let's let's say I I made a model that I want to present it to the DevOps guys who want to implement it. So in that case, I need to Look towards what changes in the technical description they would need to do, what type of instances, what type of, you know, nitty gritties of Code that they would want to look forward for, and let's say if I want to present it to a client, So what exactly is end to end experience? How would he send an input? How would he get an output? What does that mean? How the output would perform, what the output would mean for him? So That is, uh, I can think of the, uh, tailoring data presentation to different

Okay. So in order to, uh, effectively communicate the insights to stakeholders is, first and foremost, in the first phase, I need to define what the problem is, what exactly is the problem, and how the particular solution is going to solve it. So, basically, what I'm trying to say here is that, first, I need to I I need to make them understand that let's say this is the problem or this is the problem statement per se. And this is how this solution would be implemented and that would be solving this particular issue let's say, for example, uh, we analyze data in which companies selling few products to, let's say, age group of people between 19 to 30, and that particular product is sold less when the age group of people is between 60 to 70, so we need to target that audience more. Maybe we need to, you know, differentiate between the strategy on how we are promoting that particular product on the market. Let's say there are some products which are generally taken by men, but women are not tend to take it. So then we need to understand, like, how we are going to change our marketing strategy to target that particular audience so in order to create effective communication towards the insight, we need to first formalize the problem and then how we are going

So for the data representation, I prefer, generally, I prefer coding. I I'm a a coder at the core, so I prefer Matplotlib, SNS. There is a very good library which uses the Internet called Plotly. So they are very good tools which are directly, uh, which can be directly used in Python code. Uh, they can take in Pandas data frame, and you can manipulate it on the go. So that is what I prefer the most, and the basic reason is that because of the coding aspect, uh, I am able to, you know, modify it on the go, on the fly. I'm not restricted, uh, by the format or other things of the automated tools. Uh, I can have, like, whatever representation that I want. I'm not restricted, basically, because the code is ultimately if you use any tool. Ultimately, in the back end, there is a code that is running. The code which has been automated productionized by some people. So we can just skip to the chase and have our own code. But, yeah, definitely, in some cases, there is a benefit to using the automated tools like Tableau and things like that. Power BI and Tableau where you have large amount of data, and it is highly improbable that you can actually make some sense on analyzing it. There, we can use all these automated tools like Tableau and Power BI. But mostly, in any case, we can any day use, uh, Python libraries for that. So that is what my preferred data representation

Yep. Okay. So, definitely, there is, uh, we need to perform like, we need to analyze the marketing data based on ad performance, website traffic, Social media metrics such as sentiment analysis based on the post, how many people are talking about it, What if the new product is being launched? How much it is being received? How can we make it trending? How can we See, if we look into the, uh, thing called graph, uh, analytics so in the graph analytics, we can Somehow identify which are the influential nodes are, and we can pay those influential nodes So that our product or our, uh, marketing strategies are per me uh, you know, permeated Effectively throughout the graph. So I'm codling the graph list. Anything is a graph. A social networking site is a graph, So we can use that. Apart from that, we need to understand that what exactly is in our website or ads, which is, You know, connecting to the users most. So we need to look for those influential points That that are more effective when we look at the bigger picture. Apart from that, uh, we need also to understand that if we are, You know, making new users. There is a thing called Geo, uh, geofencing, which we which can be used. Let's say I have a shop which can detect the registered users if they are in close proximity and send them tailored or, you know, Customize messages or offers for them so that it may increase the performance of overall system.

Okay. So first and foremost, let me start from the very beginning itself when I was working with TCS. I single handedly migrated a project, which was, uh, very much dependent on Oracle web server and, uh, Oracle 9 I database, which I modified it in a way that you only need to change a property file in which you just specify which web server and which database services that you're going to use, and it would be completely shifted into that. So that involved the first and foremost application of my problem solving abilities. After that, I kept on doing that. There were so many examples I can quote in army, but, presently, what I'm doing here in LSEG as a data scientist is there are, uh, issue of PDF. We are getting the PDF, but we are not getting the exact sentences. And we are using the sentence classification. So we need to get effective sentences. But how can we get it? Because the structure of the data is unstructured. So there was a big problem, and I solved it by library called PymuPDF. And, uh, I was able to solve that particular issue. Uh, my sentence splitter has been implemented, gone into production, and it is being used throughout all of those projects which are being used now. Apart from that, there were some issues with, um, a few short learning, which is called the set fit model, in which we need to have we need to, you know, modify the set fit model itself where for every, uh, we need to combine multiple fields in a single model due to the cost constraints. So what I did is, like, I had to freeze the head and apply a logistical regression model to every field in the head, and then we can solve it.

Okay. To optimize data storage and retrieval, 1st and foremost, we need to, 1st and foremost, we need to, uh, mod, uh, you know, you can say, get a hold on how many connections To the data storage that we have. We need to pull it. We need to use the pool connection so that not, Uh, there are not many open connections which are hogging the, uh, bandwidth, and they are making the entire system slow. 1st and foremost is that. Second, we need to design a redundancy strategy, and we need to normalize our data so that we are not, Uh, storing the redundant information again and again, which can be, uh, avoided. Apart from normalization, during the time of retrieval, we need to have indexing proper indexing so that during the retrieval, It can be fast. We need to implement caching so that whatever data is being accessed frequently Is not being searched again and again. Then we need to have, like, you know, auto scaling in terms of let's say, If the requests are coming out to be more, we we don't need to implement it stop the services and implement it again. We have to have auto scaling implemented beforehand. So all these things I would use to optimize the data storage and retrieval.

Yeah. So as I told you before, the data visualization that I generally use is from Python, which makes it super easier for me to manipulate whatever the data I want to represent, I don't have to, you know, uh, bend out backwards just to represent something, some data. Uh, basically, till now, my work has been focused on completely on predictions and things like that. So in order to present that, there would be best chance that I could do it through, uh, data visualization tools of Python, like Matplotlib, Seaborn, and Plotly. Uh, so, uh, basically, to present digital marketing data, also, these bar charts can be a a good help. Something like lm plots where you would see a trend line directly going. So these are things that can be really helpful. And so l m plots, the trend lines, and time series plotting, apart from that, you can have, you know, box plots. So in which you'd clearly be able to see what are the outliers, what what exactly is, uh, where exactly is your median mode? All these centrality lie. What exactly are the numbers that you're looking for as central deviation I mean, sorry, standard deviation and all the measures of centrality for that matter. Uh, so that is what I believe.

Yk tripathi

Manager - AI and Automation

11 years

Skillsets

Vetted For

Professional Summary

Applications & Tools Known

Work History

Manager - AI and Automation

Data Scientist Manager

Data Scientist lead

Data Scientist

Data Scientist-Technical Officer

Application Developer

System Engineer

Achievements

Major Projects

Generative AI - Advanced GAN with CelebA Dataset

Titanic - Machine Learning from Disaster

NLP with disaster tweets

Education

M.Tech in Data Science

B.Tech (Computer Science)

PG Diploma

Certifications

Deep Learning

Python for data science

Java

Feature engineering

Deep learning

Intermediate machine learning

Az-900

Python for data science

Sql

Intro to gis

AI-interview Questions & Answers