profile-pic
Vetted Talent

Yk tripathi

Vetted Talent

Results-driven and adaptable Data Scientist/Machine Learning with a successful track record in managing multiple priorities and delivering high-quality solutions. Proficient in NLP and machine learning techniques, with a focus on automating and processing vast amounts of text data. Recognized for expertise in using ML to analyze and extract insights from complex documents, such as bonds uploaded to the London Stock Exchange. Adept at developing and deploying unified models for classification and extraction, utilizing Sagemaker pipelines and hyperparameter tuning for optimal performance. Skilled in intelligence gathering, statistical analysis, and data mining, with a strong emphasis on attention to detail and written communication. A proactive problem solver with a passion for leveraging generative AI for Information Warfare. Experienced in technical entry into the Armed Forces, integrating the latest technology into existing frameworks and using ML to automate troop movement and supply management. Holds an M.Tech in Data Science and a B.Tech in Computer Science, along with certifications in machine learning, deep learning, cloud services, and more. Actively engaged in data science projects, including Eurobonds analysis at the London Stock Exchange and machine learning projects on Kaggle. A self-motivated professional with excellent organizational and time management skills.

  • Role

    Manager - AI and Automation

  • Years of Experience

    13.2 years

Skillsets

  • Glue
  • Quantum Computing
  • Agentic AI
  • automation
  • AWS
  • Azure
  • Data Science
  • Docker
  • GCP
  • Generative AI
  • Git
  • GitLab
  • Database management
  • Google Cloud
  • Java
  • Lambda
  • Lang Graph
  • LangChain
  • LangSmith
  • Machine Learning
  • MLOps
  • Monitoring
  • Reinforcement Learning
  • Transformers
  • Data Engineering
  • NLP - 7 Years
  • Deep Learning - 5 Years
  • Python - 7 Years
  • Statistical analysis
  • Sagemaker - 2 Years
  • SQL - 6 Years
  • A/B testing - 6 Years
  • Artificial Intelligence
  • BigQuery
  • Cloud Services
  • clustering
  • Python - 5 Years
  • Data Visualization
  • Database management
  • Decision Trees
  • Feature Engineering
  • Mathematics
  • Product Management
  • Statistical Modeling
  • Statistics
  • Web Analytics
  • Quantum Computing

Vetted For

0Skills
  • Roles & Skills
  • Results
  • Details
  • icon-skill_image
    Digital Data ScientistAI Screening
  • 76%
    icon-arrow-down
  • Score: 76/100

Professional Summary

13.2Years
  • Apr, 2025 - Present1 yr 2 months

    Manager

    EY
  • Manager - AI and Automation

    EY
  • Jun, 2024 - Jan, 2025 7 months

    Project Manager

    Affine
  • Apr, 2017 - Mar, 20235 yr 11 months

    Data Scientist

    Government of India
  • Mar, 2023 - Mar, 20241 yr

    Data Scientist

    LSEG
  • Mar, 2024 - Nov, 2024 8 months

    Consultant data scientist

    Affine
  • Mar, 2016 - Dec, 2016 9 months

    Application Developer

    Oracle
  • Jan, 2014 - Feb, 20162 yr 1 month

    System Engineer

    TCS

Applications & Tools Known

  • icon-tool

    AWS Cloud

  • icon-tool

    Amazon SageMaker

  • icon-tool

    Azure

  • icon-tool

    Keras

  • icon-tool

    PyTorch

  • icon-tool

    Hugging Face

  • icon-tool

    Tensorflow

  • icon-tool

    neural network architectures

  • icon-tool

    Python

  • icon-tool

    LLM

  • icon-tool

    LLAMA

  • icon-tool

    Generative AI

  • icon-tool

    Machine Learning

  • icon-tool

    Huggingface

  • icon-tool

    ETL pipelines

  • icon-tool

    ChatGPT

  • icon-tool

    Snowflake

  • icon-tool

    Spark

  • icon-tool

    Generative AI

Work History

13.2Years

Manager

EY
Apr, 2025 - Present1 yr 2 months

Manager - AI and Automation

EY
    Agentic AI frameworks, MCP and other state-of-the-art solutions. Refined system efficiency by 15% by monitoring the solution with Langsmith.

Project Manager

Affine
Jun, 2024 - Jan, 2025 7 months
    Increased project delivery efficiency by 45% for Shutterstock by managing offshore teams. Managed 4 different projects for the same client. Delivered high-priority insights and was pivotal in several solutions design.

Consultant data scientist

Affine
Mar, 2024 - Nov, 2024 8 months
    Led KHC project, resulting in a budget increase of $110,000. Upgraded operational efficiency by 70% through automation of freight. Improved prediction accuracy by note reading using GPT models. Critical in designing generative AI solution.

Data Scientist

LSEG
Mar, 2023 - Mar, 20241 yr
    Automated and processed vast amounts of text data in various forms as an NLP data scientist. Improved data extraction accuracy by 60% through designing ML-based PDF retrieval systems. Developed Sagemaker Pipelines for classification model deployment and testing.

Data Scientist

Government of India
Apr, 2017 - Mar, 20235 yr 11 months
    Amplified operational efficiency 25% by integrating advanced technology into military systems. Conducted satellite and radar data analysis using ML techniques. Researched about automation of troop movement and generative AI applications.

Application Developer

Oracle
Mar, 2016 - Dec, 2016 9 months
    Improved deployment efficiency by 25% by optimizing the design and planning workflow. Enhanced operational efficiency, reducing downtime by 30% through expert collaboration in troubleshooting and debugging. Developed application scripts and updated technical documentation.

System Engineer

TCS
Jan, 2014 - Feb, 20162 yr 1 month
    Amplified system efficiency by 40% by resolving 150 technical issues and proposing 5 design solutions. Improved IT compliance efficiency by 30% by developing comprehensive IT policies.

Achievements

  • I was awarded GOLD Award in LSEG for contribution in the Eurobonds project
  • Spearheading a GenAI-centric project for a global MNC client
  • Automating and processing the vast amount of text data using NLP techniques
  • NLP Automation Success
  • Model Optimization
  • Tech Integration Leader
  • ML Competition Ranking

Major Projects

3Projects

Generative AI - Advanced GAN with CelebA Dataset

Nov, 2023 - Dec, 2023 1 month
    Generative AI project utilizing GANs for image generation. Augmented datasets by creating 10,000 high-quality images with GAN model from CelebA dataset.

Titanic - Machine Learning from Disaster

Nov, 2022 - Dec, 2022 1 month
    Machine learning project predicting survival based on dataset analysis. Improved prediction accuracy by building predictive models to analyze survival rates using passenger data.

NLP with disaster tweets

Nov, 2021 - Dec, 2021 1 month
    NLP project for monitoring Twitter communications during emergencies. Boosted disaster response efficiency by developing a machine learning model for classifying disaster-related Tweets.

Education

  • M.Tech in Data Science

    BITS Pilani (2024)
  • B.Tech (Computer Science)

    IERT (2013)
  • PG Diploma

    Savitri Bai Phule University (2019)

Certifications

  • Deep Learning

    Kaggle (Oct, 2010)
  • Python for data science

    Udemy (Sep, 2009)
  • Java

  • Feature engineering

  • Deep learning

  • Intermediate machine learning

  • Az-900

  • Python for data science

  • Sql

  • Intro to gis

AI-interview Questions & Answers

I am YK Tripathi, and I am a data scientist. I have completed my MTech in data science from BITS Pilani. Before that, I was working with the Indian Army for six years. Prior to that, I worked with Oracle, and prior to that, I worked with TCS. I have a BTech in computer science and engineering from IERT, Allahabad. After that, I'm currently working with the London Stock Exchange Group as an NLP data scientist. Basically, my role in LSEG is to extract meaningful information from a vast amount of PDFs and different types of data that are being uploaded to the stock exchange, which are in various formats – sometimes structured, sometimes semi-structured, sometimes unstructured. But we have to process it and make sense of it so that some meaningful data can be extracted from it and used by the company. Apart from that, I am also certified as an Azure fundamentals AZ-900 certified professional, and I am certified in Python for data science. Apart from that, I have extensive work experience with AWS SageMaker, which is a data science platform.

Okay, so for significant changes and trends within the dataset, if the dataset contains data which is numerical in nature, we can perform many statistical analyses to identify that. So, basically, significant changes are properly identified as outliers. If we do a box plot of any numerical data, the outlier things can be indicative of significant changes, and we can identify trends by various plots like line plots and things like that, where you would see a trend line which would be corresponding to the latest trends, things like that. If the data is time-series data, then we can identify trends. There are automated tools also which can be used directly. Let's say, for example, Tableau, in which you just have to drag and drop the target column into the Tableau, what we say is a desktop view or dashboard, and there you would see many trend lines, whichever you want to see, and significant changes can be detected programmatically also, where you would create a pandas data frame and then perform some coding, like you can have common sense checks. Let's say if you want to talk about the age. So, age cannot extend to 200 years for humans. Right? So these types of significant changes and trends you can analyze in three ways.

To analyze large volumes of data, first and foremost, we need to use some cloud services in formats such as AWS, Azure, Google Cloud, or Adobe Cloud. Data can be stored in a data lake or relational database management systems, which are capable of handling large amounts of data. Apart from that, data warehouse, data lake, and similar systems can be handled. Apart from that, there is a new concept called SPA, which is stream processing, and analytics in which large amounts of data are directly processed and presented to the user. We can use Kafka, Spark, and similar tools. Additionally, big data can be handled using Hadoop, but I would prefer using Kafka and Spark. This is how I think we can handle large volumes of data. Another technique to handle large volumes of data is to break it down into chunks. We take small amounts of data first and then continue processing it, compartmentalizing the data. This is one way to handle large volumes of data when there are restrictions on spending, as all cloud resources are costly.

Yeah. So in the process of tailoring a data presentation to different audience types, we can have, for example, business people. Right? First and foremost, I would go from the top. So let's say you have business people, the people who need to understand the business value of your solution. So first, we need to present the business value of that solution. We need to present the graphs. Let's say I want to shift my work to the Cloud. So I need to explain how this Cloud thing is going to help me in the longer run. How my Opex and CapEx are going to reduce over a period of time, or whatever the changes are. So they might be interested in the money aspect. Let's say I want to present something which is more towards the technical people. For example, if I want to implement something, let's say I made a model that I want to present it to the DevOps guys who want to implement it. So in that case, I need to look at what changes in the technical description they would need to make, what type of instances, what type of nitty-gritty code they would want to look at, and let's say if I want to present it to a client. So what exactly is the end-to-end experience? How would he send an input? How would he get an output? What does that mean? How the output would perform, and what the output would mean for him? So that is, I can think of the tailoring of data presentation to different audience types.

In order to effectively communicate insights to stakeholders, it's first and foremost necessary to define what the problem is, exactly, and how a particular solution is going to solve it. So, first, I need to make them understand that this is the problem statement, per se. This is how the solution would be implemented, and that would solve this particular issue. For example, we analyze data showing that companies selling few products to people between 19 and 30 years old sell less of that product when targeting the age group of 60 to 70. We need to target that audience more. We may need to differentiate between strategies on how we promote that particular product on the market. Let's say there are some products generally taken by men, but not by women. We need to understand how we'll change our marketing strategy to target that particular audience. To create effective communication towards the insight, we first formalize the problem and then determine how we're going to solve it.

So for the data representation, I generally prefer coding. I'm a coder at core, so I prefer Matplotlib, SNS. There is a very good library which uses the internet called Plotly. So they are very good tools which are directly usable in Python code. They can take in a Pandas data frame, and you can manipulate it on the go. That is what I prefer the most, and the basic reason is that because of the coding aspect, I am able to modify it on the go, on the fly. I'm not restricted by the format or other things of the automated tools. I can have whatever representation that I want. I'm not restricted, basically, because the code is ultimately running in the backend, there is a code that is running. The code which has been automated and productionized by some people. So we can skip to the chase and have our own code. But, yeah, definitely, in some cases, there is a benefit to using automated tools like Tableau and things like that. Power BI and Tableau are useful when you have a large amount of data, and it's highly improbable that you can actually make sense of it by analyzing it. There, we can use all these automated tools like Tableau and Power BI. But mostly, in any case, we can use Python libraries for that. So that is what my preferred data representation.

We need to perform a thorough analysis of the marketing data based on ad performance, website traffic, and social media metrics such as sentiment analysis based on posts, how many people are talking about them, and how the new product is being received. We need to determine how to make it trending. To do this, we can look into graph analytics, which can help us identify influential nodes and pay those nodes to effectively permeate our product or marketing strategies throughout the graph. I'm considering the graph list, which can be anything – a social networking site is a graph, so we can use that. We also need to understand what's on our website or ads that's connecting with users the most. To do this, we need to look for those influential points that are more effective when we consider the bigger picture. Apart from that, we need to understand that if we're making new users, we can use geofencing. For example, if I have a shop that can detect registered users in close proximity, I can send them tailored or customized messages or offers to increase the overall system's performance.

So first and foremost, I started working with TCS. I single-handedly migrated a project that was very much dependent on Oracle Web Server and Oracle 9i database. I modified it in a way that you only need to change a property file, in which you just specify which web server and database services you're going to use, and it would be completely shifted into that. That involved the first and foremost application of my problem-solving abilities. After that, I kept on doing that. There were so many examples I could quote in the army, but, presently, what I'm doing here in LSEG as a data scientist is there's an issue with PDFs. We're getting the PDFs, but we're not getting the exact sentences. And we're using sentence classification. So we need to get effective sentences. But how can we get them? Because the structure of the data is unstructured. So there was a big problem, and I solved it by using the library called PyMuPDF. And I was able to solve that particular issue. My sentence splitter has been implemented, gone into production, and it's being used throughout all of those projects that are being used now. Apart from that, there were some issues with a few short learning models, which is called the set fit model, in which we need to modify the set fit model itself. We need to combine multiple fields in a single model due to cost constraints. So what I did is like I had to freeze the head and apply a logistic regression model to every field in the head, and then we could solve it.

To optimize data storage and retrieval, first and foremost, we need to get a hold on how many connections we have to the data storage. We need to pull it. We need to use the pool connection so that there aren't many open connections which are hogging the bandwidth and making the entire system slow. Second, we need to design a redundancy strategy and normalize our data so that we're not storing redundant information again and again, which can be avoided. Apart from normalization, during the time of retrieval, we need to have proper indexing so that during retrieval, it can be fast. We need to implement caching so that whatever data is being accessed frequently is not being searched again and again. Then we need to have, like, auto scaling in terms of, if the requests are coming out to be more, we don't need to implement it, stop the services and implement it again. We have to have auto scaling implemented beforehand. So all these things I would use to optimize the data storage and retrieval.

So as I told you before, the data visualization that I generally use is from Python, which makes it super easier for me to manipulate whatever data I want to represent, I don't have to bend out backwards just to represent something. Basically, till now, my work has been focused on predictions and things like that. So in order to present that, there would be the best chance that I could do it through data visualization tools of Python, like Matplotlib, Seaborn, and Plotly. So, basically, to present digital marketing data, bar charts can be a good help. Something like lm plots where you would see a trend line directly going. These are things that can be really helpful. And lm plots, the trend lines, and time series plotting, apart from that, you can have box plots. In which you'd clearly be able to see what are the outliers, what exactly is your median, mode, and all the measures of centrality. What exactly are the numbers that you're looking for as measures of central deviation, I mean, standard deviation and all the measures of centrality for that matter. So, that is what I believe.