profile-pic
Vetted Talent

Yk tripathi

Vetted Talent

Results-driven and adaptable Data Scientist/Machine Learning with a successful track record in managing multiple priorities and delivering high-quality solutions. Proficient in NLP and machine learning techniques, with a focus on automating and processing vast amounts of text data. Recognized for expertise in using ML to analyze and extract insights from complex documents, such as bonds uploaded to the London Stock Exchange. Adept at developing and deploying unified models for classification and extraction, utilizing Sagemaker pipelines and hyperparameter tuning for optimal performance. Skilled in intelligence gathering, statistical analysis, and data mining, with a strong emphasis on attention to detail and written communication. A proactive problem solver with a passion for leveraging generative AI for Information Warfare. Experienced in technical entry into the Armed Forces, integrating the latest technology into existing frameworks and using ML to automate troop movement and supply management. Holds an M.Tech in Data Science and a B.Tech in Computer Science, along with certifications in machine learning, deep learning, cloud services, and more. Actively engaged in data science projects, including Eurobonds analysis at the London Stock Exchange and machine learning projects on Kaggle. A self-motivated professional with excellent organizational and time management skills.

  • Role

    Manager - AI and Automation

  • Years of Experience

    11 years

Skillsets

  • Glue
  • Quantum Computing
  • Agentic AI
  • automation
  • AWS
  • Azure
  • Data Science
  • Docker
  • GCP
  • Generative AI
  • Git
  • GitLab
  • Database management
  • Google Cloud
  • Java
  • Lambda
  • Lang Graph
  • LangChain
  • LangSmith
  • Machine Learning
  • MLOps
  • Monitoring
  • Reinforcement Learning
  • Transformers
  • Data Engineering
  • NLP - 7 Years
  • Deep Learning - 5 Years
  • Python - 7 Years
  • Statistical analysis
  • Sagemaker - 2 Years
  • SQL - 6 Years
  • A/B testing - 6 Years
  • Artificial Intelligence
  • BigQuery
  • Cloud Services
  • clustering
  • Python - 5 Years
  • Data Visualization
  • Database management
  • Decision Trees
  • Feature Engineering
  • Mathematics
  • Product Management
  • Statistical Modeling
  • Statistics
  • Web Analytics
  • Quantum Computing

Vetted For

0Skills
  • Roles & Skills
  • Results
  • Details
  • icon-skill_image
    Digital Data ScientistAI Screening
  • 76%
    icon-arrow-down
  • Score: 76/100

Professional Summary

11Years
  • Jan, 2025 - Present 8 months

    Manager - AI and Automation

    EY
  • Jun, 2024 - Jan, 2025 7 months

    Data Scientist Manager

    Affine
  • Mar, 2024 - Jun, 2024 3 months

    Data Scientist lead

    Affine
  • Mar, 2016 - Dec, 2016 9 months

    Application Developer

    Oracle India
  • Apr, 2017 - Mar, 20235 yr 11 months

    Data Scientist-Technical Officer

    Indian Army
  • Mar, 2023 - Mar, 20241 yr

    Data Scientist

    London Stock Exchange Group
  • Jan, 2014 - Feb, 20162 yr 1 month

    System Engineer

    TCS

Applications & Tools Known

  • icon-tool

    AWS Cloud

  • icon-tool

    Amazon SageMaker

  • icon-tool

    Azure

  • icon-tool

    Keras

  • icon-tool

    PyTorch

  • icon-tool

    Hugging Face

  • icon-tool

    Tensorflow

  • icon-tool

    neural network architectures

  • icon-tool

    Python

  • icon-tool

    LLM

  • icon-tool

    LLAMA

  • icon-tool

    Generative AI

  • icon-tool

    Machine Learning

  • icon-tool

    Huggingface

  • icon-tool

    ETL pipelines

  • icon-tool

    ChatGPT

  • icon-tool

    Snowflake

  • icon-tool

    Spark

  • icon-tool

    Generative AI

Work History

11Years

Manager - AI and Automation

EY
Jan, 2025 - Present 8 months
    Agentic AI frameworks, MCP and other state-of-the-art solutions. Refined system efficiency by 15% by monitoring the solution with Langsmith.

Data Scientist Manager

Affine
Jun, 2024 - Jan, 2025 7 months
    Increased project delivery efficiency by 45% for Shutterstock by managing offshore teams. Managed 4 different projects for the same client. Delivered high-priority insights and was pivotal in several solutions design.

Data Scientist lead

Affine
Mar, 2024 - Jun, 2024 3 months
    Led KHC project, resulting in a budget increase of $110,000. Upgraded operational efficiency by 70% through automation of freight. Improved prediction accuracy by note reading using GPT models. Critical in designing generative AI solution.

Data Scientist

London Stock Exchange Group
Mar, 2023 - Mar, 20241 yr
    Automated and processed vast amounts of text data in various forms as an NLP data scientist. Improved data extraction accuracy by 60% through designing ML-based PDF retrieval systems. Developed Sagemaker Pipelines for classification model deployment and testing.

Data Scientist-Technical Officer

Indian Army
Apr, 2017 - Mar, 20235 yr 11 months
    Amplified operational efficiency 25% by integrating advanced technology into military systems. Conducted satellite and radar data analysis using ML techniques. Researched about automation of troop movement and generative AI applications.

Application Developer

Oracle India
Mar, 2016 - Dec, 2016 9 months
    Improved deployment efficiency by 25% by optimizing the design and planning workflow. Enhanced operational efficiency, reducing downtime by 30% through expert collaboration in troubleshooting and debugging. Developed application scripts and updated technical documentation.

System Engineer

TCS
Jan, 2014 - Feb, 20162 yr 1 month
    Amplified system efficiency by 40% by resolving 150 technical issues and proposing 5 design solutions. Improved IT compliance efficiency by 30% by developing comprehensive IT policies.

Achievements

  • I was awarded GOLD Award in LSEG for contribution in the Eurobonds project
  • Spearheading a GenAI-centric project for a global MNC client
  • Automating and processing the vast amount of text data using NLP techniques
  • NLP Automation Success
  • Model Optimization
  • Tech Integration Leader
  • ML Competition Ranking

Major Projects

3Projects

Generative AI - Advanced GAN with CelebA Dataset

Nov, 2023 - Dec, 2023 1 month
    Generative AI project utilizing GANs for image generation. Augmented datasets by creating 10,000 high-quality images with GAN model from CelebA dataset.

Titanic - Machine Learning from Disaster

Nov, 2022 - Dec, 2022 1 month
    Machine learning project predicting survival based on dataset analysis. Improved prediction accuracy by building predictive models to analyze survival rates using passenger data.

NLP with disaster tweets

Nov, 2021 - Dec, 2021 1 month
    NLP project for monitoring Twitter communications during emergencies. Boosted disaster response efficiency by developing a machine learning model for classifying disaster-related Tweets.

Education

  • M.Tech in Data Science

    BITS Pilani (2024)
  • B.Tech (Computer Science)

    IERT (2013)
  • PG Diploma

    Savitri Bai Phule University (2019)

Certifications

  • Deep Learning

    Kaggle (Oct, 2010)
  • Python for data science

    Udemy (Sep, 2009)
  • Java

  • Feature engineering

  • Deep learning

  • Intermediate machine learning

  • Az-900

  • Python for data science

  • Sql

  • Intro to gis

AI-interview Questions & Answers

So I am my name is YK Tripathi, and, I am a data scientist. I have purse I have completed my MTech in BITS Pilani in data science domain. Uh, before that, I was working with Indian Army for 6 years. And prior to that, I was working with Oracle. And prior to that, I was working with TCS. I am a BTech in computer science and engineering from IERT, Allahabad. And uh, after that, presently, I'm working with London Stock Exchange Group as NLP data scientist. So, uh, basically, my role in LSEG is to extract meaningful information from a vast amount of PDFs and different type of data which is being uploaded to the stock exchange, uh, which is uh, in variety of formats, sometimes structured, sometimes semi structured, sometimes unstructured. Uh, but it is we have to process it and make sense of it so that, uh, some meaningful data can be extracted from it and used by the company. Apart from that, I am, uh, also certified uh, as Azure fundamental AZ 900 certification, I am certified for Python for data science. Uh, apart from that, I have a extensive work experience with AWS SageMaker, which is a data science platform uh

Okay, so for significant changes and trends within the dataset, if the dataset contains Data which is numerical in nature, we can perform many statistical analysis To identify that so, basically, uh, significant changes are properly outliers. So, if we do a box plot of any numerical data, so the Outlier things the outlier data points can be indicative of significant changes, And we can identify trend by various plots like l m plot and things like that where you would see a trend line, which would be corresponding to the latest trends, things like that. If the data is a time series data, so in there, we can identify trend. There are automated tools also which can be used directly. Let's say, for example, Tableau, in which you just have to drag and drop target column into the Tableau, uh, what we say is a desktop view or dashboard, And there you would see many trend lines, whichever you want to see, and significant changes can be detected Programmatically also where you would create a pandas data frame and then perform some Coding, um, like, uh, you can have common sense checks. Let's say if you want to take about talk about the age. So age cannot be extending to 200 years for humans. Right? So these type of significant changes and trends you can analyze in 3 ways.

Okay. To analyze large volumes of data, 1st and foremost, we need to use some cloud services in some of the formats like I can use AWS I can use Azure, even Google Cloud, Adobe Cloud. Anything can work. So data can be stored in a data lake or some sometimes, uh, the previous implementation have relational database management systems, which are handling of, uh, capable of handling a large amount of data. Apart from that, data warehouse, data lake, and things like that can be handled. Apart from that, there is a new thing called the SPA, which is stream processing, and analytics in which the large amount of data which is coming to you, which is being directly processed on the go and then present it to the user. So we can use, you know, Kafka, Spark, things like that. Apart from that, there is also big data which can be used by, uh, like, Hadoop and things like that, but I would rather prefer using Kafka and, uh, Spark. So this is what I think I would handle, the large volumes of data. Apart from that, there is also 1 more technique which you can do by to handle the large volume of data is, you know, you break it down to chunks. You take small amount of data first and then keep on doing it, like, you know, compartmentalizing the data. So that is also one way if, uh, the restriction is towards the spending, like how much we are going to spend because all the cloud resources, they are, uh, costly. They would cost some money. So that is how we

Yeah. Sure. So in the process of tailoring a data presentation to different audience type, uh, we can have, like, business people. Right? 1st and foremost, I I I would go from the top. So let's say you have the business people, the people who need to understand the business value of your solution. So first we need to present the business value of that solution. We need to present the graphs Let's say I want to shift My work to the Cloud. So I need to explain how this Cloud thing is going to help me in the longer run? How my Opex and CapEx are going to, you know, reduce over a period of time or whatever the changes are. So they might be interested Didn't see that, the money aspect. Let's say I want to present, uh, something which is more towards the technical people. Like, If I want to implement something let's let's say I I made a model that I want to present it to the DevOps guys who want to implement it. So in that case, I need to Look towards what changes in the technical description they would need to do, what type of instances, what type of, you know, nitty gritties of Code that they would want to look forward for, and let's say if I want to present it to a client, So what exactly is end to end experience? How would he send an input? How would he get an output? What does that mean? How the output would perform, what the output would mean for him? So That is, uh, I can think of the, uh, tailoring data presentation to different

Okay. So in order to, uh, effectively communicate the insights to stakeholders is, first and foremost, in the first phase, I need to define what the problem is, what exactly is the problem, and how the particular solution is going to solve it. So, basically, what I'm trying to say here is that, first, I need to I I need to make them understand that let's say this is the problem or this is the problem statement per se. And this is how this solution would be implemented and that would be solving this particular issue let's say, for example, uh, we analyze data in which companies selling few products to, let's say, age group of people between 19 to 30, and that particular product is sold less when the age group of people is between 60 to 70, so we need to target that audience more. Maybe we need to, you know, differentiate between the strategy on how we are promoting that particular product on the market. Let's say there are some products which are generally taken by men, but women are not tend to take it. So then we need to understand, like, how we are going to change our marketing strategy to target that particular audience so in order to create effective communication towards the insight, we need to first formalize the problem and then how we are going

So for the data representation, I prefer, generally, I prefer coding. I I'm a a coder at the core, so I prefer Matplotlib, SNS. There is a very good library which uses the Internet called Plotly. So they are very good tools which are directly, uh, which can be directly used in Python code. Uh, they can take in Pandas data frame, and you can manipulate it on the go. So that is what I prefer the most, and the basic reason is that because of the coding aspect, uh, I am able to, you know, modify it on the go, on the fly. I'm not restricted, uh, by the format or other things of the automated tools. Uh, I can have, like, whatever representation that I want. I'm not restricted, basically, because the code is ultimately if you use any tool. Ultimately, in the back end, there is a code that is running. The code which has been automated productionized by some people. So we can just skip to the chase and have our own code. But, yeah, definitely, in some cases, there is a benefit to using the automated tools like Tableau and things like that. Power BI and Tableau where you have large amount of data, and it is highly improbable that you can actually make some sense on analyzing it. There, we can use all these automated tools like Tableau and Power BI. But mostly, in any case, we can any day use, uh, Python libraries for that. So that is what my preferred data representation

Yep. Okay. So, definitely, there is, uh, we need to perform like, we need to analyze the marketing data based on ad performance, website traffic, Social media metrics such as sentiment analysis based on the post, how many people are talking about it, What if the new product is being launched? How much it is being received? How can we make it trending? How can we See, if we look into the, uh, thing called graph, uh, analytics so in the graph analytics, we can Somehow identify which are the influential nodes are, and we can pay those influential nodes So that our product or our, uh, marketing strategies are per me uh, you know, permeated Effectively throughout the graph. So I'm codling the graph list. Anything is a graph. A social networking site is a graph, So we can use that. Apart from that, we need to understand that what exactly is in our website or ads, which is, You know, connecting to the users most. So we need to look for those influential points That that are more effective when we look at the bigger picture. Apart from that, uh, we need also to understand that if we are, You know, making new users. There is a thing called Geo, uh, geofencing, which we which can be used. Let's say I have a shop which can detect the registered users if they are in close proximity and send them tailored or, you know, Customize messages or offers for them so that it may increase the performance of overall system.

Okay. So first and foremost, let me start from the very beginning itself when I was working with TCS. I single handedly migrated a project, which was, uh, very much dependent on Oracle web server and, uh, Oracle 9 I database, which I modified it in a way that you only need to change a property file in which you just specify which web server and which database services that you're going to use, and it would be completely shifted into that. So that involved the first and foremost application of my problem solving abilities. After that, I kept on doing that. There were so many examples I can quote in army, but, presently, what I'm doing here in LSEG as a data scientist is there are, uh, issue of PDF. We are getting the PDF, but we are not getting the exact sentences. And we are using the sentence classification. So we need to get effective sentences. But how can we get it? Because the structure of the data is unstructured. So there was a big problem, and I solved it by library called PymuPDF. And, uh, I was able to solve that particular issue. Uh, my sentence splitter has been implemented, gone into production, and it is being used throughout all of those projects which are being used now. Apart from that, there were some issues with, um, a few short learning, which is called the set fit model, in which we need to have we need to, you know, modify the set fit model itself where for every, uh, we need to combine multiple fields in a single model due to the cost constraints. So what I did is, like, I had to freeze the head and apply a logistical regression model to every field in the head, and then we can solve it.

Okay. To optimize data storage and retrieval, 1st and foremost, we need to, 1st and foremost, we need to, uh, mod, uh, you know, you can say, get a hold on how many connections To the data storage that we have. We need to pull it. We need to use the pool connection so that not, Uh, there are not many open connections which are hogging the, uh, bandwidth, and they are making the entire system slow. 1st and foremost is that. Second, we need to design a redundancy strategy, and we need to normalize our data so that we are not, Uh, storing the redundant information again and again, which can be, uh, avoided. Apart from normalization, during the time of retrieval, we need to have indexing proper indexing so that during the retrieval, It can be fast. We need to implement caching so that whatever data is being accessed frequently Is not being searched again and again. Then we need to have, like, you know, auto scaling in terms of let's say, If the requests are coming out to be more, we we don't need to implement it stop the services and implement it again. We have to have auto scaling implemented beforehand. So all these things I would use to optimize the data storage and retrieval.

Yeah. So as I told you before, the data visualization that I generally use is from Python, which makes it super easier for me to manipulate whatever the data I want to represent, I don't have to, you know, uh, bend out backwards just to represent something, some data. Uh, basically, till now, my work has been focused on completely on predictions and things like that. So in order to present that, there would be best chance that I could do it through, uh, data visualization tools of Python, like Matplotlib, Seaborn, and Plotly. Uh, so, uh, basically, to present digital marketing data, also, these bar charts can be a a good help. Something like lm plots where you would see a trend line directly going. So these are things that can be really helpful. And so l m plots, the trend lines, and time series plotting, apart from that, you can have, you know, box plots. So in which you'd clearly be able to see what are the outliers, what what exactly is, uh, where exactly is your median mode? All these centrality lie. What exactly are the numbers that you're looking for as central deviation I mean, sorry, standard deviation and all the measures of centrality for that matter. Uh, so that is what I believe.