Results-driven and adaptable Data Scientist/Machine Learning with a successful track record in managing multiple priorities and delivering high-quality solutions. Proficient in NLP and machine learning techniques, with a focus on automating and processing vast amounts of text data. Recognized for expertise in using ML to analyze and extract insights from complex documents, such as bonds uploaded to the London Stock Exchange. Adept at developing and deploying unified models for classification and extraction, utilizing Sagemaker pipelines and hyperparameter tuning for optimal performance. Skilled in intelligence gathering, statistical analysis, and data mining, with a strong emphasis on attention to detail and written communication. A proactive problem solver with a passion for leveraging generative AI for Information Warfare. Experienced in technical entry into the Armed Forces, integrating the latest technology into existing frameworks and using ML to automate troop movement and supply management. Holds an M.Tech in Data Science and a B.Tech in Computer Science, along with certifications in machine learning, deep learning, cloud services, and more. Actively engaged in data science projects, including Eurobonds analysis at the London Stock Exchange and machine learning projects on Kaggle. A self-motivated professional with excellent organizational and time management skills.
Manager - AI and Automation
EYData Scientist Manager
AffineData Scientist lead
AffineApplication Developer
Oracle IndiaData Scientist-Technical Officer
Indian ArmyData Scientist
London Stock Exchange GroupSystem Engineer
TCSAWS Cloud
Amazon SageMaker
Azure
Keras
PyTorch
Hugging Face
Tensorflow
neural network architectures
Python
LLM
LLAMA
Generative AI
Machine Learning
Huggingface
ETL pipelines
ChatGPT
Snowflake
Spark
Generative AI
So I am my name is YK Tripathi, and, I am a data scientist. I have purse I have completed my MTech in BITS Pilani in data science domain. Uh, before that, I was working with Indian Army for 6 years. And prior to that, I was working with Oracle. And prior to that, I was working with TCS. I am a BTech in computer science and engineering from IERT, Allahabad. And uh, after that, presently, I'm working with London Stock Exchange Group as NLP data scientist. So, uh, basically, my role in LSEG is to extract meaningful information from a vast amount of PDFs and different type of data which is being uploaded to the stock exchange, uh, which is uh, in variety of formats, sometimes structured, sometimes semi structured, sometimes unstructured. Uh, but it is we have to process it and make sense of it so that, uh, some meaningful data can be extracted from it and used by the company. Apart from that, I am, uh, also certified uh, as Azure fundamental AZ 900 certification, I am certified for Python for data science. Uh, apart from that, I have a extensive work experience with AWS SageMaker, which is a data science platform uh
Okay, so for significant changes and trends within the dataset, if the dataset contains Data which is numerical in nature, we can perform many statistical analysis To identify that so, basically, uh, significant changes are properly outliers. So, if we do a box plot of any numerical data, so the Outlier things the outlier data points can be indicative of significant changes, And we can identify trend by various plots like l m plot and things like that where you would see a trend line, which would be corresponding to the latest trends, things like that. If the data is a time series data, so in there, we can identify trend. There are automated tools also which can be used directly. Let's say, for example, Tableau, in which you just have to drag and drop target column into the Tableau, uh, what we say is a desktop view or dashboard, And there you would see many trend lines, whichever you want to see, and significant changes can be detected Programmatically also where you would create a pandas data frame and then perform some Coding, um, like, uh, you can have common sense checks. Let's say if you want to take about talk about the age. So age cannot be extending to 200 years for humans. Right? So these type of significant changes and trends you can analyze in 3 ways.
Okay. To analyze large volumes of data, 1st and foremost, we need to use some cloud services in some of the formats like I can use AWS I can use Azure, even Google Cloud, Adobe Cloud. Anything can work. So data can be stored in a data lake or some sometimes, uh, the previous implementation have relational database management systems, which are handling of, uh, capable of handling a large amount of data. Apart from that, data warehouse, data lake, and things like that can be handled. Apart from that, there is a new thing called the SPA, which is stream processing, and analytics in which the large amount of data which is coming to you, which is being directly processed on the go and then present it to the user. So we can use, you know, Kafka, Spark, things like that. Apart from that, there is also big data which can be used by, uh, like, Hadoop and things like that, but I would rather prefer using Kafka and, uh, Spark. So this is what I think I would handle, the large volumes of data. Apart from that, there is also 1 more technique which you can do by to handle the large volume of data is, you know, you break it down to chunks. You take small amount of data first and then keep on doing it, like, you know, compartmentalizing the data. So that is also one way if, uh, the restriction is towards the spending, like how much we are going to spend because all the cloud resources, they are, uh, costly. They would cost some money. So that is how we
Yeah. Sure. So in the process of tailoring a data presentation to different audience type, uh, we can have, like, business people. Right? 1st and foremost, I I I would go from the top. So let's say you have the business people, the people who need to understand the business value of your solution. So first we need to present the business value of that solution. We need to present the graphs Let's say I want to shift My work to the Cloud. So I need to explain how this Cloud thing is going to help me in the longer run? How my Opex and CapEx are going to, you know, reduce over a period of time or whatever the changes are. So they might be interested Didn't see that, the money aspect. Let's say I want to present, uh, something which is more towards the technical people. Like, If I want to implement something let's let's say I I made a model that I want to present it to the DevOps guys who want to implement it. So in that case, I need to Look towards what changes in the technical description they would need to do, what type of instances, what type of, you know, nitty gritties of Code that they would want to look forward for, and let's say if I want to present it to a client, So what exactly is end to end experience? How would he send an input? How would he get an output? What does that mean? How the output would perform, what the output would mean for him? So That is, uh, I can think of the, uh, tailoring data presentation to different
Okay. So in order to, uh, effectively communicate the insights to stakeholders is, first and foremost, in the first phase, I need to define what the problem is, what exactly is the problem, and how the particular solution is going to solve it. So, basically, what I'm trying to say here is that, first, I need to I I need to make them understand that let's say this is the problem or this is the problem statement per se. And this is how this solution would be implemented and that would be solving this particular issue let's say, for example, uh, we analyze data in which companies selling few products to, let's say, age group of people between 19 to 30, and that particular product is sold less when the age group of people is between 60 to 70, so we need to target that audience more. Maybe we need to, you know, differentiate between the strategy on how we are promoting that particular product on the market. Let's say there are some products which are generally taken by men, but women are not tend to take it. So then we need to understand, like, how we are going to change our marketing strategy to target that particular audience so in order to create effective communication towards the insight, we need to first formalize the problem and then how we are going
So for the data representation, I prefer, generally, I prefer coding. I I'm a a coder at the core, so I prefer Matplotlib, SNS. There is a very good library which uses the Internet called Plotly. So they are very good tools which are directly, uh, which can be directly used in Python code. Uh, they can take in Pandas data frame, and you can manipulate it on the go. So that is what I prefer the most, and the basic reason is that because of the coding aspect, uh, I am able to, you know, modify it on the go, on the fly. I'm not restricted, uh, by the format or other things of the automated tools. Uh, I can have, like, whatever representation that I want. I'm not restricted, basically, because the code is ultimately if you use any tool. Ultimately, in the back end, there is a code that is running. The code which has been automated productionized by some people. So we can just skip to the chase and have our own code. But, yeah, definitely, in some cases, there is a benefit to using the automated tools like Tableau and things like that. Power BI and Tableau where you have large amount of data, and it is highly improbable that you can actually make some sense on analyzing it. There, we can use all these automated tools like Tableau and Power BI. But mostly, in any case, we can any day use, uh, Python libraries for that. So that is what my preferred data representation
Yep. Okay. So, definitely, there is, uh, we need to perform like, we need to analyze the marketing data based on ad performance, website traffic, Social media metrics such as sentiment analysis based on the post, how many people are talking about it, What if the new product is being launched? How much it is being received? How can we make it trending? How can we See, if we look into the, uh, thing called graph, uh, analytics so in the graph analytics, we can Somehow identify which are the influential nodes are, and we can pay those influential nodes So that our product or our, uh, marketing strategies are per me uh, you know, permeated Effectively throughout the graph. So I'm codling the graph list. Anything is a graph. A social networking site is a graph, So we can use that. Apart from that, we need to understand that what exactly is in our website or ads, which is, You know, connecting to the users most. So we need to look for those influential points That that are more effective when we look at the bigger picture. Apart from that, uh, we need also to understand that if we are, You know, making new users. There is a thing called Geo, uh, geofencing, which we which can be used. Let's say I have a shop which can detect the registered users if they are in close proximity and send them tailored or, you know, Customize messages or offers for them so that it may increase the performance of overall system.
Okay. So first and foremost, let me start from the very beginning itself when I was working with TCS. I single handedly migrated a project, which was, uh, very much dependent on Oracle web server and, uh, Oracle 9 I database, which I modified it in a way that you only need to change a property file in which you just specify which web server and which database services that you're going to use, and it would be completely shifted into that. So that involved the first and foremost application of my problem solving abilities. After that, I kept on doing that. There were so many examples I can quote in army, but, presently, what I'm doing here in LSEG as a data scientist is there are, uh, issue of PDF. We are getting the PDF, but we are not getting the exact sentences. And we are using the sentence classification. So we need to get effective sentences. But how can we get it? Because the structure of the data is unstructured. So there was a big problem, and I solved it by library called PymuPDF. And, uh, I was able to solve that particular issue. Uh, my sentence splitter has been implemented, gone into production, and it is being used throughout all of those projects which are being used now. Apart from that, there were some issues with, um, a few short learning, which is called the set fit model, in which we need to have we need to, you know, modify the set fit model itself where for every, uh, we need to combine multiple fields in a single model due to the cost constraints. So what I did is, like, I had to freeze the head and apply a logistical regression model to every field in the head, and then we can solve it.
Okay. To optimize data storage and retrieval, 1st and foremost, we need to, 1st and foremost, we need to, uh, mod, uh, you know, you can say, get a hold on how many connections To the data storage that we have. We need to pull it. We need to use the pool connection so that not, Uh, there are not many open connections which are hogging the, uh, bandwidth, and they are making the entire system slow. 1st and foremost is that. Second, we need to design a redundancy strategy, and we need to normalize our data so that we are not, Uh, storing the redundant information again and again, which can be, uh, avoided. Apart from normalization, during the time of retrieval, we need to have indexing proper indexing so that during the retrieval, It can be fast. We need to implement caching so that whatever data is being accessed frequently Is not being searched again and again. Then we need to have, like, you know, auto scaling in terms of let's say, If the requests are coming out to be more, we we don't need to implement it stop the services and implement it again. We have to have auto scaling implemented beforehand. So all these things I would use to optimize the data storage and retrieval.
Yeah. So as I told you before, the data visualization that I generally use is from Python, which makes it super easier for me to manipulate whatever the data I want to represent, I don't have to, you know, uh, bend out backwards just to represent something, some data. Uh, basically, till now, my work has been focused on completely on predictions and things like that. So in order to present that, there would be best chance that I could do it through, uh, data visualization tools of Python, like Matplotlib, Seaborn, and Plotly. Uh, so, uh, basically, to present digital marketing data, also, these bar charts can be a a good help. Something like lm plots where you would see a trend line directly going. So these are things that can be really helpful. And so l m plots, the trend lines, and time series plotting, apart from that, you can have, you know, box plots. So in which you'd clearly be able to see what are the outliers, what what exactly is, uh, where exactly is your median mode? All these centrality lie. What exactly are the numbers that you're looking for as central deviation I mean, sorry, standard deviation and all the measures of centrality for that matter. Uh, so that is what I believe.