profile-pic
Vetted Talent

Gelli Tarun

Vetted Talent
A skilled machine learning engineer passionate about solving real-world problems. Wish to explore this cutting-edge technology to help organizations develop new and integrate products Collaborated with multivariate teams of product development to insert trained models and gauge performance improvement. Planned, researched, and developed SOTA deep learning models to evaluate and perform semantic segmentation, object detection, and classifications. Developed data analysis and data preparation pipeline.
  • Role

    Data Scientist

  • Years of Experience

    4.3 years

Skillsets

  • MLFlow
  • Azure ai fabric
  • XgBoost
  • Vector databases
  • Transformers
  • Time Series
  • TensorFlow
  • SQL
  • Snowflake
  • Reinforcement Learning
  • R
  • PyTorch
  • PySpark
  • Power BI
  • OpenCV
  • Deep Learning
  • LLMs
  • Layoutlm
  • LangChain
  • Keras
  • Informatica
  • explainable AI
  • Databricks
  • BigQuery
  • Azure ML Studio
  • Azure
  • AWS
  • Python - 5 Years
  • NLP

Vetted For

10Skills
  • Roles & Skills
  • Results
  • Details
  • icon-skill_image
    Machine Learning Scientist II (Places) - RemoteAI Screening
  • 66%
    icon-arrow-down
  • Skills assessed :Large POI Database, Text Embeddings Generation, ETL pipeline, LLM, Machine Learning Model, NLP, Problem Solving Attitude, Python, R, SQL
  • Score: 59/90

Professional Summary

4.3Years
  • Jul, 2024 - Present1 yr 11 months

    Data Scientist

    The Family Office Company Bsc
  • Aug, 2021 - Jul, 20242 yr 11 months

    AI ML Engineeer

    Biomed Informatics
  • Jan, 2021 - Jul, 2021 6 months

    Data Science Internship

    Biomed Informatics

Applications & Tools Known

  • icon-tool

    R PROGRAMMING

Work History

4.3Years

Data Scientist

The Family Office Company Bsc
Jul, 2024 - Present1 yr 11 months
    Building and optimizing ML models for credit risk, fraud detection, and customer segmentation. Designed and deployed end-to-end AI/ML solutions by engineering scalable PySpark data pipelines, building predictive models, and delivering actionable insights through advanced analytics. Implementing end-to-end ML pipelines, including data preprocessing, feature engineering, and model training. Monitoring and maintaining models in production, retraining as needed for performance. Collaborating with engineering teams to integrate models into scalable systems. Utilizing SQL for extracting and transforming large datasets from financial databases to support model development. Optimizing PySpark notebooks and SQL queries to improve data processing efficiency and ensure seamless integration with ML pipelines. Built working models of: PRIME AI, Sales Process Hubspot Funnel, Seamless ML Pipelines Migration with AutoML for Drift Detection and Monitoring, Studio Looker, Top Up Scoring Model, Digital Footprint.

AI ML Engineeer

Biomed Informatics
Aug, 2021 - Jul, 20242 yr 11 months
    Building AI models. Built working models using deep learning (neural networks and ANN's). Explaining the usefulness of the AI models to a wide range of individuals within the organization, including stakeholders and product managers. Developing infrastructures for data transformation and ingestion. Applied data science techniques, such as machine learning and statistical modeling. Experienced project manager with a track record of successful planning, execution, and team collaboration, adept at risk management and maintaining rigorous quality assurance processes. Built working models of: Chatbot Using Generative AI, Health diseases (heart attack prediction), Health of Lung Infection (CNN with PyTorch and Opencv), Predictive Modeling (US betting firm), Health Diabetic Retinopathy (CNN with Opencv and TensorFlow Serving), Stock Price Prediction (RNN + LSTM, AI Pipeline).

Data Science Internship

Biomed Informatics
Jan, 2021 - Jul, 2021 6 months

Achievements

  • Building AI models
  • Built working models using Deep Learning
  • Developing infrastructures for data transformation and ingestion
  • Applied data science techniques

Major Projects

24Projects

Detecting Diabetic Retinopathy

    Built a CNN model for detecting diabetic retinopathy and deployed it using TensorFlow Serving.

Stock Price Prediction Using DEEP-Q Learning

    Prepared an agent by implementing Deep Q-Learning that can perform unsupervised trading in stock trade. The aim of this project is to train an agent that uses Q-learning and neural networks to predict the profit or loss by building a model and implementing it on a dataset that is available for evaluation.

Health diseases Cardiovascular diseases

    Cardiovascular diseases are the leading cause of death globally. It is therefore necessary to identify the causes, so i had developed a system to predict heart attacks in an effective manner.

Stock Price Prediction Using DEEP Learning

    Prepared an agent by implementing Deep Learning that can perform unsupervised trading in stock trade. The aim of this project is to train an agent that uses Deep Learning and neural network models like RNNS AND LSTMS to predict the profit or loss by building a model and implementing it on a dataset that is available for evaluation.

Health Diabetic Retinopathy

    I had built a CNN model using distributed training that can detect diabetic retinopathy and deploy it using TensorFlow Serving.

Predictive Modeling

    I had built a machine learning model for a US Client which can predict runs of a batsman and number of wickets can be taken by a bowler in T20 matches using machine learning.

Health of Lung Infection

    I had built a model using a convolutional neural network that can classify lung infection in a person using medical imagery

Health diseases

    Cardiovascular diseases are the leading cause of death globally. It is therefore necessary to identify the causes, so i had developed a system to predict heart attacks in an effective manner

Chatbot Using Generative AI

    I had developed a real-time chatbot using LLMS and Layout LLM (Open ais Gpt-3, Whisper, Microsoft T-5) for sequencing and Whisper for speech to text processing to engage with the customers to boost their business growth by using NLP and Speech Recognition. We had deployed using Flask for web development & Microsoft Azure for deployment. The chatbot is very helpful for its 24/7 presence and ability to reply instantly.

Studio Looker

    This is a Solution for our Relationship Managers so that they can easily understand about our clients when they are about to contact and know about their likes and dislikes using the data aggregated of all features like from demographics to text to investments.

Top Up Model

    So, we had developed a model which uses the textual data from our clients to prioritize to contact our clients based on the probability by predictions, we used the meeting notes, call notes and the emails data and also with some feature engineered features from the above data.

Digital Footprint

    It is a dashboard created by us which uses the clients and prospects data from emails to calls to meetings counts and their digital activities and uses XGB Model to predict which client is low hanging fruit and fruitful to become a client or do a Top up if he is a client already. So, based on their activities to make it easy for our RMS to Contact and understand whom to contact in order.

Chatbot (SPEECH TO TEXT FOR CUSTOMER SUPPORT)

    I had developed a real-time chatbot to engage with the customers using voice commands and solving queries in order to boost their business growth by using NLP and Speech. The chatbot is very helpful for its 24/7 presence and ability to reply instantly.

Detection of Lung Infection

    Built a CNN model to classify lung infections in patients using medical imagery.

Health Care/Cardiovascular diseases

    Developed a system to predict heart attacks effectively, addressing a leading global cause of death.

Chatbot Development

    Developed a real-time chatbot with NLP and Speech Recognition to engage with customers and enhance business growth.

Facial Recognition

    Using a deep convolutional neural network (CNN) to perform facial recognition using Keras.

Emotion Recognition

    Future customizations, such as understanding human emotions, could lead to a range of advancements, such as determining whether a person likes a specific statement, item or product, food, or how they are feeling in a particular circumstance, and so on. I had built a model using a convolutional neural network that can classify a person's emotion

Lending Loan Data Analysis

    For some companies correctly predicting whether or not a loan will be a default and it is very important. In this project, using the historical data, I had built a deep learning model to predict the chance of default for future loans.

Prepared an agent by implementing Deep Q-Learning that can perform unsupervised trading in stock trade.

Stock Price Prediction

Health Informatics/Detecting Diabetic Retinopathy

Health Informatics/Detection of Lung Infection

Health Informatics/Cardiovascular diseases

Education

  • PGP in AI/ML Engineer

    Purdue University (2023)
  • B. Tech in Computer Science

    Sreenidhi Institute of Science and Technology (2022)

Certifications

  • Pgp in ai/ml engineer

AI-interview Questions & Answers

Kelly Turan. I'm currently working as an AI engineer at Biomed Informatics. My background is in computer science engineering, where I earned my bachelor's degree. I then completed a postgraduate diploma in AI and ML. After that, I started working with Biomed Informatics as an AI engineer. Here, our base is developing AI and ML models used in the healthcare sector mostly. We use computer vision technologies to build predictive models that can analyze CT scans and X-ray images to detect tumors or fractures a patient may be suffering from. We also do predictive analysis for clients based on their data. For example, recently, we worked on a project predicting IPL run scores and wicket predictions for bowlers based on the client's data. We also develop chatboards integrated with LLMs, implementing APIs and fine-tuning models for specific use cases. We then quantize the models to deploy them easily on hardware systems with minimal requirements. These are the main aspects of my education and work experience.

So, basically, the ETL pipelines are extract, transform, and load pipelines, which is used in our process of transforming the raw data into processing methods so the model can digest the data. So geospatial data means data with diverse kinds of text, numbers, or else categorical values. To develop a procedure in this pipeline, firstly, we need to convert the categorical ones, which is weather, sunny, into vectors. So the model can convert them into numbers. And then, we have different procedures, like in the geospatial where we have to transform the scaling procedure of our data. So in this case, we use the MinMax standard scaler or robust scaler based on the kind of data and their values. So, particularly, we do scaling to make all the values into a particular range so the model doesn't get overfitting or underfitting. To set up the ETL pipeline, firstly, we use a pipeline method from scikit-learn to set up the pipeline. So, in that, we include the scaling methods and the model, what kind of model we have to use, and the metrics we need to check for the model. So this is a basic procedure and a high-level procedure for implementing our ETL pipeline to handle the geospatial data. So using the pipeline from scikit-learn, we can create a ETL pipeline which can extract, transform, and load the data, whether it's geospatial or any other kind of data. It can handle all kinds of preprocessing techniques.

So, basically, the system is designed to automate the recognition and flagging of outdated POA designs. So, like, I am not familiar with this kind of system, but I have a bit of knowledge on automate recognition. Like, POA refers to UI listings. So, basically, the POI listing means it's a technique which can leverage the pipeline of a process. Like, it's an important process where we can, sorry, I'm not able to recognize exactly.

So the LLM is the best choice to enhance an existing NLP-based system because the NLP-based systems are built using LSTMs or encoder and decoder, but they are not trained on a huge amount of data. So they are not able to give a good response to the users, basically. So instead of the NLP-based system, if we use an LLM on the data which we want, then it is a very good technique to make our chatbot very effective, which can be liked by the users, and the users will be using more. And they will also be satisfied with the model. So it is also easy to train the model and leverage the model because using existing LLM, which is trained on huge amount of data with billions of parameters, can enhance more knowledge on our data after converting our data into vectors. So based on using LLM, it can not only generate good responses, but also leverage techniques and enhance the model to respond as a human instead of a chatbot. It can also help in many ways with a less number of training hours, and then it is very cost-effective also for the training purpose. Like, instead of using hundreds of GPUs, we can use a single GPU and train a model or train a multi-billion model with multimodal, which can be effective and very best in the case of using an existing NLP-based system.

More script for that event to Huddl do now. Can fold increase in data science. So, basically, instead of using our R code, we can efficiently handle the data, like using data table instead of a data frame for faster data manipulation. Also, the memory management cleanup used objects like using R's m and garbage collector functions to free up the memory. Also, vectorization replaced loops with vectorized operations whenever possible, and it supports parallel processing, like parallel libraries utilizing packages like parallel and foreach for parallelized tasks, and distributed computing can also be used. We're considering using sparklyr to interface with Hadoop for distributed data processing and also algorithm optimization, like choosing algorithms that scale well with the data size, such as stochastic gradient descent for linear models, and sampling techniques. Then we have batch processing for algorithms that suppose process data in batches instead of loading all the data. And it is efficient operations with the data storage and database integration with Postgres SQL and MySQL, and hardware utilization, like using machines with more RAM and CPU for cloud solutions like AWS for scalability and GPU acceleration using packages like TensorFlow, Fintech, Flow, and Keras for GPU acceleration and machine learning. And also benchmarking and profiling, we can use the Profvis package, and then cloud and distributed computing, like using H2 for scalable machine learning with Huddl interface. So these are the techniques, so we can conclude that scaling an art script for a 10-fold increase in data science requires a combination of coding practices, leveraging parallel and distributed computing, and potentially utilizing more powerful hardware or cloud services. By systematically applying these strategies, we can ensure that our script remains performant even with significantly larger datasets.

So, basically, firstly, we can involve the system architecture, like, using real-time data streams, like, by each Kafka, Amazon Kinesis, or Google Cloud, and then maintaining a database of POI information which could be stored in a relational database like PostgreSQL or MySQL. And then the data processing layer, we have stream processing using stream processing frameworks like AppEdge, Flyte, or Spark Streaming, or Kafka Streams. And then microservices, we have implementing microservices for different tasks such as data ingestion, data processing, and data enrichment. And then, using the POI table, we can store the information in a relational table with appropriate indexing. And in real-time, we have data tables. Then, enriching and processing, like, using the data in this and data parsing, geo-spatial enrichment. We process our data and integrate it with POI information. And then, we have setting up real-time, like, streaming data. Here, we can use, like, geospatial queries using the PostGIS extension and PostSQL. And then we have monitoring and scaling where we are implementing monitoring techniques using tools to track performance and detect bottlenecks and auto-scaling, like, using Kubernetes or cloud-based solutions, like scaling factors to handle varying loads. So, by leveraging real-time stream processing frameworks, an efficient database, geo-spatial capabilities of PostGIS, and microservices, we can build a robust SQL-based solution for real-time POI data enrichment. And, of course, we need to ensure we monitor and scale our system as needed to handle growing data volumes.

Selecting UI table with certain attributes is what it's trying to accomplish. So, like, in this scenario, we are trying to accomplish. So, basically, looking at the table, I can see that we are selecting the name and the location category from POIS. Like, it's a table. So we fill the table by filtering by category. Like, it attempts to filter the POI where the category is either hotel or restaurant. And then non null location, it filters out the POIs where the location field is null. And then sort by name length, we are doing the results are intended to be ordered by the length of the name in the field in descending order. Then we are limiting the results up to the top ten results. I think there are a few syntax errors in that select field, like a few syntax errors. So this is what we are trying to do from this SQL.

So we are examining this Python code for passing GeoSpecial because there is a code that could lead to unhandled exceptions. Yes. I think there are a few issues in the code. Firstly, in the 10th day, there's an indentation error, the code is inside the try block, but it's not properly indented. So, this slice fixes the indentation error, then the import statement uses GeoPandas as 'gdp', but later uses 'GPT', which is inconsistent. It should be consistent throughout. And the read file function calls the mismatcher code around the file name. And then the syntax in the except exception handling the string inside the print function has mismatched quotes and is missing a closing parenthesis and double quotes in the first print statement. The print statement for the second general exception error handling is missing an opening parenthesis and the closing parenthesis. So these are the errors I can see in the code.

So, basically, to extend an ETL pipeline to integrate with the POA and finding involves several steps. Like, we can use the API integration, like using the restful APIs or web services provided by the third-party POI data providers using tools like the requests in Python for fetching data and then scheduling extraction, like implementing cron or jobs or scheduling tools, like Apache Airflow to regularly pull data. And then, using the transforming phase, like data cleansing and normalization, scheme mapping, like mapping the third-party data scheme to our internal schema or data validation, like validating data types, removing duplicates, handling missing data. Then we have the loading phase where we have transactional loading and stacking tables, like using database transactions to ensure asset properties. Right. And tagging tables, like loading data into tagging tables. Firstly, like, to validate before merging and merging into main tables, then ensuring asset properties, like ensuring that the ETL step is atomic in case of failure. The system should revert to the previous consistent state. And using constraints like triggers and validation rules in the database to maintain integrity and the isolation we have, like implementing proper transaction isolation levels to ensure that transactions do not interfere with each other and the durability, like ensuring that once the transaction is committed, it is stored permanently, so we use reliable storage solutions and regular backups. So this is the process where we can integrate third-party POI data feeds while maintaining

So, basically, the divider being so, basically, if I click without self-filing, data integrity is compromised. So, in this case, we had to do the extracting phase using API integration with restful APIs or web services. Then, we used the transforming phase, which included schema mapping and data validation, and loading phase, which included transactional loading and stacking tables. We then ensured the asset property for greater efficiency without sacrificing data integrity. So, we can plan our optimization by optimizing this process below, similar to the previous process for the previous question. We had asset properties, and we can do it or slightly refactoring queries. This involves analyzing existing queries to identify bottlenecks and using indexes. We identify columns frequently used in the where clause to end joint conditions, and we limit the use of distinct to minimize its impact on query performance. We ensure that it isn't necessary for specific views, and we use union all instead of union. We also normalize the data model and denormalize for reoperation. We consider denormalizing tables for read-heavy operations to reduce join operations and improve query performance and query execution plan analysis. We can do this using join operations. So, we need to choose appropriate algorithms, such as hash, merge, and index-based algorithms, depending on the size of the table and available indexes. Then, parameterization is used instead of embedding values directly into SQL queries. We use parameterized queries to avoid SQL injection and regular maintenance. Then, testing is done, and we plan and optimize SQL queries used in the ETL process for greater efficiency without sacrificing data integrity.

Auto plus which are data integrated in an existing Python-based ETL pipeline for announced processing. So, basically, we can test the button. Firstly, like, choosing the model and then data preprocessing and then model inference, like, using the NLLM model inference on the input data depending on the case. And then post-processing the output generated by the LLM model as needed. This may include decoding, formatting, and further analysis, and then handling. We can do testing and validation. After that, we have to optimize the LLM integration for performance by leveraging techniques like batch processing and parallelization, and then documentation and training, documenting the integration process, providing training and support for the team members working with the integrated LLM model. We then use the CICD pipelines to continuously monitor and evaluate the performance of LLM integration. This is the main process. In other cases, we can also use AB testing, prompt tuning, re-tuning, prefix tuning, fine-tuning the models based on the data we have, which can improve some models for rogue. We can check rogue metrics or use an LLM only. We can do model evaluation. Using these steps, we have the Weights and Biases, also known as WNB, which can also be used in the ETL pipeline. It regularly monitors the model's performance and sends a mail if the model is stacking or lacking behind the scores, which we have benchmarked. This is a process to enhance NLP processing.