profile-pic
Vetted Talent

Bhavarth Bhangdia

Vetted Talent

As a Data Science Intern at Alphaa AI, where I create impactful solutions using Python, Kaggle notebooks, and mathematical and statistical principles. I have proficiency in creating diverse datasets, architecting robust pipelines, and optimizing ETL processes for enhanced efficiency and data handling. I also convey complex ideas through compelling data stories, demonstrating my communication and visualization skills.


I am pursuing my Bachelor of Technology in Electronics and Communication Engineering from the Indian Institute of Information Technology Allahabad, with coursework in Data Structures, Operating Systems, Distributed Systems, and Machine Learning. I have skills in front-end and back-end development, using languages such as C++, JavaScript, and SQL, and frameworks like ReactJS, NodeJS, and MongoDB. I have spearheaded the development and launch of dynamic websites and applications, such as Filmpire CineVerse and Media Mimic, that enhance user engagement and streamline content discovery processes. I have solved over 500+ challenging problems on platforms like LeetCode, InterviewBit, and Code Studio, reflecting my dedication to honing problem-solving skills.


I am driven by a quest for excellence, constantly seeking to stay updated with the latest industry trends and best practices. I am eager to bring my technical expertise, passion for innovation, and collaborative spirit to a forward-thinking team.

  • Role

    ML ENGINEER

  • Years of Experience

    1.9 years

Skillsets

  • JavaScript
  • SQL
  • Python
  • AWS
  • diffusion model
  • Git
  • JAX
  • Keras
  • OpenCV
  • PyTorch
  • Ray Serve
  • TensorFlow
  • Torch
  • Transformer

Vetted For

10Skills
  • Roles & Skills
  • Results
  • Details
  • icon-skill_image
    Machine Learning Scientist II (Places) - RemoteAI Screening
  • 62%
    icon-arrow-down
  • Skills assessed :Large POI Database, Text Embeddings Generation, ETL pipeline, LLM, Machine Learning Model, NLP, Problem Solving Attitude, Python, R, SQL
  • Score: 56/90

Professional Summary

1.9Years
  • Aug, 2024 - Present1 yr 1 month

    ML ENGINEER

    PIXLR
  • Dec, 2023 - Mar, 2024 3 months

    AI CODER

    SCALE AI
  • Sep, 2023 - Dec, 2023 3 months

    DATA SCIENCE INTERN

    ALPHAA AI
  • May, 2023 - Jul, 2023 2 months

    SOFTWARE DEVELOPER INTERN

    AITA

Applications & Tools Known

  • icon-tool

    Javascript

  • icon-tool

    React

  • icon-tool

    Node.js

  • icon-tool

    Next.js

  • icon-tool

    Express.js

  • icon-tool

    Python

  • icon-tool

    PyTorch

  • icon-tool

    BigPanda

  • icon-tool

    MySQL

  • icon-tool

    Git

  • icon-tool

    Docker

  • icon-tool

    OpenShift

  • icon-tool

    Kubernetes

  • icon-tool

    Azure DevOps

  • icon-tool

    Postman

Work History

1.9Years

ML ENGINEER

PIXLR
Aug, 2024 - Present1 yr 1 month
    Led research and development of AI systems in Image Processing using LoRA finetuning. Designed inference pipelines optimizing runtime by 70%. Implemented MLOps pipelines with Ray Serve, improving model serving stability by 120%. Reduced text-to-image generation latency by 75%.

AI CODER

SCALE AI
Dec, 2023 - Mar, 2024 3 months
    Evaluated AI-generated code quality, improving readability and maintainability by 92%. Resolved coding problems, optimized code performance, addressed 80% of bottlenecks, developed high-quality source code.

DATA SCIENCE INTERN

ALPHAA AI
Sep, 2023 - Dec, 2023 3 months
    Engineered a random forest classifier for customer churn prediction, achieving 85% accuracy. Modeled and analyzed risk tools like VaR and stress tests. Engaged with trading teams for risk and performance studies.

SOFTWARE DEVELOPER INTERN

AITA
May, 2023 - Jul, 2023 2 months
    Engineered an online enrollment system, facilitating registration for over 55+ students nationwide.

Achievements

  • 92% improvement in AI-generated code readability
  • 85% accuracy rate on customer churn prediction
  • 20% reduction in maintenance costs
  • 15% increase in machinery uptime
  • 30% boost in customer satisfaction scores
  • Mean Absolute Error (MAE) of less than 2% in stock price forecasting

Major Projects

4Projects

Predictive Maintenance for Industrial Machinery

    Developed a predictive maintenance solution using LSTM neural networks.

Sentiment Analysis for Customer Feedback

    Implemented Bidirectional LSTM and CNN models for sentiment analysis of movie reviews. Achieved up to 88.61% accuracy with BiLSTM using dropout regularization.

Stock Price Forecasting using Machine Learning

    Applied Random Forest and Gradient Boosting for accurate stock price prediction.

SMART POWER ALLOCATION SYSTEM

    Designed an actor-critic reinforcement learning framework improving power supply reliability by 89%. Achieved 85% faster convergence through novel optimization techniques.

Education

  • Bachelors in Technology ELECTRONICS AND COMMUNICATION ENGINEERING

    INDIAN INSTITUTE OF INFORMATION TECHNOLOGY ALLAHABAD (2024)
  • CBSE CLASS XII

    ST. PAUL HIGH SECONDARY (2019)

Certifications

  • Microsoft Technology Associate (MTA)

  • Microsoft Technology Associate (MTA)

  • Supervised machine learning

  • Advanced in data science

  • Advanced in machine learning

  • Microsoft technology

AI-interview Questions & Answers

My name is Bhavad Bhangria, and I recently completed my bachelor's degree in electronics. So a system designed to automate the recognition and the flagging of the outdated, uh, POI listing is that, first of all, we can think is that flagging out of the outdated policy and build it on a multi component architecture. So first thing to do is that data ingestion. As the system ingest a policy data from the variable sources such as the databases, document repositories, and web scraping. And then in the ETL process and extract and transform load pipeline ensure that the data is clean, standardized, and loaded into the central repository. Then in the NLP, that is a natural language processing engine, the engine processes the textual data of the policy extracting key metadata such as we can take off a date, uh, renewal periods, and the version number. And then in the machine learning models, it understand the text and the content and the context of the policy that can identify the indicated obsolescence or updates. And then finally, we can think of as it will be a continuous process and the continuous learning, which will make to automate recognition and the flagging of the outdated POI listings.

So to suggest a method, uh, to automate the data verification in the detailed pipeline for consistency, so first of all, we can think is that to automate this data, we will require to define the data quality rule. So in the data quality rule, there are basically 2, 3 characteristics that we have to remember. So the first characteristics that we have to remember is the consistency rule that to ensure that the data confirms to the expected format and value. The next is that to ensure the uniqueness room as whenever we have one type of data, there are no more than more duplicates records that are presented in this data and then we will perform a initial data profiling to understand the characteristics of the data. We will use tools like, we can say, apachegraphing to analyze the data patterns and anomalies and then we will check the data quality before we load it to the ETL processes. These include the schema validation, data type check, and the initial data profiling. And next is that we will implement a automated testing framework such as Deque to continuously test data through the ETL processes. And finally, we can rate the data lineage to understand the flow of data through the ETL pipeline. So basically, we will implement the logging and auditing to maintain a history of data changes and transformations, and we will do, um, use tools like in Apache, Atlas, uh, tool for data lineage and linearity. So this will be my approach to process a method to automate the data verification in an ETL pipeline for the consistency.

So, basically, uh, 2 devices strategy, like, for implementing an SQL based, uh, in real time data validation. So to ensure the data ideas to the expected format, the very basic things that we have to get into mind that is the data type, its ranges, and its values, and then check for the missing values, check for the mandatory field that will be required as we will populate through it. The next thing to understand to ensure that the no duplicate records that are present in the data, And then we will choose, like, a ETL tool that supports SQL based operations. Like, we can say, for taking example of an Apache Airflow, and then we will use more robust database like RDA BMS, for an example, MySQL to manage and query your data. And then we will load raw data into the staging table for the initial validation, and then we will embed this data validation and transformation logic with your SQL script as the way we want to process that data, and then after that, we will write our SQL queries so that we can validate the data after it has been loaded to the target labels, target tables that we want it. And then finally, we'll set up a, uh, real time monitoring tools so that we can get the notification of the stakeholders of if any data quality issue occurs in the future. So this would be my approach to device the strategy so that we can implement an SQL based solution for a real time UI metadata enrichment. So, uh, the workflow will of this is that first, we will load raw data into the staging folders, and then we will perform pre ETL validations using SQL queries. 2nd is that to apply the transformation using SQL scripts as to test the real time data quality checks in the transformation queries. And then 3rd is that to load the transformation data into the target tables so that we can perform the post ETL validations using the SQL queries. And finally, we will configure alerts to notify if any anomaly occur in the future.

So, um, a technique, uh, for incorporating vector database. So first of all, what we can do, uh, determine the type of data that we are dealing with the very basic thing that we are to take in our mind, and then we will define the criteria of the matching. For example, let's say cosine similarity for embedding and the another example that come into my mind is that Euclidean distance for, uh, feature vector. And next is that we will select a vector database that supports high dimensional data and efficient, uh, that supports, uh, high dimensional data and efficient searching similarity so that we can search for their data, and then we will deploy a chosen vector database for managed services like Pinecon and follow the cloud provider setup instruction. For open source solution like Milvus, we will deploy them on our preferred infrastructure, And then we will create a index in that vector database to facilitate those efficient matching similarly in throughout the data. And in the next phase, we will extract that raw data from the source system, and then we will generate a feature vectors for a suitable model for images that we want to search, that we want to match, and then we will transform this raw data into the vector representation. This can be done using retained models that we had trained on the specified data and then we load that, uh, generated vectors into the vector database. For example, like, uh, loading vectors into the pinecon, and then we will implement the matching algorithm. So this implementation, it will check real time checks to ensure the data quality and consistency in the vector representation and it will set up monitoring to track the performance and health of the vector database. And then finally, we will, uh, review these matching results and refine the feature selection mechanism. And then we will update the ETL pipeline and the vector database configuration as needed to improve the performance. So first of all, we will extract, then we will transform that raw data into the vector representation. Then, 3rd, we will load that, then we'll perform similarity searching matching using the vector databases, and then we will retrieve and process the matching results. And then we will monitor the vector database and the ATL pipeline so that it will configure the alerts for any issues in the data processing and match that accuracy. So in this way, I will propose a technique for incorporating a vector database technology in a POI matching algorithm.

So like to ensure the freshness of the points of interest data within our dataset. So the methods that we can use is that, uh, first method that come into my mind is the use of time stamp tracking so that we can ensure the PI the POI, that is a point of interest record with that associated time stamp indicating the last update or verification, and then we will implement the process to update or verify that data regularly, and then we will define the constituencies about the fresh data. This could be based on the last updated time stamp, and we then we will write a script to store that procedure to check the freshness of each poi record. Like example, we can write an SQL query for it, and then we will set up the alerts that does not meet the freshness criteria. Uh, for example, let's say if we send an email, if the stable data is detected, and then finally, we will use dashboard so that we can monitor freshness. We will create a dashboard for it. Example, we can use Tableau or and integrated with the Flask or Django so that we can get the whole charts and the library so that we can understand and visualize the data. And next is that we can develop a strategy to refresh the stale data. That this could be involve automated updates from the external data sources or let's say if we are updating it manually. Like, example, to understand this would be, like, uh, when we integrate with external API, this will provide updated POI information. So basically, the workflow to deal with this is that the time stamp tracking, the second is that the freshness to check the scripts, 3rd is that to keep the alerts using the dashboard. And then 5th is that to do the data refresh. To automate this data refresh processes by integrating with an external POI data sources.

So in this Python function that is meant to match the POI names. So the to spot any logical error that might cause incorrect matching is that as first of all, we are checking the lower of pui.a, and pui.blower, we are equal to matching it. And then similarly we are checking the length of p o s must be greater than 5 or length of p o I must be greater than 5, greater than 2. So here rather than to use our operator, we will use end operator so that it will take both condition, true and true, and then it will return true or else it will return false. So the error that I highlight using this, uh, sample code is that in the length poi greater than a, it should be and rather than or operation.

So this Python code, uh, for passing geospatial data, determine whether there's a bug that could let to unhandled exception. Okay? So here the work that could led to the unhandled exception is, first of all, we are importing GeoPandas, and then we are reading a file, and then first then we are printing the rules. And then in the exceptional handing, we are, uh, reading the file not found that catches the specific error and the general exception that catches any unexpected error through generic error code message. So to ensure the GeoPandas ensured, first of all, let's run the command of pip installgeopandas and then we will replace that file with our file. This so by running it, we can get the error and we can modify it as per to the terminal that give suggest because here as we are correctly handling that file not found error and an exception as an e that unexpected error occurred.

So to formulate an approach for creating an high accuracy ML model in r, first of all, let's go to the basics. We define the problem statement that is to predict the popularity or attractiveness of POI and then the objectives to develop a ML model so that it can predict as accurately as possible. Uh, the probability of POI in decision making for an example, let's say, 2 years. Then we will first collect the data. Data cleaning, we will do data cleaning by and so to handle the missing values, and then we will after taking the data and processing to the data cleaning, we will, uh, use the feature selection process so that we can take which features is required for the popularity of it and then derive a new feature from existing one that may capture the additional information. If we may require that, this additional feature will require for this to preprocess the model, and then that convert the categorical into the numerical representation like techniques. We can use label encoding. The one example to use that, and then we will normalize the feature to ensure that they may have the similar ranges and the magnitude. So the normalization, it is one of the very key factor that we had to keep into the mind. And then select the appropriate ML model for regression and classification task. Like for regression, we can use linear regression, random forest regression, and another example is to use the gradient boosting regression. And then in the classification, we can use the logistic classification, random forest classifier. Then we will consider the ensemble methods like gradient, foreign ingredient boosting so that we can get the pros and cons of it. And then we will split the data, divide the datasets into the trading and into the training and testing and validation purposes, and then we will use techniques like k fold cross validation to assess the model generalization and prevent overfitting. And then we will choose the appropriate evaluation metrics, like, based on the problem type. So we can example, we can use f one score. We can use accuracy precision and recall factors to understand the evaluation metrics of the model that we had just and, uh, made it and then assess the model performance on the validation dataset and then selected the evaluation metrics. And then finally, we can what we can do? We can, uh, fine tune the model hyperparameter using studies like grid search so that we can optimize its performance. And then we can analyze the feature importance to understand which factors will contribute most to the most to the UI popularity and then you can deploy this final model. So basically, this all is the throughout steps to process the information, to select the best model criteria. If we want any additional feature based upon it, we can use the additional feature and then we can do label encoding. Then finally, we can do normalization and then we can use the model. So there spreading the data into your training data and for testing and validation proposed, and we can use k cross validation technique to determine the effectiveness of the model. And then finally, we can hyper tune the model as for the requirement of it. So this is how I will formulate an approach to create the high accuracy and then model to predict the POI popularity based on the various factors.

So today, um, optimize SQL query using EDL processes for greater efficiency without sacrificing data integrity. We can use database monitoring tools to identify queries with high execution. Let's say the we can use tools like explain for MySQL to understand the query execution plan and identify the potential bottleneck then ensure the database schema is properly normalized, reduce the reluctancy and redundancy and improve the quality efficiency and then consider denormalization certain tables if it improves the quality performance, especially for the frequently accessed data. And then we can use the other strategy that is the indexing strategy that is the identify the column used, uh, in where clause and the join conditions. Create the appropriate indexes on this columns to speed up the data retrieval process. And finally, the main thing to understand is that we have to be cautious, like, not to create too many indexes as it can impact the insert and update performance. So this caution we had to kept in our mind. And then in the query optimization, use where clause efficiently so that it can filter row using the index column first, and then we can select the necessary column to reduce the data transfer overhead so we can avoid select operation and then use appropriate join types to ensure the join conditions are efficient. And allocate sufficient memory to the database server to reduce the disk input, output, and improve the caching performance and then optimize the, buffer pool size to increase the buffer pool size so that we can cache the frequently data in our memory and it can be fast for it. The another thing is that we can do parallel processing in the databases as it will make and boost up the time that required to load that data. So by considering these all method, we can consider a scenario is that to use the database to monitor tools to identify queries with high execution time, use explain to analyze the execution plans of identified queries, create indexes on the columns in where clause and join conditions to improve the query performance, and then rewrite queries to use efficient where clauses and avoid unnecessary subqueries and optimize the joins operation, like to use inner join, outer join, and then configure the database server to allocate a sufficient memory so that it can optimize the disk layout and adjust the buffer pool size. And final and the most finally, what we can do, um, we can tune SSE. We can tune the CPU so that we can use it for the parallel pre processing task. So this is how my plan to optimize SQL.

So first of all, let's understand NLP, natural language programming. So it will determine the specific NLP task that we want to perform. So here, for what we will do, we will ensure that the textual data is extracted from the relevant sources as a part of the ETL processes and affect the volume of the textual data being processed to determine the scalability requirement for NLP processing and then perform text cleaning such as removing special characters, stop words, and punctuation, and then determine the life for sentiment analysis, determine the sentiment. For named entity recognition, determine the entity like persons, organizations, and then we can use the NLP library such as SPAC c or again phase transformer for NLP processing, and then extract the relevant features from text to downstream analysis and then convert this text into the numerical representation using the word embedding. Like, we can use BRT embedding for it and the we will select the most important feature for downstream task like m l analysis and then add incorporate NLP processing steps into the existing ATL pipeline as a separate transformation stage. The most important thing is to do a separate transformation stage, and then we will utilize the parallel processing technique to scale the NLP for large volume of text data that we are having in it. And then we will maintain a data consistency and integrity through the ETL pipeline by validating the NLP outputs against the source data. And then we will train the ML model on NLP process data for task like, uh, sentiment analysis, and then we will evaluate the model performance by search as, uh, f one score metrics, recall metrics, uh, of of precision metrics, and then we will continuously improve the NLP model based on the performance feedback and the domain specific requirement. And then we will deploy this model into the production environment for real time processing, and it will also can do a batch processing for a time that required by the system and then monitors the performance and the data quality to ensure the smooth operation and finally establish a feedback loop to connect user feedback and update model so that we can process the pipelines at your accordingly. So basically, to summarize my points, data extraction, the first step we will do the data cleaning, we will remove the unwanted characters that require in the data, Then we will do pre processing for it, clean and pre process the data, then apply NLP techniques, then we will do feature engineering so that we can train the ML model on NLP process data and deploy the NLP model and then monitor the performance. So by following all these by creating a separate transformer function in it. So this is how I will approach it.