profile-pic
Vetted Talent

Kalpesh Anil Dahake

Vetted Talent
Looking for challenging role in reputable organization to utilize my technical and management skills for the growth of the organization as well as to enhance my knowledge about new and emerging trends.
  • Role

    Senior Associate II, Data Science

  • Years of Experience

    5 years

  • Professional Portfolio

    View here

Skillsets

  • Dlib
  • LLM
  • Machine Learning
  • rag
  • AWS Bedrock
  • Azure Machine Learning
  • Azure openai
  • Oci ai service
  • BERT
  • GPT-4o
  • Microsoft Azure
  • OpenCV
  • Prompt Engineering
  • Robot Framework
  • Sdxl-lora
  • Service Now
  • Tableau
  • Agentic AI
  • NLP - 3 Years
  • Oracle - 1 Years
  • GPT-4 - 6 Years
  • Github - 3 Years
  • Python - 3.8 Years
  • Deep Learning - 4.0 Years
  • SQL - 2 Years
  • Generative AI - 3.0 Years
  • AI - 3 Years
  • AWS
  • Computer Vision
  • Docker
  • Gemini 1.5 flash
  • Gemini 1.5 pro
  • Gpt-3.5
  • Gpt-4 turbo

Vetted For

8Skills
  • Roles & Skills
  • Results
  • Details
  • icon-skill_image
    Junior Data Scientist (Onsite)AI Screening
  • 63%
    icon-arrow-down
  • Skills assessed :Data Modelling, data-science, Data Visualisation, machine_learning, Problem Solving Attitude, Python, R Python, SQL
  • Score: 57/90

Professional Summary

5Years
  • Jul, 2024 - Present1 yr 10 months

    Senior Associate II, Data Science

    Kyndryl Solutions
  • Apr, 2023 - Sep, 2023 5 months

    Software Engineer II

    Terralogic Software Solutions
  • Sep, 2021 - Jan, 20231 yr 4 months

    System Engineer

    Huawei Technologies
  • Oct, 2018 - Mar, 2019 5 months

    Project Engineer Intern

    MD Electricals
  • Nov, 2019 - Dec, 20201 yr

    Associate Engineer

    Tata Communications
  • Dec, 2020 - Sep, 2021 9 months

    Customer Service Executive

    Tata Communications Transformations Services

Applications & Tools Known

  • icon-tool

    MySQL

  • icon-tool

    Oracle Cloud

  • icon-tool

    Azure Machine Learning Studio

  • icon-tool

    GitHub

Work History

5Years

Senior Associate II, Data Science

Kyndryl Solutions
Jul, 2024 - Present1 yr 10 months
    Leading an AI team managing an AI Automobile project in "Manufacturing Operation" department, ensuring successful delivery within timelines. Worked with GPT-4 Turbo, GPT-4o, Gemini 1.5 Pro, and Gemini 1.5 Flash for various AI tasks. Utilized SDXL LoRA for vision related projects, including image analysis and car plant layout generation. Deployed and managed AI models using Docker for scalable and efficient execution. Developed & Deployed multiple chatbots: Talk to Doc Chatbot, Talk to DB Chatbot, Contextual Understanding Chatbot. Applied ML for prediction & classification tasks. Built Data Anonymizer model by using GPT-4o & LLM-Guard. Built a Text to SQL chatbot for query generation. Gained expertise in Agentic AI concepts and applications.

Software Engineer II

Terralogic Software Solutions
Apr, 2023 - Sep, 2023 5 months
    Develop and implement AI algorithms and models to solve complex problems. Research and analyze data to identify trends and patterns. Design and develop machine learning models. Develop and maintain AI-based databases. Analysed and improved the performance of a GPT-3.5 powered chatbot by indentifying and resolving bias in its responses. Collaborated with a team of engineers to build and deploy a production-ready chatbot using GPT-4.

System Engineer

Huawei Technologies
Sep, 2021 - Jan, 20231 yr 4 months
    Engage directly with customers to participate in design and development of solution according to functional requirements. Perform debugging, troubleshooting, modifications and unit testing of integration solutions. Support, monitor, execute production application jobs and processes. Participate in the development of documentation, technical procedures and user support guides. Designing and building advanced NLP models, algorithms, and systems that can process, understand, and generate human language with high accuracy and efficiency. Documenting all phases of the NLP development process, including design decisions, methodologies, and result to ensure knowledge sharing and maintain high-quality standards.

Customer Service Executive

Tata Communications Transformations Services
Dec, 2020 - Sep, 2021 9 months
    Evaluating the functional and non-functional requirements for testability and the test cases and use cases for suitability for automation. Developing, coding and executing test case and test script frameworks using automated tools such as Rational Functional Tester and other Rational suite tools. Experience with performance monitoring tools and visualizations. Ability to multi-task and test different applications necessary for a release with a proven ability to work under pressure and deliver within tight deadlines. Design, development, and enhancement of chatbot solutions, integrating NLP techniques and actively participating in testing and debugging processes for improved user experiences.

Associate Engineer

Tata Communications
Nov, 2019 - Dec, 20201 yr
    Python application development, testing, and maintenance. Work with other developers, designers, and stakeholders to satisfy project needs. Troubleshooting and code debugging. Developing and maintaining code and application documentation. Participating in code reviews and contributing to best practices for the team. Maintaining current knowledge of emerging technologies, programming languages, and software development techniques. Applied NLP techniques and analysed text data, including sentimental analysis, named entity recognition, topic modelling and text categorization.

Project Engineer Intern

MD Electricals
Oct, 2018 - Mar, 2019 5 months
    Worked as a Project Engineer Intern under Automation Department. Assisted in automation projects and contributed to engineering tasks.

Major Projects

6Projects

Autonomous Robot Controller

    Obstacle avoiding robot using infrared object detector for navigation. The robot turns left or right when an obstacle is detected and moves forward when the path is clear.

Smart Robot to Rescue Child from Borewell using Raspberry Pi

    Developed a wireless robot for surveillance and rescue operations in borewells, controlled via an Android application.

Restaurant Reviews Sentimental Analysis using NLP

    Implemented NLP-based sentiment analysis on restaurant reviews, achieving 73% accuracy in sentiment prediction after data preprocessing and classifier training.

Zomato Restaurant Rating Predictor

    Built a machine learning model in Python to predict restaurant ratings based on customer reviews and features, providing insights for the food industry.

FaceVision using OpenCV

    Developed a real-time face detection system using Haar cascades and Dlib's HOG detector for accurate detection in images and videos.

Resume Parser using NLP

    Created Python scripts using regular expressions and NLP to extract key information from resumes, including names, contact details, skills, and education.

Education

  • Bachelor of Engineering

    Savitribai Phule Pune University (2019)
  • Diploma in Engineering

    Maharashtra State Board Of Technical Education (2016)
  • HSC

    Pune University (2013)
  • SSC

    Pune University (2011)

Certifications

  • OCI AI Foundation Associate (1Z0-1122-23)

    Oracle (Dec, 2023)
  • "Databricks Accredited Generative AI Fundamentals

    Databricks (Dec, 2023)

Interests

  • Exploring Places
  • Technology Research
  • Reading
  • AI-interview Questions & Answers

    Good morning. I have three plus years of experience in the data scientist field and the AI field, including machine learning, deep learning, and deployment. I graduated from University of Engineering and am currently passing the 2,000 90 batch. Then I started my career in data communication as an associate engineer. There, I worked on the chatbot development team and looked at NLP data analysis. I analyzed data based on NLP techniques, such as text summarization, machine translation, and NER. And then I shifted into the PCPS, a sub-ordinate of data. They have a customer service executive coordinating with the project delivery department. I made sure that our Data architect team was doing well and that no issues arose. I worked to fix project delivery bugs and deduct all those bugs that came into the project delivery. Then I switched into technology. In technology, I worked on a system that could predict alarms. The product was a network management system. So when the network fiber went down, an alarm should come. Based on this, we created a prediction model using our video streaming data and past data. My last company was Theralogic Solutions, Theralogic Software Solutions. There, I was also a software engineer. I handled AI-related tasks and analyzed their HRM data. I required some machine learning models to improve their HRM assistant and worked on the chatbot customer's voice system. Based on the small data set, we deployed the chatbot using NLP techniques and utilized some algorithms, such as G53 and G54. We're trying to improve the chatbot and grow our model. That is my background.

    Implementing regularization techniques in neural networks is to combat overfitting and improve model generalizability. There are several techniques. One is L1 regularization and L2 regularization. So, if I talk about L1 regularization, it's also known as the Lasso. It adds the sum of the absolute values of the model's weights to the loss function. This encourages sparsity by driving some weights to 0, effectively reducing the model. L2 regularization, on the other hand, adds the sum of the squared values of the model's weights to the loss function. This pushes the weights towards 0, but not eliminating them entirely. That is the main difference between L1 and L2 regularization techniques. And about the implementation of that, we have data types of probability regularization. There is a process of choosing the right hyperparameter, like lambda for the regularization strength. There are techniques like that, and also dropout. Dropout randomly drops neurons during training, forcing the network to rely less on individual features and learn more robust representations. This is implemented by setting the dropout parameter, for example, dropout = 0.2. This chooses the dropout rate depending on the problem and network complexity. Also, there's early stopping. For that, we monitor the performance on the validation set during training. We stop training if the validation performance doesn't improve for a defined period. This prevents the model from memorizing the training data and generalizes better. Using early stopping, we choose an appropriate patience value for validation metrics like accuracy or loss. We also artificially increase the diversity of the training data through techniques like image flipping, cropping, or adding noise. This makes the model more robust to variations in unseen data. We also work on weight decay. This gradually reduces the magnitude of the weights during training, similar to L2 regularization, but often implemented separately. This is implemented by setting the weight decay parameter in the optimizer, for example, if you use the Adam optimizer, you set the beta k parameter. This chooses the rate decay rate based on the data and network complexity. So, yeah. So, we have to find the optimal combination and hyperparameter for this.

    So you reduce the order of a fitting machine learning model by using Python. When we use Python, it offers a toolkit for addressing or fitting machine learning models. There are some common approaches in data-centric techniques like data augmentation. It artificially expands your dataset by random transform options like rotation, flipping, cropping, and feature engineering and selection. Analyze your features to identify early stopping. When we use early stopping, it's a technique that trains your model with a separate validation set. And stops training when validation performance starts to decline. When we're talking about the model, there's a technical aspect of regularization to avoid overfitting. This involves penalizing the last parameter values in your model to prevent overfitting on specific aspects of the data. Regularization is a common approach implemented easily with libraries like Scikit-Learn or. Another technique is dropout. Dropout specifically randomly deactivates neurons during training, forcing the model to learn robust representations that don't rely on specific features. And care of dropout is for this purpose, model complexity and reduction. You see, in our models, we use simpler models with fewer layers of neurons that start with basic models and gradually increase complexity if necessary. Then another is ensemble sampling technique. Ensemble sampling is an internal sampling technique that combines the predictions from multiple models that are trained differently. For example, different hyperparameters or training splits. It's used to ensure better generalization. And next, it guides which mapping and boosting techniques are not provided tools for avoiding overfitting and machine learning models. However, it's very important that the best approach depends on the specific dataset and model. So, we can experiment with different techniques and evaluate their impact on our model performance to find the optimal approach.

    So as of my knowledge, normalizing data directly within SQL queries for machine learning purposes isn't a good approach first. It can lead to issues. Yes. Why? Because the data leakage inefficiency problem and limited techniques as getting functionality and data integrity for this one happen. But there are some approaches we use. For example, pre-processing data outside is quality use. At the Python level, it's like. The efficiency of normalizing data separately and testing sets before feeding them into your machine learning pipeline. This ensures data integrity and allows for flexible and optimized calculations. It's critical to store the normalized data in a way that's performance is critical. Consider pre-calculating and storing normalized data in separate tables for faster access during model and evaluation. However, remember to keep track of normalization parameters in the database extension. There are some databases that offer extensions or stores processes. Those are specifically designed for data tasks, but options and capabilities vary. So machine learning emphasizes careful data handling and avoiding leakage. Normalizing within SQL queries might appear convenient, but it can compromise this principle, ultimately leading to suboptimal model performance.

    So as per my knowledge, SQL window functions are indeed powerful tools for performing calculations. So, we can do this by using the defining the window. The window clause specifies the window if the function operates, so that involves the partitioning and defining how to group those logically using the partition by clause. For example, calculating moving averages within each customer ordering, specify the order of flows with each partition using Order by, this determines the direction of the window, like chronological order or time series data. It shows the function also. So there are some functions like some average, mean, max, and calculate aggregate values within the window for metrics like total sales or average price. Then the row number, dense rank, ranks so that assigns sequential ranks or positions to rows based on order or values within the window. Another one is the lead or lag, so that accesses values from specific positions ahead or behind the current row, useful for calculating differences or lags. Come sum or come average, that calculates cumulative sums or averages from the beginning of the window up to the current row. To specify the range, the rows or range define clause defines the extent of the window relative to the current row. So rows and preceding include the end row before the current row. Rows and following include the rows after the current row. Rows between m preceding and n following, similar to those, but specifies setting dumps of the order. For example, 2 days preceding. An example, if we take one example, select customer ID, order date, product bytes. I'll be selecting those parameters from the table. So, have this product place or partition by customer ID, order by order date. Rows between 2 preceding and current row, as moving average price from orders. So, like that, this is the example. So, this very calculates moving average price products for each customer, transferring the current price and the 2 preceding orders within the same customer group. So, the window function has a complex city where it's so ensure you use their logic and impact on different functions in the window that are that's our specific purpose. So, the common also, when we use the format table expression to precalculate window results using credibility, it's important.

    So handling variable correlations when developing a multivariate linear regression model is crucial because it can significantly impact the validity and interpretability of our model. Therefore, we have to identify correlated variables first. We calculate the correlation matrix to visualize the pairwise correlation between all our variables and look for high correlations, typically above 0.8 or 0.9, which might indicate redundancy. A scatter plot is important; it's a scatter plot between pairs of variables to graphically assess the nature of the correlation, whether linear, etc. Then we have to understand the impact of correlation, particularly in multicollinearity, where high correlation can lead to unstable estimates, inflated standard errors, and difficulty interpreting individual variable coefficients. Additionally, separation effects can occur, where one variable might mask the true relationship of another with the dependent variable. We then have to apply some strategies. If two variables are highly correlated and offer similar information, we consider removing one based on knowledge for future importance analysis. We also have to combine variables by creating new variables that combine highly correlated ones if it makes logical sense based on the context. Then we can use techniques like principal component analysis to reduce the number of variables and obtain uncorrelated components. We can also use regularization techniques, such as L1 and L2 in regression models, to penalize large coefficients and reduce the effect of multicollinearity. Another approach is to interpret results with caution, especially in highly correlated settings, and monitor various inflation and factor effects to quantitatively assess the severity. We also have to evaluate the impact of different strategies on our model performance. Furthermore, we can use visualization techniques like partial dependence plots to understand the combined effect of multiple variables on the dependent variable. Handling variable correlations requires understanding potential issues and applying appropriate strategies based on our specific task, data, and research.

    So the late gen clause ensures that all customers included in the result, even if they have no sales in 2023. The 'a' dot sign updates 'less than or equal to' to 'less for' date condition filters customers who signed up before or on the sales date. The output of the Python course shows that the customer with ID one has sales of 100 on 2023 1 1, and their sign up date is 2021 1 which is equal to the sales date.

    So when we are deploying the state management approach in our predictive model, specifically in a production environment, we have to choose the right state management approach for predictive models in production. It depends on several factors. So there is model complexity. Simple models require a few parameters and might not need complex state management. Another factor is the data and the update frequency. Frequently updated models necessitate a reactive state. Another factor is the scalability and high availability. Consider distributed solutions for large-scale deployment cycles and integrations with existing infrastructure. We should leverage existing tools and frameworks whenever possible. There are some common state management approaches and their potential applications. Key-value stores, such as Redis and Memcached, are fast, highly available, and simple for static model parameters. They're suitable for storing model coefficients in intermediate calculations and user-specific states. Distributed file systems, such as PFS and TaskFS, are scalable, durable, and handle large data sets. They're suitable for modern checkpoints but may not be optimized for shipment updates, which can introduce latency. However, they're suitable for storing large model train-based historical predictions. Database management systems, such as SQL, are structured storage with data-creation capabilities. However, they're less performed for frequent updates than key-value stores. We use them for storing model metadata, user-specific context, and model training loss. We also use some model-serving frameworks, such as TensorFlow Serving. These frameworks are designed for model performance, handle different model formats, and are optimized for performance. However, they can be complex to manage and require specific expertise. When deploying the model itself, managing model versions, handling inference requests, and tracking model performance, we use some state management libraries, such as DBC. These libraries are used for virtual experiments, tracking model performance, debugging, and deployment pipelines. We track model performance, track and deployment history, and keep managing different model versions. A combination of these approaches can be used. Key-value stores are suitable for storing models and large datasets, databases are suitable for structured data and logic models, and model-serving frameworks are suitable for deploying and serving the model. State management libraries are used for tracking and deployment history. Apart from that, we need to implement additional security measures, such as robust authorization mechanisms for model state. We also need to monitor data, updates, model performance, and resource usage. Disaster recovery is also essential.

    There are several approaches. We can take to dynamically adapt the data visualization based on user data selection in Python, each with its own strengths and weaknesses. So there is a way to use trademarks, such as libraries like Dash, Plotly, and Bokeh. There's also server-side rendering and streaming with Django, FastAPI, and Flask. It can be used for a more complex understanding of server-side programming. Another approach is to use some pre-built dashboards, such as Tableau, Power BI, and Google Data Studio. It is easy to use and has a wide range of features. When we use Python, we get libraries like Matplotlib, Seaborn, and Broadly. We can use that one also. For simple charts, if data updates frequently, consider server-side rendering or event-driven approaches, such as faster bits in web frameworks. We have to consider the complexity of visualization. For simple charts, we can use Matplotlib. Some of the more difficult views require the plot and there are finders. You can use them in Python. So there are some local JavaScript libraries also, such as Vega and D3.js, but they can't be used under Python. So broadly, Python is also powerful for interactive visualization, with extensive chart documentation and a community that supports integration with data sources. It allows custom layouts and interaction. It can also be used with Vega Lite for building new visualizations and concise code. We can use pandas for data manipulation and cleaning, and NumPy for numerical computation, which is meant for basic static visualization. Our users can choose an interactive

    So to leverage Azure's AML services, we enhance our machine learning model. Some ways we can leverage this include data management and storage. We use the data bridge to collaborate on data exploration, analysis, and model training in the unified platform. Then we perform the training and experimentation. In training and experimentation, we're using the machine learning Azure machine learning service. It trains and manages it with machine learning workflows in the cloud to find the best-performing model on our dataset. Then, in experimentation, as our machine learning experimentation, we track, compare, and reproduce machine learning experiments to enable efficient model development. Also, we use Azure Functions for deployment and serving, deploying lightweight models to respond to events or triggers in real-time. Sometimes we use the Ubernet service to deploy our trained model as containerized microservices or scalable and secure inference. We mostly use Azure Functions. Then, we evaluate our performance in machine learning studio. In Azure machine learning studio, we view model performance and health metrics in our usual inference environment. Then, we understand how our model makes predictions with expedition and insights. That's basically the services we use, but it depends on our machine learning model that we will take for our specific requirement. So.

    So integrating the power of UI visualization into a Python-based data analytics workflow, can we add tools through several methods? We can use several methods. There is a Power BI REST API. So Power BI provides a REST API that allows us to embed Power BI reports and dashboards into custom applications, including Python ones, so we can use the API to access Power BI, including data sets and dashboards programmatically. We can use Power Embedded also. It allows us to embed Power BI reports and dashboards directly into our desktop applications. We can use the Power BI Embedded Python SDK to integrate this visualization into our Python-based applications. And this Power BI Desktop model, while Power BI Desktop itself is not directly available, you can export data from Power BI Desktop to various file formats, for example.CSV, Excel. And then you import and analyze that data using Python libraries such as NumPy, and Pandas. So there are some Python visualizations. Now there are also powerful data visualization libraries in Python, like Plotly. We can extract data from Power BI either through exporting or using APIs and then visualize it using these Python libraries within our Python-based data analytics workflow. So we can use Azure databases also. Power BI data flows allow you to prepare and transform data within, and then you can use Azure databases, which support Python, for advanced analytics and machine learning tasks. So we can then visualize the results back in Power BI. Also, we can use Jupyter Notebooks for embedded analytics. So we can embed Power BI directly into Jupyter Notebooks and use Python code. So it depends on all these methods. It depends on our specific requirements, considering factors such as real-time updates, licensing, and integration complexity. So based on those parameters, we can choose the method that we want.