11.5+ years of experience working as an independent consultant providing services into Data Science, Product pricing and Engineering. Worked in a technical lead cum project management role, managing 8-10 resources in my previous roles.
Technical Expertise:
ML ALGORITHMS: NLP, Text Classification, Linear & Logistic Regression, KNN, Decision Trees, Random Forests, Clustering (K-means), Naive Bayes Theorem, Principal Component Analysis, Market Basket Analysis, LSTM/FBProphet, ARIMA, GBM, Recommendation System etc.
SOFTWARE SKILLS: Python, SQL, Databricks, AWS, GCP, Advanced SAS (Base SAS Certified), VBA, R, Tableau, Power BI, EMBLEM, RADAR & MS Office, Azure Cloud.
DATA OPERATIONS: Data Architecture, Azure Data Factory, AWS Model deployment, Databricks, ML Flow and MLOPS
Independent Consultant
Contractual EngagementsData Architect - ML/Technical Lead
Koantek AI Cloud Pvt. LtdManager - Data Science and Analytics
PWC IndiaData Analytics Assistant
RSA InsuranceSenior Consultant-Data Analytics
Allianz InsuranceActuarial Consultant
AXA Insurance
Python

NLP

SQL

SAS

RStudio

Microsoft Power BI

Tableau CRM

Looker

Oracle SQL Developer
Actuarial Consultant/Manager, AInsurCo Insurance, UK
-Working on Actuarial Pricing models and Data Science Projects in Insurance
-Power BI Dashboards for Claims analytics and Reserving
-LLMs Use Langchain and Open AI to read through PDFs of e-books on finance and insurance and ask queries to produce answers relevant to the analysis.
Lead Data Scientist, Tecxar Pvt. Ltd., Banking, India
-Working on ML models and dashboards to increase loan recoveries for loans.
-Brainstorm and develop ML use cases for improving recoveries for loan accounts.
-LLMs Using generative AI to produce info. Based on text search from data using ChatGPT with packages like OpenAI and Hugging face.
Project- Reverse Logistics, Market Mix Models, Anomaly Detection, Stock Price Prediction Model
Working in a tech. lead role managing a team of data scientists and engineers, responsible for designing the data architecture, pricing, data management, model development on cloud using AWS/ Azure/ Databricks. Working on AI/ML models for US based clients mainly from BFSI and Retail domain. Developing model using Python/SQL as a language in Databricks cloud. Develop business presentations and BI visualizations (Power BI/ Tableau) to showcase results and business outcomes
Responsibilities-
Retail: Reverse Logistics for clothing (UK) Design architecture for model deployment in AWS and predict the returns for particular brand/ region/ Colour/ Material/ Cost etc. for the clothing.
Pharma/life Sciences: Marketing Analytics (Korea) - Built an ARIMA model: Predict the future sales for different products. Automated and processed nearly 130 ARIMA models using a common python code & calculate the monthly carry over effect.
Finance: Stock Price Prediction Model (US) Design architecture, setup ETL and predict the stock prices for next 60 days and test the prices through different algorithms :Long- and short-term memory (LSTM), ARIMA/ Auto-ARIMA and FBprophet.
Successfully managed 3 data engineering work projects :E2 Migration PoC Migrate the data for the client from one workspace to the databricks workspace. Build an ELT process and Data pipeline Develop the same on databricks workspace.
Project- AI Strategy Plan (Dubai), NLP Models, Fraud Taxation analytics, Computer Vision, Clustering
Working in a Manager cum technical consultant role managing a team of 5-6 resources on client sites. plan and develop ML models and dashboards for taxation, PM rural development schemes. Fraud analytics using clustering to identify the loopholes for retail merchants in taxation develop an AI Strategy Plan for a client in Dubai for 6 months as a team of 4. This project involved benchmarking and assessing the organization, identifying the bottlenecks and recommending the AI Plan for the next 5 years along with developing 5 PoCs to showcase the abilities from AI. developed a couple of PoCs NLP Model for HR Analytics and computer vision to analyse and understand the traffic movement on expected bottlenecks in Dubai.
Responsibilities-
Responsibilities-
Responsibilities-
Responsibilities-
You can find feedback on my LinkedIn profile under work samples as a PDF for contractual engagements and testimonials from the global leaders for whom I have worked in the industry.
https://www.linkedin.com/in/mohit-bansal-7227bb34/
I have 11 years of experience in data science and analytics. I worked for different insurance brands for 7 years at insurance companies. Then I worked with PwC Consulting as a data science manager. These days I'm working as an independent consultant since last year. Before that, I've also worked in data architect and data scientist roles, leading teams on the consulting side. In terms of projects, I've done machine learning and data science projects and dashboards, developed them hands-on, managed teams of 8 to 10 people. Apart from that, I've spent 2 years working onshore internationally with clients directly. In terms of education, I've done a master's in applied official research from Delhi University. I also have a bachelor's in mathematics honors from Ramjas College at Delhi University. I'm based out of Gurgaon in India right now. Currently, I'm working as a freelancer on contractual roles. One of my contracts is ending, so I'm looking for a full-time role or full-time contract. That's the summary of what I have right now.
How do you utilize Python to automate the retrieval and processing of data from multiple sequel databases, ensuring data consistency across sources. Well, frankly, first thing is, you know, Python. So I will link my Python as through SQL queries using SQL or by Spark, you know, as a function. So that helps me just create a pipeline where I can write SQL queries directly. Now the second thing comes, we have SQL different data sources. How do I combine them? So first of all, I need to create a link for all the different data sources in the code itself. So suppose I create five datasets linked to five different datasets. You know, I can create a view. I can create a pipeline dataset directly also. A view helps read the data before we import. So creating that pipeline, creating that connection for five datasets from five different sources. Once that is done, then we can come back and collate all those based on the headers and based on the data we have. In the meantime, we can always, like, summarize do a basic analysis on that data before we import it to look at the anomalies or how the consistency of the data is to look at the missing percentages, outliers of data. I think that's one thing. Then the second thing is, when you're bringing the data, so it can be a data lake formation where the first layer can be bronze, where all the raw data sets are. The second layer can be silver where we summarize the data from the bronze layer. Suppose there are five raw data tables, we summarize them into one and try to pick only the legitimate data based on the missing and outlier analysis and observations we are looking at. The third could be, you know, the goal stage where we look at the pure data analytics part where we have summaries from the data coming in as per the requirement for the business. So those are some processes. Then we can also do batch processing if required later on to improve the efficiency.
For data-heavy Python applications, best practices for handling transactions and maintaining mass properties in a sequel database include: For data-heavy Python applications, what are the best practices for handling transactions and maintaining mass properties in a sequel database? I think when it's too data-heavy, Python cannot handle anything on the local system. Definitely. One thing is that we need to go to a cloud connection. So AWS, Snowflake, or any other cloud platform can be used. The storage has to be on the cloud to use cloud backup and cloud storage for a good backup. Secondly, internet speed is definitely there. But then, the database has to be stored in a way that the processors are up to date and processing speed is high when dealing with heavy data. We can always leverage cloud instances and clusters to leverage that speed. As I mentioned in the earlier answer, batch processing is something we do to handle those transactions and handle that data coming into it. It can be a SQL database or any other database for that matter. But if it is a SQL database, we can definitely do similar things. I have done 10 to 15 TB of SQL server processing as well, John, if that helps.
So, the Python code is intended to filter out messages from a dataset that are marked as spam, then calculate the percentage of spam messages. However, it throws a zero division error without executing the code. I'm going to debug this situation and explain the cause of a zero error. So, the code execution that's causing the zero division error is either division by zero or missing length of data. It could be that the length of data might be zero, or it might just be missing. The code is calculating a percentage by categorizing messages as spam based on the label in the data. However, if it doesn't find a label 'spam', it won't categorize it as a spam message. Then, the percentage of spam is calculated using a formula where we're dividing the length of spam messages by the length of data. The issue arises when the length of data is empty, leading to a zero division error. This is because the length of data is zero, or the data is missing, causing an infinite loop and resulting in an error. To handle this, we can add an if condition that checks if the length of data is not equal to zero before executing the code. This way, either the code will produce a percentage or it will skip the answer. We can also add an error message in case of a zero division error, using an if-else condition. If it's a zero division error, the code can execute an else statement giving a message that the user has entered either a zero or missing length of data.
Company is launching a new feature for its messaging service. How would you use Python's equality to derive a model that predicts user adoption rate after launch and what key performance indicators (KPIs) would you track? How would you use Python's equality to derive a model that predicts user adoption rate? Okay. So, first thing is the data. Whenever we look at any problem, we need to understand the data. I think this is more of a logistical model where we say, okay, how much percentage of the reduction is there with those messaging services, new feature being adopted or not. Now, first thing is that what data we have. So, I need to go back and see, okay, how much data we have. Do we have one month of data, one year of data? For this feature, what data we have. Because it's a new feature, we might not have the data. In that case, we can look at the competitors' data or market data as well through a market survey. That's another research part where we can do. But first, point is where we collect the data, gather the data. Second, point is where we see, okay, the data which we have, is it sufficient enough to look at the important columns and characteristics of the data set we are looking at. So, if I say, that's a messaging service. So, when was the message sent? Was it delivered? Was it read? If yes, then, was there any action point required? So, these are the characteristics of the messages at what time it was sent, what time it was read. So, that's also important to understand, okay, how much impact it is bringing into the system. Then, secondly, the user database to whom you're sending the messages to the customers, how they're reacting. So, in terms of their characteristics, like, user's age, occupation, what if we do have financial information created. If we don't, it's fine. But these are the KPIs from users which we can collect. And then, company's characteristics in terms of, okay, are they charging anything for this messaging service? What is the feature they are looking at? So, these characteristics can always be used and looked at. Now, let's go ahead. We do, you know, all the analysis in the model development, which generally happens using statistical analysis, summarizing the data, things into correlations and model development. So, we'll put a logistical model or we can use decision trees, gradient boosting as well, methods. So, it depends on what we are trying to do. But in the end, it will be a logistical model, a probabilistic model, which might give you adoption rate in terms of, okay, if we send out 100 messages. Out of this, 50 messages were read and 30 of them actually adopted. So, in that case, it will be 30% of adoption rate when you're looking at it. Then, coming on to the KPIs. So, KPIs, I think, for this particular feature, if I say, this is an AI, ML messaging service, so I assume it will be more of, kind of automated service as well. So, how many of them were read within, 30 minutes of installation and how many of them were read between 24 hours or 48 hours. So, looking at how much time it was required to adopt to a particular service.
I think you took a lot of time there. What is your approach to minimize the number of SQL queries in a Python application for database interaction? See, basically, if we have to minimize the SQL queries, I can directly try to interact with tables there. Generally, I use subqueries as one part where I can handle multiple tables within a query itself using subqueries. We'd merge directly with the databases when we are trying to look at unique keys and primary keys. And then, basically, one status in any Python application, you can directly call out those SQL queries. That's not a problem. And I think that's one thing. And then there are no code platforms which you can use in Python. Those things can always be leveraged.
Optimizing a sequel query used in a Python data science application. Okay. Which factors would you consider to improve the execution time? Optimizing simple query used in a Python. So, frankly, if you know we are running the SQL queries, it depends on how we are trying to run the queries, how what we are trying to execute them? So, for example, if I say we are writing sub queries. Within sub queries, when we are merging the data, it can happen that there is a dataset a, and there is a dataset b. Now we are trying to merge both of them. If dataset a versus dataset b merges, and it is a big dataset from b. It will take time. But what we can do is use where conditions, which automatically filters out from a and b, whatever is needed directly, and try to optimize on the performance there. And similarly for other datasets. So if there are 5 datasets, we operate it in one scenario. We can create a view, where we create a column a as dataset a as a base and others, as you know, the left join datasets where we only match it by the where condition to understand? Okay. If there is a certain condition present, only then it merges. So it will automatically filter out initially only the datasets before merging. Because once it merges, then we do it. It will be multiple rows, and we don't really want to end up in that scenario. That's one of the things. Then, another thing is, when we are trying to optimize the SQL queries, the main task which comes around is how we can create summaries. So if there are bigger datasets, we can always initially optimize the datasets in their databases itself and try to look at optimized summaries and then merge them. That's another part, which we can handle. Then, looking at the unique summaries and stuff like that, those can also be handled. And then, I think duplicates should be handled. These are the anomalies which have to be handled. I think that's one thing, then we can do batch processing in terms of the operating time. So if we are trying to operate on certain datasets, we can do multiple clusters and batch processing as well. Then other operational issues which we can sort out. And the third part is definitely the coding part. When we are looking at the code, take only those specific columns which are needed for the processing. Do not pick all 200 columns if required? Just pick your 20 columns while processing the data, while merging it, and that should be done.
Okay. You're trying to query a database using SQL within a Python. Python should return the count of messages that failed to send for each customer, but it's returning an incorrect count. What could be the reason that the count is incorrect? Please list potential bugs and assumptions in SQL or database setup that you would check to troubleshoot this. Connection in the database. Query. Select customer ID, count message ID as failed message from where status is failed. Grouped by customer ID. Yes. So this is failed to back customer ID. Yep. Okay. So can custom 1 customer have multiplied each return the count of messages and failed to send for each customer Okay. C.execute query. Alright? Result. See. Fetch all. Group by customer ID., one thing could be there that we don't really have, you know, distinct so it could be distinct count of message ID. So there could be that, you know, there is a repetition of messaging IDs where we haven't removed duplicates, for failed messages. And then secondly, I think, you know, I'm not sure if we can right now, but, yes, 2nd part is where we debug, you know, the individual customer ID. So if I'm there, I would just pick 2 customer IDs and try to check How much is the count in, you know, actual count versus what it should appear in the outlook. I think that's how I debug it, and it's easier to do it that way. we can easily find out, you know, what problem it's causing. Generally, it's duplication which happens or, you know, some, values which are being missed. Sometimes it can be case sensitive. Like, we have written where status is failed. It could be that, you know, in the data, it's capital f written, which is causing this issue. So they can be all these issues, you know, which we can debug easily.
It does while improving the query performance of an existing system that manipulates data from large datasets. How much would you like to optimize data handling methods without disrupting ongoing performance. So I've done a similar thing, for Postgres. Basically, we were trying to improve the performance of the data. One thing is, as I mentioned, delta lake, if you're working on the cloud and there is heavy data you're looking at. Creating a delta lake works. Creating a delta lake means multiple sources can give you large-scale databases and create data summaries out of it. Whatever is required in raw format for you. And then filter out, operate on the data itself. But that's one option, creating a delta lake with different procedures through batch processing. Then secondly, if I say, optimizing often happens at base levels. So I think base levels need to check where we say, okay. If you're looking at one particular SQL query or a database, optimize on those columns. Don't take everything into account. Only take the ones which are required, and that would summarize your dataset as well. The second part could be looking into the application. So at times, the server is not able to take the load from the data. So the fetching of those datasets has to be quick enough. For that, we might need to improve the RAM which we are operating on, it could be the clusters. Like, the clusters basically provide the RAM and efficiency for the processing. So that could be another one, which could, like, harm the website's content for the customer as well. So I think these are the things which we need to check. And as I said, SQL queries should be efficient. Then, you're operating batch processings, and how the processing is happening has to be efficient enough to handle the servers as well and the query databases. Then if it doesn't, we can either improve the CPU usage for the clusters. And otherwise, we can use optimizing summaries for that. We can always summarize the data.
Imagine you have a Python's custom message as well. You need to group message by customer. One of the column names have a typo. So it's the group by operation to fail. Without seeing the data frame, what would be your first step in debugging this issue, considering common cause for such an error. okay. So, basically, we are just doing a group by here for the count. Code continue but raises a error at group by statement. So that See this see. Now there are 2 things. Either the syntax is wrong or, you know, the column which has been picked is wrong. So if I assume the syntax is correct here, when we are now, we are trying to group by. Now what is happening is, It's it's probably, you know, a different column which we need to pick. So we can just do, like, you know, customer messages, I think, is data frame. So customer underscore messages dot content or info. So that will produce a list of all the column names from the database which we are looking at, in the data set. And from that data frame, we can pick the right column, which we need to, you know, process for the count. So in this case, you know, the typo error won't be there. And always try to, you know, copy paste the column names rather than typing them in because this is a very common mistake which happens to during, you know, your
What strategies have you employed in the past to build reporting NBI features that cater to both? Okay. Great. So, for reporting and features, if I say, for these purposes, majorly, we develop the models, but we use dashboards and reports for front-end handling for a non-technical audience. Because, generally, in our profile, we are presenting our insights to the higher management for the company. So, generally, we have to be non-technical while showcasing the results and trying to say, okay, how it impacts the business and all the operations part when we're looking at it. Now, either we can use dashboards. The first thing is user visualizations, dashboards, using Power BI, Tableau, Excel, VBA, etcetera, or Google Analytics for that matter. Then, the second part is where you can use reports. So, there are a lot of monthly reports, which are used for monitoring and tracking. These reports are nothing but basically showing you basic numbers and trends. So, it can be tabular, just numbers processing through showcasing what are the percentages increase and decrease, and the same can be shown through dashboards. Now, if we are not using these, we can also use PPTs for that matter. So, at times, I create presentations for the clients. Do you know how I present my insights? Because a dashboard is fine for operational purposes if they want to play around with the data and see how it affects one variable and the other variable. But, for customer insights, I generally prefer PPT as one of the platforms because presentations help you capture everything which is going in the background. And, you know, communicate to the management that, okay, how we can improve your business, how this is impacting your business.
What feature would you look for in a dashboard creation tool like Looker or Tableau that would enhance the accessibility of data insights for nontechnical users? I think visualization is something which we definitely showcase. So generally, I try to do a story building around the dashboard itself. Like, first, we create a summary, then we create the first level of detail. And then we go to the second level of detail. So if I say it's a retail company, you know, growing. So first, I want to show, okay, what is the profile? What is the overall number of which we are looking at. So that gives me a summary to understand, okay, how many of them, what is the customer attention or what is the revenue? How is the overall picture placed for the management. Now then there's a first level of detail which says, okay, how many products were there? How many customers were there? What is the type of products we are looking at? And then, what are the details which we can look at in terms of each product. So what are the top selling products? What are the top selling locations? Then, they could be customers, you know, in specific, age groups or categories which we might want to look at. So those are the things which come in the first level of detail. And then there is a second level of detail, which says, okay, I want to drill down to specific customers and see the list of those customers. So I say, okay, in retail, how many of them purchased coffee or how many of them purchased clothing online. So if I want to see, okay, how many of them are actually purchased in which location. So then I can give a drill-down facility to see, okay, these are the, you know, our 1,000 customers who ordered today for this. For retail, the data is large. That is the reason I'm saying this. Because otherwise, it will go into lakhs and, you know, millions of data. So those things can be used. And then, accessibility can be in terms that, you know, we can put buttons for action buttons for queries as well. So depending on if there are any specific requirements from the clients, we can give, you know, direct query processing buttons as well, which just process and give the drill-down data for them.