11.5+ years of experience working as an independent consultant providing services into Data Science, Product pricing and Engineering. Worked in a technical lead cum project management role, managing 8-10 resources in my previous roles.
Technical Expertise:
ML ALGORITHMS: NLP, Text Classification, Linear & Logistic Regression, KNN, Decision Trees, Random Forests, Clustering (K-means), Naive Bayes Theorem, Principal Component Analysis, Market Basket Analysis, LSTM/FBProphet, ARIMA, GBM, Recommendation System etc.
SOFTWARE SKILLS: Python, SQL, Databricks, AWS, GCP, Advanced SAS (Base SAS Certified), VBA, R, Tableau, Power BI, EMBLEM, RADAR & MS Office, Azure Cloud.
DATA OPERATIONS: Data Architecture, Azure Data Factory, AWS Model deployment, Databricks, ML Flow and MLOPS
Independent Consultant
Contractual EngagementsData Architect - ML/Technical Lead
Koantek AI Cloud Pvt. LtdManager - Data Science and Analytics
PWC IndiaData Analytics Assistant
RSA InsuranceSenior Consultant-Data Analytics
Allianz InsuranceActuarial Consultant
AXA InsurancePython
NLP
SQL
SAS
RStudio
Microsoft Power BI
Tableau CRM
Looker
Oracle SQL Developer
Actuarial Consultant/Manager, AInsurCo Insurance, UK
-Working on Actuarial Pricing models and Data Science Projects in Insurance
-Power BI Dashboards for Claims analytics and Reserving
-LLMs Use Langchain and Open AI to read through PDFs of e-books on finance and insurance and ask queries to produce answers relevant to the analysis.
Lead Data Scientist, Tecxar Pvt. Ltd., Banking, India
-Working on ML models and dashboards to increase loan recoveries for loans.
-Brainstorm and develop ML use cases for improving recoveries for loan accounts.
-LLMs Using generative AI to produce info. Based on text search from data using ChatGPT with packages like OpenAI and Hugging face.
Project- Reverse Logistics, Market Mix Models, Anomaly Detection, Stock Price Prediction Model
Working in a tech. lead role managing a team of data scientists and engineers, responsible for designing the data architecture, pricing, data management, model development on cloud using AWS/ Azure/ Databricks. Working on AI/ML models for US based clients mainly from BFSI and Retail domain. Developing model using Python/SQL as a language in Databricks cloud. Develop business presentations and BI visualizations (Power BI/ Tableau) to showcase results and business outcomes
Responsibilities-
Retail: Reverse Logistics for clothing (UK) Design architecture for model deployment in AWS and predict the returns for particular brand/ region/ Colour/ Material/ Cost etc. for the clothing.
Pharma/life Sciences: Marketing Analytics (Korea) - Built an ARIMA model: Predict the future sales for different products. Automated and processed nearly 130 ARIMA models using a common python code & calculate the monthly carry over effect.
Finance: Stock Price Prediction Model (US) Design architecture, setup ETL and predict the stock prices for next 60 days and test the prices through different algorithms :Long- and short-term memory (LSTM), ARIMA/ Auto-ARIMA and FBprophet.
Successfully managed 3 data engineering work projects :E2 Migration PoC Migrate the data for the client from one workspace to the databricks workspace. Build an ELT process and Data pipeline Develop the same on databricks workspace.
Project- AI Strategy Plan (Dubai), NLP Models, Fraud Taxation analytics, Computer Vision, Clustering
Working in a Manager cum technical consultant role managing a team of 5-6 resources on client sites. plan and develop ML models and dashboards for taxation, PM rural development schemes. Fraud analytics using clustering to identify the loopholes for retail merchants in taxation develop an AI Strategy Plan for a client in Dubai for 6 months as a team of 4. This project involved benchmarking and assessing the organization, identifying the bottlenecks and recommending the AI Plan for the next 5 years along with developing 5 PoCs to showcase the abilities from AI. developed a couple of PoCs NLP Model for HR Analytics and computer vision to analyse and understand the traffic movement on expected bottlenecks in Dubai.
Responsibilities-
Responsibilities-
Responsibilities-
Responsibilities-
You can find feedback on my LinkedIn profile under work samples as a PDF for contractual engagements and testimonials from the global leaders for whom I have worked in the industry.
https://www.linkedin.com/in/mohit-bansal-7227bb34/
I I have, like, 11 years of experience into data science and analytics, And I worked for different insurance brands for 7 years of my life, like, at insurance. And then I've worked with PwC Consulting, As you know, data science manager before. Uh, these days I'm working as an independent consultant since last 1 year. And previously, I've also worked in a data architect role, data scientist role, lead leading the teams as well on the, you know, consulting side. So in terms of projects. So I've done machine learning and data science projects and dashboards, uh, developed them hands on, managed teams of 8 to 10 size of people. Then, uh, apart from that, uh, been 2 years onshore internationally working with clients directly. So, yeah, I think that sums up. In terms of education, I've done masters in applied official research from Delhi University. Then I have done my bachelor's in mathematics honors from Ramjas College at Delhi University. I'm based out of Gurgaon right now in India. And right now, like, um, as I told, I'm working as on contractual roles as a free time freelancer. Uh, one of my contractors getting over, so that's the reason I'm looking for a full time role or full time contract. Uh, that that's, you know, the summary of what I have right now.
How do you utilize Python to automate the retrieval and processing of data from multiple sequels databases, ensuring data consistency across sources. Well, frankly, uh, first thing is, you know, Python. So I will, uh, link, you know, my Python as through SQL queries using by SQL or by Spark, you know, as a function. So that helps me, you know, just Create a pipeline where I can write SQL queries directly. Now second thing comes, you know, we have SQL different data sources. How do I combine them? So First of all, I need to create a link for, you know, all the different data sources in the code itself. So suppose I create 5 datasets, uh, linked to 5 datasets. You know, I can create a view. I can create a pipeline dataset directly also. View helps read the data before, you know, we import. So creating that pipeline, uh, creating that connection for 5 datasets from 5 different sources. Once that is done, then we can come back and, you know, uh, collate all those, uh, based on the headers and based on the, you know, data we have. In the meanwhile, we can always, uh, so if I'm using Databricks, so, like, I'll just, you know, summarize, do an a basic analysis on that data before I import it to look at, you know, the anomalies or, you know, how it's of the consistency of the data is to look at the missing percentages, outliers of data. I think that's that's one thing. Then second thing is, When you're bringing the data, so they can be delta formation where 1st layer can be bronze, uh, where all the raw data sets. The 2nd layer can be in a silver where we summarize the data from the bronze layer. Suppose there are 5 raw data tables, we summarize them into one and try to pick only the legitimate data based on the, you know, missing and outlier analysis and, uh, observations we are looking at. 3rd could be, you know, uh, goal stage where we look at the pure data analytics part where, you know, we have summaries from the Data coming in as per the requirement for the business. So those are some processes. Then, uh, we can also do batch processing if required later on to improve the efficiency.
For data heavy Python application, what are the best practices for handling transactions and maintain mass properties in a sequel database? Uh, so I think when it is too data heavy, so Python cannot handle anything on the local system? Definitely. Uh, one thing is that, you know, we need to go to a cloud connection. So AWS as your data breaks, uh, Snowflake. You know, any any cloud platform which we can use. So the storage has to be on the cloud, uh, to use the, you know, cloud backup and cloud storage to have a good backup? And secondly, uh, Internet speed is definitely there. But then, uh, we also need, you know, the database is to be stored in a way that the processors are up to date and processing speed is high when we are dealing with heavy data. So We can always leverage cloud, uh, instances and, you know, clusters to leverage that speed. And then As I mentioned in the earlier answer as well, batch processing is something we do to, you know, handle those transactions and handle those data that's coming into it. And c it can be SQL database. It can be, you know, any other database for that matter. But, yes, uh, if it is a SQL database, uh, definitely, you know, we can do similar things. Uh, I have done, you know, 10 to 15 TB of SQL server processing as well, John, if that helps.
Your Python code is intended to filter out messages from a dataset that are marked as spam, then calculate the percentage of spam messages. However, it throws a zero division error without executing the code. Explain how you debug this situation and what could be the cause of a zero error. Pandas data equal to comment, column, spam messages, whatever label as spam, percentage length upon length of data into 100. Okay. So we are getting a 0 division error. Code execution that is either division. That is either division because some length of data might be 0 or it might just be missing. Could be. Yep. Uh, so, basically, you know, this is a calculating a percentage. So wherever it finds a spam labeled message. It categorizes as a spam from the data. But then if it doesn't find a label spam, it won't categorize it as a spam message. Now percentage of spam has been directed through a formula where we are doing the length of spam messages divided by length of data in 200 is what we are trying to do here. Now whenever this length of data is empty, No. There's a 0 division error because it will lead to infinite loop, which gives it an error. So if the the length of this data is 0 or, you know, length of the data so if the data is missing, the length of the data is 0. That is the reason it is producing this error. So if we can just put a if condition which says, you know, execute this only if length of data is not equal to 0. I think that can be easily handled here. So either it will produce a percentage or it will skip the answer. If, uh, also, we can put an error message in case, you know, there's a zero division error loop. So if if else condition can be there, so if it is a zero division error, then it can execute a else statement giving a message that, uh, you have entered either length of data
The company is launching a new feature for its messaging service. How would you use Python's equal to derive a model that predicts user adoption rate after launch and what k p KPIs would you track? Okay. Company is launching a new feature for his messaging service. Alright? How would you use Python's equal to derive a model that predicts user adoption rate dot. Okay. So, uh, first thing is, you know, the data, uh, whenever we look at any problem. So I think this is more of, uh, you know, predicting, uh, user adoption rate, which is usually more of a logistical model where we say, okay, how much percentage of the reduction is there with those messaging services, new feature being adopted or not. Now first thing is that what data we have. So I need to go back and see, okay, how much of data we have. Do we have, you know, um, 1 month of data, 1 year of data? Uh, for this feature, what data we have. Because it's a new feature. We might not have the data. In that case, we can look at the competitors data or market data as well through a market survey. That is another research part where we can do. But first point is where we collect the data, gather the data. Second point is where we see, okay, uh, the data which we have, is it sufficient enough to look at the, you know, important, uh, columns and characteristics of the data set we are looking at. So if I say, you know, that's a messaging service. So when was the message sent? Uh, was it delivered? Was it read? Uh, if yes, then, uh, was there any action point required? So these are the, you know, characteristics of the messages at what time it was sent, what time it was read. So that is also important to understand, okay, how much, you know, uh, impact it is bringing into the system. Then secondly, the user database to whom you're sending the messages to the customers, uh, how they're reacting. So in terms of, you know, their characteristics, like, uh, user's age, uh, occupation, you know, what, uh, if we do have financial information created. If we don't, it's fine. But these are the KPIs from users which we can collect. And then, uh, company's characteristics in terms of, you know okay. If are they charging anything for this messaging service? What is the feature they are looking at? So these characteristics can always be, you know, used and looked at. Now keep beef, we go ahead. We do, you know, all the analysis in the model development, which generally happens, uh, using missing out loud analysis, summarizing the data, thing into correlations and model development. So we'll put a logistical model or, you know, uh, we can can be a decision to use gradient boosting as well, uh, methods. So it depends on what we are trying to do. But in the end, yeah, it will be a logistical model, a probabilistic model, which might give you, uh, adoption rate in terms of, okay, if we send out 100 messages. Out of this, 50 messages were read and 30 of them actually adopted. So in that case, it will be 30% of adoption rate when you're looking at But yeah. Then coming on to the KPIs. So KPIs, I think, uh, you know, for this perfect particular feature, if I say, uh, this is an AI ML messaging service, so I assume it will be more of, you know, kind of automated service as well. So how many of them were read within, you know, 30 minutes of ins and how many of them were read between 24 hours our 48 hours. So looking at how much time it was required to, you know, adopt to a particular service,
Okay. I think I took a lot of time there. What is your approach to minimize the number of SQL queries in a Python application for database interaction. Uh, see, basically, if we have to minimize the SQL queries, I can directly, you know, try to interact with tables there. Generally, I use subqueries, uh, as, you know, one of the part where I can handle multiple tables within a query itself using subqueries. And we'd merge directly with, you know, the databases, uh, when we are trying to look at unique keys and primary keys. And then, uh, basically, you know, one status and in any Python application, uh, you can directly, you know, Call out those SQL queries. That's not a problem. And I think that's one thing. And then there is, uh, no code platform which you can use in Python. So those things can always be leveraged.
Optimizing a sequel query used in a Python data science application. Okay. Which factors would you consider to improve the execution time? Optimizing simple query used in a Python. So, frankly, if you know we are running the SQL queries, uh, it depends on how we are trying to run the queries, how what we are trying to, you know, execute them? So, for example, if I say we are writing sub queries. Within sub queries, when we are merging the data, it can happen that, you know, there is a dataset a, there is a dataset b. Now we are trying to merge both of them. If your dataset a versus dataset b merges? And it is a big dataset from b. It will take time. But what we can do is, uh, you know, we can use where conditions, which automatically filters out from a and b, whatever is needed directly, and try to optimize on the performance there. And similarly for other datasets. So if there are 5 datasets, you know, we operate it in one scenario. We can create a view Venus way that, you know, uh, we create a column a as, uh, dataset a as a base and others, as you know, the left join datasets where we only, uh, match it by the where condition to understand? Okay. Uh, if there is, uh, certain condition present only then it merges. So it will automatically filter out initially only the datasets before org mergers? Because once it merges, then we do it. It will be multiple rows, you know, and we don't really want to end up in that scenario. That's the one of the things? Then, uh, another thing is, uh, you know, when we are trying to optimize the SQL queries, uh, the main task which comes around is how we can create summaries. So if there are bigger datasets, we can always in initially, you know, optimize the datasets in their databases itself and try to, you know, look at optimized summaries and then merge them. That's that's another part, you know, which we can handle. Then, uh, looking at the unique summaries and stuff like that, those can also be handled. And then, uh, I think duplicates should be handled. These are the anomalies which has to be handled. I think that's one thing then, uh, we can do batch processing in terms of the operating time. So if we are trying to operate on, you know, certain datasets, uh, we can do multiple clusters and batch processing as well. That's 1. Then other, you know, operational issue which we can sort out. And 3rd part is definitely the coding part. You know? When we are looking at the code, uh, take only those specific columns which are needed for the processing. Do not pick, you know, all 200 columns if required? Just pick your 20 columns while processing the data, while merging it, and that should be done.
Okay. You're trying to query a database using SQL within a Python. Python should return the count of messages that failed to send for each customer, but it's returning an incorrect count. What could be the reason that the count is incorrect? Please list potential bugs and assumptions in SQL or database setup that you would check to troubleshoot this. Connection in the database. Query. Select customer ID, count message ID as failed message from where status is failed. Grouped by customer ID. Yes. So this is failed to back customer ID. Yep. Okay. So can custom 1 customer have multiplied each return the count of messages and failed to send for each customer Okay. C.execute query. Alright? Result. See. Fetch all. Group by customer ID. Uh, Uh, one thing could be there that we don't really have, you know, uh, distinct so it could be distinct count of message ID. So there could be that, you know, there is a repetition of messaging IDs where we haven't removed duplicates, uh, for failed messages. And then secondly, I think, uh, you know, I'm not sure if we can right now, but, yes, Uh, 2nd part is where we debug, you know, the individual customer ID. So if I'm there, I would just pick 2 customer IDs and try to check How much is the count in, you know, actual count versus what it should appear in the outlook. I think that's how I debug it, and it's easier to do it that way. Uh, we can easily find out, you know, what problem it's causing. Generally, it's duplication which happens or, you know, some, uh, values which are being missed. Sometimes it can be case sensitive. Like, we have written where status is failed. It could be that, you know, in the data, it's capital f written, which is causing this issue. So they can be all these issues, you know, which we can debug easily.
It does while improving the query performance of an existing that manipulates data from large. How much you would to optimize data handling method without disrupting ongoing performance. Uh, so I've done a similar thing, uh, you know, for Postgres. Uh, basically, we were trying to improve the performance of the data. Uh, one thing is, as I mentioned, you know, delta lake, if you're working on the cloud and there is a heavy data you're looking at. Uh, creating a delta lake works. Creating a delta lake means, you know, multiple sources can give you large scale databases and create data summaries out of it. Whatever is required in raw format for you. And then filter out, you know, operate on the data itself. But that's one option creating a delta leaking with different procedures through batch processing. Then secondly, if I say, you know, optimizing often happens at, you know, base levels. So I think base levels need to check where we say, okay. Uh, if you're looking at one particular SQL query or a database. Now optimize on those columns. Don't take everything into the data. Only take the ones which are required, And that would summarize, you know, your dataset as well. Second part could be, you know, looking into application. So is, uh, at times, you know, the server is not able to take the load from the data. So the fetching of those datasets has to be quick enough. For that, we might need to improve the RAM which we are operating on, it could be the clusters. Like, you know, the clusters basically provide the RAM and efficiency for the processing. So that could be another one, which could, you know, like, harm the website's content for the customer as well. So I think these are the things which we need to check. And as earlier said, uh, SQL queries should be efficient. Then, you know, you're operating batch processings and how the processing is happening has to be efficient enough to handle the servers as well and the query databases. Then if it does not, we can either improve the CPU usage, uh, for the clusters. And, otherwise, we can use, know the optimizing summaries for that. We can always summarize the data.
Imagine you have a Python's custom message as well. You need to group message by customer. One of the column names have a typo. So it's the group by operation to fail. Without seeing the data frame, what would be your first step in debugging this issue, considering common cause for such an error. Uh, okay. So, basically, we are just doing a group by here for the count. Code continue but raises a error at group by statement. So that See this see. Now there are 2 things. Either the syntax is wrong or, you know, the column which has been picked is wrong. So if I assume the syntax is correct here, when we are now, we are trying to group by. Now what is happening is, It's it's probably, you know, a different column which we need to pick. So we can just do, like, you know, customer messages, I think, is data frame. So customer underscore messages dot content or info. So that will produce a list of all the column names from the database which we are looking at, uh, in the data set. And from that data frame, we can pick the right column, which we need to, you know, process for the count. So in this case, you know, the typo error won't be there. And always try to, you know, copy paste the column names rather than typing them in because this is a very common mistake which happens to during, you know, your
What strategies have you employed in the past to build reporting NBI features that cater to both? Okay. Great. Uh, so for reporting and features, if I say, you know, for these purposes, majorly, We develop the models, but we use, uh, dashboards and reports for, you know, front end handling for nontechnical audience. Because, generally in our, uh, no profile. We are presenting our insights to the higher management for the company. So, generally, we have to be nontechnical, you know, while showcasing the results and trying to say, okay, from the business and how does it is impact the, you know, all the operations part when we're looking at it. Now either we can use dashboards. So first thing is, you know, user visualizations, dashboards, uh, using Power BI, Tableau, Excel, VBA, etcetera, or Google Analytics for that matter. Then 2nd part is where you can use reports. So there are a lot of, you know, monthly reports, which uh, used for monitoring and tracking. These reports are nothing but basically showing you, you know, basic numbers and trends. So it can be tabular, just numbers processing through showcasing, you know, what are the percentages increase and decrease, and same can be shown through, uh, dashboards. Now if suppose we are not using these, we can also use PPTs for that matter. So at times, I create, uh, presentations for the clients. Do you know, uh, present my insights? Because dashboard is fine for the operational purposes If they want to, you know, play around with the data and see how it is affecting how one variable is affecting the other variable. But, uh, for the customer insights, I generally prefer PPT as, you know, one of the platforms because presentation helps you capture share everything which is going in the background. And and, uh, you know, communicate to the management that, okay, how we can improve your business, how you know this is impacting your business.
What feature would you look for in dashboard creation tool like Looker or Tableau that would enhance the accessibility of data insights for nontechnical users. Uh, I think, you know, visualization is something which we definitely showcase. So, generally, I try to do a story building around the dashboard itself. Like, first, we create a summary, then we create, uh, first level detail. And then we go to a 2nd level detail. So if I say it's a retail company, you know, growing. So first, I want to show, okay, what is the profile? What is the overall numbers of the which we are looking at. So that gives me a summary to understand, okay, how many of them, uh, what is the customer attention or what is the revenue? How is the overall picture placed for the management. Now then there's a first level detail which says, okay, uh, how many products were there? How many customers were there? What is the types of products we are looking at? And then, um, what are, you know, the details which we can look at in terms of each product. So what are the top selling products? What are the top selling locations? Then, um, they could be customers, you know, in specific, uh, age group or categories which we might want to look at. So those are the things which comes in the 2nd level 1st level detail I feel, and then there is a 2nd level detail, which says, okay, I want to drill down to specific customers and see the list of those customers. So I say, okay, in retail. How many of them purchased coffee or how many of them purchased clothing online. So if the and I want to see, okay, uh, how many of them are actually purchased in which location. So then I can give a drone facility to see, okay, these are the, you know, our 1,000 customers who ordered today for this. So for retail, the data is large. That is the reason I'm saying today. Because, otherwise, it will go into lakhs and, you know, millions of data. So those things can be used. And then, accessibility can be in the terms that, you know, we can put buttons for action buttons for queries as well. So depending on if there are any specific requirements from the clients, we can give, you know, direct query processing buttons as well, which just process and, you know, give the drill downs data for them.