Results-oriented Data Scientist with around 2 years of experience in Data Science. Proficient in leveraging AI/ML algorithms to collect, analyze, and transform complex data sets into actionable insights. Skilled in using SQL, and programming languages such as Python to develop machine learning and deep learning models. A self-motivated, quick learner, I am dedicated to optimizing business performance through innovative AI/ML solutions, fostering a culture of inclusion.
Generative AI Data Scientist Associate
PwcJr. Data Scientist
Mirafra TechnologiesMachine Learning Engineer
Datafoundry
Git
.png)
Docker
So myself, currently working as a junior data scientist at Mirafra Technologies. I joined Mirafra in 2022, I have around two plus years of experience overall. So in Mirafra, my main job responsibilities are of a developer type, where one project I worked on was to develop human fall detection using CNN models. So I detected the human posture structure and extracted the features from that, and I classified the video of a human fall, determining whether it was a fall or not. And also I worked on an automatic resume parser, which works on NLP techniques, mainly including N-grams and HashMap functions, so that I can easily extract content from the resume. And previously, I worked at Data Foundry as a machine learning engineer intern. There, I worked on a project called legal entity detection, which involved a client who was a legal lawyer, they have their own application, a legal application, where they upload several documents, legal documents from a very long time, which are scanned manually. So to extract data from those scanned images, that's the part, I worked on identifying the scanned PDF or document, to identify whether each page was suitable for OCR or not. So that's the work I did. And also later on, we did the OCR part. And before that, I earned a master's in data science from Amrita University, my thesis was on emotion analysis, specifically sentiment analysis of moods. And I also hold a bachelor's degree in computer science from R Institute of Technology, Vijayawada.
Yes. Yes. So for different datasets, we can use different tools. If there is numerical data, we can use pandas. These two are the most used tools to understand the distribution of the data. Because if we want to find the correlation between the features or to find the relevant features in the dataset, we can use pandas and also visualize the data. So, if two features are correlated, we can visualize this using, like, heat maps. We can also use pandas, networkx, and Plotly. These are the different tools we can use.
Yes. So as I mentioned earlier, I worked on the project with Neemos and analysis. So, this is my Emtek thesis. So that time, there was no proper resource to do the part. That means there was no proper GPU support, so I did the project on my own system, which has minimal GPU support. With the huge data, it became very difficult to train a model. So, what I did was, first, I needed to convert high-resolution images to small resolution, and I made sure that I didn't remove the proper features. That's one approach. I also generated documentation of the same resolution I had changed before. These are the approaches I considered. I used a repeated training approach. For example, if there was a dataset with 100 samples, I did the training on a different dataset and repeated it. Because of this, we can optimize the memory. These are the approaches I followed.
Yes, so to implement a dictionary in Python, there is a library called scikit-learn. In scikit-learn, it's one of the most popular libraries in Python where we can use many machine learning classifiers, machine learning models. There, it is very useful and very easy to use also. So that's the main model library most people use, specifically scikit-learn.
Yes. So, one of the projects I worked on was human file detection, where I had to detect a person in the frame and determine if the person had fallen. This is a use case where elderly people or infants stay alone in homes. It's very difficult to maintain a caretaker 24/7. So, this is where we can install a CCTV camera and easily detect if the person has fallen. There are many approaches to identify the tracking of the human body. In this use case, I should detect the fall. In the fall, there is one more scenario: if the person is in a sleeping position, that may be considered a fall. If there is a fall happening, then in the standing upright portion, the person may fall like this, or if there is a shift in movement from standing to dropping. If we consider the distance between the head portion and the top portion of the y-axis, that distance definitely reduces. The hip angle and knee angle also change, and these changes happen. Once we detect the opposing structure of the body, we can easily extract the features. These features are the angles, distances, and rate of change of these angles. Using these features, my purpose is solved: fall detection is solved. Like in NLP, I worked on a product for meme text analysis. The text on the Internet is not always a proper grammar-based sentence. Some people write something in different words, half words, and these words. To train these types of words, I used the word embedding model called POS text, which was trained on no grammar, half words, and features. Based on this, for different scenarios, we can use different customized missions and Python-based algorithms.
Yes. I worked on projects related to the analysis part and all. So, if you consider a cricket dataset, I worked on a cricket analysis project. If you consider cricket data, there are several features about a cricket person. For example, name, age, debut date, last played match, and average, as well as strike rate. There are several lots of features. So, in that case, we should consider only specific metrics for a specific task. For example, in a particular match, the spin bowler is bowling. For that spin bowler, if you consider his all previous performances, we should select the spinner statistics. Again, it's a bad step. So, we should select the spinner and also use the index spinner for off spin or next spin. We should select the particular feature. In that, we should see the average for right-hand spinners, left-hand spinners separately, and also batting first or batting second, these types of features. So, different things we have to consider. Not only for this. There are many other projects where I should consider what kind of text analysis, right, what kind of text I should use for different things. So, these are all things we should consider.
Yes. So there is an issue in this Python code. In the normal data function, they're calculating mean np.me. However, nowhere in the world are they using mean. So definitely there is a memory issue. The mean variable is not used anywhere, and that's definitely an issue. It takes some time, and if there's a huge dataset, that itself takes a lot of time. If you remove that line, then it works well. Also, we can use the mean in an efficient way because in most cases, people normalize using the mean only. It's like, you know, first we calculate the mean of the whole data, and for each sample, we test by subtracting the mean to normalize, which is basic normalization. Instead of using standard deviation, if we use mean, that's also an approach. Then we can remove the standard deviation function. So either way, we can do that.
Yeah, filtering model, yes. So, no. Most machine learning pre-trained models, right, are of extension, you know, PKG files, PKL files like that. So, we should choose the proper load function to load the different types of pre-trained models. In this case, it may work, it may not work, so we should choose the correct model.
Python scikit-learn is one of the most powerful and useful packages in Python where we can use different machine learning models. Like, we can access different machine learning models and also do different tasks with the dataset, such as splitting the dataset. So we can also build a reliable and assembling learning system using the scikit-learn. That is a very good approach.
Yeah, so in my real-world experience, I don't have much experience in debugging. But in the experience I had as a Python backend developer, I had debugging with, you know, PDB, the Python debugger. So using that, if we run the code, if we put the Python debugger at that point, it will stop there. Then we can see the value of the variable and how to go forward. So this one I want done and also, a breakpoint. That is one of the outputs.
Yes, so for a given dataset, first we can choose different machine learning models and train them with the data. No, so also we can choose and find through them if the score is not up to the mark. Right? So we can choose which models perform well. Right? We can choose the models that we have.