Data Science has been my major point of interest for a year now and my zest lies majorly in Machine Learning. I find easy coding in Python, but I never limit myself to learning Python just for Data Science.
Analyst-Data Science
Adidas India PvtData Scientist
.Kreate Energy Pvt LtdDeep Learning Intern
ResoluteAI.inApache Kafka
Databricks
Project Jupyter
Jenkins
Bitbucket
NLP
Hello. So my name is Anujya Saath. I have completed BTEC, um, from Turbine Technical University in computer science. I have complete, uh, I have total 2.5 years of experience in data science. Currently, I'm working as a data science in Adidas. Yeah. So I have worked on the multiple projects like by using machine learning, deep learning, and statistical techniques, Also NLP techniques. So I have more experience and exposure in the Python and NLP deep learning and machine learning. Yeah. Thank you.
So, uh, designing a so designing, uh, algorithm to detect trends to detect trends in your time series data, uh, which involves several steps. Uh, let's suppose, uh, let's we will go with the step by step. So first step is, like, data preprocessing steps where we can do the cleaning parts and normalization parts. And then after, we have decomposition steps. So, basically, the decompose the time series data into its components like trends, seasonalities, and residuals. And after that, use the techniques such as seasonal decomposition of time series or moving average methods. And then after, we can, uh, detect the time series trends by moving average. So here, what we can do, calculate the moving average, uh, uh, over a specific window up to, uh, like, uh, smooth of the data and identify the trend. After that, we can do through the linear regression. So inside the linear regression, what we can do, fit the linear model to the time series data to determine the train components. Then after, we can also do to the polynomial fitting. After that, the exponential is smoothing. Then after, we can do the some of the statistical testing so that we know the trends is seasonality or not. So use some statistical tests such as augmented Dickie fully, uh, Dickie Fuller. And And after that, what we can do, uh, we can check for the additional stationality and identify the trends. Then after there are some of the machine learning approach through which also we can find out the trends like ARIMA models, LSTM. After that, uh, we can, um, we can implement the change points uh, detect the algorithm to identify the points where the statical properties of the time series data change significantly or not. So all these steps we can do to find out the trends also. We can find out the model and what is the like, how it's performing. And if anything, we have we can find out what is.
So what process what you follow to perform So for feature selection, what are all, uh, process we can follow for the prediction predicting model. So first, uh, we will understand the data. Is the data is correct for me or not? Uh, we will do some idea analysis. And after that, we can do the data preprocessing where we can handle the machine value, data is normalized or not. After that, we can encode the data. And then after what we can do, we can, uh, initially, we can select some of the features by let's suppose, uh, there are some of the features which is not relevant. So remove irrelevant features. And after that, remove redundant features. Then after, we will do the univariate selection. So if there is any univariate relation, then we will find through the major metrical, like, giant square, nova, and mutual information scores. After that, what we can do? A model based selection. So fit, uh, inside the model based selection, first, we will find out the feature importance. So here, what we can do, train a model at 2 different different machine learning algorithm and use the feature important as code to select the top features. After that, we can find the rec recursive feature elimination. So here, uh, we basically treated that, uh, iteratively train the model and remove the least important features. After that, we will use the regularization so that we will find the is that we use, uh, these methods like, uh, lasso, uh, digitization, which can shrink confidence of least important features. After that, also, we can do the dimensionality reduction. So through dimensionality reduction, what we can do, um, we reduce the dimension of the features and we select the most important features. Let's support, um, by doing the PC analysis. After that, we can do the linear decrement analysis. Then after what we are doing, uh, we will do the cross validation. So the cross validation also, we can, uh, select the feature sets and the best cross validate score. And after that, uh, the final, we can what we can do is iterative process. So mainly the features in action is a iterative process continuously refine the feature based on the model performance and inside case. So this all we can do, uh, to select that new features for the predict.
So for this question, what we can do, uh, first, we will ensuring the re promptive security in the data analysis workflow. It is a procedure procedure for verifying the results and facilitating the collaboration again. So here are the key steps which I'm going to explain for, uh, ensuring the refocus ship rating. So first, what we are doing, uh, use the version control. So the kit, uh, if we are using the version control of kits, utilize the kit for version control for cohort scripts or configuration file. After that, repository. So we stored projects in the repository like GitHub, GitLab, or Bitbucket. Track to change and collaborate with here. So then after we have a document, we will do some documents. So we will add the comment in our code so that it will be more, uh, function it will be more effective when someone reads. And after that, we will also, um, make a file like a readme file. So inside the readme file, we've been really, uh, we've generally used the information of the installations and guidelines. After that, we will create our notebooks. Then after, uh, we have to manage some of the environments file and configurations. Then after, uh, we have to check the data management system like draw data, data provens, data storage, these kinds of what we have today. Then after consistent stand on this, after that automation and scripting. Then after we finally, we will do the testing and validation so that we will get the accurate and exact analysis and ensure that it predicts you will gain your data analysis workflow.
So, uh, let's suppose. First, I'll talk about the problem. So problem is, like, predicting the sales performance of the retail store. Like, as I already told you, I'm working in Adidas. So from predicting the sales performance of retail store based on the various factor, like advertising, expenditures, seasonal effects, and competitor activity. So based on that, I have used to pull in linear regression rather than the linear regression. So during a project and at predicting monthly sales for the retail has to we initially started with the linear regression to establish a relationship between advertising expenditure and sales. And then after the data, it's include monthly sales data over several years of features like, uh, advertising spends number of propositions and competitive prices. So what we are doing first for the initial analysis, we first, uh, apply for the e d we I had applied e d analysis and then after linear regression results. And over that, uh, I used the polynomial regression if the alleged approach. So by using both linear and polynomial, I got better results in a polynomial regression. Like, there are the multiple features. That's why I used the polynomial regression. Uh, as we already know, if the data is more than more features, then we generally prefer the polynomial regression, which give the better approach to results for our data. Also, I have evaluate, uh, r square matrix as well as RMSE for the polynomial regression model with the linear regression model and observed a significant improved in the fit with higher r s squared and lower r m s u. So overall, polynomial regression model provided a much better fit to the data as compared to the linear regression.
Yes. So, yes, there is an issue with how the trading is handled in this code. The main problem is that, uh, you are calling the trade dot run instead of trade dot starts calling. And after that trail dot runs executes the run methods in the correct threads. And the main thread, uh, the main thread rather than starting a new thread here, uh, what we can do, first, we will call the threads dot starts. Yeah. So, uh, I'll explain. Uh, threads that run versus thread that is starts. The threads that run, this method calls the run method directly on the current threads, main threads in this case. So the code inside the run will execute in the main threads. And, uh, thread dot is start. This method, uh, creates a new thread and call the run method in that new threads align the run method to execute concurrently with the main threads. So by using 3rdot test starts, the output should be as expected with working, uh, with working and printed from the new threads. The main threads plates. Printed from the main threads, the corrected code ensures that the worker class implements the, uh, renewal interface correctly and threads. Thread threading is handled properly.
So inside this code, there are there are a few issues with the provided hard function where for calculating the standard deviation of a vector. So here, the corrected version, uh, I'll explain. So the main issue is assignment operators, less or equals to. The assignment operator should be less and minus instead of this equals to and after that, the variable naming and syntax error. There are a space in incorrect operators, main and after that, a space val minus main of x. So variance x equals to main val. So we will keep the underscore. And after that variance calculation, the variance formula is incorrect. So and then after r is a case sensitive. So variance and variance, the capital letter needs to be in the function name is usage as example, like, the function name in the usage example, s t s k does not match the defined function. It's a s t s t e l l a. So this is the corrected form. So there are some of the syntax error and some are the assignments error. So this we need to correct.
So here's an algorithm for using the regression analysis to predict the future sales based on multidimensional historical data. So I'll go the step by steps. 1st, we will do the data collection and preparation. So here, we will handle the missing data and if the data is in correct format or not. Check the seasonal effects, competitor price, and everything we need to take first, and thereafter we will normalize the data as clearly in a data in that same format. And after that, we will do the EDN analysis. So inside the EDN analysis, we check the distribution, correlation, and trends. And after that, visualize the relationship between features and the skills. Then after, we will form the feature selection. So here, we will perform the most important feature. After that, uh, we will split the data into train and test. Then after we will apply the model selection, let's choose a linear regression mode choose a regression model. Let's suppose linear regression polymedia or any advanced regression models like random forest gradient boosting. So consider trying multiple models and comparing their performance and then after we will do the model training, train the recreation models on the training data. After that, we will test for the performance of the data. So we will do the model evaluation. So evaluate the model on the testing set using metrics like mean absolute error, m a c r square. Then after we will do the hyperparameter tunings. So by using this hyperparameter tuning, we will improve the performance of our model, and then after we will finally predict the model. Uh, use the train data model to predict the fluency based on the new data. So on the distance, we will perform, and we will ensure that, uh, algorithm is working.
Uh, okay. So let's just start for this. So automating the data cleaning process while minimizing the loss of the critical data requires a systematic approach. So I'll explain some of the approach. 1st, we will understand the data and then identify the critical data elements that should be phased up. Then after defining the cleaning rules and start testing, like missing values, offline detection, uh, duplicates, data type, normalization on screening. And then after we will do the creator, use it with cleaning pipelines, then after implement the automation. So here, what we can do, use the app, which is such as pandas, Python for the efficient data manipulation and employed tools like this, uh, psychic lines, preprocessing module to in transformation. Then after validation and monitoring, uh, which includes the validation steps to check the integrated quality of the clean data. And after that, monitor the cleaning process to catch the issue early. Then after, we will document all these things, and this will give you give us the proper proper data cleaning process without loss of any critical data. So we will apply for
So for so validating and, uh, testing newly developed tool for the data analysis before deployment inclusivity and So first, we will do the unity unit testing. So write a unit test. And after that, we'd cover, uh, test coverage. So ensure the comprehensive test coverage edge cases, and error handling. And after integration testing, combine the components. And then after, we will do the end to end all data analysis workflow to validate to the in that tool from the data input to outputs. Then after, we will check the performance testing, so load the testings, benchmarking that. And then after, uh, we will do the u f. That means, like, user acceptance testing. So in work and the users. And after that, we will do the data validation. So in terms of data validation, the equation of the results, and the consistency checks we will do. So these kinds of testing we will do to ensure that is the deployed tools for the data analysis before it is deployed or not. So is the newly deployed data is correctly or not. So we will ensure by doing the some kinds of a testing. Thank you.