
Senior Associate II, Data Science
Kyndryl SolutionsSoftware Engineer II
Terralogic Software SolutionsSystem Engineer
Huawei TechnologiesProject Engineer Intern
MD ElectricalsAssociate Engineer
Tata CommunicationsCustomer Service Executive
Tata Communications Transformations Services
MySQL

Oracle Cloud

Azure Machine Learning Studio

GitHub
Good morning. I have three plus years of experience in the data scientist field and the AI field, including machine learning, deep learning, and deployment. I graduated from University of Engineering and am currently passing the 2,000 90 batch. Then I started my career in data communication as an associate engineer. There, I worked on the chatbot development team and looked at NLP data analysis. I analyzed data based on NLP techniques, such as text summarization, machine translation, and NER. And then I shifted into the PCPS, a sub-ordinate of data. They have a customer service executive coordinating with the project delivery department. I made sure that our Data architect team was doing well and that no issues arose. I worked to fix project delivery bugs and deduct all those bugs that came into the project delivery. Then I switched into technology. In technology, I worked on a system that could predict alarms. The product was a network management system. So when the network fiber went down, an alarm should come. Based on this, we created a prediction model using our video streaming data and past data. My last company was Theralogic Solutions, Theralogic Software Solutions. There, I was also a software engineer. I handled AI-related tasks and analyzed their HRM data. I required some machine learning models to improve their HRM assistant and worked on the chatbot customer's voice system. Based on the small data set, we deployed the chatbot using NLP techniques and utilized some algorithms, such as G53 and G54. We're trying to improve the chatbot and grow our model. That is my background.
Implementing regularization techniques in neural networks is to combat overfitting and improve model generalizability. There are several techniques. One is L1 regularization and L2 regularization. So, if I talk about L1 regularization, it's also known as the Lasso. It adds the sum of the absolute values of the model's weights to the loss function. This encourages sparsity by driving some weights to 0, effectively reducing the model. L2 regularization, on the other hand, adds the sum of the squared values of the model's weights to the loss function. This pushes the weights towards 0, but not eliminating them entirely. That is the main difference between L1 and L2 regularization techniques. And about the implementation of that, we have data types of probability regularization. There is a process of choosing the right hyperparameter, like lambda for the regularization strength. There are techniques like that, and also dropout. Dropout randomly drops neurons during training, forcing the network to rely less on individual features and learn more robust representations. This is implemented by setting the dropout parameter, for example, dropout = 0.2. This chooses the dropout rate depending on the problem and network complexity. Also, there's early stopping. For that, we monitor the performance on the validation set during training. We stop training if the validation performance doesn't improve for a defined period. This prevents the model from memorizing the training data and generalizes better. Using early stopping, we choose an appropriate patience value for validation metrics like accuracy or loss. We also artificially increase the diversity of the training data through techniques like image flipping, cropping, or adding noise. This makes the model more robust to variations in unseen data. We also work on weight decay. This gradually reduces the magnitude of the weights during training, similar to L2 regularization, but often implemented separately. This is implemented by setting the weight decay parameter in the optimizer, for example, if you use the Adam optimizer, you set the beta k parameter. This chooses the rate decay rate based on the data and network complexity. So, yeah. So, we have to find the optimal combination and hyperparameter for this.
So you reduce the order of a fitting machine learning model by using Python. When we use Python, it offers a toolkit for addressing or fitting machine learning models. There are some common approaches in data-centric techniques like data augmentation. It artificially expands your dataset by random transform options like rotation, flipping, cropping, and feature engineering and selection. Analyze your features to identify early stopping. When we use early stopping, it's a technique that trains your model with a separate validation set. And stops training when validation performance starts to decline. When we're talking about the model, there's a technical aspect of regularization to avoid overfitting. This involves penalizing the last parameter values in your model to prevent overfitting on specific aspects of the data. Regularization is a common approach implemented easily with libraries like Scikit-Learn or. Another technique is dropout. Dropout specifically randomly deactivates neurons during training, forcing the model to learn robust representations that don't rely on specific features. And care of dropout is for this purpose, model complexity and reduction. You see, in our models, we use simpler models with fewer layers of neurons that start with basic models and gradually increase complexity if necessary. Then another is ensemble sampling technique. Ensemble sampling is an internal sampling technique that combines the predictions from multiple models that are trained differently. For example, different hyperparameters or training splits. It's used to ensure better generalization. And next, it guides which mapping and boosting techniques are not provided tools for avoiding overfitting and machine learning models. However, it's very important that the best approach depends on the specific dataset and model. So, we can experiment with different techniques and evaluate their impact on our model performance to find the optimal approach.
So as of my knowledge, normalizing data directly within SQL queries for machine learning purposes isn't a good approach first. It can lead to issues. Yes. Why? Because the data leakage inefficiency problem and limited techniques as getting functionality and data integrity for this one happen. But there are some approaches we use. For example, pre-processing data outside is quality use. At the Python level, it's like. The efficiency of normalizing data separately and testing sets before feeding them into your machine learning pipeline. This ensures data integrity and allows for flexible and optimized calculations. It's critical to store the normalized data in a way that's performance is critical. Consider pre-calculating and storing normalized data in separate tables for faster access during model and evaluation. However, remember to keep track of normalization parameters in the database extension. There are some databases that offer extensions or stores processes. Those are specifically designed for data tasks, but options and capabilities vary. So machine learning emphasizes careful data handling and avoiding leakage. Normalizing within SQL queries might appear convenient, but it can compromise this principle, ultimately leading to suboptimal model performance.
So as per my knowledge, SQL window functions are indeed powerful tools for performing calculations. So, we can do this by using the defining the window. The window clause specifies the window if the function operates, so that involves the partitioning and defining how to group those logically using the partition by clause. For example, calculating moving averages within each customer ordering, specify the order of flows with each partition using Order by, this determines the direction of the window, like chronological order or time series data. It shows the function also. So there are some functions like some average, mean, max, and calculate aggregate values within the window for metrics like total sales or average price. Then the row number, dense rank, ranks so that assigns sequential ranks or positions to rows based on order or values within the window. Another one is the lead or lag, so that accesses values from specific positions ahead or behind the current row, useful for calculating differences or lags. Come sum or come average, that calculates cumulative sums or averages from the beginning of the window up to the current row. To specify the range, the rows or range define clause defines the extent of the window relative to the current row. So rows and preceding include the end row before the current row. Rows and following include the rows after the current row. Rows between m preceding and n following, similar to those, but specifies setting dumps of the order. For example, 2 days preceding. An example, if we take one example, select customer ID, order date, product bytes. I'll be selecting those parameters from the table. So, have this product place or partition by customer ID, order by order date. Rows between 2 preceding and current row, as moving average price from orders. So, like that, this is the example. So, this very calculates moving average price products for each customer, transferring the current price and the 2 preceding orders within the same customer group. So, the window function has a complex city where it's so ensure you use their logic and impact on different functions in the window that are that's our specific purpose. So, the common also, when we use the format table expression to precalculate window results using credibility, it's important.
So handling variable correlations when developing a multivariate linear regression model is crucial because it can significantly impact the validity and interpretability of our model. Therefore, we have to identify correlated variables first. We calculate the correlation matrix to visualize the pairwise correlation between all our variables and look for high correlations, typically above 0.8 or 0.9, which might indicate redundancy. A scatter plot is important; it's a scatter plot between pairs of variables to graphically assess the nature of the correlation, whether linear, etc. Then we have to understand the impact of correlation, particularly in multicollinearity, where high correlation can lead to unstable estimates, inflated standard errors, and difficulty interpreting individual variable coefficients. Additionally, separation effects can occur, where one variable might mask the true relationship of another with the dependent variable. We then have to apply some strategies. If two variables are highly correlated and offer similar information, we consider removing one based on knowledge for future importance analysis. We also have to combine variables by creating new variables that combine highly correlated ones if it makes logical sense based on the context. Then we can use techniques like principal component analysis to reduce the number of variables and obtain uncorrelated components. We can also use regularization techniques, such as L1 and L2 in regression models, to penalize large coefficients and reduce the effect of multicollinearity. Another approach is to interpret results with caution, especially in highly correlated settings, and monitor various inflation and factor effects to quantitatively assess the severity. We also have to evaluate the impact of different strategies on our model performance. Furthermore, we can use visualization techniques like partial dependence plots to understand the combined effect of multiple variables on the dependent variable. Handling variable correlations requires understanding potential issues and applying appropriate strategies based on our specific task, data, and research.
So the late gen clause ensures that all customers included in the result, even if they have no sales in 2023. The 'a' dot sign updates 'less than or equal to' to 'less for' date condition filters customers who signed up before or on the sales date. The output of the Python course shows that the customer with ID one has sales of 100 on 2023 1 1, and their sign up date is 2021 1 which is equal to the sales date.
So when we are deploying the state management approach in our predictive model, specifically in a production environment, we have to choose the right state management approach for predictive models in production. It depends on several factors. So there is model complexity. Simple models require a few parameters and might not need complex state management. Another factor is the data and the update frequency. Frequently updated models necessitate a reactive state. Another factor is the scalability and high availability. Consider distributed solutions for large-scale deployment cycles and integrations with existing infrastructure. We should leverage existing tools and frameworks whenever possible. There are some common state management approaches and their potential applications. Key-value stores, such as Redis and Memcached, are fast, highly available, and simple for static model parameters. They're suitable for storing model coefficients in intermediate calculations and user-specific states. Distributed file systems, such as PFS and TaskFS, are scalable, durable, and handle large data sets. They're suitable for modern checkpoints but may not be optimized for shipment updates, which can introduce latency. However, they're suitable for storing large model train-based historical predictions. Database management systems, such as SQL, are structured storage with data-creation capabilities. However, they're less performed for frequent updates than key-value stores. We use them for storing model metadata, user-specific context, and model training loss. We also use some model-serving frameworks, such as TensorFlow Serving. These frameworks are designed for model performance, handle different model formats, and are optimized for performance. However, they can be complex to manage and require specific expertise. When deploying the model itself, managing model versions, handling inference requests, and tracking model performance, we use some state management libraries, such as DBC. These libraries are used for virtual experiments, tracking model performance, debugging, and deployment pipelines. We track model performance, track and deployment history, and keep managing different model versions. A combination of these approaches can be used. Key-value stores are suitable for storing models and large datasets, databases are suitable for structured data and logic models, and model-serving frameworks are suitable for deploying and serving the model. State management libraries are used for tracking and deployment history. Apart from that, we need to implement additional security measures, such as robust authorization mechanisms for model state. We also need to monitor data, updates, model performance, and resource usage. Disaster recovery is also essential.
There are several approaches. We can take to dynamically adapt the data visualization based on user data selection in Python, each with its own strengths and weaknesses. So there is a way to use trademarks, such as libraries like Dash, Plotly, and Bokeh. There's also server-side rendering and streaming with Django, FastAPI, and Flask. It can be used for a more complex understanding of server-side programming. Another approach is to use some pre-built dashboards, such as Tableau, Power BI, and Google Data Studio. It is easy to use and has a wide range of features. When we use Python, we get libraries like Matplotlib, Seaborn, and Broadly. We can use that one also. For simple charts, if data updates frequently, consider server-side rendering or event-driven approaches, such as faster bits in web frameworks. We have to consider the complexity of visualization. For simple charts, we can use Matplotlib. Some of the more difficult views require the plot and there are finders. You can use them in Python. So there are some local JavaScript libraries also, such as Vega and D3.js, but they can't be used under Python. So broadly, Python is also powerful for interactive visualization, with extensive chart documentation and a community that supports integration with data sources. It allows custom layouts and interaction. It can also be used with Vega Lite for building new visualizations and concise code. We can use pandas for data manipulation and cleaning, and NumPy for numerical computation, which is meant for basic static visualization. Our users can choose an interactive
So to leverage Azure's AML services, we enhance our machine learning model. Some ways we can leverage this include data management and storage. We use the data bridge to collaborate on data exploration, analysis, and model training in the unified platform. Then we perform the training and experimentation. In training and experimentation, we're using the machine learning Azure machine learning service. It trains and manages it with machine learning workflows in the cloud to find the best-performing model on our dataset. Then, in experimentation, as our machine learning experimentation, we track, compare, and reproduce machine learning experiments to enable efficient model development. Also, we use Azure Functions for deployment and serving, deploying lightweight models to respond to events or triggers in real-time. Sometimes we use the Ubernet service to deploy our trained model as containerized microservices or scalable and secure inference. We mostly use Azure Functions. Then, we evaluate our performance in machine learning studio. In Azure machine learning studio, we view model performance and health metrics in our usual inference environment. Then, we understand how our model makes predictions with expedition and insights. That's basically the services we use, but it depends on our machine learning model that we will take for our specific requirement. So.
So integrating the power of UI visualization into a Python-based data analytics workflow, can we add tools through several methods? We can use several methods. There is a Power BI REST API. So Power BI provides a REST API that allows us to embed Power BI reports and dashboards into custom applications, including Python ones, so we can use the API to access Power BI, including data sets and dashboards programmatically. We can use Power Embedded also. It allows us to embed Power BI reports and dashboards directly into our desktop applications. We can use the Power BI Embedded Python SDK to integrate this visualization into our Python-based applications. And this Power BI Desktop model, while Power BI Desktop itself is not directly available, you can export data from Power BI Desktop to various file formats, for example.CSV, Excel. And then you import and analyze that data using Python libraries such as NumPy, and Pandas. So there are some Python visualizations. Now there are also powerful data visualization libraries in Python, like Plotly. We can extract data from Power BI either through exporting or using APIs and then visualize it using these Python libraries within our Python-based data analytics workflow. So we can use Azure databases also. Power BI data flows allow you to prepare and transform data within, and then you can use Azure databases, which support Python, for advanced analytics and machine learning tasks. So we can then visualize the results back in Power BI. Also, we can use Jupyter Notebooks for embedded analytics. So we can embed Power BI directly into Jupyter Notebooks and use Python code. So it depends on all these methods. It depends on our specific requirements, considering factors such as real-time updates, licensing, and integration complexity. So based on those parameters, we can choose the method that we want.