Amith KA is a seasoned professional with 14.5 years of expertise in leadership, management, and a comprehensive skill set spanning data engineering, data science, and systems design. Proficient in engineering best practices, data architecture, and cloud platforms such as GCP, AWS, and AZURE, Amith excels in full-stack development and employs tools like Terraform for efficient infrastructure management. With proficiency in Python, TensorFlow, PyTorch, and BigQuery, Amith is a specialist in machine learning, deep learning, MLOPS, and Kubernetes. Additionally, expertise in areas like NLP, computer vision, and generative AI, along with a strong background in security and Docker, further underscores Amith's impact in the industry.
Head of AI & Engineering
ChryselysChief Architect
66 DegreesPrincipal Data Scientist
Suyati TechnologiesAnalytics Manager
Metro Trading CompanyData Science Manager
EYKubeflow
Tensorflow
AWS (Amazon Web Services)
Azure
Google Cloud Platform
Docker
Kubernetes
Google Cloud Platform (GCP)
AWS
Terraform
CI/CD pipelines
Jenkins
Python
pandas
scikit-learn
Keras
PyTorch
Snowflake
BigQuery
Redshift
Informatica
MySQL
SQL
SSIS
Pyspark
Quicksight
Looker
Power BI
Tableau
Spotfire
UiPath
Blue Prism
VBA
Word
Outlook
VB.NET
Node.js
HTML
CSS
So a bit about myself. I work as chief architect or rather the capacity for chief architect for 66 degrees. I've been with the company for a little over 3 years right now. So I handle most of the engineering as well as AI opportunities for the company. So this would be, you know, stakeholder interactions, setting up workshops, speaking to them, trying to get a hold of what the requirement at the client side is and trying to basically, it's consulting. So, uh, you know, trying to drive those consulting measures forward, trying to identify risks, trying to set up a a proper, you know, transition pipeline for all of their data to come in so that we can process it, and we can take it forward. It's it's more of a senior leadership role that I take up here, so it's just more of, you know, stakeholder management, trying to bring bring in more and more and more businesses. That's part of the job. Uh, I'm also well versed with the architectural components because, you know, throughout my life, I'd say at least about, you know, 15 years as a total tenure that I have. Uh, so I worked as a data scientist or at least rather different levels of data science, uh, which also includes, uh, you know, majority of that has been towards building pipelines, so which I believe is also, you know, arguable and, uh, you know, it's it's also, uh, an integral bit of, uh, data science because the data pipelines matter. Because if you're looking at a solution as a whole, uh, those 2 bits when they come together, that's sort of that weird Venn diagram that forms that's sort of what, you know, AI enablement is all about. So at least I personally believe that, you know, that's a bigger picture. A tech. And team size, again, you know, different levels of teammates. I'd say, you know, principal is kind of like a manager. We have, you know, associates in different levels reporting into me. So it's it's just mostly managing the team, trying to drive from, you know, practices, you know, hiring and upskilling people, trying to figure out ways to strategize and move forward, at least. Those are, you know, some of the roles and responsibilities that comes within the purview of this role, and I personally find that it's been challenging. It's also exciting with the whole lot of generative AI things that's happening. A and, yep, I think that's that kind of does justice to what I preach.
Large scale dataset, uh, training any sort of a model for me, uh, you know, typically starts with trying to understand more about the, you know, opportunities or rather trying to explore more of the dataset. I think, you know, 1st and foremost, the most important thing is to try and get, a general sense of idea about the data set. This could be done using something like exploratory data analysis, but, you know, more important, if you could get in touch with a team who you're working with. So these could be, you know, foreman, you know, front runners, uh, people who have actually worked with the data. So that's a lot of where what the or where the insights come from. And I think that that that's also pretty, uh, you know, important and trying to decide what those different areas are. And, Uh, so, yeah, that's that's probably one of the first things that I'd say. And EDA is, you know, standard EDA, which is, you know, trying to understand more about the data distribution of data, missing value. Should you impute or should you not? Should you get rid of those data? Or maybe it makes sense to try out something like a cross validation, try and create different datasets or different sample sets to try out with your different models. And I probably approach the problem by trying to define what it is that we are trying to achieve through it. So definition makes, uh, a whole lot of sense because then you have set a path on trying to achieve a solution to a problem. And that's probably one of the things that I try. And, uh, once that's done, I think it also makes sense to take that work with your stakeholders, uh, communicate that so that they have a clear understanding. We kinda call call that toll gate is kind of like, uh, you know, a ritual, if you must, uh, where we try to understand more about what we provide them as a service and are they in agreement with it. So it's sort of that core of all sorts. And once that's done, you can train some sort of a baseline model. You can try and compare it with, you know, different strategies in trying to achieve it. Also, a whole lot of integration tests, all of that matters because, you know, if you start developing all of this in parallel, it makes sense. So going forward, if it's just, you know, hyperparameter tuning, which is, again, a more complex area. But if you slowly start, uh, you know, parallelizing a lot of these things. And it makes sense once you go forward, you don't have a whole lot of things on your plate. You have sort of parallelized it, and it kind of makes sense or serialized in a way. And that makes sense. And, uh, I would probably look at, defining metrics and trying to evaluate how the model does. So for classification, we have something completely different. It's like, uh, you know, vision recall, confusion matrix, all of those make sense. For regression based problems, it's quite different. For something like, uh, object detection, it would be quite different. So define those matrix, Does, uh, makes sense? And then slowly taking it forward. Once that benchmark is set, I think, uh, what's Most important is to try and see if there is, uh, you know, versions of the model possible to improve it. And, uh, probably, this is also where I would also start to work on trying to devise a a path for a production ready AI system. This makes sense because, you know, if you have some sort of a pipeline which is compatible with the retraining of models, uh, trying to detect things like MLOps, uh, you know, uh, data drift or any sort of a model retraining pipeline and storing those artifacts to some sort of to make sense. You can use something like MLflow for this. You can probably create pipelines using airflow. If not, you can stick to something like 2 flow pipelines. A whole lot of those make sense. But, yeah, I think that that kind of sums it up.
Uh, a loss function is is kind of an important part of when trying to train some sort of model, and it it varies on a lot of different, uh, you know, sense when you're trying to determine what what the sort of, uh, you know, efficient loss function is. For instance, if you're just trying to approach a problem, like, you're trying to determine 2 classes, which of those are, you know, right, thing we can do with something like a binary cross entropy where, you know, you you're devising something like an based on a probabilistic measure of it. So a binary cross entropy is basically trying to take all of your inputs and trying to channel it either through a probability distribution, which on the higher sense would indicate that it's pointing towards 1 class, and, uh, the other one would be pointing towards the other class. So binary cross entropy, that makes sense. Uh, and if you have a whole lot of classes, like you're trying to determine something like, you know, a bunch of colors, I think, uh, a softmax function would make more sense because, you know, you're trying to, uh, you know, take the entirety of that probability and trying to distribute along, uh, different classes that you have. The sum of it adds to 1. So this was how I would define, uh, a loss function, but there are a definite set of rules that make sense. But, technically, what a loss function is trying to do is that it's gonna take in all you have your data set, which is sort of like maybe a bunch of columns and rows. And once all of that data is in place, you need to sort of convert that into a machine readable format. Or probably that's not the right choice of words. It's more like, you know, converting your data into, uh, a set of format that, uh, machine learning model can adapt to. And when I say machine learning model is just a bunch of layers, you have linear layers. You have convolutional layers if you're dealing with, uh, images part. You know? You can also use convolutional layers on, you know, sequential data. It's it's shown some sort of evidence there. But basically, once you do that whole linear transformation, all of these layers are incorporating something like a linear transformation. But just doing a linear transformation isn't enough because you also need to traverse that space of nonlinearity, and that's probably where you bring in something like an activation function. An activation function kind of takes in your, uh, weights and biases or the, you know, multiplicative product of all of those things and tries to take it to a different dimension altogether. So that's nonlinearity. And once all of that is done, so you traverse through different layers, and this loss function is particularly important because that's the deciding factor. It's it's going to take your inputs, all of the, you know, accumulated weights through different different layers. And the loss function is typically what decides how you would want to, you know, uh, readjust those weights. So if I'm seeing that, you know, I'm trying to predict a dog as a cat, so a loss function is typically what is gonna tell me that, you know, you predicted it as a cat, and it's wrong. So I need to go back and reevaluate all of my layers and try to readjust those ways, and it's done using something called back propagation, which is basically, you know, partial differential differentiation of all of the different functions that I talked about which is, you know, for w x USB is, you know, typically what's happening in all of the layers and plus you also have activations on top of it. So this is, uh, what typically makes sense. Now coming to loss functions, there are a lot of different loss functions. You have, uh, something called triplet loss, which is particularly exciting when you're trying to do something like Siamese networks. You also have, uh, you know, linear layer loss, so squared, uh, residuals that also makes sense.
When refining an architecture, I think it it mostly starts with what it is that you're trying to do. Because when you're looking at, uh, some sort of an output of a, uh, neural network or, you know, training matrices, You can try and project that as, you know, a loss or it could also be measured as accuracy. You can plot that on something like in recall curve and try to determine how to make sense of it. And, typically, this is one of those areas where you would Need to understand that a lot of parameters goes into training the entirety of the network. So trying to play around with a few things such Just the batch size also makes a lot of sense. And on batch sizes, to be given, you have a lot of, uh, data coming in. You can try and, uh, you know, split it into batches. And each and every batch would be of a, you know, particular size. This would go into the network. So basically when you're passing a batch into a network, the entirety of a batch passes into the network, tries to Understand or tries to learn patterns from those, and, uh, it co adapts in a way. And different batch sizes would also Playing around with it also makes sense, but you can also do this whole bunch of things using something called, uh, hyperparameter tuning. You can use something like a Randomized grid search or a grid search. There is also something called a Bayesian kind of, uh, tuning algorithm where you're basically just throwing in all of these parameters, and and asking the model to, you know, run several iterations of it and try and figure out what the best, uh, fitting network is. And how do you balance complexity? And complexity is kind of like a, you know, strange area. Uh, what what what's the safest best is to try to read more about the papers, the architectures that work best, and then probably start there. Because even if you're trying to Start off on a whole different end and trying to build this whole system out. It might not work out. At least personally, I've, uh, tried to design my own Networks and it's not. I've played around with things like, you know, skip connections. It's never really worked out. So I think a balance typically exists on what it is that you're trying to achieve, and, uh, it's it's basically optimization. When you're looking at, uh, things like you have your resources, you have resources, could be time, and you are still trying to achieve something. So you need to get that done in this limited time. So trying to look at it in a way by, you know, submitting something like an impact analysis makes sense where you define what it is that you're trying to achieve, Uh, look at what the impact of, you know, introducing this complexity is, trying to, uh, mitigate that risk, and trying to move forward. At least those are some of the things that I try. And performance, again, it's it's just a measure of, how your model performs, which could be which is basically, uh, done through loss functions. So for different problems, it's a different loss function there. And, It it also depends on basically trying to get the, you know, actual problem statement defined, and only then comes the whole concept of performance again. So I would say this complexity and performance is tightly linked to one another. It's still optimization at the end of it. So yeah.
So how would you implement a continuous retraining mechanism? I think this is also a a complex problem in itself because you can start with something like you know, you have your MLP model. It a for this sake of an argument? I'm looking at something like where it's gonna take in a bunch of descriptions that's coming out of things, and it's going to predict if that, what the sentiment of that is? Is that positive or negative? Which means my classes is just about to there could also be a neutral second sentiment which could be, you know, 3 different classes. But typically it starts by looking at what your, you know, dataset is and trying to understand more about the dataset. Uh, why this makes a whole lot of sense is because if you're training a data on about, you know, 2 months' worth of data, it also makes sense to try and uh, explore the other bounds, like the upper bound and lower bound, which is the upper bound or lower bound, which again be something like, uh, you know, maybe 2 months of data? What what did it look like? Maybe t minus 3, uh, when it all started. Is there a potential data drift that's happening? Maybe it's it's worth taking into account because maybe the 1st month when it all started off, the data sanity and the cleaning, procedures wherein in place and the data was too biased? So it makes sense to try and look at the month on month, deviation in data and try to look at that as some sort of a datum to start training your model? And once your model has been once you've decided that, know, the good amount of data that we are gonna train this model is on 2 months' worth of data? Incrementally, we'll go on and try to once a new month's worth of data is up, Uh, we would then trigger retraining mechanism because that period in interval on when to kick off a training makes a whole lot of difference because I've seen companies that train the data on every other day, which makes little to no sense because in some cases, it does. Spread? In some cases, that's just a waste of, you know, all of your compute and trying to, uh, deploy a whole lot of things and then trying to realize halfway down the line that all of it Uh, makes less sense. But, uh, once you've defined that frequency, that's when it really makes sense to set up that pipeline, and pipeline could be simple as something as simple as try to look at or device something like a safe landing zone? And a safe landing zone is typically where your data gets accumulated. And Uh, then you can define something like a pipeline, uh, which typically could be, uh, something like an orchestration layer. Airflow would make sense where once your data has been accumulated there, you can kick off something like an event, and that event would typically take all of your datas, uh, do some EDA? Try to project all of those metrics onto something like an artifact store where it could be connected to something like a dashboard. You can visualize all of this. And once that's done, it would slowly push it onto your retraining pipeline, which would typically run it through either 1 model or several different models which kind of makes it a tandem, uh, you know, problem or a tandem selection? And, uh, once that's done, you can take one sort of a metric. For instance, it could be precision recall. You can take that and or compare it with, uh, what your current deployed model is? And if that is somewhat better, you can always push that into production. If not, you can store all of those results somewhere and, you know, send out a notification using something like a PubSub queue to some of the engineers saying that A new model was trained, but, you know, it isn't looking up to mark with the current production pipeline. So, you know, the entire process Uh, is has figured out that, you know, let's continue with what the model that's currently at. So that's probably how I would design it, but it's more complex than that.
I'd say a transformer model is much more better. And why is it better? It's it's possibly because, uh, you know, uh, CNN or something of that sort is looking at data in a very sequential manner, and sequential makes sense because if you're looking at language, language has always been sequential where, you know, a word that you're about to utter is always dependent on the prior words or, you know, it's also somewhat dependent on the words that's coming after, so which kind of makes it sequential in a way. But uh transformers make more sense because it kind of takes its concepts and nullifies it and, uh, introduces this concept of uh, although language is sequential, but we are trying to look at language in a way that, you know, where there is some sort of a relationship between each and every word, and it demystifies it in a way by taking the entire you know sentence. For instance if I'm looking at some sort of sentence which says that you know, I went to the financial or, you know, I I went to the bank to deposit a bunch of money or, you know, a little bit of money, and that's that. So it'll take this entirety of this word. It'll pass it through something like a bitwise a byte wise tokenizer, which is basically taking each and every word splitting it up into, uh, a concatenation of different numbers. And these numbers are something like a a dictionary key value pair so that, you know, in a sense, I'm just converting my text into something like a representative number or numeric format. And once that's done, I'm passing it through an embedding layer, which is basically you can use any sort of an embedding layer and embeddings are basically converting all of these, uh, you know, words into close approximation vectors which are closer together. For instance, there are different types of it. There is GLOB embeddings and different sorts of embeddings. And once all of that is done, it's passing it through the next layer, which is kind of called, uh, a positional encoding, which is, uh, why this is being done is because the step that comes after it is the self attention. There are multi head self attention layers and post that. There is something called a feed forward network. So self attention basically is gonna take all of your words, jumble it up, and try to do something like a cross product of different words with one another, and, uh, the whole reason why it makes sense is because, uh, you you at the end of this step, you're basically gonna get a matrix of, uh, correlation or you can sort of call it a covariance score of each and every word with one another. And in doing so so you kind of lose out on the position of each and every word. So if I do this with my particular sentence, which I went to bank to deposit some money. Right? Uh, the word bank that is currently at the 3rd or the 4th position would kind of get rearranged and that entire order is lost. So at the end of it, if you lose out an order, you still need to put or piece all of these things back together? Because, uh, language is very sequential, and you kind of need the model to output the exact same thing back. So that's why positional encoding and it it uses something very intelligent called, you know, I think it's a sine and cosine function, which kind of, uh, works on that oscillatory pattern to try and determine what each and every position is, and then it passes through something called, uh, you know, multi head attention and, uh, feed forward network, which kind of predicts the end of it. So why that makes sense? Because, you know, you kind of get that word relation of one word with each and every word where the same is impossible with CNN because, you know, it's it's looking at things in a very sequential kind of an order, and that might or evidence proves that it it makes less sense.
I would think that, The problem here is it has something called a self attention layer, which makes sense. It's it starts with self Attention. But, uh, probably one of the things that's missing is the embedding layer because you typically need embeddings that learn, uh, through it, which essentially means that an embedding layer is a, you know, a a layer that learns, which means it needs to be part of the transformer block, essentially. Uh, if if that isn't the case that you're Still using something like a pretrained embeddings, and you are giving no sort of, uh, weightage to something like, uh, the embedding layer Or, uh, in a sense that the waste of the embedding does not change. It's it's really constant, but the rest of the layers would still work. So I would say that's the potential issue with this. And Self attention kind of takes in 3 Parameter. That could also be one of the problems, um, because the feedforward network is typically gonna take some sort of a a vector representation of words once it's passed through something like an embedding layer. And I It's a bit confusing, but I feel that the problem here could be the self attention only needs to take in x a single time as opposed to taking it about 3 different times. And, uh, typically, self attention also is supposed to pass through something called a feedforward network. And once the feedforward has been computed, It needs to only look at, uh, the CLS token, Uh, to try and identify what, uh, the output is. But, you know, there I I would say these are 2 of the Problems that I've identified, one of the the embedding layer. Uh, there are 2 different architectures of this where you decide not to change your embeddings at or where you could use something like a pretrained embedding. If if that was the case, then this, uh, block makes sense. Uh, you're only training the rest of the block by, uh, you know, avoiding the embeddings out of it. That makes sense. The second problem was, uh, the self attention being, You know, it's it's thinking about the input or x is is just a vector. I think 1 vector is more than enough, uh, rather than doing it. And within the self attention, you can define something like, you know, multi head attention and all of these things, and The results of all of these could be concatenated. So all of this could be it's not parallelized, really. It's it's Quadratic by nature. So the complexity still exists. So it it makes no sense to take in, you know, x about, you know, 3 different types into self attention. I think that's
For generative AI model. So, typically, Uh, for for any sort of text generation, right, it it makes no sense to use something like, uh, you know, red reduction in mean because you are typically trying to figure out what the next Word is going to be because transform models are basically trained on 2 different kind of problems. 1 is try to identify what the next word in the sequence is. The other 1 is trying to guess the missing words or, you know, mask tokens. So for these 2 specific tasks, it makes less sense to use something like, uh, you know, reduce mean because, know, we are typically even if, uh, reduction is main is one of the important factors there. How would you exactly determine what that, you know, sort of mean is? I mean, you can look at, uh, you know, vectors and mean of vectors to try and, uh, see, how one vector is close to another. But in a transformer setting, it it makes no sense because your your task is not to Try and say a word is, uh, exactly the same as another word. In fact, what you're trying to do is It's going to say the first word might be completely different to the next word that's going to be predicted, but they're still, uh, you know, related to one another, and this could only be done by trying to do trying to predict what the next word is. And I think there are a lot of different ways of trying to do this, but, you know, I think softmax, uh, with something called negative sampling is one of the ways of actually doing this where you're trying to predict what the next word is. And, uh, instead of trying to compute the loss over, you know, all of the words in your What you're trying to do is sample out just a few negative examples, uh, considering the one positive example and then trying to define find something like the loss function. So that makes more sense. So categorical cross entropy, that makes more sense, uh, when actually training something like a generative AI model. And when fine tuning is a completely different problem because then you're trying to do something like you know, for this entity you could be doing something like, uh, multilabel classification where you're trying to say if, You know? If this, uh, in within this particular sentence, is the first word a noun? Is is it something else? Is it, uh, figure of speech part of speech rather and, uh, trying to exactly determine that. So when training a generative AI model, it makes no sense to use something like, uh, you know, production and mean. So, basically, what you are trying to do is do something like, uh, squared residual and trying to figure out, you know, how close one word is to another or one vector is to another. And in doing so, it does not Actually makes sense because the end objective there is is not to figure out or trying to compute loss over how similar to, uh, words are or trying to reduce the distance between them. In fact, trying to figure out something called a co occurrence score or a relationship between 2 completely different words. So that fits the
I think, uh, for designing something like a scalable generative AI system, it it typically starts with, you know, trying to accumulate a whole bunch of data. It it it makes, uh, the most sense because, uh, you typically want to try and define what the sort of problem is that you're trying to using this generative AI model. If it's a general purpose problem, if you're just trying to create some sort of a model that's capable of generating large amounts of text, makes sense to do, uh, you know, to scrape the whole of Internet. But if it is very specific, for instance, you could be trying to only look at medical journals to try and have this generative model learn more about, uh, you know, medical disorders, uh, in in any case. So if if that's the case, then it'll be quite different. But I would still try and balance the data set in a way that, you know, there is good amount of balance between, uh, what's, uh, medically accurate as well as something that's general out there because at the end, I want my model to be also so able to understand language as a construct and not just language as What is being merely derived out of, you know, what's in something like an encyclopedia? So that's how I would probably start. And Then for training the model, it typically starts with, uh, picking out a transform architecture that's, uh, worked better. And by introducing something like a change in case if that's really necessary because, uh, change can go a long way. It's it's kind of like, you know, you're deviating away from what works, and that kind of gets you into this whole experimental rabbit hole, which is kind of hard to get out of. So if it's, you know, quick to deployment, quick to production, I think what makes, uh, the most sense is to try and, uh, train the model by the stable accepted principles out there based on the white papers and research materials. And once that's done, once you have kicked off the training, uh, the whole A generative AI model, it's quite an expensive model, so it it also makes sense to try and figure out what the budget for the training is try to set up your GPUs and CPU time and how long do you wanna keep this training or how quickly you do that. So trying to figure out the exact GPU compute 10Q can allocate for it makes the most sense. And before deploying the whole lot of it into production, I think it makes sense to do something like a pre kind of a pre ritual checkup or a preliminary test where you're trying to take, uh, not a whole bunch of data, but just to define evidence that your entire model architecture is working good. For this, take out maybe a small, uh, subset of the data, try and pass it through the, you know, model. Try and look at the we'll, uh, the maybe the loss function or the loss and accuracy out of the model and trying to see if that it's, uh, you know, slowly learning something or is it, you know, stuck at some sort of a a loop there? So I think that also makes sense. And once all of those checks are ready, that's probably where you deploy to production. You'd also defined checkpoints so that, you know, at each and every point in case something goes wrong, you have all of those model checkpoints so that you can fall back to that point, kind of like a fail safe measure. Uh, try and define metrics, try and put that on something like a 10 support so that it it keeps on updating based on each epoch, uh, which which makes sense because then you can track things based on, you know, how quickly that progresses. Those are, you know, some of the things that I would design in order to do that, and this coupled with
In a multi project environment, uh, performance of generating model, I think it it depends on the use case. Am I just gonna use one particular generative AI model across different teams, or is it should it be more curated to fit their demand? Because if if it's just 1, Uh, deploying it should not be so much of a problem because I just need to define some sort of an endpoint and try and figure out, you know, what's the number of instances of that should be available to users, which means I can kind of stop it from throttling, and I could ensure that, you know, people have access to it. Now how do I measure performance? Because this is a trickier problem because it depends on what your dataset has been trained on. If it's a generative AI model that's trained on the whole of the Internet, maybe it makes sense, uh, to define their own mattresses and, uh, you know, things like your their safe net or a safety net. Because in some cases, I've seen generative AI model hallucinate, and they could be, uh, giving you all all bunch of, uh, you know, nonsense. So it makes sense to try and, uh, you know, look at that from the perspective of how, uh, a model should behave or I think the right word would be polarity of the model. And performance could be measured on different scales. 1 is availability, which I mentioned. I tackle it by trying to understand there are different users that would be accepting it and try and put it on something like an auto scaling Kubernetes, uh, endpoint, probably that makes sense because I would also need to have something like a GPU attached to it at some point. If not, When people would try to access that model, it'll take a whole whole lot of time for them to get there. So so performance could be tied to that. And the best way to do that is to try and set up something like, uh, you know, a Kubernetes cluster with a GPU sidecar attached so that, uh, you know, it can work in tandem. And that's probably the, you know, efficient way of actually doing that. Or I think, yeah, that's that's probably one of the things that I try. I think serverless might also work in case if you're looking to downsize the model or, uh, you know, attaching a GPU with serverless, I don't think it's the most, uh, convenient way of doing things. But those are some of the things that I would try. And And consistency is something that is, uh, you know, slippery slope Because, uh, in case if my generative AI model isn't con continuously trained. If it is continuously trained, then consistency would be a huge problem because there might be a bunch of teams who would have designed their endpoints or their code blocks in a way that, you know, version 1 of the model was capable of probably taking things in, and it the prompt could be something like, you know, uh, generate, uh, the responses in, uh, JSON format, and it would have worked. Now in case if I train this model again, uh, in a different version with a whole bunch of dataset, a lot of those weights would be replaced. I would to argue that it's it's possible for the model to return some sort of an eventual consistency, but not a whole lot. So I think it would be safe to assume that or try to figure out which of these teams would require that sort of Sensei, and then try to device version of models, uh, based on that task. And Other than that, I think, yeah, I think that's that's how I would scale and provide
Uh, most appropriate hugging face pretrained model. Not quite sure of the name, but, Uh, you know, I've I've mostly worked on Google version of models and a Couple of things on the AWS side. Uh, I think, uh, Mistral was one of those versions of the models that was Recently launched. There was also the 2,000,000,000. I think there has been a lot of it. It's just an evolving discipline, which means there are new versions of models coming out, you know, each and every day. So I don't exactly recall what's the latest or the best sort of a model there is. But for a chatbot, I think it it makes sense to have some sort of a model where, uh, you know, it's just trained on, uh, the whole Internet of data, uh, and it's just much more smaller in size. I think the model version 2 would make the most sense. And why would I choose that model? Uh, typically, because when I'm looking at a chatbot, I'm probably designing this for my end user who could be, uh, you know, they could be from different demographics. Say it could be, you know, people who so that bias should not exist in my data. This being one of the reasons why I would choose some sort of a model that's been trained on a lot of unbiased so that, you know, I I could have people that are, you know, Uh, typing in full sentences, I could also have people that are using short forms to try and get through to things. So My model should eventually be able to understand what the context that is being specified is and then being to being able to respond to it. So those would be some of the things that I would try. And chatbot as such would, not, uh, cut it. I think it it would still be able to do that. But to try and improve that, I would, Uh, pair it with something like lang chain so that, you know, I could set up something like a a rag kind of a system where, uh, it's just more like, you know, regenerative, uh, you know, uh, augmented generation. So that's, uh, probably one of the things that I'll try, and that'll make more sense because then I would have that added block of of fact checking all of my information with something like a global dataset rather than just relying on something like, models, uh, actual performance based on hallucinations and whatnot. So That's that's probably something, uh, I'll I'll check. So regenerative, uh, or retrieval based augmented generation retrieval, augmented generation blocks make more sense there.
Uh, an approach to fine tuning a GPT 2 model specifically for clients. Domain specific fine tuning is is still possible. I think it it makes sense to accumulate all of your data to try and look at your data and try and define something like key value pairs. And it could be done in different ways where, you know, you could provide something like a whole sentence. It could be something called a context, and then there could be something like a question and an answer from within that context. So, basically, curating your dataset in a way that, you know, it has a bunch of text, uh, question, and what the specific kind of, uh, answer to look for from within that question is. That's how I would tailor the dataset. And once that's done, you can pass it on to a model and, you know, set up something like, uh, uh, EPOC training for a bunch of, uh, hours. Uh, since it's also going to be an expensive model, I think it makes sense to do this on something like uh, a cloud service provider with a GPU attached to it, which means that you would be running all these models for something like node hours. And determining the exact node hours would be dependent on the cloud service provider. They would have documentation on what works for what size. So trying to figure out that exact balance makes the most sense. And, uh, a whole lot of it for fine tuning a GPT model, you don't really need to disassemble the entire architecture by trying to freeze a bunch of layers. I think, uh, you know, there are, representations or, you know, rather code, uh, blocks which has had or shown some sort of an evidence to work. So the best bet would be to try and use one of those models which has shown some sort of a stable curve and then using that to continue with your training. Because since it's a very expensive model, if you're just trying to, you know, try and experiment with it, it might not, uh, gel well or sit well with the particular problem that you're trying to solve with. It could go, you you know, sideways. And that's definitely not something that you'd wanna do when you're trying to, uh, train a model for a particular client. So, yeah, I think those are some of the steps that I take in order to fine tune a GPT two model. Uh, rather than trying to experiment with it, I think it makes sense go with, you know, stability or something like exploration or exploitation.