
Amith KA is a seasoned professional with 14.5 years of expertise in leadership, management, and a comprehensive skill set spanning data engineering, data science, and systems design. Proficient in engineering best practices, data architecture, and cloud platforms such as GCP, AWS, and AZURE, Amith excels in full-stack development and employs tools like Terraform for efficient infrastructure management. With proficiency in Python, TensorFlow, PyTorch, and BigQuery, Amith is a specialist in machine learning, deep learning, MLOPS, and Kubernetes. Additionally, expertise in areas like NLP, computer vision, and generative AI, along with a strong background in security and Docker, further underscores Amith's impact in the industry.
Head of AI & Engineering
ChryselysChief Architect
66 DegreesChief Data Scientist
Suyat! TechnologiesAnalytics Manager
Metro Trading CompanyCOE Leader AI/ML
EY.png)
Kubeflow

Tensorflow
AWS (Amazon Web Services)
Azure

Google Cloud Platform
.png)
Docker

Kubernetes

Google Cloud Platform (GCP)

AWS

Terraform

CI/CD pipelines
.png)
Jenkins

Python

pandas

scikit-learn

Keras

PyTorch
Snowflake

BigQuery

Redshift
.png)
Informatica

MySQL

SQL

SSIS
Pyspark

Quicksight

Looker

Power BI

Tableau

Spotfire

UiPath

Blue Prism

VBA
.jpg)
Word

Outlook

VB.NET
Node.js

HTML

CSS
So a bit about myself. I work as the chief architect for 66 degrees, in the capacity of chief architect. I've been with the company for a little over three years now. So I handle most of the engineering as well as AI opportunities for the company. This would involve stakeholder interactions, setting up workshops, speaking with them, trying to get a hold of what the requirements are at the client side, and trying to basically consult. So, I try to drive those consulting measures forward, identify risks, set up a proper transition pipeline for all their data to come in, so we can process it and take it forward. It's a senior leadership role that I take up here, so it's just more stakeholder management, trying to bring in more businesses. That's part of the job. I'm also well-versed with the architectural components because, throughout my life, I've had about 15 years of total tenure. I worked as a data scientist at different levels, which also includes building pipelines, a crucial part of data science because the data pipelines matter. If you're looking at a solution as a whole, those two bits come together to form a Venn diagram, which is what AI enablement is all about. I personally believe that's a bigger picture, a technical one. Team size and levels of teammates vary, but I'd say a principal is like a manager. We have associates in different levels reporting to me, so it's mostly managing the team, driving practices, hiring and upskilling people, and figuring out ways to strategize and move forward. Those are some of the roles and responsibilities within this role, and I personally find it challenging, but also exciting with the whole lot of generative AI things happening. And, yes, I think that does justice to what I preach.
Large-scale dataset training typically starts with trying to understand more about the opportunities or rather exploring more of the dataset. I think, first and foremost, the most important thing is to try and get a general sense of the data set. This could be done using something like exploratory data analysis, but more importantly, getting in touch with a team you're working with is crucial. These could be front runners, people who have actually worked with the data, so that's where the insights come from. And I think that's also pretty important, trying to decide what those different areas are. And so, that's probably one of the first things that I'd say. EDA is standard, trying to understand more about the data distribution, missing values – should you impute or not? Should you get rid of those data? Or maybe it makes sense to try out something like cross-validation, create different datasets or different sample sets to try out with your different models. I probably approach the problem by trying to define what it is that we are trying to achieve through it. So definition makes a whole lot of sense because then you have set a path on trying to achieve a solution to a problem. And that's probably one of the things that I try. Once that's done, I think it also makes sense to take that work to your stakeholders, communicate that so that they have a clear understanding. We call that a toll gate, kind of like a ritual, where we try to understand more about what we provide them as a service and are they in agreement with it. So it's sort of that core of all sorts. And once that's done, you can train some sort of a baseline model, try and compare it with different strategies in trying to achieve it. Also, a whole lot of integration tests, all of that matters because if you start developing all of this in parallel, it makes sense. So going forward, if it's just hyperparameter tuning, which is a more complex area. But if you slowly start parallelizing a lot of these things, and it makes sense once you go forward, you don't have a whole lot of things on your plate. You have sort of parallelized it, and it kind of makes sense or serialized in a way. And I would probably look at defining metrics and trying to evaluate how the model does. So for classification, we have something completely different, like precision, recall, confusion matrix, all of those make sense. For regression-based problems, it's quite different. For something like object detection, it would be quite different. So define those metrics, and it makes sense. And then slowly taking it forward. Once that benchmark is set, I think what's most important is to try and see if there are versions of the model possible to improve it. And probably, this is also where I would also start to work on trying to devise a path for a production-ready AI system. This makes sense because if you have some sort of a pipeline which is compatible with the retraining of models, trying to detect things like data drift or any sort of a model retraining pipeline and storing those artifacts. You can use something like MLflow for this, or create pipelines using Airflow. If not, you can stick to something like 2 Flow pipelines. A whole lot of those make sense. But, yeah, I think that kind of sums it up.
A loss function is an important part of training a model, and it varies depending on the problem you're trying to solve. For instance, if you're trying to determine 2 classes, you can use a binary cross-entropy loss where you're trying to devise a probabilistic measure. So a binary cross-entropy loss is basically trying to take all of your inputs and channel it through a probability distribution, which indicates that it's pointing towards one class, and the other one would be pointing towards the other class. Binary cross-entropy makes sense. And if you have a whole lot of classes, like you're trying to determine a bunch of colors, a softmax function would make more sense because you're trying to take the entirety of that probability and distribute it along different classes that you have. The sum of it adds to 1. This is how I would define a loss function, but there are a definite set of rules that make sense. But technically, what a loss function is trying to do is take in all your data set, which is like a bunch of columns and rows, and convert that into a machine-readable format. It's more like converting your data into a set of formats that a machine learning model can adapt to. When I say machine learning model, I mean a bunch of layers, like linear layers, or convolutional layers if you're dealing with images. You can also use convolutional layers on sequential data. It's shown some evidence there. But basically, once you do that whole linear transformation, all of these layers are incorporating a linear transformation. But just doing a linear transformation isn't enough because you also need to traverse that space of nonlinearity, and that's probably where you bring in an activation function. An activation function takes in your weights and biases or the multiplicative product of all of those things and tries to take it to a different dimension altogether. So that's nonlinearity. And once all of that is done, so you traverse through different layers, and this loss function is particularly important because it's the deciding factor. It's going to take your inputs, all of the accumulated weights through different layers, and the loss function is typically what decides how you would want to readjust those weights. So if I'm seeing that I'm trying to predict a dog as a cat, the loss function is typically what tells me that you predicted it as a cat, and it's wrong. So I need to go back and reevaluate all of my layers and try to readjust those weights, and it's done using something called back propagation, which is basically partial differential differentiation of all of the different functions that I talked about, which is typically what's happening in all of the layers and plus you also have activations on top of it. So this is what typically makes sense. Now, coming to loss functions, there are a lot of different loss functions. You have something called triplet loss, which is particularly exciting when you're trying to do something like Siamese networks. You also have linear layer loss, or squared residuals, which also makes sense.
When refining an architecture, I think it mostly starts with what it is that you're trying to do. Because when you're looking at some sort of output of a neural network or training matrices, you can try and project that as a loss or it could also be measured as accuracy. You can plot that on a recall curve and try to determine how to make sense of it. And, typically, this is one of those areas where you would need to understand that a lot of parameters go into training the entirety of the network. So trying to play around with a few things such as just the batch size also makes a lot of sense. And on batch sizes, to be given, you have a lot of data coming in. You can try and split it into batches. And each and every batch would be of a particular size. This would go into the network. So basically when you're passing a batch into a network, the entirety of a batch passes into the network, tries to understand or tries to learn patterns from those, and it co-adapts in a way. And different batch sizes would also make sense to play around with, but you can also do this whole bunch of things using something called hyperparameter tuning. You can use something like a randomized grid search or a grid search. There is also something called a Bayesian kind of tuning algorithm where you're basically just throwing in all of these parameters, and asking the model to run several iterations of it and try and figure out what the best fitting network is. And how do you balance complexity? And complexity is kind of like a strange area. What's the safest best is to try to read more about the papers, the architectures that work best, and then probably start there. Because even if you're trying to start off on a whole different end and trying to build this whole system out, it might not work out. At least personally, I've tried to design my own networks and it's not worked out. I've played around with things like skip connections. It's never really worked out. So I think a balance typically exists on what it is that you're trying to achieve, and it's basically optimization. When you're looking at things like you have your resources, you have resources, could be time, and you are still trying to achieve something. So you need to get that done in this limited time. So trying to look at it in a way by submitting something like an impact analysis makes sense where you define what it is that you're trying to achieve, look at what the impact of introducing this complexity is, trying to mitigate that risk, and trying to move forward. At least those are some of the things that I try. And performance, again, it's just a measure of how your model performs, which could be done through loss functions. So for different problems, it's a different loss function there. And it also depends on basically trying to get the actual problem statement defined, and only then comes the whole concept of performance again. So I would say this complexity and performance is tightly linked to one another. It's still optimization at the end of it. So yeah.
So how would you implement a continuous retraining mechanism? I think this is also a complex problem in itself because you know, you have your MLP model. It's for the sake of an argument, let's assume we have an MLP model. You can start with something like where it's gonna take in a bunch of descriptions that's coming out of things, and it's going to predict if that, what the sentiment of that is? Is that positive or negative? Which means your classes are just about to be there could also be a neutral second sentiment which could be, you know, 3 different classes. But typically it starts by looking at what your dataset is and trying to understand more about the dataset. Why this makes a whole lot of sense is because if you're training a data on about, you know, 2 months' worth of data, it also makes sense to try and explore the other bounds, like the upper bound and lower bound, which is the upper bound or lower bound, which again be something like, you know, maybe 2 months of data? What did it look like? Maybe t minus 3, when it all started. Is there a potential data drift that's happening? Maybe it's worth taking into account because maybe the 1st month when it all started off, the data sanity and the cleaning procedures were in place and the data was too biased? So it makes sense to try and look at the month on month deviation in data and try to look at that as some sort of a datum to start training your model? And once your model has been trained on 2 months' worth of data, incrementally, we'll go on and try to once a new month's worth of data is up, we would then trigger the retraining mechanism because that period and interval on when to kick off a training makes a whole lot of difference. Because I've seen companies that train the data on every other day, which makes little to no sense because in some cases, it does. In some cases, that's just a waste of, you know, all of your compute and trying to deploy a whole lot of things and then trying to realize halfway down the line that all of it makes less sense. But, once you've defined that frequency, that's when it really makes sense to set up that pipeline, and pipeline could be simple as something as simple as trying to look at or devise something like a safe landing zone? And a safe landing zone is typically where your data gets accumulated. And then you can define something like a pipeline, which typically could be, something like an orchestration layer. Airflow would make sense where once your data has been accumulated there, you can kick off something like an event, and that event would typically take all of your data, do some EDA, try to project all of those metrics onto something like an artifact store where it could be connected to something like a dashboard. You can visualize all of this. And once that's done, it would slowly push it onto your retraining pipeline, which would typically run it through either one model or several different models which kind of makes it a tandem problem or a tandem selection. And, once that's done, you can take one sort of a metric. For instance, it could be precision recall. You can take that and compare it with, what your current deployed model is? And if that is somewhat better, you can always push that into production. If not, you can store all of those results somewhere and, you know, send out a notification using something like a PubSub queue to some of the engineers saying that a new model was trained, but, you know, it isn't looking up to mark with the current production pipeline. So, you know, the entire process is has figured out that, let's continue with what the model that's currently at.
A transformer model is much more better. And why is it better? It's possibly because, you know, CNN or something of that sort looks at data in a very sequential manner, and sequential makes sense because if you're looking at language has always been sequential, where a word that you're about to utter is always dependent on the prior words or, it's also somewhat dependent on the words that are coming after, so which kind of makes it sequential in a way. But transformers make more sense because they kind of take its concepts and nullify them and introduce this concept of, although language is sequential, but we are trying to look at language in a way that, you know, there is some sort of a relationship between each and every word, and it demystifies it in a way by taking the entire sentence. For instance, if I'm looking at some sort of sentence that says, "I went to the financial or, I went to the bank to deposit a bunch of money or a little bit of money, and that's that." So it'll take this entirety of this word. It'll pass it through something like a byte-wise tokenizer, which is basically taking each and every word, splitting it up into a concatenation of different numbers. And these numbers are something like a dictionary key-value pair, so that, you know, in a sense, I'm just converting my text into something like a representative number or numeric format. And once that's done, I'm passing it through an embedding layer, which is basically converting all of these words into close approximation vectors which are closer together. There are different types of embeddings. There is GLOB embeddings and different sorts of embeddings. And once all of that is done, it's passing it through the next layer, which is kind of called positional encoding, which is why this is being done is because the step that comes after it is the self-attention. There are multi-head self-attention layers and post that, there is something called a feed-forward network. So self-attention basically is going to take all of your words, jumble it up, and try to do something like a cross-product of different words with one another, and the whole reason why it makes sense is because, at the end of this step, you're basically going to get a matrix of correlation or you can sort of call it a covariance score of each and every word with one another. And in doing so, you kind of lose out on the position of each and every word. So if I do this with my particular sentence, which I went to the bank to deposit some money. Right? The word "bank" that is currently at the 3rd or 4th position would kind of get rearranged, and that entire order is lost. So at the end of it, if you lose out on the order, you still need to put or piece all of these things back together? Because language is very sequential, and you kind of need the model to output the exact same thing back. So that's why positional encoding uses something very intelligent, called a sine and cosine function, which kind of works on that oscillatory pattern to try and determine what each and every position is, and then it passes through something called multi-head attention and feed-forward network, which kind of predicts the end of it. So why that makes sense? Because you kind of get that word relation of one word with each and every word, which is impossible with CNN because it's looking at things in a very sequential kind of an order, and that might or evidence proves that it makes less sense.
I would think that the problem here is that it has something called a self-attention layer, which makes sense. It starts with self-attention. But probably one of the things that's missing is the embedding layer because you typically need embeddings that learn, which essentially means that an embedding layer is a layer that learns, which means it needs to be part of the transformer block, essentially. If that isn't the case that you're still using something like a pre-trained embedding, and you are giving no sort of weightage to something like the embedding layer or in a sense that the waste of the embedding does not change. It's really constant, but the rest of the layers would still work. So I would say that's the potential issue with this. And self-attention kind of takes in three parameters. That could also be one of the problems, because the feedforward network is typically going to take some sort of a vector representation of words once it's passed through something like an embedding layer. And it's a bit confusing, but I feel that the problem here could be the self-attention only needs to take in x a single time as opposed to taking it about three different times. And typically, self-attention also is supposed to pass through something called a feedforward network. And once the feedforward has been computed, it needs to only look at the CLS token to try and identify what the output is. But you know, there I would say these are two of the problems that I've identified, one of the embedding layer. There are two different architectures of this where you decide not to change your embeddings at or where you could use something like a pre-trained embedding. If that was the case, then this block makes sense. You're only training the rest of the block by you know, avoiding the embeddings out of it. That makes sense. The second problem was the self-attention being it's thinking about the input or x is just a vector. I think one vector is more than enough, rather than doing it. And within the self-attention, you can define something like multi-head attention, and the results of all of these could be concatenated. So all of this could be it's not parallelized, really. It's quadratic by nature. So the complexity still exists. So it makes no sense to take in you know, x about three different types into self-attention. I think that's
For generative AI models, typically, it makes no sense to use something like reduction in mean because you are typically trying to figure out what the next word is going to be. Transform models are basically trained on two different kinds of problems. One is trying to identify what the next word in the sequence is. The other is trying to guess the missing words or mask tokens. For these two specific tasks, it makes less sense to use reduction in mean because we are typically trying to find a different kind of mean. How would you exactly determine what that sort of mean is? I mean, you can look at vectors and the mean of vectors to try and see how one vector is close to another. But in a transformer setting, it makes no sense because your task is not to try and say a word is exactly the same as another word. In fact, what you're trying to do is say the first word might be completely different to the next word that's going to be predicted, but they're still related to one another. This could only be done by trying to predict what the next word is. And I think there are a lot of different ways of trying to do this, but I think softmax with something called negative sampling is one of the ways of actually doing this where you're trying to predict what the next word is. Instead of trying to compute the loss over all of the words in your vocabulary, you're trying to sample out just a few negative examples, considering the one positive example and then trying to define a loss function. So, categorical cross entropy makes more sense when actually training something like a generative AI model. And when fine-tuning is a completely different problem because then you're trying to do something like multilabel classification where you're trying to say if this, in this particular sentence, is the first word a noun. Is it something else? Is it a figure of speech, part of speech, rather, and trying to exactly determine that. So when training a generative AI model, it makes no sense to use something like production and mean. So, basically, what you are trying to do is do something like squared residual and trying to figure out how close one word is to another or one vector is to another. And in doing so, it does not make sense because the end objective there is not to figure out or trying to compute loss over how similar words are or trying to reduce the distance between them. In fact, trying to figure out something called a co-occurrence score or a relationship between two completely different words fits the context.
For designing something like a scalable generative AI system, it typically starts with trying to accumulate a whole bunch of data. It makes sense because you typically want to try and define what the sort of problem is that you're trying to use this generative AI model for. If it's a general-purpose problem, if you're just trying to create some sort of a model that's capable of generating large amounts of text, it makes sense to do you know to scrape the whole of the internet. But if it's very specific, for instance, you could be trying to only look at medical journals to try and have this generative model learn more about medical disorders. In any case, so if that's the case, then it'll be quite different. But I would still try and balance the data set in a way that there is a good amount of balance between what's medically accurate and what's general out there because at the end, I want my model to be able to understand language as a construct and not just language as what is being merely derived out of what's in something like an encyclopedia. So that's how I would probably start. And then for training the model, it typically starts with picking out a transformer architecture that's worked better. And by introducing something like a change in case if that's really necessary because change can go a long way. It's kind of like you're deviating away from what works, and that kind of gets you into this whole experimental rabbit hole, which is kind of hard to get out of. So if it's quick to deployment, quick to production, I think what makes the most sense is to try and train the model by the stable accepted principles out there based on the white papers and research materials. And once that's done, once you have kicked off the training of the whole generative AI model, it's quite an expensive model, so it also makes sense to try and figure out what the budget for the training is, try to set up your GPUs and CPU time, and how long you want to keep this training or how quickly you do that. So trying to figure out the exact GPU compute budget makes the most sense. And before deploying the whole lot of it into production, I think it makes sense to do something like a pre-checkup or a preliminary test where you're trying to take not a whole bunch of data, but just to define evidence that your entire model architecture is working well. For this, take out maybe a small subset of the data, try and pass it through the model. Try and look at the loss function or the loss and accuracy out of the model, and try to see if it's slowly learning something or if it's stuck at some sort of a loop. So I think that also makes sense. And once all of those checks are ready, that's probably where you deploy to production. You'd also define checkpoints so that at each and every point in case something goes wrong, you have all of those model checkpoints so that you can fall back to that point, kind of like a fail-safe measure. Try and define metrics, try and put that on something like a dashboard so that it keeps on updating based on each epoch, which makes sense because then you can track things based on how quickly that progresses. Those are some of the things that I would design in order to do that, and this coupled with
In a multi-project environment, performance of generating a model depends on the use case. Am I just going to use one particular generative AI model across different teams, or should it be more curated to fit their demands? Because if I'm just going to use one, deploying it shouldn't be a problem because I just need to define an endpoint and figure out the number of instances that should be available to users, which means I can stop it from throttling, and ensure that people have access to it. Now, how do I measure performance? Because this is a trickier problem because it depends on what your dataset has been trained on. If it's a generative AI model that's trained on the whole of the Internet, maybe it makes sense to define their own metrics and safety nets. Because in some cases, I've seen generative AI models hallucinate, and they could be giving you all nonsense. So it makes sense to try and look at that from the perspective of how a model should behave or the polarity of the model. And performance could be measured on different scales. One is availability, which I mentioned. I tackle it by trying to understand there are different users that would be accepting it and try to put it on something like an auto-scaling Kubernetes endpoint, probably that makes sense because I would also need to have something like a GPU attached to it at some point. If not, when people would try to access that model, it'll take a whole lot of time for them to get there. So performance could be tied to that. And the best way to do that is to try and set up something like a Kubernetes cluster with a GPU sidecar attached so that it can work in tandem. And that's probably the efficient way of actually doing that. Or I think, yeah, that's probably one of the things that I try. I think serverless might also work in case if you're looking to downsize the model or attaching a GPU with serverless, I don't think it's the most convenient way of doing things. But those are some of the things that I would try. And consistency is something that is a slippery slope. Because, in case my generative AI model isn't continuously trained. If it is continuously trained, then consistency would be a huge problem because there might be a bunch of teams who would have designed their endpoints or their code blocks in a way that version 1 of the model was capable of probably taking things in, and it would have worked. Now, in case I train this model again, in a different version with a whole bunch of data, a lot of those weights would be replaced. I would argue that it's possible for the model to return some sort of eventual consistency, but not a whole lot. So I think it would be safe to assume that or try to figure out which of these teams would require that sort of consistency, and then try to devise versions of models based on that task. And other than that, I think, yeah, I think that's how I would scale and provide.
Most appropriate Hugging Face pre-trained model, not quite sure of the name, but you know, I've mostly worked on Google versions of models and a couple of things on the AWS side. I think Mistral was one of those versions of the models that was recently launched. There was also the 200 million parameter model. I think there have been a lot of them. It's just an evolving discipline, which means there are new versions of models coming out each and every day. So I don't exactly recall what's the latest or the best sort of model there is. But for a chatbot, I think it makes sense to have some sort of a model that's just trained on the whole internet of data, and it's just much more smaller in size. I think the model version 2 would make the most sense. And why would I choose that model? Typically, because when I'm looking at a chatbot, I'm probably designing this for my end user who could be from different demographics. Say it could be people who, so that bias should not exist in my data. This being one of the reasons why I would choose some sort of a model that's been trained on a lot of unbiased data, so that I could have people that are typing in full sentences, I could also have people that are using short forms to try and get through to things. So my model should eventually be able to understand what the context that is being specified is and then be able to respond to it. So those would be some of the things that I would try. And a chatbot as such would not cut it. I think it would still be able to do that. But to try and improve that, I would pair it with something like LangChain so that I could set up something like a rag kind of a system where it's just more like regenerative, augmented generation. So that's probably one of the things that I'll try, and that'll make more sense because then I would have that added block of fact-checking all of my information with something like a global dataset rather than just relying on something like models' actual performance based on hallucinations and whatnot. So that's probably something I'll check. So regenerative or retrieval-based augmented generation retrieval, augmented generation blocks make more sense there.
An approach to fine-tuning a GPT-2 model specifically for clients. Domain-specific fine-tuning is still possible. I think it makes sense to accumulate all of your data to try to look at your data and define something like key-value pairs. And it could be done in different ways where you could provide something like a whole sentence. It could be called a context, and then there could be something like a question and an answer from within that context. So, basically, curating your dataset in a way that has a bunch of text, questions, and the specific kind of answer to look for from within that question is. That's how I would tailor the dataset. And once that's done, you can pass it on to a model and set up something like EPOC training for a bunch of hours. Since it's also going to be an expensive model, I think it makes sense to do this on something like a cloud service provider with a GPU attached to it, which means you would be running all these models for node hours. Determining the exact node hours would be dependent on the cloud service provider, which would have documentation on what works for what size. So trying to figure out that exact balance makes the most sense. And a lot of it for fine-tuning a GPT model, you don't really need to disassemble the entire architecture by trying to freeze a bunch of layers. I think there are representations or code blocks that have shown some sort of evidence to work. So the best bet would be to try and use one of those models that has shown some sort of a stable curve and then use that to continue with your training. Because since it's a very expensive model, if you're just trying to experiment with it might not gel well or sit well with the particular problem you're trying to solve with. It could go sideways. And that's definitely not something you'd want to do when you're trying to train a model for a particular client. So, yeah, I think those are some of the steps that I take in order to fine-tune a GPT-2 model. Rather than trying to experiment with it, I think it makes sense to go with stability or something like exploration or exploitation.