Staff Software Engineer
Nume CryptoSoftware Development Engineer
Amazon India Development Center, BangaloreSenior Software Engineer
Ola CabsSenior Software Engineer
Mahindra ComvivaGit
AWS Lambda
AWS CloudFormation
AWS CloudWatch
Amazon EC2
Apache Airflow
Postman
REST API
PostgreSQL
MongoDB
Amazon DynamoDB
Visual Studio Code
Ethereum
AWS (Amazon Web Services)
Android
Jira
Finances Technology Group
Consumer Behaviour Analytics
Blockchain & Financial Transaction Innovations:
Finance technology group - Seller Taxation Services
Consumer Marketing Analytics
Okay. So I have been working as a staff software engineer in my current startup called newcrypto.com. Uh, I have overall 11 years of experience. Uh, my journey started in 2012 when I started working as a uh, antiviral developer in one of the telecom companies called Mahindra Comviva. After that, I stayed there for 2 years. After that, I moved to Ola, hollacapps.com, uh, in their very nascent phase. Uh, it was 2015 early 2015. There, I worked in their main consumer app there, both Android and Ios. Their back end APIs, uh, location APIs, billing pricing, discount APIs, uh, driver location APIs. And this continued around for 3 years. Uh, the app being one of the most, you know, used app in India, and it still is. Uh, so that was a big learning curve over there. That was a big responsibility of handling that deployment, uh, release cycle, uh, regular, uh, regular meetups with product and ideating the requirement and whatever inputs are given, those inputs are formed into ideas, or the ideas behind those inputs are formed into reality. That was my job. Uh, that team of 4 people who are managing everything around that. Uh, so I stayed in Ola for around 4 years. After that, in 2019, I moved to Amazon. Uh, in Amazon, uh, there was this team. Uh, I worked in 2 verticals, basically. 1st vertical was related to prediction. I mean, it's more on the analytics and ML side where you are based on the user's current action, we will detecting the long term free cash flow for Amazon. For example, today, if you are going to Amazon and purchasing a baby diaper or maybe a PlayStation, it it is highly possible that tomorrow you'll come back to Amazon again and purchase a HDMI cable or a game CD. Or for in case of baby selection, you can come again and purchase baby clothes or baby bottles. So because you did a purchase today, there will be chance that your conversion because if your experience of current purchase is good, you'll come back again and do subsequent purchases tomorrow. So this was the base behind this model. We so overall, my responsibility was that, you know, this team was newly formed, and, uh, I was reading a small charter over there where we were, uh, creating the entire pipeline for getting the data from different 70 different teams. Uh, those datas were more more or less the high value actions and low value actions contained the reviews and everything. And we were getting that data and doing all sorts of cleanups and then providing it to model. I closely work with the science guys over there and came up with the proper solution. Uh, then I moved to another team that was related to taxation. Uh, the the that again, that team was newly formed, and I was leading that team. I led the group of 8 developers. Uh, the idea behind that team was to, you know, uh, have whatever transaction that is happening in Amazon and in their ecommerce platform. You kind of get those transactions and apply local taxes, Amazon's cut, and everything around that. And it contained end to end cycle from tax calculation to reduction to, uh, returns filing and everything that goes into the ecosystem. So I managed that team. After that, I moved to my current start up, Memcrypto.com. There, it was everything was done from scratch by me. Uh, you know, I was the 1st engineer guy hired over there. We ended up building the entire stack from scratch. I ended up hiring 5, 7 folks from an n I t three g. We ended up mentoring them.
So so idea behind each transaction here is to maintain the highest level of isolation, uh, like, uh, in DB, we would like to have a serialized kind of approach, specifically when a transaction is taking place. For example, uh, this would be something led to to select for update kind of scenario. Like, for example, uh, if you are doing a transaction right now. So there will be a change in 2 parties, like, in party a and party b. When party in for party a, transaction will be amount will be reduced from their wallet, and for party b, amount will be credited. So what you have to do here is to maintain the consistency is to apply locks on the the row of transaction, wherever the transaction. A locks of a lock on the row of user and and you select the transaction. Meanwhile, you make sure that whatever select statement is coming for reading that row, it should be it should be queued. I mean, since the lock is taken on that row, it should be queued. And then until less, the entire transaction takes place, like, a's the wallet is detected and b's wallet is detected. Those logs but there are only 2 logs on 2 users. Those logs can't can't be removed. So that's how it goes. So that is the highest level of transaction called isolation called serialization, uh, with respect to RDMS DBs. So we apply it over there. Uh, apart from that, what we do is, uh, while there will be another scenario, like, when the transaction is taking place, you are committing it in some form of DB. So, obviously, you'll need some kind of duplicancy to make it fault tolerant. Uh, to make it fault tolerant, the best mechanism is to have periodic backups, as well as, uh, when the writes are taking place. Right? So you need to, make sure that re replicas or whatever re replicas are there, they should also contain this transaction. Uh, and until less there I mean, there are multiple, uh, commit algorithms that goes behind this scenario. There is, like, there is one called Paxos, and another one is 2 phase commit. So in 2 phase commit, until less all the replicas have the copies of this transaction committed, uh, as in the all these rows where these rows have changed for user a and user b, uh, we don't actually mark this transaction as finished. So we kind of should opt for 2 phase commit rather than Paxos. In Paxos, what happens that we have uh, quorum. Like, for example, there is 2 write replicas and, say, 5 read replicas. So in that case, we take the quorum, like, say, you know, maybe 3 rated replicas have transaction committed, then probably we can say, hey. Go ahead. But in kind of scenario where payment is involved, we should be very careful around that that until unless all the ready replicas have the same amount committed, we can't mark the transaction, uh, as done. So that's the main thing. These 2 things. Uh, first is the row level isolation with highest level of isolation, uh, and 2 phase commit. Combination of these 2 will help in maintaining consistency.
So, yeah, uh, more on the answer that I said a while back. Highest availability is something and, I mean, highest availability and fault tolerant is kind of synonyms here, I'll say, because, uh, first of all, your system needs to be available to do all kind of transactions. So highest availability kind of comes with the caches sorry. Uh, not caches. With the system being available. The system being available means even if there is a part of system that has gone down, uh, there should be a different path available, uh, which should work probably properly. And now there can be multiple points of failure. Your point of first point of failure can be that the application server itself. So you need to have multiple nodes running your application server. Uh, so you need to have duplication over there. So that duplication will we need to have the load balance in the front, which will direct the traffic based on the some algorithm. It can be round robin. It can be, uh, hashing based mechanism, anything. Uh, all all users. I mean, whatever suits the scenario. Uh, so 1st duplication layer comes over there at the application server. 2nd duplication layer comes over the database, uh, like, you know, uh, a a database, as I said, uh, we need to have our record saved. Uh, our records are saved only when even if 1 note goes down, there should be another note available in same availability zone as well as the different availability zones. So idea behind here to have data across different availability zone is that in supposedly, in 1 data center, if the data data center itself goes down, we need at least data that duplicated in different availability zones, and we can actually redirect our traffic over there. So idea here is that maybe we can take a bit of hit on latency, but we can't take our hit on availability and fault tolerance. So second thing is they have duplicancy at the data database level. In database level, as I said, we, uh, we need to have our databases available and multiple availability zone. Apart from that, uh, there should be scenario that if transactions are coming in really high volume, we should we should again, we should process each transactions. So to processing to process each transaction, we if our processing capability is somewhat lesser than the input capability, like the scale uh, to increase the scale, we can have some kind of queuing mechanism so that each transaction, when it comes up, they can be individually processed by our application servers. Uh, I stated earlier that about the logs on the highest level of, uh, isolation DB isolation that is serialization. So in in that case, you know, uh, we won't end up messing up the transaction. Like, the consistency will be maintained. For fault tolerance, again, uh, the basic idea is the duplicancy as much as possible. There should be regular backups. Uh, this is probably secondary, but, again, this is important. There have been instances in earlier that we have data losses. To prevent that kind of data losses, at least we still have some kind of backup that our database can point to in case everything goes wrong, and at least we have some kind of data set. So do you click and see combined with backups combined with putting up your services in different abilities on will kind of take care of most of the things.
Okay. So in distributed payment transactions, the thing is, uh, based on I mean, there can be certain rules like whom to prioritize, whom not to prioritize. Given that, uh, if we we can we can kind we we may have the hard SLAs, uh, on some some, uh, vendors. Like, supposedly, if our payment gateway is catering different, uh, vendors, it can be credit card vendors. It can be some other payment gateway. It can be, you know, uh, some group of highly available services, uh, where you need to do instant payment confirmation. Uh, so based on that, uh, you can opt you can actually have different kind of SLAs. Uh, those SLAs will be the main factor of these priority queues, like, uh, whatever transaction is coming up from these vendors or these, uh, providers, they can be, uh, mapped into different, what to say, topics or queues, uh, based on which we can read the transaction. So one of the thing that I can think of here is having, uh, distributed queue kind of mechanism where our producer will decide producing server will decide. Like, obviously, our transaction will run sometime land up in first endpoint, which is some kind of, uh, gateway server. Our gateway server here will decide, like, based on the priority queue priority of the given vendors, like, which topic to send data to. Uh, and, again, we will have our consumer group here. Uh, we can scale up consumer group or even kind of divide those consumer group again based on priorities. Like, supposedly, say, if, uh, we have to process, say how do we justify it? Some some some group of transactions earlier. Maybe some federal transactions are coming. So we have to process those federal transactions with the minimum SLA. Then probably the topic which will contain those federal transactions will have the consumer group in a higher number, which will pick up those transactions. And, uh, and compared to the transactions, which can be afforded with the slightly slight relaxation on SLA limit can have slightly lower number of consumers in the consumer group. So, obviously, with the increasing number of consumer groups, the transactions that are landing into the topic will be picked first and processed first. So that's how we can decide the
Okay, in case of failure, thing is, we can have two mechanisms here. One mechanism can be a central orchestrator, which can foresee this transaction like where, for example, say there are three servers, you are doing transaction one to two from server one to two. I mean client, for example, client hits server one, server one commits the transaction, it creates two different transactions to be committed in two different sub servers, maybe some subsystems. Now it may end up that inventory, for example, inventory has gone out, like in this microsecond your inventory count has decreased. So you can't allocate, you can't allocate that inventory to current user for which the user has done the transaction. So there can be two options. One option is an orchestrator, which can create a reverse transaction and, you know, run it on the system one. So say payment is already done, orchestrator saw that, okay, now the resource is already occupied. Orchestrator ends up creating a reverse transaction to roll back the amount. And it puts up in the, there will be a separate queue probably, rollback queue, where they can put this transaction and rollback queue can be read on priority. So this first microservice server will pick up that transaction and it will process it, it will rollback it. If we are not taking the orchestrator route, we can have same, I mean, the thing is we need the rollback transaction, whatever be the case. So it would be the responsibility of the subsequent systems to put up the transaction. That approach is a bit tricky, given that, say there are two subsystems, both have to put up a same transaction to rollback. So that won't make much sense, but we can have some kind of mechanism here that, you know, that system one, once the subsystem one processes the transaction, system two gets the transaction, it has to rollback. It will put up that reverse transaction, again, it will put up in the queue of S1. Then they will take up the transaction and process it and rollback it. So it's more like a choreographic way of doing transaction where there is effectively a central node which is deciding on how to process further and they end up creating the rollback transaction. That would be the best way I can think, will be the most effective strategy here.
So, this is good thing, two things here, we need communication at two points, right? Communication between client and server and communication between two servers at our end. Between two servers at our end, it's kind of we have more control over there. First thing is we can do is we can highly contain our microservices in some kind of virtual private cloud where only allowed group of IPs, a group of domain probably, group of IPs on the secure port can access it and that is kind of not exposed to outside world. That will greatly reduce the blast area like in case something unwanted happens. Second thing here is the use of MTLS mechanism. In MTLS mechanism we have some these certifications which is when like you know transaction comes from T1 to T2, I mean server 1 to server 2. Server 1, server 2 is due to verify the certification of server 1 and then it will process it. So, MTLS is very effective way when both the systems are under our control. Obviously, we'll use HTTPS or whenever we do the communication. So, that kind of makes the data secure over the fly and backed by MTLS and virtual private group, it kind of makes the system between server to server secure. Now, comes the client to server. In client to server, given that we will be facing the external entity. If there is a possibility of implementing the OAuth, we should do that. With OAuth, the problem is that we need some kind of user identification. So, in that case, if we are exposing our SDK to the client to integrate, they can provide the identity to us. So, whenever client will do the transaction, we'll check their identity first. OAuth probably uses this asymmetric encryption, I think. So, they will encrypt the transaction at client end, decrypt it, get out those ID values that there will be value like till how long. I mean, client provide the credentials, server ends up giving them the access token and refresh token. Access token, refresh token, and ID token. So, one of these token contains those auth token. I think auth token can be used for further transactions, like whenever we'll pass the data, auth token will be using, will be passing the data in the header. And the header itself, we can verify the client credentials. Specifically, what happens that when you generate the auth, it kind of, you know, take some of the user parameters. So, when you, when client provide that parameter back to user, you can again decrypt and validate the user credentials. And also, you can check the validity of the token, like till how long the token is valid and everything around it. So, that is one effective way. Second is that even if client is also under our control, we can use the MTLS mechanism over there to verify the certificates. So, these two are the best ways I can think of. Tokens at client end and VPC and MTLS.
I think this condition is wrong, i is less than equal to number by 2. If you have to find the prime number of 6, the prime number, no, so we will 6, 1, 3, 2. So, we will take 6, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4. Yeah, there is problem with this by 2 condition. For example, if you take the number 4, number by 4 by 2 will become 2. So, loop i is equal to 2, i is less than 2, equal to 2. It will run once, 4 by 2. Yeah, that's the only thing I can think of.
I mean, we are relying here on the fact that process transaction will somehow, you know, initialize the logger, which is wrong. Process transaction has its own responsibility for processing the transaction. You can't rely on the logger part. Even though we are doing the null check, it doesn't make sense. You should initialize it globally somewhere, and then go ahead. These loggers, I mean, are mostly singleton, but in this case, it is implementing some it is implementation of some kind of logger. Maybe a file logger or maybe a memory logger memory logger. Whatever be the implementation, it should be initialized first, then only it should go ahead. If we are relying on process transaction to initialize the log or implementation, probably then that's the wrong approach.
We offline a process to dynamically adapt payment processing thresholds. Based on real-time analytics. So as far as my understanding goes for this question, we are kind of monitoring the traffic in real-time probably or probably a fraudulent transaction like we are listening to each transaction one by one and so let's go with the traffic first like say we have our system over here where data lands in the starting server, I mean gateway server, gateway server puts it in some kind of distributed queue, distributed queue has topics, topics have partitions and topics are listened by multiple consumer groups so one of the consumer group that is consuming this topic has some kind of rule engine running in it, that rule engine can have multiple rules, it can have rules for detecting fraudulent transactions, that rule engine can have rule to, you know, detecting some, what to say, some invalid parameters being passed or I'll say invalid transactions on the fly. So yeah, based on those rules, based on that detection we can stand up some kind of rules. There can be, we can have our, this thing, consumer group running, consumer group can detect if there is high amount of traffic that is coming up maybe on a particular group of region so if that, that rule hits off then we can end up scaling up the topics and consumer groups itself, I mean the broker, message brokers itself, if scale has gone really high at a given point of time for fraudulent transactions or some invalid transaction as I've already told, we can analyze them via this mechanism on the fly and we can do any action. We can call any notification service which can notify any dependent microservice and that microservice can take up action, that microservice can basically take an action to increase the, increase the, you know, processing hardware on the fly, be it the consumer group or be it the scaling group, anything around that. So that's what I can think of.
Okay. I'm good to start. Uh, 22 there are 2, 3 points that I can think of here. First is, uh, given that we are overhauling a payment system, it's not that uh, we are creating something from scratch. So we need to understand the pinpoints where and the bottlenecks bottle bottlenecks And the area where fixes are needed or some alternate approach is needed. We can analyze that data with our engineering team and backup with the data that can be presented to nontechnical stakeholders. It can be managers. It can be directors and anyone around it. So everything should be backed by data. And once we have the report around it, uh, we should have the plan, like, how are you going to process the pain points? There will be multiple pain points. There can be only dependent pain points. You can't do everything at once because, again, this system is under heavy load, and you can't do heavy changes all at once. So we need to understand and divide that meanwhile, changes are needed. So we can divide the groups into priority, and we can pick up slow and incremental release. Uh, the thing is, when I say slow, it is not that the timeline will be delayed. It's more on the side that we end up delivering the good quality of work rather than trying to fix all that stuff at once and running into further unimaginable mess. So idea would be to, you know, understand the pinpoint, back up with the data, uh, reach up at the agreeable point with all the stakeholders, come up with a plan, with priorities, uh, come up with the designs, HLB. It can be high level design. It can be low level designs with technical leaders, then come up with the timelines, like and the task distribution among the specific teams, Teams having the knowledge of the earlier set system should be given priority in my opinion because, again, this is a critical part. So yeah. Uh, that's how it goes uh, serial wise. You once the plan HLD and LLDs are ready and they are approved, you can assign the developers over there with the lines. Time lines should be thought should be a deep thought through process. It should be broken down in the task level. And, yeah, we should, again, focus on the quality rather than on delivering everything at once. So that's how I plan it.
So reconciliation, I believe, is this important part, of the system. Uh, we need to have our data. So reconciliation happens when reconciliation happens. It's not a real time thing. It's more of a bad job that happens at periodic hours or maybe at the end of the day or something like that. So we have 2 data sources here. Probably 1 say, uh, we are doing reconciliation between our server and the events that is generated from the client. That is one source that I can think of. Like, when we are doing payment, our client generates some kind of events that we are kind of reading it in the message. Uh, our distributed tools are reading it, and we are kind of putting it somewhere. And then we'll use it for reconciliation purpose. So and we have another source of truth, which is our database where each individual transactions are landing, and we are kind of still bookkeeping it. So at the end of the day, when, uh, we can have our job shared at this, uh, DGS I mean, job scheduler triggered with this approach where, first of all, job scheduler's job would be to check for the resources and data availability. Uh, it will check if data is available for today's transaction, like 24 hour delta or maybe 48 hour delta since yesterday, like, maybe 3 days back and yesterday so that even if there is something missing, we can probably take it in next day. And, obviously, it will pull out the logs, uh, or data from both source through truth of source source of truth. Uh, obviously, client side, both events will be granted priority, and we will do the judgment over here. Uh, like and that source of truth can be the banks itself. Like, uh, we can get some kind of data from there. Uh, so whatever data we are getting, we will get the data. We will get the identities of the user from different system. So once our bad job will trigger, we'll get the data, we'll get the identity, we'll get the other metadata with whatever is needed from different subsystems, and we'll do the reconciliation 1 by 1 for each transaction. Uh, bad job won't run for all the time, so we can take a good capacity hardware over here, and we can start the job. It will sequentially go through all the transactions and process the data 1 by 1, see if something is missing or something is off chart. For something that is off chart, that can be put off as a separate line item in another report. I read you can kind of throw it in some of the some of the another notification service, uh, where a separate mini service will be reading these notifications, compiling them, and then reading a report out of it. Uh, once the report is ready, you can kind of send it to different people who are responsible around it, and they can take the corrective action or see what can be done.