Vetted Talent

Arpith

Vetted Talent

As an accomplished Software Engineer with 11+ years of experience and proven track record in leadership. I excel at driving operational improvements and enhancing customer satisfaction.

Role
Machine Leaning Engineering Manager
Years of Experience
12 years

Skillsets

Big Data
Sagemaker
REST
PostgreSQL
Ml models
Hadoop
GraphQL
Google apps
Google App Engine
Druid
AWS - 6 Years
Apache Spark
Apache
Oozie
Azure
Python - 9 Years
Prometheus - 2 Years
Java - 6 Years
Kafka - 4 Years
Elasticsearch - 2 Years

Vetted For

15Skills

Roles & Skills
Results
Details

Staff Software Engineer - Payments EconomicsAI Screening
70%

Skills assessed :Collaboration, Communication, Payments systems, service-to-service communication, Stakeholder Management, Architectural Patterns, Architecture, Coding, HLD, LLD, Problem Solving, Product Strategy, SOA, Team Handling, Technical Management
Score: 63/90

Professional Summary

12Years

Feb, 2025 - Present 9 months
Machine Leaning Engineering Manager
Bluevine
Jan, 2023 - Dec, 20241 yr 11 months
Consultant Engineering Manager
Consultant Engineering Manager
Jan, 2021 - Dec, 20221 yr 11 months
Staff Engineer
Egnyte
Jan, 2013 - Dec, 20163 yr 11 months
Big Data Developer
Rackspace
Jan, 2016 - Dec, 20171 yr 11 months
Software Engineer
Kahuna Inc
Jan, 2018 - Dec, 20213 yr 11 months
Lead Engineer
Target
Jan, 2009 - Dec, 20112 yr 11 months
Software Developer
HCL Technologies

Applications & Tools Known

PostgreSQL
Django REST framework
Apache Cassandra
Apache HBase
Druid

Work History

12Years

Machine Leaning Engineering Manager

Bluevine

Feb, 2025 - Present 9 months

Led the development of advanced ML models for fraud detection, anomaly analysis and streaming solutions. Fraud and Anomaly Detection Models: Led development of ML models for fraud and anomaly detection, including account fraud, name mismatch, fraud mismatch significantly improving fraud prevention and operational security. Analytics Engine Development: Led development of rollup analytics engine for efficient aggregation, filtering, and computation of entity-specific data, boosting the speed and accuracy of business reporting. Change Data Capture (CDC) Integration: Managed the migration from MySQL to Kafka using Debezium, enabling real-time data streaming and ensuring consistency and high availability across distributed systems. AWS SageMaker ML-Ops Framework: Architected and led an end-to-end ML-Ops framework on AWS SageMaker, overseeing model training, data processing, performance monitoring, and versioning, which streamlined ML workflows and accelerated model delivery.

Consultant Engineering Manager

Jan, 2023 - Dec, 20241 yr 11 months

Led a team to manage infrastructure for a GraphQL service Responsible for infrastructure scaling for GraphQL servers with multiple load balancers to ensure high availability. Identified and addressed API performance bottlenecks, reducing daily average costs by 60%. Enabled efficient serving of 500K concurrent users with 80% fewer instances.

Staff Engineer

Egnyte

Jan, 2021 - Dec, 20221 yr 11 months

Engaged in backend engineering for content management systems Designed and implemented Egnyte Search connector using Apache Tika for over 100 file types, enabling search engine indexing and content analysis. Designed and implemented a Migration tool that enabled the indexing of 100TB of Autocad files and OCR data for efficient search capability. Implemented Data Deduplicator using MD5 hashing to reduce costs, save storage space and improve system performance by eliminating duplicate content.

Lead Engineer

Target

Jan, 2018 - Dec, 20213 yr 11 months

Focused on data analytics and performance optimization Led engineering team to successfully build and deploy a real-time analytics dashboard, enhancing user accessibility and data insights. Developed and optimized core features of a no-code data analysis platform, focusing on constructing complex analytical queries routed to a query federation system. Designed and developed a high-performance real-time streaming pipeline using Apache Flink and Kafka, processing 4 billion events building data-intensive back-end performance optimization. Used Apache Druid for efficient data storage and real-time analytics, contributing to the scalability and extensibility of the platform solutions.

Software Engineer

Kahuna Inc

Jan, 2016 - Dec, 20171 yr 11 months

Worked on customer engagement solutions Built a multi-channel customer journey visualization platform using Google App Engine. Developed a high-throughput API for ad placements on mobile devices.

Big Data Developer

Rackspace

Jan, 2013 - Dec, 20163 yr 11 months

Focused on big data solutions and cloud orchestration Deployed and administered a Hadoop cluster and Kafka with 20= nodes Accelerated customer growth by 2X by efficient provisioning of cloud clusters.

Software Developer

HCL Technologies

Jan, 2009 - Dec, 20112 yr 11 months

Involved in software development and system performance enhancement Led design and development of high-throughput microservices using REST API in Python Achieved a 30% improvement in overall system efficiency.

Testimonial

Target

Samrakshini

Linkedin Recommendation

Major Projects

3Projects

Text extraction on all type of files

Egnyte

Jan, 2022 - May, 2022 4 months

Strengths of Apache Tika:

Content Extraction:
Apache Tika excels in extracting content from a diverse range of file formats, providing a unified interface for content analysis.
Metadata Retrieval:
Efficiently retrieves metadata, offering valuable information about documents, including author, creation date, and more.
Language Detection:
Provides language detection capabilities, aiding in understanding the linguistic context of documents.
Extensibility:
Highly extensible, allowing users to add custom parsers for specific file formats or customize existing ones.

Drawbacks and Challenges:

Extraction Data Size:
Faces challenges with large data extraction, where performance may degrade for extensive documents, impacting processing speed.
Missing Mime Type Detection:
Tika may encounter difficulties in accurately detecting mime types for certain file formats, leading to potential misclassification.
Language Detection Accuracy:
While offering language detection, the accuracy may vary depending on the complexity of the document, potentially leading to misidentifications.
Resource Intensiveness:
Processing resource-intensive files might strain system resources, affecting overall performance and responsiveness.

Realtime Analytics Platform

Target

Jan, 2020 - Aug, 2020 7 months

Strengths of the Flink Project for Real-Time Analytics:

Agility:
Flink's high-level API facilitates maintaining a single codebase for the entire search infrastructure process.
Provides a framework for expressing complex business logic efficiently.
Consistency:
Offers at-least-once semantics crucial for reflecting changes in databases.
Adaptable to exactly-once requirements for various use cases within the company.
Low Latency:
Enables rapid updates in search results, ensuring timely reflection of changes like inventory availability.
Suitable for dynamic scenarios where low-latency is essential.
Cost Efficiency:
Handles high-throughput efficiently, resulting in significant cost savings for Alibaba's data processing needs.

Challenges Faced and Optimization Strategies:

External Storage Bottleneck:
Identified accessing external storage like HBase as a production bottleneck.
Introduced Asynchronous I/O to address this issue, with plans to contribute to the community.
State Backends and Latency Optimization:
Highlighted differences in latency when using different state backends (filesystem/hashmap vs. rocksdb).
Provided insights into optimizing state backend choices based on state size and memory capacity.
Resource Allocation for Low Latency:
Emphasized the importance of allocating enough resources to reduce latency.
Recommended monitoring Flink metrics and scaling up or out based on job requirements.
Experimental Results:
Shared experimental results for the WindowingJob, showcasing latency reductions with increased parallelism.
Illustrated the impact of resource allocation on reducing the 99th percentile latency.

Omni-Channel marketing platform

Kahuna

Jan, 2017 - May, 2017 4 months

Strengths of Omni-Channel Marketing Platform

Multi-Channel Integration:
Integrates seamlessly with various channels, including Yelp, offering a unified platform for marketing efforts.
Enhanced Visibility:
Leverages Yelp's extensive user base to enhance visibility and reach a diverse audience across multiple channels.
Customer Engagement:
Facilitates effective customer engagement by utilizing Yelp's features, such as reviews and ratings, to build trust and credibility.
Data Analytics:
Incorporates robust data analytics capabilities, allowing businesses to gain insights into customer behavior and preferences.
Personalized Marketing:
Enables personalized marketing strategies by leveraging Yelp data, tailoring messages to specific customer segments.

Challenges and Considerations:

Rate Limiting for Push Notifications:
Faces challenges with rate limiting when sending push notifications, requiring careful management to avoid exceeding service limits and ensuring effective communication.
Timezone Differences in Messages:
Addresses timezone differences as a challenge, necessitating the implementation of strategies to ensure messages are delivered at optimal times across diverse geographical locations.
Coordination Across Channels:
Manages coordination challenges when orchestrating marketing efforts across multiple channels, ensuring a cohesive and consistent brand message.
User Privacy and Permissions:
Navigates the complexities of user privacy concerns and permissions, ensuring compliance with regulations and building trust among customers.

Education

Master of Science
The University of Texas, Dallas (2013)
Bachelor of Engineering
Visvesvaraya Technological University (2009)

Certifications

Google Analytics

AI-interview Questions & Answers

Sure. So I did my, um, bachelor's as well as my master's from computer science, and I do have a specialization in cloud computing. That's where my master's was from. Recently, the latest role that I had worked with is called is with an Ignite as a staff software engineer, wherein I have completely built up the platforms from the scratch. And the task of this particular product was to improve the search relevancy for the end user to help finding the files and, uh, uploaded documents much faster. So that's been, uh, latest, uh, latest of my job experience. Prior to that, I have worked with the target, wherein I have built up a team to build a real time platform analysis. This particular platform has been used internally by everyone in the target in order to do the reporting as well as analytics during the peak critical time just like Thanksgiving. So this particular team has been able to build up a real time platform as well as the batch processing platform to process the files much faster and in a way that anyone can use this platform without any technical knowledge. So that has been my recent experience.

Sure. So there are a couple of, uh, statistics algorithm that we could use in order to predict what how the data has been lifting compared to the past methods, the current data. So some of the methods are being regression models that I have used. Apart from the regression models, there were some other statistical models that has been used in order to verify the star drift in the data. If we don't have any of the machine learning, uh, statistical models of the libraries, just like Python has Pandas that has built in sci fi libraries, in order to do this kind of static statistical differences. We could use those libraries in order to see whether there's an actual drift in the data or not. So the other way in order to do analysis in the drift is to preserve all this data in the persistent storage, such as, like, AWS s three or it could be an HDFS, uh, HDFS. And we do a batch batch processing on the data. Once we do a batch processing and once we clean the data in such a way that it can be used by anyone when I say it can be used by any any one, each of the fields has to be marked with the relevant tags. So those tags are being further used by the models or the ML models in order to figure out whether there is a dip in the data. I would take a simple example. Let's say there has been a purchase for a purchase of a particular product for 10 times within the duration, then it seems to be kind of a fraud. And this is not a normal behavior of any end customers who normally do. So in that particular case, we want to identify the locations by using the location stats that are coming in from the data. And comparing these tags, we compare with the past 1 month of the data of that particular customer to see whether the customer has been actually making those purchases, if it's a valid purchases or not. If it is if it's seen that there are no transaction has been done with the past 1 month and this stat these regression models are saying that there this particular point has been marked from drastically from the previous history points, then we can easily identify that this payment has been, uh, has been a fraud payment, and we can probably stop processing the transaction altogether.

So the communication protocol, there could be multiple communication protocol. Right? So I would take few examples that I have personally built. 1 is protocol wherein I built a REST API so that each of the microservices can coordinate with, um, uh, coordinate with the others using the REST endpoint. It could be any of an HTTP or it could be an HTTPS endpoint that we can use based on an authentication mechanism. And in a way that once in once this API has been authenticated, we can use those, uh, results in order to process that data. So that's the 1st method, which is using REST endpoint. The second is gRPC, which is a remote procedure calls. Let's say we have to call in data which has been residing on a completely different system. Since REST API has a limit in terms of getting a data downloaded, We if if the data the result JSON responses ranging from GBs of the data, then it's not ideal to have an open socket all the way open and then keep downloading the data. In that particular case, we can use a gRPC, uh, remote procedure calls in order to download this data. And the other way is to use a graph. So the GraphQL, the advantages of using the GraphQL is we don't really have to worry about the hundreds of the microservices that are being required, uh, that are being dependent on each other. We don't really have to know each of these services. We could have we could expose 1 REST endpoint based on that 1 REST endpoint with, internally, all of that other microservices, depending services, has been routed through this proxy server or could be GraphQL server. And once we collect our data, we can then reupload. We we can then give that user we can then give the response back to end user without internal teams have to worrying about each of those services and remembering the way how to authenticate all the all those endpoints. We can just have a service key, so we can have a service authentication mechanism. It could be JSW token way, or it could be any of the token mechanism we can use in order to authenticate the system and then internally use, or the GraphQL in order to navigate those queries to our different microservices.

So some of the common problems in how in order to process the, uh, payment mechanism. Right? Let's say a particular payment made transaction could be, I I'll take up an example. Let's say if a if an item has been purchased and it has been marked as purchasing in one endpoint. Now this purchase item item has to be come has to be sent into a bank so that we can actually get the money for that particular item. And then we and the next step is to actually send the communication back to the user. So if we don't have an item potency in the system build, what happens is if a particular transaction has been failed in 1 per in 1 microservices. We might end up pushing the same data again and again to the to different services in order to make sure that we do get a proper authenticated response back. So to avoid those kind of mechanism, one approach is to have a unique transaction ID. And this transaction ID should has to be based on some kind of a mechanism. It could be based on a team. It could be based on an authentication header. We have to use an authentication header or an authentication token along with the along with the unique ID in order to make sure that if a duplicate transaction comes into our system, each of these microservice or each of these, uh, individual system know how to handle uh, this, uh, unique transaction. Based on this unique transaction ID, we can either completely ignore those transactions saying that this we have already seen this transaction coming from on a different source. It could be maybe because that particular system have processed part of the payment, and then it has crashed at that particular time. And then now we want to update the inventory saying that we have purchase has been completed. Now we have to mark that inventory as being, uh, we have to mark that inventory as being completed. In this scenario, if we don't have an item potency based on the transaction ID or based on the unique ID, then we don't we are not very sure about whether it's an actual call or it could be a duplicate call. So we can definitely use a snowflake in order to generate a unique ID along with other mechanisms such as, like, uh, we can have a services name. We can have a time stamp. All of these factors comes into picture to create a transaction ID, which is very unique. But then we can easily able to solve the item potency because we know that, uh, there are no duplicate transaction that happened because of this.

So any high throughput transaction, right, we have to build an if we have to build an architecture in such a way that any kind of pre peak critical transaction should be able to complete and easily able to handle the transaction. So how we how to make sure this kind of system is let's say, if you have a load balancer in place, uh, make sure that the load balancer is able to accept the load and then redirect it to to all of our all of the individual microservices. It could be proxy or it could be engineering. So that's the first, um, first system that has to be placed so that we don't really have to worry about how to redirect the traffic based on the load that we are getting into our system. 2nd step is to make sure that what is a framework that we are using in order to build this service oriented architecture, right, where if it's a it could be any, uh, any services. It could be, let's say, like, we we have to update a particular inventory 100 times a day. In that particular case, we have to have a framework. It could be, uh, REST server in to access those, uh, to access those transaction and be able to process those transaction. Uh, in Python, we have a multiple, uh, REST API. It could be like a Django or it could be like a, uh, All this REST API service can be easily used in order to build a high throughput system. So once we have this high throughput system, what we have to make sure is we have to have an alerting and mechanism systems in place so that if there is any kind of a discrepancy or if there is any drift in the data, the right amount of, uh, alerting has to be sent to a person on call so that he'll be able to handle all the issues with on time without having to worry about whether my system is working or not. So the couple of other things that I have to make sure is, let's say, if any of the, uh, process if in any process, if a particular transaction fails, then do I have a mechanism in place so that all the retry can be handled very easily? So that is one particular thing that I have to ensure while building up the while building up a high throughput system. Uh, the second, uh, second thing is to make sure that we build a decoupled system so that we don't have a dependency on one another. 1 system does not really depend completely upon another. If there is a dependency system, I mean, between the services, make sure that you use queuing system. It could be Kafka or it could be cloud queuing, any of the queuing system so that we push the messages, and each of these servers can be able to process the messages independently independently. And once we once we realize we are hitting the throughput, uh, very end of a system very end throughput of system. In that particular case, we can easily horizontally, we could even scale the system, or it could be vertically scaling the system. Or this can be only achieved if we build a decoupling system in place in order to have a very high throughput throughout each of the

Right. So when we want to talk about the consistency, we have to figure out a way whether we want to have an eventual consistency or whether we have to have a strong consistency. Each of these methods have their own own drawback system. Since the payment transaction has to be processed with a strong consistency, I would definitely, uh, refer to how prefer to have a strong consistency, or there could be cases where even eventual consistency can be built. Let's talk about the case of a strong consistency. During during an event such as, like, uh, Black Friday, we want to figure out whether an item is being available or not. If that particular item has been available, then we have to process that item very quickly, and we have to update all the payment, uh, records in in all across our system. In this scenario, how we can how can we ensure the strong consistency? We could have a 2 phase commit protocol to make sure that every system only commits the data after each of the independent system commits into its individual data. So we can can either use a zoo peeker in order to track those uh, quorum between the systems and in order to get the consistency across the system. And, uh, so that that's one way. Right? So this particular way, we so this strong consistency have its own drawbacks. Once we want to have a strong consistency, then there there will be cases where the entire transaction could be delayed. So if we are if we are okay to have a delay in place, let's say a couple of milliseconds of delay are in place, then we can definitely be sure that at each of these steps, the commit has been happening locally, and we are getting an acknowledge ban acknowledgment back to each of the, uh, let's say, server systems saying that the part of my work has been communicated. But in eventual consistency, we don't really have to worry about this buddy this mechanism. So web payment transaction once a payment transaction has been pushed to a messaging system such as Kafka, each of the servers who are, uh, uh, constantly subscribing to subscribe to the topic consume some messages. And as and when the top as and when the services has consumed these messages, they perform their own actions. And, eventually, they will say that my task has been completed. But once the task has been completed, it's up to the task of our server to actually actually ensure that each of the steps has been completed ideally, and each of them has been completed successfully. Only after that, we will update a master record saying that this task has been now completed. If during any phase come come, if any of the task fails, it's that, uh, it's that bonus of the server to retrigger the task and make sure that, uh, the task completes within the time mode period. So that's how we can maintain the consistency across the server, across the notes, and then make sure that the payment transaction happens correctly.

Right. By looking at the code, what is, um, the main main couple of things that are missing is we don't really are not performing any kind of an authentication in this system. Right? We are getting the messages. We are reading the request, and we are if it's paste a post method, we are just getting an amount. We are not validating any amount. So a couple of thing that has been missing is we do we have a valid request that's coming in from a valid server? If it's an internal server, then do we have a security key in place? So the if we have a security key in place, then we know that it's an internal method that's coming in from an internal server. We then what we have to do is we have to audit that particular record to make sure that the authentication has been coming from an internal service. The 2nd case, if it's coming from an external services or it could be from an end user, then there is no way to authenticate whether the request has actually authenticated from that particular end user. So though we are not doing any kind of an authentication in the whole method, We are just assuming that we are getting an amount x y z, and then we are placed we are processing the payment based on that particular payment. Apart from processing the payment, the the head header has not been checked to see whether want to process it, whether how do we want to call do a callback or whether it's a previous callback which has a token in it. Now we are we want to get the result out of based result based on that particular token. So I would say the the authentication part has been missing along with the logging and auditing are also missing. We thought the auditing, I'm assuming the process payment method might have some auditing mechanism, but it would be nice to have an audit in place as soon as we get the uh

Right. So initially, the logger is being assigned as a null. Once we assign that to a null, there could be a possible reason that in the process transaction, the logger could have been instantiated. But in the as I look in the code snippet, the logger, if a pump process transaction, if any kind of an exception happens, then the logger might never have initialized. And we try to capture, uh, messages based on the logger, which was never being initialized. It could be in the process transaction method. So, ideally, we should have the logger initialized before we even start the processing transaction. Because, let's say, if there is any kind of exception that happens within the process transaction method, yes, definitely, we get the exception. But it could be it could happen that the exception might have popped up before the logger was actually being Instantiated. So without instantiation of a logger, if we try to uh, log a message, uh, then we might have we we will get an exception thing saying that logger was never being initialized, and, uh, that in turn refers to an

So the the way I want to architect the processing of the distributed payment mechanism. Right? So assuming that I have a, uh, an API or it could be an agent that is being, let's say, an app has been built out, right, and from an Airbnb. Now we want to we are capturing every kind of clicks that's happening through that particular app. So once a clip clicks has been captured, we're saying that what are the user action on the within an app, I would want to push those messages into Kafka messaging system. It could be any messaging system. Since we have we want to have an eventual transaction with the high high throughput. Right? Let's assume that I push all the messages into the Kafka messaging system. The Kafka message each of these messages will have a time stamp along with the record saying that which, uh, which, uh, token or which device it has been generated from, what is a time stamp, or country, region, all those kind of additional requirement that comes in. So once this requirement comes into our messaging system, now we have an x number of microservices that are being built out based on the Kafka messaging system. It could be an inventory. It could be a payment. It could be updating the records, or it could be any number of microservices. So what is the task of this each of the microservices? We would still have to have a global global database. And in the case of, uh, this scenario, I would use a Cassandra, which is a which is a high right, uh, throughput mechanism based database. Right? Since we want to have a very high throughput, we are writing a tons of, uh, uh, transaction per second into a into a database. We want to have a very high throughput and the right mechanism. So I would use a Cassandra in order to process all those messages into a database. So advantages of having Cassandra is each of the microservices has their dependency into their Cassandra. And each of them, we can individually rely on this global Cassandra, which is known which is very well established and which is very well performing. So we really don't have to worry about the local databases on how to how to store the data and then how to push those data back to our system. So the Cassandra is being Cassandra takes case of that, and now comes the processing part. What is the framework that I want to use in order to process the messages that has been returned into Kafka. Uh, so for the processing type, I would use Apache Flink, which is a real time processing framework, to consume those messages since, uh, Apache Flink has a connector, which is written built in connector, which could easily pull those messages, which could easily pull the billions of messages per second or could be a minute and able to process those, like, those those messages real time. Flink also have a capability to run analytics on the fly based on the Flink SQL, then that would become my high that would become my best case for handling the mechanism based on the, uh, real time analytics of the data. Now for the batch processing, I would use Spark because Spark has very good, uh, framework in order to do a batch processing and eventually store all the data into HDFS for any of our archival data processing. And, uh, the all this mechanism along with these mechanisms, I would build auditing and the logging, such as, like, a PagerDuty along with the logging to a CloudLogger. That way, we have a high throughput mechanism in the entire processing or distributed

So in order to have a bit of, um, any of once we build out an API, it's always good to have a good understanding about how we how we use the wordings, which it could be like HTTP works, how we are we using it. So first case is make sure that if you are using any of list based on responses, always use a plural form in order to get that result back. So for upgrade for having a rolling upgrade, I would use versioning. Let's say v one version, v two version. This kind of versioning will make sure that we'll understand that all of the current endpoint have a good versioning system in place so that we can easily switch back to a previous method, previous, uh, watch previous APIs if we have to ever roll back to, uh, previous version of an API. So that is one way to make sure that may any of, uh, version compatibility can be easily handled.

Right. So 2 things I could probably message is, 1, is to have a callback or a webhook in order to do an asynchronous communication in the real time mechanism. Once we have the callback or a webhook, then we would process we would push a message just to our system saying that, uh, we want to get the result back as soon as as soon as we complete the processing of a system. So web webhooks will make sure that we have an endpoint that is in test, which will push all the results back to that particular point. Or in the terms of a callback, we have a method which we have which we are registering at the time of making an API call. In that scenario also, we know that, uh, there is a callback that I have to make sure that I will call during, uh, uh, during the response time. The set the third approach is to make sure that we push a message just into a messaging queue. So we know that if there is a message in the messaging queue, we know that the task has been completed after it has been processed, and then we based on the transaction ID or an unique ID, we can easily pull them pull back the results. So these are a couple of the methods I would that I would use in order to do an asynchronous communication.

Arpith

Machine Leaning Engineering Manager

12 years

Skillsets

Vetted For

Professional Summary

Applications & Tools Known

Work History

Machine Leaning Engineering Manager

Consultant Engineering Manager

Staff Engineer

Lead Engineer

Software Engineer

Big Data Developer

Software Developer

Testimonial

Target

Major Projects

Text extraction on all type of files

Realtime Analytics Platform

Omni-Channel marketing platform

Education

Master of Science

Bachelor of Engineering

Certifications

Google Analytics

AI-interview Questions & Answers