In his dynamic technical journey, Binoy has cultivated a remarkable expertise in web development, with a strong focus on Django, and Machine Learning (ML) and Artificial Intelligence (AI), leveraging these skills to tackle intricate challenges and drive progress. With a wealth of experience spanning over five years, he has demonstrated proficiency in developing robust web applications using Django framework, implementing cutting-edge ML/AI algorithms, and architecting intelligent solutions across diverse domains. His adeptness extends to other web development frameworks and technologies, such as Flask and FastAPI. Additionally, Binoy possesses a deep understanding of ML/AI concepts and techniques, including libraries like PyTorch, TensorFlow, and Pandas. Renowned for his technical problem-solving prowess and innovative mindset, Binoy continues to make significant contributions to both the web development and ML/AI communities.
Software Engineer II
Abnormal SecuritySenior Software Engineer
Urban PiperSenior Software Engineer
Crest Data SystemsSoftware Developer
BoTree Technologies PVT. LTD.Django
Django REST framework
Python
FastAPI
Flask
Docker
Jira
GitHub
GitLab
Terrafrom
Postman
PostgreSQL
AWS Cloud
AWS (Amazon Web Services)
AWS CloudWatch
AWS Secrets Manager
Amazon CloudFront
GCP Services
Kibana
Confluence
Bamboo
Kubernetes
GoLang
MongoDB
MySQL
REST API
AWS Lambda
Apache Kafka
RabbitMQ
Redis Stack
Atlassian
Git
Google Cloud Platform
AWS
Grafana
Kibana
Kafka
Pandas
Apache Airflow
PyTorch
Tensorflow
Could you help me understand more about your background by giving brief introduction of yourself? Okay. So, I would like to share my introduction as I am 5.5 years of experience as a senior software developer. I have worked in multiple domains such as education, e-commerce, cyber security, Fortech and email security products as well. I have worked like since around multiple technologies which is like includes Python, Django framework, FastAPI, Flask, then a bit of Go, Java, Node.js, React and have worked on the same. Also I have like quite a good amount of experience into the AWS services such as CloudFront, EC2, EKS and Kubernetes, Docker Hub, GCP services, Secret Manager and so on as well. I have worked with multiple databases as well like Postgres, MySQL, MongoDB and have a chance to work with as well like the queue based mechanisms with using the Celery, RabbitMQ, Redis and like the RabbitMQ as well. So I have a good amount of experience in those. If I talk about my generic experience then I have a good responsibility though I am responsible for maintaining the team, leading the team in the front with the three to four members developing the product architecture and build the design of the product. So I have a quite a good amount of experience in that. So yeah, that's a brief introduction about myself.
So, in Python, we usually for the third-party REST API, we set a rate limit via the rate limit function like with the attribute called limit and we set it like around per minute or we said that, okay, we're going to have a request this many per minute or so. And like backup procedures is like when we say that, okay, we have a set up a rate limit, we can say that request exceeded, we can handle this like with the proper response saying that when we have reached the limitations, we're going to send the proper response getting the message so that the user can have an idea about the same. Also like the other question is like how we handle the, when we are integrating with the third-party REST API. So when we handle a REST API, usually we don't have a control over the like basically how many requests we have to make and what requests we will be making. So ideally, the ideal situation should be that we should establish some structure or architecture where we can set up a rate limit saying that like if we set up like if the API is having 100 requests per minute of the rate limit, we should set it at a 90 requests per minute and we should set a trigger which is going to trigger and say that, okay, we are reaching the threshold and we have to call the API accordingly.
So, when we, so basically, Terraform, so Terraform basically works on the config files, when we say that config files, the config files are itself responsible for defining the state of the application. So the Terraform config file is going to say that, okay, how the basically application will be having an infrastructure. So whenever we say that, for an example, we have to set it for EC2, S3 instances. And so on those parts, the Terraform configs are basically the .tf configs files are created. And the ideal situation should be, we can kept it as a S3 bucket, and we can basically use those bucket URL to call in our environment. So whenever in the, in the environment, we have to set up and we have to refer the Terraform config file. Ideally, we have to say it in a way that okay, we for the multiple environments, we will be copying it, we will be having it in a common for the local development, we can definitely use that into local which we will avoid to commit the changes while we are pushing the any of the features and tasks in the environments, we can set it up into the as an environment file either, or we can set up it as an some S3 bucket with some restricted access and can be only accessed while it is being used with the S3 URL. So that can be a way to avoid multiple conflicts, like avoid conflicts between the multiple environments. The S3 one is a good way because over that point of time, we will be able to make up a point like we will say that it will be it will be common between all of them and whoever wants to access it, they can exit either a local with some restricted permissions or in the environments via the access.
So here the question is like, how we will be building a Python application and implement unit test that ensures API integrations points are reliable. So when we say that API integrations, so API integration is integrations where really the responses always matters. So when the API is going to respond, API call is succeeded, it is going to respond with a status code as 200. If it is a create API, it is going to return as 201 response status. If there is a problem into the calling body, the request body, then it is going, the API going to return the 400 bad request. If it is, there is some authorizations or authentication issue, then it is going to return the 401 status code. So the API returns the response status code, which we need to ensure that we have covered all the aspects whenever we are building an API integration. So to ensure that what we have built, we should write a unit test where we can mock the request body using the unit test. We can say that, okay, unit test with the mock module, we can mock the request body, we can mock the URL, and we can set the response of this API call and say that, okay, for the, if we are getting a 200, then the, our basically function, or we can say that our view, which we have integrated, what it will return when we pass a particular kind of a body. So in that unit test, we can prepare some input body, then we can pass it to the call and execute the request. When the request is executed and the response is given, we can compare it with the expected output that we can set, like expected response status code, the response body, and we should assert that using the assert expected output is equal to what the response data we are getting, along with the status code. We can compare. So for different status codes, we can write the API-based unit test to ensure that our integration is very reliable.
So, when we are dealing with the high concurrency operations, so acid, so basically acid property is in kind of property where we say that okay it's atomicity, consistency and incompetency and durability. So, each of them is responsible for its property, so atomicity says that let's say for an example we have a bank transfer, we have a transaction table for a bank related operations, so in a transaction table we are storing the two savings account A and savings account B, so for an example if the money is deducted for savings account A and somehow our process interrupts into middle of that and the money is not transferred to the savings account B, then there is a loss of the data or we can say that inconsistence of data, so hence we say that there is atomicity should be established which states that the transaction should be either succeeded or the failure and there should be a rollback, so that is what means by we say that all or nothing or the transaction should either complete all of that or it should not complete partially, so that is like atomicity, consistency, incompetency and the durability, the transaction should be durable, so it should not be like we are receiving a high load and the transaction gets middle of that and it disrupts, so to ensure that we should use the with block which is a context manager, we should use the context manager to ensure that our end with the transaction.atomicity, so in the python it suppose that with the transaction.atomic, with that block we can create our database statements whichever we want to execute either create of the table and so, we can ensure that by this what will be assure that if in that block any of the failure is occurring, the rollback will happen to the first statement and any of the executed record statements will be rollbacked, so this is what we can have on a relational database, we can ensure that the asset property is maintained via establishing the log, so this will not hamper it, this will state the atomicity, consistency, durability and so.
So, to setting up an automated CI-CD pipeline for a Python application, there is a AWS or itself provides a deployment pipeline, like basically AWS has its own service of deployment service, where we can set up a CI-CD pipeline. So when we say that the CI-CD pipeline, so continuous integration and continuous deployment. So what does it mean actually? So continuous integration and continuous deployment pipeline states that whenever we are having an application that is definitely going to be versioning, or we can say that a feature development and feature addition task going to happen. So in those cases, when the tasks are deployed, or we can say that we are ready to go on production, so instead of doing a manual effort, like deploying it to the production, then restarting the ECS services or the instances that we have deployed for the running the applications, and to check and monitor every stuff or so in those parts, the CI-CD pipeline comes into picture which states that this is going to be autonomous, this is going to be continuous integration and development. So we can set up actions, for an example, we have a GitHub repo, and we say that, okay, we can establish some actions, which is going to trigger the AWS deployment pipeline. How it works is, in the GitHub actions, we can set up several steps, several checks to basically ensure that when we are merging the code to the production branch, it is passing all the checks. The checks can be like a Python pilot, the unit test cases are passing or not, the versionings are proper or not, the codes are working fine. So all those checks we can establish, based upon that, if all the checks are passes, then we can trigger the, at the end of the actions, we can trigger the AWS code pipeline, like AWS deployment pipeline, which will, what it will do is, for an example, we have a master-slave architecture, it's going to deploy first on the master, it's going to deploy that image and that image will be automatically deployed to the slave instances, and that's how a CI-CD pipeline can be placed for the Python application, which will maintain the automated CI-CD for our application.
So either say that examine the Python code for automating finance task. Is there a logical error related to the handling of transaction dates? How do you debug or fix it? So here I can see that there is an import date time. Okay. And if there is a process transaction function which accepts a transaction, probably it could be a object. I can see that it's a dictionary. So we are getting a datetime.date.today. And if we are comparing it with a today greater than transaction of a date, so status is equal to processed and further processing steps. So here I can see that there is a problem because why it is a problem if a transaction isn't kind of we are accepting a date first of all. So we need to convert to the date object because here what we are seeing is datetime.date.today is actually getting the date in a format and we are comparing with the greater than. So it's not an ideal way to do that because a datetime object should be compared with a datetime object. So the comparison in the if statement itself wrong, which is like comparing a string, which is greater than or not. It's not an ideal way to do that. The ideal way should be like both the ones should be in the datetime format or datetime object of Python. And we can find the difference between them. So that will ensure itself in the inbuilt functionality that it will cover it up. But doing this, it will not going to be under the process until and unless the data are really similar. For example, the transaction date is the same and it is going to compare the string comparison. So string comparison is not an ideal situation for this. Here instead of datetime transaction or datetime object should be taken for the today and from the transaction data, we should take a date and convert it into datetime object and then we should compare it because here the formats can also differ because we don't know like in the transaction of the date, what kind of format we are going to receive and in the today what is going to be a format. So comparing both the different formats is not an ideal way. Ideally, we should be a single format. So we should take a today as a datetime object and compare it with the transaction date and transaction date should be in a particular format of a datetime object which a today can be compared on.
In the Python code snippet for Elasticsearch data matching, please explain the usage of the should clause in the query and how it affects the result search. OK, so from Elasticsearch, import Elasticsearch, we have got an object of Elasticsearch where we have built a query. It's a Boolean. Then should match title data, description automation, minimum should match is one. OK, response is the Elasticsearch.search under the index documents body is equals to query. So here the query is basically having two matches. Like ideally, there are two match placed over here saying that we have to match two things. One is the title should be data, one is the description should be automation. But the should clause here is saying that this should be the match when we are getting the data. From the data, when we are getting the data, we should filter it out using this both match. We have to get the title column and the description column, and we have to match the title. We have to match the data accordingly. But here the minimum should match is one. So it states that from both of them. For an example, in a record, description is not automation. For an example, if a description is manual, and the title is data, then it is a passing and we should include that record. But for an example, if title is not data and description is manual, we should not include that because both of them are not matching. Here the clause is we have to ensure that one of them is match, either title or either description.
So, what approach I would be taking to build a resilient data pipeline that integrates multiple third-party data sources into a single database schema. So, when we say that like how we can achieve when we are getting a third-party data sources also like there are multiple third-party data sources. So, there is a one concept which is called an ETL so that stands for extract transform load. So, that concept says that when we are dealing with a multiple source of data where the data are from multiple sources and we have a database schema in our database, we have a single schema. So, how to achieve that? So, first of all the first step would be to extract data from multiple third-party data sources. So, let us take an example we have data from APC like we have three third-party integrations integration A, integration B, integration C all the three are having different schemas. So, what we have to do is we have to build an extraction layer. So, what that extraction layer will do is it will fetch the data from all the sources. Now we are going to have a transformation layer. So, what this transformation layer is actually so this transformation layer is basically responsible for converting the extracted data to a single database schema that we have. So, we will be having a database schema, we will be having a structure of schema. So, what this transformation layer will do is basically convert the received source data into a transformation transfer like which we the target data schema that is in on our side and the load layer will basically load into the database schema. So, let us take an example for an example my data is coming from three things like one is a GCP bucket, one is a S3 bucket, one is from like a MongoDB or so. So, all are having a different database like these three are data two one is a relational database, but we do not know the schema side and the MongoDB is a document DB like basically a non basically a non-readable like yeah. So, that is the three databases we have. So, based on that the transformation layer will be receiving this data will be convert and we have prepared a JSON based mapping like which will map the all the received source data into the third power single database schema. So, when the transformation is happening the transformation will happen for all the sources of data integration A, integration B, integration C, we going to convert it and we going to push it to the database. So, this can be a streamlined process or we can say that we can have service queue mechanism which is going to receive the data then another queue will be responsible for like calling of the task for processing the data which is a transformation and the third part could be the loading the data which is a third queue mechanism. So, by this way we can establish a process to basically handle the third multiple data sources in the database schema.
Okay, how would you automate data extinction from invoices using NLP in Python while ensuring high accuracy in diverse formats? So basically when we have to deal with the NLP in the Python where we want accuracy in the diverse formats, definitely the main thing can be achieved over here is via the Q mechanism. So in the Q we can have a basically pagination based extraction. So whenever we are getting the data, we should ensure that we are getting the data from data sources and we are using the Q mechanism batches. So batches is the word which I will try to explain. So batch can be made for getting the accurate data into diverse formats and if we are dealing with the batches, at least we will be ensuring that we are not pulling all together and we are pulling it data batch by batch with the offset being used and usage of the offset and the batch can be a solution to basically establish a high accuracy. Because in this case, we won't be taking time in a single request to basically get the data. We will be making a multiple request like with the batch request and we will be getting the batch request data and we will ensure that we will are getting the diverse formats. So this is basically about process of ETL extraction transformation layer process, which is going to basically convert the data into a required particular format and diverse formats.
So the method to diagnose IAM becomes too slow queries in Python application. So this is a common issue when we are using AWS RDS for Postgres. In Python the queries become slow. So why the queries become slow is sometimes we are using some framework which is responsible for object relation mapping ORM queries. So when we say that ORM queries, are high level queries which are converted to the SQL queries in the internal process. And the ORM queries apparently looks a bit simpler but it results into a SQL query, a very complex one. So when we say that joins are placed when we try to join the data. So in those sub queries, so in those kind of queries we say that okay when we have filters, so filters we will be taking a data from multiple tables either or we will be applying many clauses over that. We might be grouping also. So those things occur into a query execution. So that sometimes a query takes a bit of long time when we have a huge data. For an example if a table is having a million records and we are processing the queries on that. So for an example if we have employee data like a million employee data is there or transaction data okay and we want to filter it with the customer name over the transaction data okay and we have millions of transaction and we want to filter the customer name. So if a normal query is executed without the index being used on a customer name, what will happen is that it is going to search each and every record until it finds its record. And in the worst case the query is going to take a bit of long amount of time. So first and foremost thing can be used is the indexing. So we can establish an indexing of the customer name table which will ensure that the tree structure is performed for the transaction table which will ensure that the basically the tree structure ensures that the data is picked up very fastly. So this will improve the performance of the query being executed on the database. The other thing could be we can establish monitoring and monitoring on the RDS instance. So in the RDS system when we establish monitoring cloud watch logs. So what is going to be happening is like we can establish matrix. So that matrix and the alerts will show that okay which query is taking how much long time and based on that we will be able to know okay this query is taking a bit long of time. We can improve this query. We can optimize the query by either using the indexing or either you converting the query into a separate queries and execute it differently. So those can be multiple options but yeah the trigger yeah the most important point is to monitor the RDS instance being called by the queries. What are the queries being executed on the RDS instance for the postgres. We should keep a monitoring on that and we should track that okay each query is taking how much time and based on that we can establish like for an example if more than 10 seconds are there then we should raise an alert and accordingly we can say that it will result into improves or slow queries.