I have over 8 years of experience with Python development, I have worked primarily with Django and Flask frameworks to create scalable web applications and deploying them on cloud.
I am confident in my ability to take on complex projects and provide innovative solutions that meet the needs of clients.
Senior Software Engineer
Miratech Pvt ltdSenior Backend Engineer
Apisero Integration Pvt ltdSenior Software Engineer
Infosys LtdPython
Pyspark
AWS (Amazon Web Services)
Apache Airflow
Snowflake
MySQL
Docker
Kubernetes
Django
Flask
Athena
Airflow
Azure Databricks
Tableau
AWS S3
AWS
Azure Blob Storage
Azure Databricks
Tableau
AWS Glue
Tableau
Tensorflow
Pandas
Tableau
AWS RDS
AWS Fargate
Kafka
Jenkins
AWS Elastic Beanstalk
Django Rest Framework
CI/CD
AWS S3
Azure App Service
Databricks
AWS S3
AWS EMR
AWS RDS
Kafka
Azure App Service
JDBC
React
HTML
CSS
Pandas
AWS S3
Javascript
My team and I were tasked with identifying a viable alternative to the pre-existing Index data platform being used (mostly Perl and SAP Sybase database along with several in-house antiquated tools).
Do you understand motor body diagram? Why keeping up deep in Actually, I have I'm Kisla. I have around 8 0.5 years of experience as a Python data engineer. And I've, um, worked with the back end as well as uh, somewhat front end technologies. And, uh, basically, I'm in charge of creating data for things from scratch. So, basically, I'm in charge of creating cloud native data pipelines, and, uh, I've worked with various data warehouses like Snowflake or and Amazon Redshift and, uh, IBM, uh, Cloud Warehouse on certain occasions. And uh, and most of my work centers around creating and managing and enhancing data pipelines. Like, um, my day to day activity includes in my current project, it includes, uh, creating a new data port and and, uh, like, enhancing or suggesting enhancements to the existing architecture. So I'm, uh, constantly working with the arc enterprise architects to design a new, more robust system of pipeline design. So my actual role is of a senior software engineer in the data department. So, yeah, that's a different direction for myself. I hope I'll get a chance to explain it further.
What the fuck? What would be your strategy? My creating an existing retail process from house, trim, cluster to BigQuery. Basically, my strategy would involve, uh, like, uh, if you want to exist, uh, exist, uh, if you want to migrate an existing retail solution, so, basically, I'll use, uh, GCS storage for my data landing zone, and I'll flow I'll use an orchestrator like, uh, Apache Airflow to, like, extract data from that, uh, source source report that we are using. In our case, the GCS storage, Google Cloud storage that, uh, the data falls into. And, uh, from then onwards, I'll extract and transform menu. I'll write menu all transformation script and use it and run it over Google Dataproc or, uh, Google Dataproc, I think. Google Dataproc would be the most suitable tool for this. So, basically, I would be handling data processing using Dataproc, and then I'll move the data as a final ETL step. I'll move the data to my BigQuery data warehouse using Apache Airflow Scheduler as well. So for so when the data lands to big BigQuery, I can easily, like, easily easily analyze our data and create visualizations using Tableau or Looker or any visualization dashboard. So I think that will be my strategy. Like, uh, and I have to consider load balancing as well. Like, it depends on the case, case to case. So, yeah, that would be a broad approach.
Python to develop complex detailed workflows involving multiple datas, ports, and targets. How do you? So, uh, actually, in my previous projects, we use the PySpark API of Python to, like, uh, to integrate various sources of data. And, uh, we we have a variety of data, uh, like, you guys said, we have a variety of data operators which are supported by Airflow. Apache Airflow as our scheduler, as our, like, uh, orchestrator. Apache Airflow allows us to use Python code along with some inbuilt Airflow operators, and, uh, they they very easily allow us to integrate data. And along with that, we can write our Python codes Python or pie PySpark code snippets on top of Dataproc clusters. Basically, Dataproc is kind of like, uh, Amazon, uh, AWS, EMR. Basically, it is a managed Hadoop service provided by Google. So so I think, uh, multiple data sources can be handled, uh, very well using Python, API, or Python. And it can be run over, uh, Dataproc clusters.
Is required to be less scalable ETL pipeline using Hadoop and R4. Scalable ETL pipeline involves basically when basically, our source system can be help, uh, can be a Hadoop storage, Hadoop cluster storage or HiveTables, basically. And we can process them, uh, we can process them. If it was a legacy system, we can we could process that by MapReduce. But in the modern cases, we use Spark and Pyspark APIs for the processing. Actually, it is an PISPR processing is an in memory processing, so it is much faster than the, uh, Hadoop's map reduce paradigm. So, basically, processing large datasets can be easily achieved by writing writing files for programs, like, uh, creating, um, creating a PySpark context and, uh, uh, PySpark session object and, uh, then performing the calculations, transformations to the data frame and RDD APIs. So it is all very, like, all very simple, actually, using Spark. Big data processing is very simple, actually, using, uh, PySpark. And, uh, we can use a familiar data frame or dataset or RDT, uh, RDT structure to, like, create, uh, create transformation pipeline. Sorry. Create transformation pipelines. And, uh, actually, it will be, uh, price per cluster would be more than enough to handle big Hadoop and live datasets. So I think a scalable EDL pipeline would involve, uh, password processing. Uh, password processors based on Dataproc Dataproc as a service. So, yeah, that's what I suppose the answer should be.
You would optimize data storage in a relational database for data intensive application. So, basically, for data optimizing data storage, there is data storage, redescripting, relational database. The first thing I'll keep in mind is, basically, for data intensive application, it has been generally observed that columnar compression approach would be better than the row based storage row based storage. So, basically, I would optimize my data storage firstly by, uh, converting my data to format like parquet, v r c, and, uh, my second step would be to compress her data into, uh, using a snappy or any other kind of compression algorithms. And, uh, my 3rd would my 3rd optimization step would be to use, uh, as far as, uh, far as possible. My 3rd actually, I'm actually, on second thought, I think the 3rd option wouldn't work in most cases. So I think, yeah, these 2 would be all. Yeah.
That you were used to implement that real time data processing with MB query environment. So, uh, so for using real time streaming, the AWS counterpart for that is, uh, AWS Kinesis. Actually, AWS Kinesis is, uh, quite, uh, is, uh, it is quite similar to AW, uh, Apache Kafka tool. So I'm forgetting the name of the GCP accounts counterpart to that AWS Kinesis, but I think it is, uh, GCP streaming. So, basically, what GCP stream does is it is similar to Apache Kafka, and it it collects our top collects, uh, topics from several producers and, uh, relays those topics to the subscribers. So, basically, it is a cloud PubSub model. So we can simply use to transfer transfer real time data using streaming streaming, uh, apps within the GCP platform. And we can transfer our streaming data 1 by 1 to the BigQuery environment, and, uh, we can query it in real time. So that would be my answer.
Why the code might not function as better? So, um, actually, if we look closely in the stream data function, actually, we are opening the cursor. But, uh, if the condition is not satisfied, Like, if the rule we are retrieving is actually none, then the whole process breaks. Like, it comes out of the function, and, uh, we are never able to close the cursor cursor object. So I think, uh, the I think the most obvious solution would be to use, uh, context handlers over there. Like, uh, with with the cursor cursor object. Or the second option would be to use try, accept, and finally to write the function. Basically, try try block would include all our all our execute all our execution steps and ex exception. Block will catch all our exceptions. And, finally, block will actually execute a perspective of whether we encountered an exception or not. So in our case, in our suggested solution, we can use cursor cursor object or any other object. Uh, we can use the cursor creation object and the query and everything else in the in the try block. And, uh, we can use the accept exception in the accept block. And for the finally block, we can use cursor dot close. So that it will it is a part of the calculation of the row. It will close the cursor object. So, yeah, that would be the most suggested approach, I think.
So coming to this, I can I can see that, uh, like, uh, we are raising an exception inside an exception block? So it can potentially, like, uh, go in and go into an infinite loop, if I'm not wrong. Like, not in an infinite loop, but, uh, I don't see any, like, utility of raising an exception inside an accept block. So I I think the the whole concept of creating exceptions within the accept block is flawed. And, uh, I think that, uh, if we remove the days, other customization issue, then I think, uh, the code looks okay. Yeah. Then I then I'm pretty sure the code is okay. Thanks.
Complex data models you've designed and how to improve in large scale data environment. Actually, uh, in my current project, I'm in charge of creating, uh, on com creating, like, uh, data models. Actually, I'm currently working for finance based. Uh, I'm actually working for financial asset manager company. It is actually the biggest asset manager company in the world, and I'm in charge of creating data schemas or data models of various incoming indexes. Like, by indexes, I mean, the entities which have us for the sub entities. Like, uh, example would be NSE, BSE, or Nasdaq, or do do those any MSCI, any other index. So, basically, I'm in charge of creating data models for incoming indexes. And, uh, like, uh, basically, what my client told me was they want, uh, the indexes. The security which we are getting is already being mapped to a public identifier provided by the vendor. And, uh, what my client told me that we want to create a data model in which we internally map the public identifier, which is given by the vendor. And, uh, actually, I have to create a logical and very, like, uh, very exhausting mapping, very exhausting exhaustive mapping of, uh, the, uh, incoming public identifiers and the internal private type, uh, private identity enterprise which we use. Uh, those we dumb as QZIPs in our language. So, basically, I'm in charge of creating those mappings. And, uh, also, I'm in charge of creating several, like, uh, several data transformations. So, uh, I think, uh, the data model involved here is very complex because we have, uh, we have, uh, many different moving parts, and we have to manage each one of them. Like, for, um, for the Brazilian or Latin American con countries, we have a index called NBMA, which is, uh, which is highly different from the those of the Asian markets. So, uh, creating a data model, which, like, which is uniform for all our client countries is very, like, exhaustive and very difficult at client to implement. And I would and I would further explain it if given a chance. No.
Given your expertise in Java, uh, but create a robust details of this. Yeah. Actually, Java's Java's concurrency, unlike Python, is a real concurrency that allows us to use, uh, multiple core processors at once. The thing is in Python, we have the concept of global interpreter lock, which we don't have in Java. So in Java, effectively, we can run the program on multiple course, and thereby, we can use we can use Java concurrency features for, like, uh, real time. And we can use the we can use Java's concurrency features in the. Actually actually, I don't have much experience with Java, but but I have some theoretical background over it. I've not used much Java in practice, So I don't think I would be the I would be very, like, I would be giving a very detailed answer on that this question.
As we're integrating our Python, we see the flow for you to ensure liability and scalability, integrating a Python by Studio. Basically, for ensuring reliability and scalability, If given the task of integrating a Python and Python based DTL process is the airflow. So, actually, I would if it was a cloud based environment, so I would be, like, I would be, first of all, to handle the reliability and scalability concerns, I would be using, uh, application load balancers application load balancers to ensure scalability. Actually, I don't want any any particular node to be like, any particular node to be overloaded with data processing. So I would use so I would use, uh, uh, actually, I would use, uh, processing of data to I'm sorry. There's some disturbance at my end. Actually, if that's basically, I have integrated Python based retail processes with airflow in the past. And, actually, my main concern would be to use the appropriate airflow operator to ensure reliability of performance. Basically, I want to use an operator which allows us which allows us to, like, allows us to, like, effectively handle the data. And, uh, my task would be to ensure that, uh, like, all the infrastructure is highly highly scalable and, uh, if possible, serverless. Like, by serverless, I mean that we are not concerned with the, uh, infrastructure provisioning of the data. Basically, the underlying cloud service takes care of provisioning the infrastructure for us as and when it is needed. Like, as and when the data flow reaches a particular threshold, then we'll automatically get a new infrastructure piece. And I'll I will use actually, that is widely available in the GCS cloud as well as the AWS cloud. So, yeah, I would use that application load balancer service extensively. Yeah. Thanks.