
I have over 8 years of experience with Python development, I have worked primarily with Django and Flask frameworks to create scalable web applications and deploying them on cloud.
I am confident in my ability to take on complex projects and provide innovative solutions that meet the needs of clients.
Senior Python Developer - Data Science & Engineering
RDSolutionsSenior Backend Engineer
MiratechSenior Lead Engineer
Apisero IntegrationSenior System Engineer
Infosys
Python
Pyspark
AWS (Amazon Web Services)

Apache Airflow
Snowflake

MySQL
.png)
Docker

Kubernetes

Django
.png)
Flask

Athena

Airflow

Azure Databricks

Tableau

AWS S3

AWS

Azure Blob Storage

Azure Databricks

Tableau

AWS Glue

Tableau

Tensorflow

Pandas

Tableau

AWS RDS

AWS Fargate

Kafka
.png)
Jenkins

AWS Elastic Beanstalk

Django Rest Framework

CI/CD

AWS S3

Azure App Service

Databricks

AWS S3

AWS EMR

AWS RDS

Kafka

Azure App Service

JDBC

React

HTML

CSS

Pandas

AWS S3

Javascript
My team and I were tasked with identifying a viable alternative to the pre-existing Index data platform being used (mostly Perl and SAP Sybase database along with several in-house antiquated tools).
Do you understand motor body diagram? Why keeping up deep in Actually, I have I'm Kisla. I have around 8 and a half years of experience as a Python data engineer. And I've worked with both back-end as well as front-end technologies. And basically, I'm in charge of creating data from scratch. So, basically, I'm in charge of creating cloud-native data pipelines, and I've worked with various data warehouses like Snowflake, Amazon Redshift, and IBM Cloud Warehouse on certain occasions. And most of my work centers around creating, managing, and enhancing data pipelines. Like, my day-to-day activity includes creating a new data port and enhancing or suggesting enhancements to the existing architecture. So I'm constantly working with the enterprise architects to design a new, more robust system of pipeline design. So my actual role is that of a senior software engineer in the data department. So, yeah, that's a different direction for myself. I hope I'll get a chance to explain it further.
What's the strategy? My strategy would be to create an existing retail process from house, trim, cluster to BigQuery. Basically, if you want to exist, you have to exist, if you want to migrate an existing retail solution. I'll use GCS storage for my data landing zone, and I'll use an orchestrator like Apache Airflow to extract data from that source report that we are using, which is GCS storage. From then onwards, I'll extract and transform the data. I'll write the transformation script and run it over Google Dataproc, which is the most suitable tool for this. I'll be handling data processing using Dataproc, and then I'll move the data as a final ETL step to my BigQuery data warehouse using Apache Airflow Scheduler. Once the data lands in BigQuery, I can easily analyze the data and create visualizations using Tableau or Looker or any visualization dashboard. I have to consider load balancing as well, but it depends on the case. That's a broad approach.
Python is used to develop complex detailed workflows involving multiple datas, ports, and targets. So, actually, in my previous projects, we used the PySpark API of Python to integrate various sources of data. And, we have a variety of data operators supported by Airflow. Apache Airflow is our scheduler, as our orchestrator. Apache Airflow allows us to use Python code along with some inbuilt Airflow operators, and they very easily allow us to integrate data. Along with that, we can write our Python codes or PySpark code snippets on top of Dataproc clusters. Dataproc is a managed Hadoop service provided by Google, kind of like Amazon AWS EMR. I think multiple data sources can be handled very well using Python APIs, or PySpark. And it can be run over Dataproc clusters.
To be less scalable ETL pipeline using Hadoop and R involves using Hadoop storage or Hive tables as the source system. We can process them using MapReduce. In modern cases, we use Spark and Pyspark APIs for processing. Actually, Pyspark processing is an in-memory processing, so it is much faster than the Hadoop's map reduce paradigm. So, processing large datasets can be easily achieved by writing files for programs, like creating a PySpark context and performing calculations, transformations to the data frame and RDD APIs. So, it is all very simple using Spark. Big data processing is very simple using PySpark. We can use a familiar data frame or dataset or RDD structure to create transformation pipelines. Actually, it will be priced per cluster to handle big Hadoop and live datasets. I think a less scalable ETL pipeline would involve using password processors based on Dataproc as a service.
You would optimize data storage in a relational database for data intensive application. So, basically, for data optimization, there is data storage, redesigning, and relational database. The first thing I'll keep in mind is that for data intensive applications, it has been generally observed that a columnar compression approach is better than row-based storage. So, basically, I would optimize my data storage firstly by converting my data to formats like Parquet, VRC, and then compressing the data using a Snappy or any other kind of compression algorithm. And my second step would be to use these two approaches as far as possible. My third actually, I'm actually on second thought, I think the third option wouldn't work in most cases. So I think these two would be all.
That you were used to implement that real time data processing with MB query environment. So for using real time streaming, the AWS counterpart for that is, AWS Kinesis. Actually, AWS Kinesis is, quite, is, it is quite similar to AW, Apache Kafka tool. So I'm forgetting the name of the GCP accounts counterpart to that AWS Kinesis, but I think it is, GCP streaming. So, basically, what GCP stream does is it is similar to Apache Kafka, and it collects our top collects, topics from several producers and, relays those topics to the subscribers. So, basically, it is a cloud PubSub model. So we can simply use to transfer real time data using streaming, apps within the GCP platform. And we can transfer our streaming data 1 by 1 to the BigQuery environment, and, we can query it in real time. So that would be my answer.
Why the code might not function as better? So, actually, if we look closely in the stream data function, we are opening the cursor. But, if the condition is not satisfied, and the rule we are retrieving is actually none, then the whole process breaks. It comes out of the function, and we are never able to close the cursor object. So, I think the most obvious solution would be to use context handlers over there with the cursor object. Like, with the second option would be to use try, except, and finally to write the function. Basically, the try block would include all our execution steps and exception handling. The except block will catch all our exceptions. And the finally block will actually execute regardless of whether we encountered an exception or not. So in our suggested solution, we can use the cursor object or any other object in the try block. We can use the cursor creation object and the query in the try block. And we can use the exception in the except block. For the finally block, we can use cursor dot close. So that it will close the cursor object.
So coming to this, I can see that, like, we are raising an exception inside an exception block. So it can potentially go into an infinite loop, if I'm not wrong. Like, not in an infinite loop, but I don't see any utility of raising an exception inside an exception block. So I think the whole concept of creating exceptions within the exception block is flawed. And if we remove the days, other customization issue, then I think the code looks okay. Yeah. Then I'm pretty sure the code is okay. Thanks.
Complex data models you've designed and how to improve in large scale data environment. Actually, in my current project, I'm in charge of creating data models. I'm currently working for a financial asset manager company, which is the biggest asset manager company in the world, and I'm in charge of creating data schemas or data models of various incoming indexes. Like, by indexes, I mean the entities that have us for sub entities. An example would be NSE, BSE, or Nasdaq, or MSCI, or any other index. So, basically, I'm in charge of creating data models for incoming indexes. And, basically, what my client told me was they want the indexes. The security which we are getting is already being mapped to a public identifier provided by the vendor. And, what my client told me that we want to create a data model in which we internally map the public identifier, which is given by the vendor. And, I have to create a logical and very exhaustive mapping of the incoming public identifiers and the internal private type. Those we refer to as QZIPs in our language. So, basically, I'm in charge of creating those mappings. And, also, I'm in charge of creating several data transformations. So, I think the data model involved here is very complex because we have many different moving parts, and we have to manage each one of them. Like, for example, for the Brazilian or Latin American countries, we have an index called NBMA, which is highly different from those of the Asian markets. So, creating a data model, which is uniform for all our client countries is very exhaustive and very difficult to implement. And I would further explain it if given a chance.
Java's concurrency, unlike Python, is a real concurrency that allows us to use multiple core processors at once. The thing is in Python, we have the concept of global interpreter lock, which we don't have in Java. So in Java, effectively, we can run the program on multiple cores, and thereby, we can use Java concurrency features for real time. And we can use Java's concurrency features in the context. Actually, I don't have much experience with Java, but I have some theoretical background over it. I've not used much Java in practice, so I don't think I would be giving a very detailed answer on that question.
As we're integrating our Python, we see the flow for ensuring liability and scalability, integrating a Python by Studio. Basically, for ensuring reliability and scalability, given the task of integrating a Python and Python-based DTL process in Airflow. So, actually, I would first handle the reliability and scalability concerns by using application load balancers to ensure scalability. I would ensure that no particular node is overloaded with data processing. So, I would use data processing to handle the load. There's some disturbance at my end. Actually, if that's the case, I have integrated Python-based retail processes with Airflow in the past. And my main concern would be to use the appropriate Airflow operator to ensure reliability of performance. I want to use an operator that allows us to effectively handle the data. My task would be to ensure that all the infrastructure is highly scalable and, if possible, serverless. By serverless, I mean that we are not concerned with infrastructure provisioning of the data. The underlying cloud service takes care of provisioning the infrastructure for us as and when it is needed. As and when the data flow reaches a particular threshold, then we'll automatically get a new infrastructure piece. That is widely available in the GCS cloud as well as the AWS cloud. So, I would use the application load balancer service extensively. Yeah. Thanks.