
He has more than 6+ years of experience working on Python, AWS services (S3, EC2, AWS Lambda, Kinesis, DynamoDB, RDS, Cloud Formation, SQS, SNS, CloudWatch, IAM, API Gateways, AWS Config, etc.), deployment services, and security practices. He also possesses strong experience working on SFMC API, Salesforce Marketing Cloud, and DevOps Tools like Jenkins, Docker, Git, and Terraform.
He is a problem solver who can design algorithms to make use of technology to enhance human life.
Member of Technical Staff
Advanced Micro DevicesCloud Solutions Architect
Cloudtech IndiaSoftware System Design Engineer
Amazon Web ServicesSoftware Developer Intern
WiseL Keycard Technologies Pvt. Ltd.Software Engineer
ConsultAdd Services Private LTD
Git

Python

PostgreSQL
AWS (Amazon Web Services)
https://www.linkedin.com/in/subodh-dubey/details/recommendations/
Okay. I have around 5 plus years of experience working as a Python cloud engineer. During this time, I majorly worked with Python frameworks such as Flask, Django, uh, and database such as SQL, NoSQL, both. In SQL, I have mostly worked with Oracle SQL, MySQL. Coming to NoSQL, I have worked with DynamoDB mostly and MongoDB as well. I have experienced with both the clouds that is AWS and GCP. In AWS, I have acquired, uh, 5 AWS certifications, which includes solutions architect professional, solutions architect associate, developer associate, uh, security specialty, and database specialty as well. Coming to containerization's background, I have worked a lot with Docker, Docker Compose. In AWS, I have worked with ECS, ECR for storing containers. Uh, coming to DevOps tools, I have worked a lot with get, uh, get get a pipelines and CICD tools like Jenkins, GitHub Actions, GitHub CI as well. So I have experience of developing an application from scratch till it's deployment to production moment. And recently, I have been working mostly with the design in AWS. I have gained a lot of experience in, uh, cloud agnostic solutions in AWS using serverless stack. So I also have worked a lot with CDK, uh, CloudFormation, TerraForm for deployment of whole, you know, whole infrastructure to cloud environment.
To ensure atomic transactions in Python script executing SQL operations, uh, we need to see that our our script is exactly run once. It should not be doubled, or else it may it may impact the the operations in the background. So for that, we can have some conditional, uh, checks before we, uh, before we before we execute things. So we can ensure this automatic transactions, uh, by leveraging the capabilities provided by database management system and Python libraries used to interact with them, such as SQLchemy, which is an ORM that can be used, uh, for, let's say, SQLite, uh, database or SQL server as well, MySQL as well. One approach to use this concept, uh, in database transaction, uh, allows you to group multiple SQL operations into a single unit and then perform it in batch. So this can help, you know, uh, to achieve automatic transactions. Uh, so we can write some statement like, you know, begin transactions and then end transaction. And it will in between those, we can write our SQL statements and then we can commit it. So this is how we can ensure how to make transactions in Python script.
To use Python to automate the deployment of AWS infrastructures for ETL purpose. So there are multiple ways to achieve this. Uh, one better way to do this would be to use CDK directly because CDK in background creates CloudFormation templates. So the process would be like would would directly write CDK files, which would act as infrastructure as a code, uh, for us. And let's say if I want to develop a serverless stack of creating Lambda function. So what I do is I just import that Lambda function in Python file. I'll just say, okay, Lambda function with Python as a language and, uh, 1 1 25 MB of memory and time with a time limit of, let's say, 20 seconds. So I can write these things inside Python file and then just write all the infrastructure that is needed for our ETL purpose. And then, uh, just deploy it using 1 command that is CDK deploy. Other way to do this would be, you know, use Terraform directly or CloudFormation directly, where we need to write those YAML files and then and then upload it. Sorry. Deploy it. So those those files should contain this, uh, this ETL infrastructures like like, you know, s 3 buckets, uh, ImRows. Let's say if you are using NCE two instance for transformation that also needed. And also for compute services, uh, if you are using Lambda function, we need to include that as well as step function. So all such things, we will need to include either in Terraform, either in for a formation template or CDK template. So other ways that we can do this would be to use SDs SDKs in Python. So like v two three. So we can leverage that to directly create, uh, infrastructures like like step function, uh, DynamoDB, which can be directly created from this. So all such, you know, options can be used to automate deployment of
Okay. So to automate sorry. To orchestrate multiple ETL jobs in AWS, uh, ensuring data consistency. So there can be multiple options to achieve this. So the first would be, um, to use, uh, Glue directly, AWS Glue, which is an inbuilt ATA tool. And the other thing can be we can also develop custom pipelines as well. The third thing which I know is, uh, we can leverage AWS, uh, Airflow, which is, uh, powered by Apache, which is also a good ETI tool where we can define various, uh, directed acyclic graphs, uh, which would consist of task that will perform ETL operations. So firstly, we need to define those ETI task and their sequence. So this is how we we can start it. And let's say we are using custom pipelines. So this can be achieved by using step function, and then each task can be 1 Lambda function. And then we can have some, uh, you know, something in between, like, for backup or for storing archive data. We can use s 3 buckets. So these also can be steps, uh, inside step function. So I'm talking about custom pipelines now. But talking about the automated pipeline in which AWS Glue provides, we can use, you know, data extraction, transformation, and loading. So we need to define blue jobs for each ETL task here, specifying the source data, um, and then transformation how do we need to transform it and target a destination as well. In in airflow, this follows the same steps where we need to define each task as a graph there. And then, uh, we get a way to also replay that data so that, you know, if some steps gets failed, we can also reassess it or look at the logs, troubleshoot it, and then not directly, uh, reject that step again. So this is how it can be achieved.
Okay. To troubleshoot an unsuccessful API data integration in Python ATL, uh, process, What we can do is we can check for logs first. You know, we can look at what error that we are getting. If it is about the error that is coming from our code, we and we have already specified some specific, uh, error codes like, you know, 404, 404, 1 for unauthentication, and, uh, let's say 500 for server error. So we can review those logs and look at, you know, where is the problem, uh, actually coming from. So the 1st step would be to review error logs. So for this, well, if we have integrated cloud formation, I'll look into those logs first. And look at specific error codes description and, you know, stack status stresses as well about what went wrong. And then I would see API documentation, you know, how this API is being created. And if I'm calling correct endpoints or not, if I'm passing correct, uh, query query string arguments or not, string parameters or not. And also if there is something that I'm passing in a post request using body. So in if I'm passing it in, uh, specific format or not. And the third thing I would check is the API access as well. So let's say if I'm calling a API and I don't have an, you know, specific access to it, like, uh, a reread access. So what I can do is I can I can check for the the, uh, the if I am being authenticated for that API or not, if I have access to those API or not by using, uh, by using logs itself or error codes itself? Uh, but also I can check where this API is defined and what are the API authentication is being used. Let's say, is it API key? Is it, like, access token? Or are these, like, credentials? So this, I can check. And, also, I can check the API connectivity by using tools like Postman and verify if those requests are successful or not. And, you know, this this also helps us isolate the testing, um, before we do it in production environment. And, uh, other things are like, you know, I'll also check the response, uh, response data. How is it coming? And if there is anything wrong in though the in receiving those data. Let's say, if I am if the API is sending data in JSON format and and I'm trying to access it in text format, that might also create a a problem for us. So these are all the things that I would troubleshoot to check, uh, what is it that that is making an unsuccessful API call.
To handle exceptions in Python script for data transformation, uh, I would start with the first step that is, um, writing try and accept block in in Python because that would that would allow me to check specific, uh, specific error or, you know, x, uh, specific error that might be coming from that script. So I would check that first, and then I would write that first. Let's say try and accept block. And inside that block, I would write this specific code which might generate some, uh, error. And then it I'll accept the specific, let's say, catch that error inside, uh, accept block and write those things accordingly. I'd also check for potential failure points, like, you know, what are the specific areas in the data transformation code where exceptions might occur. Uh, let's say it's an IO operation. It's data passing. It's database interaction, API calls, customer data processing logic. So I'll check all these things and I'll use them inside a try accept block. And also, as I said, I'll try to catch those specific exceptions only. So while doing this, what we can do is, let's say, if I'm doing some certain operations while transforming data, so I can see it like, let's say, I'm using uh, arithmetic operation here. So I use that arithmetic error inside accept log. So this can be used to check specific errors and handle proper exceptions in Python. So I'll also log those error properly. I'll also have proper fallback mechanism as well. Let's say if this error occurs, what should be the next step?
So here, the caller is using pandas to read CSV data that is large dataset dot CSV, and then optimize that, uh, optimized query for this dataset. So inside the function, um, he's attempting to optimize the query. So what he's doing is, uh, he's using data frame column a. Okay. He's trying to filter it. Let's say, data frame column a, which are greater than 100 and data frame column b, which is smaller than 200, and then copying it inside, uh, result variable. So I can see here that it has some improper column access, uh, that that can be checked here. So the code snippet contains syntax error in the function that is optimized query. So in attempts to access columns from data frame using incorrect syntax. So let's say, um, data frame 1 instead of a data frame is being, uh, can be used here. And okay. Uh, and the other thing is copying data frame unnecessarily because we are actually making a copy call, which is actually a performance bottleneck here. So this is this will copy all the rows from this column, which which is a very costly query for us and which will introduce overhead and, uh, overhead and performance issue for us when we are dealing with large dataset. So we can also face potential memory overhead depending on the size of data frame that we are using. So applying the query connection directly, let's say, might create a intermediate Boolean mask that consumes additional memory memory. Instead, better way to do this would be directly using, uh, data frame without without using copy operation. So to solve this, what we can do is, uh, let's say, we'd use data frame column a, uh, greater than sorry. Yeah. Greater than 100 and data frame column b smaller than 200. We don't need to use copy function here. So this way we can avoid unnecessary copy operation, and it would definitely solve performance bottleneck for us.
Okay. The techniques I would use in Python to ensure efficient data manipulation of large data frames would be I'll consider that, uh, as we are dealing with a large amount of data, so we'll need to, uh, look at things where we can filter data, and we do not copy data unnecessarily. And, also, it should be like, uh, the operations which we are doing with the with the, uh, database server should be minimal because let's say if I'm doing it for every row that, let's say, if I'm performing filter on every row, then it would be costlier because we are dealing with large dataset. So we can use Pandas chunking here, uh, like, you know, when dealing with large data frames. And we can define some chunk size while reading a file, let's say, CSV file or SQL file. And then, you know, we can manage those chunks accordingly. And, also, we can select, let's say, filtered columns. Like, if you if you want to fetch some specific columns from, uh, database, then we can specify it while querying the database only. And we do not let's say, the other way can be we just take out this whole data and then filter it inside our state, inside our pandas, uh, state maybe. But this can be avoided and just we, uh, we'll just fetch the amount of data that is required, uh, so that we don't need to, you know, uh, worry about it filtering later. And also it will save a lot of, uh, let's say, memory cost as well while we are calling the API for database, uh, data. And then we can use some optimizations for, uh, for proper column types as well. Let's say integers, uh, float categories to reduce a memory usage. So let's say for Numeric, we can directly convert it using pd.to Numeric. Instead of having it on each row, We can directly, you know, uh, convert those columns directly. And, uh, other things is like, you know, we can pass data frame using, uh, sparse data frame function, which is provided by pandas directly. We can also use categorization there by using pd.category. And, uh, while performing operations, we can use the concept of parallel processing here, where we send those API calls in parallel so that those calls are made, uh, you know, concurrently to the database. It would also save a lot of time for us and fetch the data in in less time than, you know, than if we do it sequentially. We can use proper group by. We can use proper, you know, pivot table functions which is provided by data, uh, data frame in pandas. So we can also avoid iterations, as I said, by by using parallel quilling. And we can also, uh, optimize memory when we are dealing with data transformation.
Okay. To architect a cloud based ATL solution that is resilient to data schema changes or time, So it's actually a complex thing to do because, as we know, the data schema is changing over time. So we cannot use something which is which is static. And, uh, if we do define those things, you know, beforehand and then then use it for data transformation. So we'll need to have some modular architecture for this. So we need to define ATS solutions with modular architecture that separates each stage of pipeline. Let's say extraction, transformation, and loading into discrete components. This allows us for easier maintenance and updates when data schema changes over time. Uh, the second thing that I do is, uh, use schema evolution handling. So I'll actually define a mechanism to handle, uh, schema evolution gracefully. Uh, use techniques such as schema interface, uh, schema on read, schema versioning, and adapt to changes in data structures without requiring immediate modifications of ETL pipeline. I'll do also metadata management. So let's say, uh, depending on the data that is coming, we need to maintain, uh, data schema and also changing of that data schema dynamically. So to handle this, we need to deal with data metadata management as well. So the mapping between these changing schemas needs to be done in a proper way. So let's say previously it was like, uh, the column type was like integer, and now we are using it as float. So those mapping needs to be stored somewhere and proper management is needed, uh, because data schema is changing over time. So this needs to be handled. I'll also do data validation and quality check because we are checking we are changing this data more frequently here. So we'll need to validate the data accordingly and, uh, before putting it to production environment. So I'll also also have flexible data transformation. So it's, as I said, it does not need to be fixed because the schema is, uh, very much flexible here. So our data transformation needs to be flexible in place. So for this, we can use tools such as Apache, Spark, uh, AWS Blue as well. And have data sorry, dynamic schema resolution during transformation. So this can be used here. But when we are dealing with custom pipelines, we really need to handle it ourselves using some using some Lambda in between, which handle those schema changes. But if you are using inbuilt tools like ETL sorry. ETL tools like Apache Spark and Glue, then dynamic schema resolutions can handle this. Thanks. We can also use version data stores here, uh, to minimize this to minimize this overhead for us to, uh
Okay. So I use AWS Cognito for securing securing a Python ETL system. Uh, when I am dealing with API, uh, for ETL operations. So let's say those APIs can be authenticated using AWS Cognito. And I can define there who has access to those APIs. So Cognito has multiple options when it comes to, you know, authentication. So it can have SAML operation. It can have, uh, based on some OAuth, and it can also authorize data. So so, yeah, for these things, uh, we can use Cognito. So the first thing that I'll do is authentication for data source. If your data data ETL system needs data access, uh, that requires authentication, so we can use AWS inbuilt services like Cognito to authenticate the ATL system before accessing the data source. And then second thing that can be used here is, uh, authentication for API access. First thing was data access. Second thing is API access. If your ETL system exposes API for data injection and extractions, you can use AWS Cognito to authorize access to these APIs. This ensures that only, uh, authentication catered users get access to those APIs directly. Uh, and also we can have this user management and data consumers defined in our AWS Cognito environment. And, also, we can manage the access by using group access access policies there. We can also have some, uh, you know, policies that are directly attached to specific, uh, group of user, specific, uh, endpoints, um, in if you're using API gateway there. And, also, this can be easily integrate with integrated with a dub other AWS services like, uh, like Lambda, like s 3, like DynamoDB. So we can also use COVID tool on-site then. So the steps I would take is I'll create a user user pool there by creating, uh, by creating AWS Cognito user pool in AWS management console and also define user attributes, password policies there. I'll integrate, uh, authentication by by either SAML or by OAuth as I said. And and then attach this, uh, whole a cognitive system to either API or the data as a source that we are using.
Okay. To create data visualization in cloud environment, we can use, uh, Streamlight. It is a platform which can analyze data and have and create some particular graph or, you know, visualize data based on the data that is coming. So firstly, we'll ensure that a cloud environment setup, uh, that we are doing should deploy our streamlet environment. So either we can use, uh, serverless platforms like, you know, uh, Docker, Fargate to deploy this. Or we can install Steamlight environment, uh, in some other environment and use the data from, uh, from cloud to to visualize it. So one way to do it would be to use a Python package manager that is pip pip install stimulate, and then this can be installed. So easiest way to do it would be use e c two e c two instance and then, you know, have streamlit there. Other way can be we can have it installed in our, uh, in our Lambda functions layer, uh, which can act as a package container for all Lambda functions. We can install it there and use it from there via API call as well. So that let's say we send some data to that API and then we get the response, uh, as a visual data there. Uh, so this can be done. We can also develop interactive visit visualizations using this stream link here. Uh, so the few things we need to consider here would be use virtual machines, uh, use containerization environment, or either have it in serverless stack deployed.