
Technical Lead / Integration Consultant
Fidelity Management & ResearchPrincipal Software Engineer
Dell InternationalSenior Oracle Developer
Capital One BankSenior Oracle Developer
PMAM IT ServicesSenior Software Engineer
CGI Information Systems & Management ConsultantsSenior Oracle Developer
Union BankOracle PL/SQL Developer
AREVA Transport & DistributionOracle Forms & Reports Developer
InterDigital Inc
TOAD

SQL Developer

Control-M
.png)
Informatica

AWS
.png)
Jenkins
Jira
Snowflake

Anypoint Studio

Python

Athena

Bitbucket

AWS Cloud

Anypoint Studio
Yeah. Let me just brief about myself. I do have about, you know, 12 years of experience, um, working with Oracle database applications and development. Um, from a current recent projects I've played, uh, in a versatile roles working with, um, Oracle PLSQL, ETL Informatica, uh, AWS, uh, which includes EC 2, uh, s 3, Athena, um, Spark SQL, Snowflake SQL. I've used, uh, you know, uh, various version control tools. Um, I've also worked on both OLTP and OLEP applications. Um, I worked for a few years on Oracle ERP, then had started the carrier with Oracle Forms and Reports for about 5 years. Um, domain knowledge, yeah, that includes, uh, insurance, transport distribution, uh, telecom, uh, and about 6 years I've worked with the BFSF projects, banking and finance institutes. Um, so coming down, uh, to my current recent project, yeah, Tech Systems was the payroll company. So I've worked, like, um, almost 4 years with them. So I've got 2 projects, uh, Fidelity Management Research and Atlassian, uh, both of 18 months contract. Um, so I got released recently, so I've been actively looking for out for my next project. Um, so for the last 3 to 4 years, I've been totally on the cloud side, working majorly on, uh, Athena, Spark, SQL, various, uh, AWS, uh, you know, cloud based applications, um, querying, uh, mainly on, you know, Snowflake queries and, uh, Spark queries. And, of course, some, uh, PLSQL coding as well. Um, apart from that, using, uh, Handymani, um, for monitoring, uh, at, you know, Anypoint Studio for monitoring as well. MuleSoft for integration. And so for whatever the company has other third party tools. So, yeah, this, uh, brief, uh, introduction about myself. Thank you.
Might to leverage Azure's feature to construct robust error handling the case within a data pipeline. Um, Yeah. Error handling, uh, we need to have exceptions, uh, in place, uh, whether it is, you know, APIs or batch of or, uh, queries that we are writing. For instance, if you are having a pipelining, then, uh, set up set up a mechanism, um, you know, for if you're loading, uh, data or transforming from source to target using either ETL or ELT process, then we'll transforming or before loading the data, have a, you know, mechanism or, uh, exception where the, uh, you know, batch job or the ETL job shouldn't fail. Rather, it has to capture the missed out records or error out records, process the remaining. Let's say, we have 1,000,000 of record and, you know, some 100 odd, uh, 100 odd, uh, records have failed. So you don't want to fail the entire process, but to, uh, proceed it, like, load the all of the data, uh, have a exception where that 100 records are stored somewhere in a table or something or in a file format, um, and then log those transactions into a log table, um, and notify to the team or to the consumer or and, um, a producer or consumer, uh, saying that these records have to be reprocessed or manually intervene and fix them and reprocess them. Um, so that is one other way of, uh, you know, handling, uh, error handling. The other one is, um, setting up the alerts or monitoring uh, flags where if some there's some discrepancy or there's high, uh, running job, um, that it is consuming more time, then probably you will have to, uh, kill the session or reprocess it. Um, uh, you know, hold the, you know, predecessor jobs, uh, rerun them. Um, so, likewise. Yeah.
Giving the synchronous nature of data. Lot of respect is for any transaction for that same process for field operations. Um, Yeah. Rollback. Yeah. If, uh, the transaction has failed, uh, there are different ways to roll back. Like I said before, either you process all of the records, like, if you have a 1,000 records that are being processed using a job, and let's say 50 records have failed or there is a discrepancy, uh, some wild card or, uh, wild card characters or some missing data, then either you roll back the entire transaction for all the 1,000 reports or process that 900 records uh, successfully, uh, and log in a transaction that 100 reports have failed, and they need to be addressed using manual process. Uh, go back to, uh, the 100 reports either using defect or some other, uh, means of validating those records and reprocess them. Um, the other one would be if you roll back, you will have to ensure that data is not lost or data that is in transition should be saved somewhere because the source has already sent the data either in file format or some other means of format, um, they may not have the data in the current practice. So the moment the data is, um, in the transit, so have a copy of that or archive that, uh, file or the data, uh, until it is, uh, all the records are, uh, you know, processed successfully. Uh, handling transactions all the yeah. So these are the couple of steps we can ensure that, uh, uh, to, you know, um, ensure that failed transactions or operations are handled in a a better manner.
Outline a strategy for reducing latency and optimizing performance, of course, when dealing with large scale of data. Um, Yeah. Reducing latency and optimizing performance. Yeah. Latency happens when, uh, let's say, uh, the data is coming from source to target. Uh, let's say there are 2,000,000 records that is being processed. Um, yesterday, it was performing well. Today, it is consuming more time. So how do we optimize in this scenario? So the the best practice would be to, uh, break down the data in chunks or, uh, you know, ensure that the source, um, is divided. Let's say, um, there's a party application, product, customer. There are a number of applications. Right? So instead of processing for all of those, uh, data, uh, bifurcate them and have a, you know, um, a dedicated um, source to target load, either have a different nodes or a different streamlines to process these data, running all of these source data parallelly to the target or end tables. Like, have a intermediate staging areas for each of these transform the data. You might not need the although a 100% of data that is coming in from source will validate or, uh, filter the data and process the target. So have intermediate staging areas for all of these source applications. Transform the data, and load it into the final target table. That is one other way. Um, Yeah. Reducing latency, uh, that depends, like, what is the process that we are, uh, following, uh, to load the data from the source target. Either, uh, just dumping the data directly, then transforming and loading, like, ELT process or ATL while loading before loading to the target, we would be, uh, processing the, uh, data, uh, you know, in the middle of, uh, you know, uh, transforming the source data to target, uh, using, uh, either AWS Glue or retail information that I've used. So filtering a better data cleansing in the middle. So the target the tables have the in data precisely. Optimizing performance. Yeah. Uh, that's a different thing. Like, we're very optimizing. Uh, we have different, uh, ways. Like, in the seller clause, check, um, how well the seller clause is written. Check the joins. Check the where clause. Um, check the, you know, group group by functions, how the subqueries are returned. Um, try to use, uh, like, indexes if possible or, uh, fine tune the indexes. Um, check with full table scans are happening on all the tables. Uh, try to, uh, you know, join the key constraints, like use the surrogate keys or other keys, uh, to use that. And hints, of course, plays a a valid role. Uh, check to use hints and see if the query is performing better. Uh, and yeah. While, uh, ensure that, uh, any large table is not being repetitive in the entire query, like using a with clause or computation or having intervals for the data, processing, like, quarterly, monthly, and yearly, that would definitely optimize or boost up the query performance.
Did you implement automated deployment, of course, updates in in its environment? Yeah. Definitely, we would go with the, um, I know, I get bitbucket, the Jenkins, or Bamboo Spec IT deployment, um, you know, to, uh, first, uh, you know, have your code in some repository using Bitbucket, and then, uh, clone it, re deploy your latest changes to the Bitbucket using it, uh, for secured reasons. Uh, then, uh, you know, you do the, um, you know, validation or approvals for that code. Uh, someone has to review that, of course. Then post that, you would first, uh, load it into staging area or you at environment. Run a, uh, dry run using the prod recent prod data to ensure whatever changes you have made, uh, is um, working fine or do a regression, uh, or smoke smoke test on the, uh, the code changes. Uh, and then you would definitely deploy using Jenkins or other, um, you know, DevOps tools that are in place, like, either using Jenkins, Bamboo Spec, IT Bamboo. There are a number of phase to deploy or use different version control tools to automate the process, schedule it. There are different scheduling tools like, uh, we have AWS airflow. We have a control m job. We have Autosys to automate this process. Um, you know, for these post queries, uh, once they are in the repository, fetch those, have that specific latest versions in place, uh, that, uh, you know, the job should pick the latest version of that code that is in place, uh, post, uh, you know, deploy. Since those code is already deployed in you at our other staging areas. So we'll have the latest version. So, uh, yeah. In the next environment, um, of course, we will, uh, be running some batch jobs. So we'll have that, uh, wrap wrapper scripts to pick the latest, uh, code that is in place depending on the latest date and time stamp. And, uh, if it is in, uh, not a UI testing area. So only that code will be picked and will be deferred automatically, um, with with the way, uh, the scheduler, uh, that we are in we are keeping those in place.
So you would you use shell script and automate the failure process for the project application setup? Um, Yeah. Usually, see, in shell script to automate a failure, uh, let's say you're running a process. You have a in file, right, coming in either CSV or pipe delimited. You would process that. Because of, uh, bad file or discard file, the process may not have happened. But still, you would process that remaining records or, uh, apple number of records, so the business shouldn't stop. Uh, so you have fewer records that have failed. So you wouldn't, uh, you would still complete the job, but have that failed, uh, data in place. Um, so how would you, uh, you know, process that replication? Let's say, um, that is one thing to reprocess that, uh, you know, records manually. Replication setup, yeah, that's a different thing. Like, if you have, uh, in, uh, let's say, either you're using Oracle Golden Gate or using materialized views to, uh, deploy or using a virtual data warehouse to dump that data tables or eXPDP or IMPDP export import of all the tables or schemas into, uh, at one place. Um, so just if you're automating the entire, uh, you know, box like a schema or database, uh, set up entirely export import process in one go or have, uh, if you're yeah. If you have 30 tables in a database and you are trying to replicate only 20 of them, then specific tables have to be replicated at that table to table, um, source to target directly, um, or using some kind of CTAS methods or using a materialized view, compute the, uh, you know, process. Even the end users could be Salesforce or Workday, so you don't want to have entire data sitting in there. Um, you just want to, have the data cleanse and retrieve only some portion of data. So you have queries returning writing in place using views or material as view. So only that data is picked up and the data is replicated. So, uh, that is one thing. So how do we automate this? There are different ways. Either you can run a bad jobs or schedule the, uh, you know, process. Um, these jobs are, uh, set up some kind of alerts or APIs to run these jobs in automation, process them depending on the, uh, transaction, uh, either it's, uh, successfully processed or failure. Uh, depending on them, we can take over, uh, you know, or, uh, have a setup of operations to, uh, you know, deal with that, um, fine tune that automation process. Uh, how well, um, is it just sitting in the end tables, or you further want to process it to different targets and who are the consumers. So there are different ways that, uh, the automation can be streamlined.
I have a Python snippet that is mean to filter and stuff. Then in the from running properly, could you detect the issue? Like, what's wrong? Okay. Data ID name. Okay. ID to name. ID is two name. It's Bob. ID three name, Cathy. Okay. Active users are 1 and 3. Filtered users filter. Lambda, you were the inactive user data. Print list filter. Filtered users are filtered. Lambda, you, um, I read in active users, data. Print list filtered users. Filtered. Yeah. In in filtered users, uh, filter, like, 4th line, we need not use that u. It directly takes, uh, u colon, then ID, um, in active users. So that is the fix that I would see. So, yeah, it would pick only the, um, data with IDs 1 and, uh, 3 in this case, um, and the active users. We need not also specify the data because only since the active users are only 1 and 3, so that will be picked up. So here, data is also not required in the line 5th, uh, to say. Yeah.
This post play SQL function that is a subministic that results in the function not returning it. It's not showing how to fix it. That's a plus function. Okay. Returns table user, uh, text to user status as okay. Begin. Begin. Return. Query select from this where it go to active. Okay. Language. Greater replace function returns user, integer, text. This is again text as begin. Need a supplement. Yeah. So return query should be at the end, uh, followed by, uh, semicolon because, um, first, it'll select all of the, uh, ID name and status from users table. That query is fine. But, uh, if we do not have the return statement at the end, it the function will not, uh, return. And, uh, you know, because it would return the table, uh, which is already defined in the declaration of the function, so the return statement should be below before the end.
How can an Azure based data platform be leveraged to facilitate real time data synchronize between post and Okay. I should paste data currently leverage to facilitate real time data processing between processes. Yeah. Real time synchronization, I'm not sure much about how do we do this, uh, between post and the rediscash. I'm sorry.
How do you, uh, tackle the challenge of seamlessly updating live data plan with 0 downtime considering positive data constraints and Azure cloud services? Seamlessly updating live data plan. Seamlessly updating. Yeah. So there are different things. Uh, if you're, uh, using constant data pipelining to update, let's say, from source to target, the target table is t 2. A source is some CSV file or pipe delimited file. So as and when the data is coming, uh, in, you would have you'll be processing in the data, uh, constantly. Uh, with when it says zero downtime, uh, obviously, you'll be having some kind of staging area or create a stage where the data is being pumped from as in when the files are being, uh, produced. You let's say, in the morning, you have file with 50 records that is already in process. So, um, the 50 records are being processed continuously or being written from, uh, you know, CSV file, pipe delimited to a staging area. The staging and the target, the bad job continuously runs or have a scheduled process pipeline setup, uh, to ingest the the data as and when it is produced in stage. So let's say, uh, file comes in 9 AM in the morning, the staging has picked up those 50 changed records. And, uh, staging would still have, like, let's say, 200, uh, existing records, uh, depending on the refresh. So the pipeline existing pipeline, because the ingest, uh, is true, and it is constantly looking for the data. So the latest, uh, records that are in staging is 50 records. The pipeline picks it up, um, and process it to the target table. And let's after 2 hours, the new, uh, data is flowing in with some, uh, 300 odd records that comes in, um, place with the are being transformed or loaded to staging area. Uh, and the pipeline keeps constantly looking for the, uh, latest data, and only those 300 records are being picked up and processed to the final target tables using that staging area. So this is one way of having a zero downtime, uh, and constantly, uh, piping that data from source, uh, flight files to the target using some intermediate process. Um, Yeah. It could be anything. PostgreSQL, MongoDB, or, uh, Snowflake, uh, you know, database. So this would be the common, uh, best practice, uh, to load the data constantly or seamlessly.
Is this an approach for creating high availability clusters for post care Linux and how the apps can be integrated to manage the orchestration process? How the apps can be integrated to manage the orchestral. Yeah. Um, this is again, uh, you know, process to set up for different means of, uh, having the staging area, having, uh, you know, set up for, uh, different, set of tables into different nodes, uh, and depending on the application and depending on the source of data, uh, that, uh, you know, the resources are allocated. The data warehouse is allocated or the cache is allocated for, uh, each of these, uh, you know, clusters. Some clusters could be large or the tables could be large, and it is dedicated to, uh, I know, a specific warehouse, uh, virtual warehouse where it consumes more data. It, uh, takes up, uh, more time to load, and there are some other, uh, small warehouse with some limited data expecting that the tables have, uh, you know, or the streaming of the data for these tables is limited. So we can bifurcate these data warehouse or schemas or, uh, virtual warehouse or cache, uh, depending on the data size or data volume that is being consumed, each day. Um, so we can have this, uh, DevOps methodologies in place to have, uh, you know, dedicating this, not just, um, any, uh, random data or random table or random files that is flowing in on, uh, segregating into, uh, each of the loads. But depending on what is the data volume that we are expecting for each of the nodes, we can have a dedicated, uh, warehouse and cash for, uh, each of the source.