
Having 8 Years of Total IT experience with special emphasis in design, development, architecture, administration and implementation of data management and governance applications. Cloud and data processing expert with multiple projects (~6 nos) for Regeneron, IBM, Intellus Learning(part of McMillan) and internship in startup CleverLogik. Seeking to draw on proven software development and engineering skills to increase and improve applications.
Technical Lead
Visionet SystemsSenior Software Engineer
IBMSoftware Engineer
Intellus Learning INCSoftware Developer
Cleverlogik Technologies PVT LTD
Amazon EKS

S3

EMR

Athena

EC2

SNS

SQS

IAM

Dremio

Airflow
.png)
Jenkins

Sharepoint

AWS Athena

Hive

Redshift

Kubernetes
Jira

GitHub

AWS

Django

MySQL

MongoDB

Pandas

Scrapy

BeautifulSoup

Selenium

OpenCV
Celery
Hello. I'm Zumbhuda Paul having 8 years of experience in cloud and data engineering platform. I'm expert in building data pipelines using ETLs, uh, using PySpark and different AWS services, independent and reusable Python projects, and with using Django, uh, AWS, optimizing existing Python projects. Currently, I'm working in Vision as a technical lead and develop of my skills, like how to handle teams, distribution of work, uh, planning for roadmaps for my team. I contributed in 2 different projects here, DID pipeline and connected to the, like, platform building in DID. Uh, developing an ETL pipeline using NiFi, AWS services like Airflow, EMR, s 3, RDB with device database like Hive Athena. Uh, in CDL, I developed a connected data like platform building, uh, platform for managing access to all users, uh, by creating policies for groups and data lakes with Kubernetes and AWS services like EKS, Node Group, s 3, IAM, and SaaS products like Privacera, Dremio. Previously, I was in IBM as software senior software engineer, uh, for 2 years and was developing, uh, IBM Cloud as a product. I was, uh, responsible for developing new VM features and managing compute, uh, cloud services like clusters, routers, network, h g group by using Python and Kubernetes. Here, I learned about cloud services in-depth. Before IBM, I was in a company named, uh, Intellus Learning as a software engineer and majorly as a, uh, in different 2 different products. 1 is iClicker and another one is, uh, another one is, uh, Insights. The Insight is basically reporting application for, uh, for instructor and students. We developed this product from scratch using Python, Django, uh, AWS services like, uh, AWS services like EC 2, uh, and, uh, DB like MongoDB and MySQL. I have created a image image processing pipeline to convert images to meaningful text using pice Python and AWS services like Lambda, SNH, SQS, uh, s 3 API Gateways, and MySQL. So here, I learned about how to build a independent and usable Python projects.
Developing a company how to ensure the data quality and integrity through the pipeline. When developing a complex ETA process, how to do ensure data quality and integrity throughout the pipeline. Ensure that, uh, ensure data quality integrated in complex ETL process involves several key practices, data validation, implement data validation checks each stage of data processing to ensure data confirm as expected formats, range, business rules, data, uh, profiling, analyze source data to understand its structure, content, and quality, identity, and handles anomalous missing values and duplicates before loading. Error handling, design, robust error handling, and logging mechanism to capture the report. Error, enable quick identification and resolution. Automated testing. Uh, develop automated testing cap capabilities, um, test to validate data transformation, consistency, and correctness as varied pipeline stage, data data lineage, uh, track data flow and transform that, uh, transform from source to target, ensure transparency and traceability, facilities, audit, and debugging. Uh, 6 is consistent, uh, standards. Enforce consistent naming, conversion data types, uh, code standard across ETL process to maintain uniformly and reduce error. Uh, this is the things we can, uh, otherwise, monitoring and alerts, backup and recovery, increment loading, all those things we can implement. By, uh, by integrating these practices, you can ensure data quality and integrate it through our data processing and then we can deliver and accurate data for downstream applications.
Provide a high level overview of how you would plan for data disaster recovery in Snowflake. Planning for data disaster recovery in Snowflake involves several key steps. One is understand requirements. The another one is data, uh, rep, uh, replication, uh, backups and snapshots, automated failures, uh, then access controls, uh, testing, uh, documentation, continuous monitoring. Uh, so, basically, uh, understanding requirements is defined the recovery time, uh, objective RTO, and recovery point objective RPO based on the business needs. Uh, data application utilize the snowflakes under, uh, many, uh, regions that replica we have to make. Backups and snapshots, We should take the backups and snapshot, um, uh, regular basis. Uh, we can schedule automated flavor, configure the automation failure process to quickly switch back to the replica the previous replica, the event for the disaster. Access control ensures that proper access control or security measurement are placed in the replica data, uh, which is unnecessary during the disaster. Testing is regulated at the disaster recovery plan throughout the simulation. That process, procedure, work expected. Documentation everything should be documentation and planned, including roles, decision plans, responsibilities, and detailed recording steps. And continuously monitoring is improving with easy to detect the early stage when the recovery process is prompted.
What method would use to deploy a Spark application in AWS? Ensures scalability and cost efficiency. To deploy to deploy a spark application in the lower the scalability and cost ability to follow these steps. 1st is Amazon, uh, EMR. So create an EMR cluster and select Spark to the c two machines type mix, uh, on on demand or spot and enable auto scaling. Instant rate, uh, like, specify multiple instance types and pricing option to flexible and cost effective. Data storage data storage is, uh, AWS s 3 you can put, uh, for data consistency, uh, submit a Spark job, EMS step, or automate the service to EMS steps using the console IDK, uh, like, monitoring and logging, uh, CloudWatch you can use to set alarms. Logs, you can enable, uh, to s 3 for, uh, troubleshooting, uh, cluster, uh, termination. Config cluster to auto terminate after job is completed and save the cost.
How can you use the skill to enhance the security and reliability data processing system? How can you use Linux system administration skills to enhance the security and reliability of data processing system. Using Linux system administration skills, you can enhance the security and reliability of data processing system through following, uh, practices. Uh, the one is, uh, user and permission management. They implement strict user access control management, uh, and group policies. Use principle at least, principle of least privilege, uh, to restrict access to sensitive data and use, uh, system resources, firewall, firewall, and network security, configure firewall, use a secure communication protocol, uh, like SSH, uh, SSL, TCL, TLS, uh, to encrypt the data transmit. System updates and patching, regular applies, uh, system patching, uh, to our operating into Linux operating system and install software when you install software. Monitoring and login. So, basically, you have to, uh, you have to check the monitoring. You can, uh, promote us. There is a software we can monitor with the promoters to track the performance, predict the anomalies, configure logging, like system log or audit, uh, audit ID, you can use to do the system login. Back up and recovery system, uh, we can use on that.
What techniques would you use to handle schema evolution in the Snowflake data warehousing without causing downstream for customers. Uh, to, uh, to handle the schema evolution in Snowflake without causing down for, uh, downtime for customer, Uh, you can use following techniques. 0 copy cloning. Create 0 copy clones to the database or schema to make changes without affecting the live environment. Uh, time travel. Utilize Snowflake's time travel feature to revert the previous states of the, uh, if needed, uh, ensuring the data integrity. And during schema change, online schema changes, apply schema changes on nondisputable manner by adding new columns or tables without dropping existing ones, uh, use values and maintain backwards compatibility with the existing schema while transiting to the new schema. Uh, transactional DDL. Uh, leverage Snowflake support a transactional DDL to ensure schema change to automate and do not leave changes in database in inconsistent state. State rollout implements schema change to its stages, gradually rolling out updates, validating to each step to minimize impact.
In summing the following pattern here, What is intended to check if element in the list and element. We don't need to run. You can you do need to run the loop, which is input chat. Input chat, there is no element called input chat. One thing, it is a wrong value. There is input list. There is nothing called input chat. And second thing is you can directly convert the list to set. Automatically, it will be deleted. We don't need to run the photo. And we have to return the set. Examine the following code is needed to check if the element all are unique. Now after getting the set, you have to convert the set to list and then check the input list and output list are equal or not. If they're equal, then return true. Otherwise, return false.
Uh, there is no need of group by here. We can order by, we can do.
Design a strategy to handle, uh, handle increment loading in an ETL pipeline that ingest data into Snowflake from multiple changing data sources. To handle increment loading into ETL pipeline, to injecting data into Snowflake and multiple data sources, We can follow, uh, this strategy is 1 is I didn't, uh, c d c d c, identity change and data capture. Uh, so implement CDC mechanism, uh, for example, it is data triggeration log, uh, log based CDC tool to capture changes, insert updates, data, uh, delete from the data source, uh, data staging. Load the captured data into, uh, staging area in Snowflake. So, uh, so, uh, dedicatedly staging tables for each data source to temporarily store incremental data. Time stamp column, ensure they, uh, ensure each source data has the timestamp column to track the latest modification time, uh, facilitating the identification of new and updated record, ETL process, extract extract, pull the incremental data based on the timestamp or serial mechanism, transfer apply, uh, necessary transformation data cleaning and duplication, or merge the incremental data into target snowflake table using time, uh, merge statement to handle the insert, uh, updates and deletes. Schedule job. Schedule detail job to run regular in intervals. Hourly or daily to process to load the increment data. Error handling. Uh, error handling and monitoring and data validation program, data validation checks, uh, load and show data from the and accuracy. By following strategy, you can implement the comp, uh, that, uh, you can implement that low, uh, increment loading in each year by plan by staging through flex from multiple changes.
Plan a transition to architecture supporting both real time and batch processing that optimize for Python based computation, modernization, share components. To tell so there are data ingestion real time. Use tools like Apache Kafka or Apache Kinesis to handle reembling data streams. Batch, use AWS s 3 or Hadoop HDFS to batch storage. Uh, for data processing, for real time, basically, uh, for data for data processing, real time Apache Flink or Apache Spark Streaming for real time or data processing for Python or PySpa. For batch processing, apply, uh, like, living in PySpark for Python based computation. Data storage. So data storage is basically the the, uh, real time data batch utilize for real time data or, uh, and batch, uh, that they use at Snowflake scalable and flexible data storage. Uh, for orchestration, Apache Airflow, you can schedule to manage real time and batch for both Computation in, uh, you can leverage. ETL pipeline, you can use. Uh, in monitoring and logging, you can use for both.
There is system for managing, uh, cross region data application within AWS. Another is the entire processing.