Vetted Talent

Somebuddha Paul

Vetted Talent

Having 8 Years of Total IT experience with special emphasis in design, development, architecture, administration and implementation of data management and governance applications. Cloud and data processing expert with multiple projects (~6 nos) for Regeneron, IBM, Intellus Learning(part of McMillan) and internship in startup CleverLogik. Seeking to draw on proven software development and engineering skills to increase and improve applications.

Role
Back End Developer
Years of Experience
8.5 years

Skillsets

Web Crawling
Web Crawling
NIFI
Kubernetes
Dremio
Cloud
BigPanda
AWS
Docker - 2 Years
Databricks - 2 Years
Containers - 3 Years
Relational Database - 6 Years
API - 4 Years
Rest APIs - 4 Years
SQL - 6 Years
Web Development
Airflow
Spark - 4 Years
Python - 8 Years
Privacera
NIFI
MySQL
Mongo DB
Kubernetes - 3 Years
Jira
Image Processing
Github
Dremio
Django
Cloud Computing
AWS - 6 Years

Vetted For

11Skills

Roles & Skills
Results
Details

Senior Data Engineer (Remote)AI Screening
70%

Skills assessed :BigQuery, Big Data Technology, ETL, PySpark, Snowflake, AWS, 組込みLinux, NO SQL, Problem Solving Attitude, Python, SQL
Score: 63/90

Professional Summary

8.5Years

May, 2022 - Present3 yr 6 months
Technical Lead
Visionet Systems
May, 2020 - May, 20222 yr
Senior Software Engineer
IBM
Nov, 2017 - Mar, 20202 yr 4 months
Software Engineer
Intellus Learning INC
Nov, 2015 - Oct, 20171 yr 11 months
Software Developer
Cleverlogik Technologies PVT LTD

Applications & Tools Known

Amazon EKS
S3
EMR
Athena
EC2
SNS
SQS
IAM
Dremio
Airflow
Jenkins
Sharepoint
AWS Athena
Hive
Redshift
Kubernetes
Jira
GitHub
AWS
Django
MySQL
MongoDB
Pandas
Scrapy
BeautifulSoup
Selenium
OpenCV
Celery

Work History

8.5Years

Technical Lead

Visionet Systems

May, 2022 - Present3 yr 6 months

Experience of design and development of Data Pipelines using Amazon Web Services such as Amazon EKS, MWAA, S3, Lambda functions, EMR, Athena, EC2, SNS, SQS, IAM. Developed Spark applications using PySpark - SQL for data extraction, transformation and aggregation from multiple file formats for analysing & transforming the data to uncover insights into the customer usage patterns. Experience in using SAAS products like Dremio, Privacera, Acryl for Connected datalake platform. ETL using Databricks Expertise in designing and implementing data pipelines using Dremio Perform requirement analysis and produce the technical design specifications. PII tag identification program on multiple datalakes. Create policies for multiple business units and individuals. Product development in Cloud platforms Good understanding of Spark Architecture including Spark Core, Spark SQL, Data Frames, Spark Streaming, Driver Node, Worker Node, Stages, Executors and Tasks. Build Data pipeline using Nifi flow for Ingestion Build Post ingestion data pipeline process using PySpark with Airflow and Jenkins Build connection between SFTP server to extract data from business side Build connection with Sharepoint to extract data and store the data after processing Data quality verification using AWS Athena and Hive and Redshift

Senior Software Engineer

IBM

May, 2020 - May, 20222 yr

Senior developer in Compute side Practice on Kubernetes Creating , managing, deploying clusters Perform operations on node group(Worker nodes) Perform Billing, Regression and Migration on Kubernetes Cluster and worker nodes Automation testing on Zonal Compute Perform Billing, Regression and Migration on Cloud resources hostOS, kube, genctl packages deployment on clusters Modified Automation codes using Pyhton 3 Deep knowledge on IBM Cloud Platform Perform cloud operations in IBM cloud Create, modify and delete cloud resources manually and automated Comfortable with Jira, Github Orchestrated efficient large-scale software deployments. Worked on new compute VM features like local disk, placement groups

Software Engineer

Intellus Learning INC

Nov, 2017 - Mar, 20202 yr 4 months

Developing REST API using Django framework. Building the application from scratch. Creating, managing AWS EKS clusters and Node groups Perform cloud operations in AWS(EKS, Lambda, EC2, S3, RDS). Define modules based on requirements Identify ways to improve data reliability, efficiency and quality. Run Application as containers Working with Pandas, DataFrame, MySQL, MongoDB Working with celery script, crontab, Jenkins Image Processing with Python 3(PIL, pytesseract and OpenCV, paramico) Image to text processing with Tesseract and other text processing APIs Web crawling using Scrapy, BeautifulSoup, Selenium Comfortable with Jira, GitHub Familiar with data structure

Software Developer

Cleverlogik Technologies PVT LTD

Nov, 2015 - Oct, 20171 yr 11 months

Data acquisition using Python Scrapy, Selenium, Beautiful Soup, MongoDB Identify ways to improve data reliability, efficiency and quality Machine Learning with Rare technologies Perform cloud operations in AWS. Developing API using Django framework. Practicing Jira, Github regular basis

Major Projects

8Projects

Connected Data Lake (CDL)

May, 2022 - Present3 yr 6 months

Creation of access management layer in Privacera, Acryl. Responsible for EKS platform and performed Create, Modify, Upgrade EKS and node groups. Managed the setup having new/existing PODS and provided rollback, restore, replicas, backup and restart. Enabled access management and control using Privecera policies for resource control, specifying audits, discovery scan and other features. Create or modify IAM policies on EKS and Node Groups for accessing other services like S3, Airflow. Senior developer on tag policy creation for UDA platform based on Discovery Scan data PII tag identification on multiple data lakes using Privacera Discovery. Time and memory utilization on discovery program by configuring Privacera YAML files using Kubernetes and Privacera manager deployments. Kafka enablement for Privacera Dev and Prod environment. S3 Tag based policy creation by Discovery generated tags VDS creation, giving access for different BUs on vds.

Data pipeline

May, 2022 - Present3 yr 6 months

Build Data pipeline using Nifi flow for Ingestion Perform Post ingestion data pipeline process using PySpark with Airflow and Jenkins. Responsible for data processing with ETL process from SFTP to S3 for Commercial business unit. Build connection with Sharepoint to extract data and store the data after processing. Data verification with AWS Athena (MySQL) , Hive and Redshift.

INSIGHTS

Created APIs for 'Insights' which is a reporting API for instructor who can log in and see activities (Like attendance, student profile) of all his students and course details (like how many people like the course, students who are enrolled, who are all attending etc). Created APIs where Institute Admin can see all and their usage within institute level.

iClicker

Getting images of different questions from Amazon S3 and convert image into text using different OCR APIs (mainly Tesseract). Identify questions and answers/ multiple choices using data analysis tools and stored in database. Managed all the processes with AWS Lambda. Build the process in AWS serverless architecture (AWS Lambda) Fetch images from S3 buckets Convert the image to text and remove unwanted data in AWS Lambda Process and structure the data and store into AWS RDS (MY SQL) Send Notification using SNS and queue-based system using SQS

Intellus

Web scraping books, pdf, epub using Scrapy, beautifulsoup and selenium. Get information and normalize the data and store into database

EPAM(Merck project)

Assembler and PQA Build Automation using Robot Framework Verify and test two different product using Veeva APIs Assembler to create documents using multiple documents and components Migrating documents from one veeva env to another using Python and veeva apis PQA to verify product quality Identifies bugs and report to developer team to fix

Regeneron Pharmaceuticals

Data pipeline Build Data pipeline using Nifi flow for Ingestion Perform Post ingestion data pipeline process using PySpark with Airflow and Jenkins Responsable for data processing with ETL process from SFTP to S3 for Commercial business unit Build connection with Sharepoint to extract data and store the data after processing Data verification with AWS Athena (MySQL) , Hive and Redshift

IBM Cloud

May, 2020 - May, 20222 yr

Isolation at CPU level by making sure Regression suite is executed and comes clean on various test conditions. Established AMD CPU processor. Convert or modify automation framework with object-oriented approach.

Education

Master of Computer Applications
Christ University (2016)
Bachelor of Computer Applications
Pailan School of international studies (2013)

Certifications

Aws cloud practitioner

AI-interview Questions & Answers

Hello. I'm Zumbhuda Paul having 8 years of experience in cloud and data engineering platform. I'm expert in building data pipelines using ETLs, uh, using PySpark and different AWS services, independent and reusable Python projects, and with using Django, uh, AWS, optimizing existing Python projects. Currently, I'm working in Vision as a technical lead and develop of my skills, like how to handle teams, distribution of work, uh, planning for roadmaps for my team. I contributed in 2 different projects here, DID pipeline and connected to the, like, platform building in DID. Uh, developing an ETL pipeline using NiFi, AWS services like Airflow, EMR, s 3, RDB with device database like Hive Athena. Uh, in CDL, I developed a connected data like platform building, uh, platform for managing access to all users, uh, by creating policies for groups and data lakes with Kubernetes and AWS services like EKS, Node Group, s 3, IAM, and SaaS products like Privacera, Dremio. Previously, I was in IBM as software senior software engineer, uh, for 2 years and was developing, uh, IBM Cloud as a product. I was, uh, responsible for developing new VM features and managing compute, uh, cloud services like clusters, routers, network, h g group by using Python and Kubernetes. Here, I learned about cloud services in-depth. Before IBM, I was in a company named, uh, Intellus Learning as a software engineer and majorly as a, uh, in different 2 different products. 1 is iClicker and another one is, uh, another one is, uh, Insights. The Insight is basically reporting application for, uh, for instructor and students. We developed this product from scratch using Python, Django, uh, AWS services like, uh, AWS services like EC 2, uh, and, uh, DB like MongoDB and MySQL. I have created a image image processing pipeline to convert images to meaningful text using pice Python and AWS services like Lambda, SNH, SQS, uh, s 3 API Gateways, and MySQL. So here, I learned about how to build a independent and usable Python projects.

Developing a company how to ensure the data quality and integrity through the pipeline. When developing a complex ETA process, how to do ensure data quality and integrity throughout the pipeline. Ensure that, uh, ensure data quality integrated in complex ETL process involves several key practices, data validation, implement data validation checks each stage of data processing to ensure data confirm as expected formats, range, business rules, data, uh, profiling, analyze source data to understand its structure, content, and quality, identity, and handles anomalous missing values and duplicates before loading. Error handling, design, robust error handling, and logging mechanism to capture the report. Error, enable quick identification and resolution. Automated testing. Uh, develop automated testing cap capabilities, um, test to validate data transformation, consistency, and correctness as varied pipeline stage, data data lineage, uh, track data flow and transform that, uh, transform from source to target, ensure transparency and traceability, facilities, audit, and debugging. Uh, 6 is consistent, uh, standards. Enforce consistent naming, conversion data types, uh, code standard across ETL process to maintain uniformly and reduce error. Uh, this is the things we can, uh, otherwise, monitoring and alerts, backup and recovery, increment loading, all those things we can implement. By, uh, by integrating these practices, you can ensure data quality and integrate it through our data processing and then we can deliver and accurate data for downstream applications.

Provide a high level overview of how you would plan for data disaster recovery in Snowflake. Planning for data disaster recovery in Snowflake involves several key steps. One is understand requirements. The another one is data, uh, rep, uh, replication, uh, backups and snapshots, automated failures, uh, then access controls, uh, testing, uh, documentation, continuous monitoring. Uh, so, basically, uh, understanding requirements is defined the recovery time, uh, objective RTO, and recovery point objective RPO based on the business needs. Uh, data application utilize the snowflakes under, uh, many, uh, regions that replica we have to make. Backups and snapshots, We should take the backups and snapshot, um, uh, regular basis. Uh, we can schedule automated flavor, configure the automation failure process to quickly switch back to the replica the previous replica, the event for the disaster. Access control ensures that proper access control or security measurement are placed in the replica data, uh, which is unnecessary during the disaster. Testing is regulated at the disaster recovery plan throughout the simulation. That process, procedure, work expected. Documentation everything should be documentation and planned, including roles, decision plans, responsibilities, and detailed recording steps. And continuously monitoring is improving with easy to detect the early stage when the recovery process is prompted.

What method would use to deploy a Spark application in AWS? Ensures scalability and cost efficiency. To deploy to deploy a spark application in the lower the scalability and cost ability to follow these steps. 1st is Amazon, uh, EMR. So create an EMR cluster and select Spark to the c two machines type mix, uh, on on demand or spot and enable auto scaling. Instant rate, uh, like, specify multiple instance types and pricing option to flexible and cost effective. Data storage data storage is, uh, AWS s 3 you can put, uh, for data consistency, uh, submit a Spark job, EMS step, or automate the service to EMS steps using the console IDK, uh, like, monitoring and logging, uh, CloudWatch you can use to set alarms. Logs, you can enable, uh, to s 3 for, uh, troubleshooting, uh, cluster, uh, termination. Config cluster to auto terminate after job is completed and save the cost.

How can you use the skill to enhance the security and reliability data processing system? How can you use Linux system administration skills to enhance the security and reliability of data processing system. Using Linux system administration skills, you can enhance the security and reliability of data processing system through following, uh, practices. Uh, the one is, uh, user and permission management. They implement strict user access control management, uh, and group policies. Use principle at least, principle of least privilege, uh, to restrict access to sensitive data and use, uh, system resources, firewall, firewall, and network security, configure firewall, use a secure communication protocol, uh, like SSH, uh, SSL, TCL, TLS, uh, to encrypt the data transmit. System updates and patching, regular applies, uh, system patching, uh, to our operating into Linux operating system and install software when you install software. Monitoring and login. So, basically, you have to, uh, you have to check the monitoring. You can, uh, promote us. There is a software we can monitor with the promoters to track the performance, predict the anomalies, configure logging, like system log or audit, uh, audit ID, you can use to do the system login. Back up and recovery system, uh, we can use on that.

What techniques would you use to handle schema evolution in the Snowflake data warehousing without causing downstream for customers. Uh, to, uh, to handle the schema evolution in Snowflake without causing down for, uh, downtime for customer, Uh, you can use following techniques. 0 copy cloning. Create 0 copy clones to the database or schema to make changes without affecting the live environment. Uh, time travel. Utilize Snowflake's time travel feature to revert the previous states of the, uh, if needed, uh, ensuring the data integrity. And during schema change, online schema changes, apply schema changes on nondisputable manner by adding new columns or tables without dropping existing ones, uh, use values and maintain backwards compatibility with the existing schema while transiting to the new schema. Uh, transactional DDL. Uh, leverage Snowflake support a transactional DDL to ensure schema change to automate and do not leave changes in database in inconsistent state. State rollout implements schema change to its stages, gradually rolling out updates, validating to each step to minimize impact.

In summing the following pattern here, What is intended to check if element in the list and element. We don't need to run. You can you do need to run the loop, which is input chat. Input chat, there is no element called input chat. One thing, it is a wrong value. There is input list. There is nothing called input chat. And second thing is you can directly convert the list to set. Automatically, it will be deleted. We don't need to run the photo. And we have to return the set. Examine the following code is needed to check if the element all are unique. Now after getting the set, you have to convert the set to list and then check the input list and output list are equal or not. If they're equal, then return true. Otherwise, return false.

Uh, there is no need of group by here. We can order by, we can do.

Design a strategy to handle, uh, handle increment loading in an ETL pipeline that ingest data into Snowflake from multiple changing data sources. To handle increment loading into ETL pipeline, to injecting data into Snowflake and multiple data sources, We can follow, uh, this strategy is 1 is I didn't, uh, c d c d c, identity change and data capture. Uh, so implement CDC mechanism, uh, for example, it is data triggeration log, uh, log based CDC tool to capture changes, insert updates, data, uh, delete from the data source, uh, data staging. Load the captured data into, uh, staging area in Snowflake. So, uh, so, uh, dedicatedly staging tables for each data source to temporarily store incremental data. Time stamp column, ensure they, uh, ensure each source data has the timestamp column to track the latest modification time, uh, facilitating the identification of new and updated record, ETL process, extract extract, pull the incremental data based on the timestamp or serial mechanism, transfer apply, uh, necessary transformation data cleaning and duplication, or merge the incremental data into target snowflake table using time, uh, merge statement to handle the insert, uh, updates and deletes. Schedule job. Schedule detail job to run regular in intervals. Hourly or daily to process to load the increment data. Error handling. Uh, error handling and monitoring and data validation program, data validation checks, uh, load and show data from the and accuracy. By following strategy, you can implement the comp, uh, that, uh, you can implement that low, uh, increment loading in each year by plan by staging through flex from multiple changes.

Plan a transition to architecture supporting both real time and batch processing that optimize for Python based computation, modernization, share components. To tell so there are data ingestion real time. Use tools like Apache Kafka or Apache Kinesis to handle reembling data streams. Batch, use AWS s 3 or Hadoop HDFS to batch storage. Uh, for data processing, for real time, basically, uh, for data for data processing, real time Apache Flink or Apache Spark Streaming for real time or data processing for Python or PySpa. For batch processing, apply, uh, like, living in PySpark for Python based computation. Data storage. So data storage is basically the the, uh, real time data batch utilize for real time data or, uh, and batch, uh, that they use at Snowflake scalable and flexible data storage. Uh, for orchestration, Apache Airflow, you can schedule to manage real time and batch for both Computation in, uh, you can leverage. ETL pipeline, you can use. Uh, in monitoring and logging, you can use for both.

There is system for managing, uh, cross region data application within AWS. Another is the entire processing.

Somebuddha Paul

Back End Developer

8.5 years

Skillsets

Vetted For

Professional Summary

Applications & Tools Known

Work History

Technical Lead

Senior Software Engineer

Software Engineer

Software Developer

Major Projects

Connected Data Lake (CDL)

Data pipeline

INSIGHTS

iClicker

Intellus

EPAM(Merck project)

Regeneron Pharmaceuticals

IBM Cloud

Education

Master of Computer Applications

Bachelor of Computer Applications

Certifications

Aws cloud practitioner

AI-interview Questions & Answers