Vetted Talent

Monica Kherajani

Vetted Talent

Disciplined and insightful data engineer with 5.5 years of experience in building real time data pipelines, ETL pipelines, implementing security aspects, creating visualizations and providing useful business insights.

Role
Lead Data Engineer
Years of Experience
6 years

Skillsets

Kibana
AWS Lambda
Unix
Power BI
NoSQL
Linux
Step functions
Logstash
EMR
Data Analysis
CI/CD
AWS
Airflow - 1 Years
Logstash
Glue
S3 - 3 Years
Python
Kafka - 2 Years
CI/CD - 2 Years
Jira
Github
Shell Scripting - 4 Years
Redshift - 5 Years
Data Analysis - 05 Years
AWS - 3 Years
SQL - 4 Years
SQL
PySpark - 4 Years
Elasticsearch - 4 Years
Elasticsearch
Python - 5 Years

Vetted For

13Skills

Roles & Skills
Results
Details

Data Engineer || (Remote)AI Screening
72%

Skills assessed :Airflow, Data Governance, machine learning and data science, BigQuery, ETL processes, Hive, Relational DB, Snowflake, Hadoop, Java, Postgre SQL, Python, SQL
Score: 65/90

Professional Summary

6Years

Mar, 2022 - Present3 yr 8 months
Lead Data Engineer
Kyndryl
Sep, 2019 - Feb, 20222 yr 5 months
Data Engineer
Larsen Toubro Infotech

Applications & Tools Known

Python
Elasticsearch
Kibana
AWS
S3
Glue
Redshift
AWS Lambda
Jira
Power BI
Kafka
Cassandra
Logstash
Spark
Kibana
Github
Agile
Shell scripting
CI/CD
SQL
NOSQL
Unix/Linux
Kafka
GitHub
Agile
Scrum
Airflow
CI/CD
Kibana
AWS
SQL
Data Analysis
Agile
Shell Scripting
CI/CD

Work History

6Years

Lead Data Engineer

Kyndryl

Mar, 2022 - Present3 yr 8 months

Developed ETL data processing pipeline using Python, Elasticsearch, Kibana. Read multiple csv files, performed transformations using NumPy and Pandas, delivering business insights. Engineered and deployed ETL pipeline in AWS using Glue, S3, Redshift, and Power BI, extracting data from Jira API and Project Rak API to analyze teams, boards, and sprints, optimizing resource allocation for 30% efficiency boost. Integrated data from APIs to build analytical dashboards. Automated Backup reporting, reducing manual efforts by 40%. Implemented best practices for logging. Communicated client needs, guiding team members. Worked on POC using AWS Data Migration service to migrate data from Oracle to Redshift. Established POC to read data from Kafka topics in Spark using PySpark. Upgraded Glue jobs from 2.0 to 4.0 using cloud formation templates and migrated jobs to production. Experience integrating LLM using OpenAI API.

Data Engineer

Larsen Toubro Infotech

Sep, 2019 - Feb, 20222 yr 5 months

Achieved real-time data reconciliation using Kafka, Cassandra, Logstash, Elasticsearch, Kibana. Upgraded Elasticsearch, Logstash, Kibana, Filebeat to latest version. Implemented Authentication, Authorization, SSL in ELK stack. Automated deployment of config files using shell script. Designed dashboards in Kibana for monitoring logs using Filebeat. Automated committing of git artifacts, saving developers time by 20%. Built real-time data pipeline to automate data collection from Kafka. Increased system performance by 30% based on JVM heap size, worker, threads.

Achievements

Developed ETL data processing pipeline using Python, Elasticsearch, Kibana
Engineered and deployed an ETL pipeline in AWS
Automated Backup reporting thus reducing manual efforts by 40%
Implemented best practices for clear logging mechanism
Established POC to read data from Kafka topics in Spark using Spark
Achieved Realtime data reconciliation using Kafka, Cassandra, Logstash, Elasticsearch, Kibana
Upgraded Elasticsearch, Logstash, Kibana, File beat to latest version
Implemented Authentication, Authorization, SSL in ELK stack
Automated deployment of config files using shell script
Designed dashboards in Kibana for monitoring logs using File beats
Automated committing of git artifacts thus saving developers time by 20%
Built real time data pipeline to automate data collection from Kafka
Increased performance of system by 30% based on parameters JVM heap size, worker, threads
Appreciated by clients for exceeding the clients requirements.
Automated Backup reporting thus reducing manual efforts by 40%.
Optimized resource allocation for 30% efficiency boost.
Automated committing of git artifacts thus saving developers time by 20%.
Increased performance of system by 30% based on parameters JVM heap size, worker, threads.
Engineered and deployed an ETL pipeline in AWS using AWS Glue, S3, Redshift, and Power BI, optimizing resource allocation for 30% efficiency boost.
Implemented best practices for clear logging mechanism.
Developed ETL data processing pipeline using Python, Elasticsearch, Kibana. Reading multiple CSV files, performing transformations on them using NumPy, Pandas and delivering useful business insights.
Engineered and deployed an ETL pipeline in AWS using AWS Glue, S3, Redshift, and Power BI, extracting data from Jira API to analyze teams, boards, and sprints, optimizing resource allocation for a 30% efficiency boost.
Integrated data from APIs to build analytical dashboards.
Effectively communicated client needs, guiding team members towards meeting project requirements.
Working on POC using Data Migration service in AWS to migrate data from Oracle to Redshift.
Established POC to read data from Kafka topics in spark using Spark.
Achieved Realtime data reconciliation using Kafka, Cassandra, Logstash, Elasticsearch, Kibana.
Upgraded Elasticsearch, Logstash, Kibana, Filebeat to latest version.
Implemented Authentication, Authorization, SSL in ELK stack.
Automated deployment of config files using shell script.
Designed dashboards in Kibana for monitoring logs using Filebeats.
Built real-time data pipeline to automate data collection from Kafka.

Major Projects

1Projects

Education

Bachelor of Engineering
Walchand Institute of Technology (2019)

Certifications

Nptel oops
Ibm cognitive class: big data 101
Nptel cloud computing
Nptel joy of computing using python

AI-interview Questions & Answers

Hi. My name is Monica Kirajani, and I've been working as a data engineer for almost 5 years. I've worked on both batch pipeline as well as e t uh, ETL data pipeline as well as real data pipelines. So in real data pipeline, I have experience with Logstash Elasticsearch, which is which is a no sequel, uh, analytics engine, Kibana, uh, for dashboarding purpose, uh, and Kafka, Spark, all these things, and Cassandra. And for, uh, ETL batch pipeline, I've worked with Python, Elasticsearch, SQL, And I also have experience in glue pipeline for processing the data, reading the data from various APIs using the glue data pipeline and loading it into the Redshift. I have also used Power BI for dashboarding, uh, purposes. And, uh, I can always learn and pick up new technologies. So I'm very excited to, uh, learn new things now. Like, Gen AI is in, uh, trends. So I'm trying to learn and pick up that as well. So I think the differentiating factor would be, uh, that I can learn and pick up new things and new challenges. That's it. Thank you.

So to build a scalable ETL data pipeline, uh, in Hadoop and Hive for processing large datasets, 1st would be to download the package, uh, then we can extract the packages. If we are using, like, Linux server, we could extract the tar files. We have to make changes in the configuration file, define our nodes, what will be the master node, what will be the slave node, or where our map reduce jobs will be run. Uh, and depending upon the size of data, we could make this configurations. And then we could, uh, write a program which will fetch the data and put it into the Hive database. Or if we could use Spark as well. As we know that Spark is in more advanced version of the Hadoop map to do, uh, which also performs in memory optimization at and it has also many advantages like Spark Streaming, MLlib. And it can integrate with various, uh, pie Python API, which is we can by Spark, which can be used. And, also, depending upon the size of our data, we can build this Hadoop cluster, uh, with, uh, deciding what will be the node, CPU, cores, memory, uh, for the Huddl processor. And then we could, uh, write a program to fetch the data from our source database. And then, uh, processes and put it into the hive.

How would you design a post risk schema for optimal query by, uh, external services? Okay. So uh, a post grid schema, uh, depends on the factor like, uh, the external service wants which key column, uh, primary key column, uh, to be used. If the external service wants to query the data based upon specific time frames, then we could partition the data by those those date times. Uh, otherwise, we could create indexing. Indexing is also one of the optimization techniques which will make, uh, our database faster. So if we know the columns which are frequently being asked, frequently being, uh, used in the select query, we could perform indexing on those databases, uh, those columns as well. And, uh, we could, uh, depending upon our schema, uh, we could design our query, uh, in an optimal way. Like, instead of using subqueries, we could use cities. If there are 2 large datasets, we could avoid joints. We could design our table in such a way that we have data. Uh, only 1, uh, single table. So by avoiding joints and performing these optimization techniques like index indexing and partitioning, uh, and also removing redundant sub queries. Um, we could, uh, make this design post case design schema for optimal querying.

Uh, Hadoop ecosystem, we can analyze for, uh, streaming real time insights. Uh, that would be a bit difficult since indirectly the streaming data pipeline will be a batch data pipeline, which will be very frequent. So, uh, for solving this, we can design a real time data pipeline using Spark. So Spark or, uh, PySpark we can use, which will read data from either Kafka topics. Uh, if we are using that and ingesting to more real time database like Elasticsearch. Um, so we could use, uh, such messaging systems which, uh, which enable us to use the real time data pipeline. We could use Kafka, Spark, or Fling. And, uh, our data source should be also very, like, no sequel in order to, uh, process these real time insights. So, um, we could use or, like, depending upon our requirement, um, we could use these technologies which favor the real time data pipeline insights from the Hadoop cluster. We could, um, put a middle layer into it for the real time streaming, which could be, uh, Kafka. And then put it into a database, a NoSQL database, which could be either Elasticsearch or MongoDB Cassandra, and then, uh, we could create dashboards upon it.

Uh, so our data intensive application depends. It is, uh, like, it is frequent to read or write. So we have, um, like, no sequel databases also, which are frequent rights. And in case of, uh, relational database, we could, um, try to increase the memory, uh, configuration settings. Like the memories, the CPU, uh, the core settings, or maybe we could add few nodes into our relational database in order to, um, maximize the number of GBs of data it can store. Or in the relational database, we could perform various optimization techniques, um, which could be indexing and partitioning. So if we know that, uh, one particular column is being, uh, asked very frequently in the query, we could create an index on the that column. So our query, uh, is very fast, uh, for the data driven application. We could also create partitioning. So our, uh, data is stored in partitioning for, uh, the parallel processing. And also, we could avoid, uh, joints like, uh, joint conditions, I mean, would take some time. So we could, uh, like, design our schema in such a way that we are getting all our necessary columns or everything in a view maybe, or or we can also create a materialized view on top of that, uh, so that, uh, it can refresh very fast. We can, uh, put the refresh rate of the view to be very fast for the data intensive application. And we could make use of the CTEs instead of the subqueries if it's, uh, taking a lot of time. So indexing, partitioning, uh, these optimization techniques as well as while designing the schema, trying to avoid joints and subqueries. We can make our relational, uh, database application much faster for an intensive data application.

So, uh, uh, slow flake, uh, snowflake queries has to be very, very optimized in such cases. Um, so we could try to make some configuration changes, uh, into our Snowflake, the connectivity changes, uh, to in order to handle the large, uh, terabytes of data, obviously, we will need much more amount of the memory. So the workers or the, you know, partition node, the CPU course, such configuration settings are there which we could change in this Snowflake. Also, the query performance, uh, will be depending uh, upon how I optimize the queries and how, uh, frequently we are querying the last dataset or, uh, with, um, so this query should, uh, avoid looping. That is it shouldn't be, like, nested subqueries because that could take a much, uh, much more time. We can avoid joints and, uh, we can try to or try techniques such as index indexing and partitioning, um, to handle these large dataset. Partitioning for, again, a parallel processing, uh, so that we could, uh, get this terabytes of data, in a much faster way. And, uh, we could try to make connections, uh, to the source uh, in our Snowflake database also by changing some configuration settings.

So here, we are handling the exception only, uh, in one way. Like, we would not know if, like, exception has, uh, occurred at the time of extraction of data only or at the time of transformation of data or at the time of loading of data. So maybe we could put different different blocks, uh, while the like, to know at least that, yes, data extraction was successful. We had an issue at the transformation or, uh, the data transformation was successful. We had a exception by loading the data due to some connectivity issue. Uh, so in this way, like, uh, maybe, uh, under raise column, raise other, uh, custom exception. Um, that is not needed since in the accept exception as e. Your e, we are exactly printing that string, um, due to which we we will know, like, okay, uh, what the exception has occurred. So raise, uh, other custom exception is not needed. We could just put uh, your it it is not needed.

Here we are saying that, uh, row, uh, while true, uh, the row is being fetched. Uh, the cursor is fetching whatever the result. Select star from large table is doing. But if the row is none, we are breaking. Still, we are calling this, uh, process row, uh, inside, uh, outside the if. So even after the break, it will come out of the if loop, but it will still execute this process row. But we don't need to accept, uh, execute this process row information if the row is none. So, uh, we should write a condition that if row is not none, uh, if that the output of the query is not none, then we should process the row row and then we should close the cursor. And also the white true loop, it will be always on. So this process row should not, uh, like, while while true will keep on running, running, running. It will never stop because the true condition will be, uh, always, uh, on. So that break should not be inside the if loop. We should also put a break for breaking the while loop and not just the if loop.

So while building a real time data pipeline, uh, we had various sources of data, and we had, uh, different different Kafka topics for them. So our data pipeline would be, like, getting data from the Kafka topic, uh, and, like, processing it using Logstash and then putting it into the Elasticsearch. So this Kafka, uh, topics were, like, had the same settings for varying size of data. So depending upon the number of partition workers' memory, we had performed a few testing scenarios. So our Kafka data pipeline would be much optimized, uh, optimized. So by using this testing scenario that for what amount of data, what are the configurations between which we need. Like, suppose we have less data, you don't need many, um, much amount of GB of memory or the worker size or CPU course. So we can keep less for, uh, less of this configuration so that it saves the space and the CPU course for larger processing data. Similarly, if we have, you know, data and Kafka topics which are much larger, so we will need, like, a much greater cluster. We will need more nodes into that cluster. And those nodes also should be, like, performance intensive. Um, so we could assign, like, a specific master node if we want. And then whatever the cluster nodes are there so that it could properly do the resource allocation for the data nodes. Uh, and like this, I perform, like, various settings for both, uh, Kafka as well as Elasticsearch in designing the real data pipeline according to, like, what size of, uh, data, what configurations are needed, and what is the cluster health. So, uh, the if the cluster is down due to some reason, um, then it is, uh, like, a data pipeline will be failed. So this should not happen. Uh, that's why we should have, uh, at least 3 nodes in our cluster so that they could elect the master, uh, while coming up. These were the few, uh, issues and challenges which I failed, uh, I faced. But with some amount of testing and testing various scenarios, so, like, when it is failing, what configurations are needed, and how should we assign these number of nodes in a cluster. So that during the time of election of the master, it is not fail. All these things, uh, we kept into the mind and, uh, we had designed a successful data pipeline of all, uh, varying size of data, amount of data, uh, into the real time data pipeline.

Uh, so in a new ETA process, first, we would take a look at the requirement. What is the requirement? What is the size of the data? How frequently is the data being queried? Like, uh, if, uh, the business requires, uh, a batch pipeline or a real time pipeline. So if they are using the dashboard once a week, once a month or so, uh, um, then an ETL, uh, batch pipeline is good. Or if they want the data pipeline to be really refreshed within few minutes, then we could, uh, design a real time data pipeline. So for, uh, data governance, first, we should check that data we are, uh, getting is not empty from the source. It is not having some connectivity issues. We should mention all these logs properly. Like, if there is any connectivity issue or if while reading the data only, it is empty. Uh, and after reading the data for processing, first, we should make the data types of all the columns are proper. Are correct. Date, uh, date, time stamp, or string, or number, whatever. The data type is there, that is required. Replacing the nulls. Replacing the nulls is very important. Otherwise, it could consider, uh, some data type mismatch. And, uh, we could also check there are no any special characters. Like, if we are reading some CSV files, uh, there there might be some hidden characters in the file due to which our retail pipeline might be fail and it is not able to read the data properly. So we should also take a look that there are no, like, extra tabs or a slash and slash t or any such characters. We should replace these, uh, characters and make sure that whatever columns we are reading, we are getting data according to that column only, uh, after loading the, uh, database into the source, uh, source connection. So while our all these testing scenarios we can perform, uh, that properly our object is getting connected. It is reading the data and after loading the data and the source, it is loading in a proper format. Uh, the columns are not mismatched, uh, or, uh, you know, they're not replaced due to some common missing issues, we could perform all these, uh, data quality checks.

Expertise in our programming, how would you create a robust ETS over touch leverages? So um, concurrency, we could implement in such a way that we are able to use it for parallel processing. What we could do, we could, uh, maybe use oops for, uh, object oriented programming. Create different different objects for different different, uh, uh, data structures or data sources, um, and make the advantage of the, uh, parallel processing. So that, parallelly, we are reading from, uh, we are reading data from various sources and writing to, uh, the different different, uh, objects using different objects, um, we could make the use of the concurrency feature.

Monica Kherajani

Lead Data Engineer

6 years

Skillsets

Vetted For

Professional Summary

Applications & Tools Known

Work History

Lead Data Engineer

Data Engineer

Achievements

Major Projects

Education

Bachelor of Engineering

Certifications

Nptel oops

Ibm cognitive class: big data 101

Nptel cloud computing

Nptel joy of computing using python

AI-interview Questions & Answers