profile-pic
Vetted Talent

Aswathy Raj

Vetted Talent

Diligent engineer with 12+ years of experience which includes contributions in data science and engineering,

development of software framework, platforms, applications and customer interaction with multilingual and

multicultural clients. An effective team player and well versed in various platforms, programming languages

and programming with different databases. Also have extensive experience in all phases of software

development, and on waterfall and agile methods of project life cycle.

  • Role

    Senior Data Engineer

  • Years of Experience

    12.00 years

Skillsets

  • Reporting & documentation
  • Github
  • Implementation Support
  • It infra management
  • Jira
  • Jupyter Notebook
  • MySQL
  • NSIS
  • Providing product demo
  • PyCharm
  • Eclipse
  • Requirement gathering
  • SQLite
  • SVN
  • Visual Studio
  • Data insights & strategy
  • Data analytics dashboard
  • Support software development
  • Tender proposals
  • Jaspersoft ireport designer 5.1.0
  • Ant script
  • Snowflake - 4 Years
  • Java - 2 Years
  • Python - 4 Years
  • MS SQL - 2 Years
  • SQL - 8 Years
  • Redshift - 1 Years
  • ETL - 4 Years
  • S3 - 4 Years
  • PySpark
  • Azure DevOps - 1 Years
  • Tcl/tk script
  • AWS - 4 Years
  • Airflow
  • Ant script
  • AWS
  • Business Intelligence
  • Ci/ cd implementation
  • Client Management
  • Databricks

Vetted For

11Skills
  • Roles & Skills
  • Results
  • Details
  • icon-skill_image
    Senior Data EngineerAI Screening
  • 62%
    icon-arrow-down
  • Skills assessed :BigQuery, AWS, Big Data Technology, ETL, NO SQL, PySpark, Snowflake, 組込みLinux, Problem Solving Attitude, Python, SQL
  • Score: 56/90

Professional Summary

12.00Years
  • Oct, 2020 - Present5 yr 7 months

    Consultant | Data Science Engineer

    Sinergia Media Labs
  • Jan, 2013 - May, 20163 yr 4 months

    IT Consultant

    AI Rawahy Technical Services
  • Sep, 2006 - Nov, 20104 yr 2 months

    Software Engineer

    Huawei

Applications & Tools Known

  • icon-tool

    Airflow

  • icon-tool

    Pycharm

  • icon-tool

    Jupyter Notebook

  • icon-tool

    Eclipse

  • icon-tool

    Visual Studio

  • icon-tool

    Jupyter Notebook

  • icon-tool

    GitHub

  • icon-tool

    SVN

Work History

12.00Years

Consultant | Data Science Engineer

Sinergia Media Labs
Oct, 2020 - Present5 yr 7 months
    Developing machine learning applications, selecting datasets, implementing ML algorithms, running tests, maintaining databases, filtering data, and preparing analysis reports.

IT Consultant

AI Rawahy Technical Services
Jan, 2013 - May, 20163 yr 4 months
    Planning project activities, managing end-to-end project management, handling technical aspects, and imparting training.

Software Engineer

Huawei
Sep, 2006 - Nov, 20104 yr 2 months
    Design and development of functional and technical solutions, removing corrupted data, rapid application development, compiling code, executing tests, defining pipeline steps, and integrating code changes.

Achievements

  • Awarded for the Tableau Dataset Migration (Customer Appreciation)
  • Achieved CEO Team Award in January 2023
  • Bagged Monthly Shining Star Award for exceptional performance under pressure, meeting strict timelines, and delivering quality results during March 2022
  • Successfully worked in China for 6 months to implement urgent requirements, completed tasks with high-quality
  • Awarded for contributions towards making projects CI (Continuous Integration) compliant, including setting up Cruise Control, writing scripts in ANT, XSL, and XML, providing technical assistance, and conducting training sessions

Major Projects

8Projects

Techstyle

Mar, 2024 - Present2 yr 2 months
    Techstyle is an American fashion brand that operates in the e-commerce domain. Technologies AMWAA, Python, SQL, Snowflake, MS SQL Server, Pycharm, GitHub Accountabilities: Continuously oversee active processes, promptly identifying and resolving any issues or failures to ensure seamless operations. Design and implement new features and data pipelines to enhance functionality and efficiency. Conduct rigorous data validation to ensure accuracy, consistency, and integrity across all datasets. Troubleshoot and fix any failures, optimizing system performance for improved reliability and speed.

NBC (National Broadcasting Company)

Nov, 2021 - Feb, 20242 yr 3 months
    National Broadcasting Company is an American commercial broadcast television and radio network. Technology Used: Python, PySpark, SQL, Amazon S3, Databricks, Airflow, Snowflake, MySQL Accountabilities: Developed frameworks and pipelines for capturing data from APIs and other sources, storing it in Amazon S3 and loading it into Snowflake tables after transformation Optimization and migration of Tableau datasets - PySPark code Managed migration process from SnapLogic to Airflow and Python, implemented distributed processing with PySpark and Spark SQL in Databricks Added and maintained ETL pipelines in Airflow and optimized Spark SQL queries to reduce reporting query run times Created and managed Delta tables and evaluated data optimization technologies Developed an audit framework integrated with Python scripts Imparted training to the new hires on domains and pipelines

Social Pulse

Mar, 2023 - Sep, 2023 6 months
    In-house project to leverage the data from various social media endpoints like Youtube, Facebook, Instagram, Twitter, LinkedIn, and TikTok, to provide a reporting dashboard. Technology Used: Python, Redshift, Amazon S3, AWS QuickSight, React, node.js Accountabilities: Architected & managed the project until completion & ensured the development of the framework, and pipelines for data capturing from AP

Amgen

Jun, 2021 - Oct, 2021 4 months
    Amgen is an American multinational biopharmaceutical company. The data science project was carried out to identify the factors leading to customer/patient dropout of one of their drugs Otezla. Technology Used: ML, Python, SQL | Platforms: Databricks Accountabilities: Analysed data in a Data Lake with over 300 tables to understand the pharma domain Prepared two aggregated datasets: one at the customer level and another at the patient level Conducted Exploratory Data Analysis, handled missing values and encoded categorical data Performed feature engineering for feature elimination and developed 12 machine-learning models for classification and clustering Created an ML pipeline for model retraining

Indventor

Oct, 2020 - Jun, 2021 8 months
    Indventor Bag Valve Mask-based low-cost ventilator which is the standard method of providing rescue ventilation to patients.

In-House Project, Indventor

Oct, 2020 - Jun, 2021 8 months
    Indventor Bag Valve Mask-based low-cost ventilator which is the standard method of providing rescue ventilation to patients. Technology Used: Python, SQL, Selenium Accountabilities: Reviewed the documents & code and presented client-side product presentation Researched features as per customer request and created UI path RPA flows

Ministry of Agriculture and Fisheries, Oman

Jan, 2013 - May, 20163 yr 4 months
    The Ministry of Agriculture and Fisheries is initiated to enrich the fields related to agriculture, livestock and fisheries. The project aimed to centralize the data from various regions. Technology Used: Core Java, Jaspersoft iReport, Windows / Software configurations Accountabilities: Understood the project architecture and the functionality of the Fisheries Licensing module Interacted with the ministry to clarify requirements, ensured alignment & conducted legacy database data analysis for migration to a new database Designed & created license cards, certificates & statistical reports using iReport, and integrated these reports into the application Deployed the database and application on the ministry's centralized server.

Security Solutions, Huawei

Sep, 2006 - Nov, 20104 yr 2 months
    Huawei is an organization worldwide known for its work in telecommunication. The project aimed to enhance the security offered at the IP layer. The product contains support from IKEv1 as well as IKEv2. I worked on a project which developed applications to enhance the security of the telecom servers. Technology Used: SQLite, C++, tcl/tk, Core Java Accountabilities: Built projects to enhance the code quality by developing an on-the-fly feedback and correction system for eclipse systems. Created CI/CD pipelines for projects in co-operating command mode integration of code quality and QA tools. Provided training for the team for building continuous integration systems for projects. Developed automation suites for building libraries across various platforms and boards Implemented and managed Continuous Integration (CI) processes and conducted training for the project team Developed and implemented GUI-specific code along with analysing new requirements and designing solutions for implementation Enhanced coding skills in Core Java and Swing and gained proficiency in Oracle database administration Extended customer support for LGT, LVM & LMT and implemented logging and auditing policies Created XML configuration files based on CIS Benchmarks Parsed, retrieved, and wrote XML configuration files and conducted training sessions on using the plug-in

Education

  • M. Tech. (Data Science and Engineering)

    BITS Pilani, India (2022)
  • B.Tech. (Computer Science and Engineering)

    MG University, India (2006)

Certifications

  • Data warehousing workshop - snowflake - october 2024

  • Academy accreditation - databricks lakehouse fundamentals - march 2023

  • Basics of natural language processing using python - march 2021 (nielit)

  • Databases and sql for data science by ibm (coursera) - nov 2019

  • Introduction to git and github by google (coursera) - october 2020

  • Exploratory data analysis with python and pandas (coursera) - march 2021

  • Data warehousing workshop - snowflake - october 2024 (credential id 119306090)

  • Basics of natural language processing using python - march 2021 (nielit) - (credential id olc3190)

  • Exploratory data analysis with python and pandas (coursera) - march 2021 (credential id yv8396ns2l25)

  • Databases and sql for data science by ibm (coursera) - nov 2019 (credential id - usxtmtufvyt8)

  • Introduction to git and github by google (coursera) - october 2020 (credential id - nuwxqp5a3gte)

AI-interview Questions & Answers

I'm a grammar editor for interview transcripts. I have 11 years of experience working in multiple domains, including ecommerce, media and entertainment, pharmaceutical, and telecom. Currently, I'm working in ecommerce, and my previous roles were in media and entertainment for 2.5 years and pharmaceutical for an unspecified period. Prior to that, I worked in telecom, creating security solutions for server applications. I enjoy solving technical problems and am a quick learner of new tools and technical documentation. I'm skilled at creating proof-of-concepts and exploring the feasibility of new tools in our current environment. I'm also passionate about mentoring juniors and creating a seamless work atmosphere. As a team player, I've been working remotely for 3 years and find it to be a positive experience. In contrast to my current remote work, my previous experience working in an office was similar, and I appreciated the opportunity to connect with colleagues, share knowledge, and receive assistance when needed.

Implementing a data quality framework using PySpark to ensure the integrity of ETL process data is very much required for the downstream process. Mainly, we use PySpark for big data processing. To ensure quality, we have to have all these mandatory fields which are being used downstream properly, null checks, and all. We can enable all these null checks and more using PySpark for the columns that are incoming. Also, we can have the schema enabled while reading the data, which will ensure that the data of each column is appropriate for our data. Additionally, if we need some mandatory checks to be conducted, that can also be accommodated in the schema.

How do you perform deduplication on a dataset in Snowflake that has been ingested from an ETL pipeline incorrectly multiple times? This was kind of performed in my workplace. So, we performed incremental loads into Snowflake, and those corresponding tables have time stamp columns that determine the duplicate data. We performed a count analysis on how data was ingested or when the data was ingested to the table. Depending on the count and time, if it would be expected to complete at a specific time only. If there are any discrepancies, we would detect that and delete all that data. Apart from that, we would have all that data deleted from the Snowflake table. Yeah.

If I were to design an ideal pipeline that handles time series data, which design patterns would I implement and why? To be frank, I haven't handled any time series data until now. However, one thing that comes to my mind is that we would need specific checks to ensure that when the data is populated, the scheduler runs accordingly. I'm not sure if I'm hitting the right point of this question. However, this is what comes to my mind when I read the question. There should be mechanisms to check if the data is loaded at specific time points, and the scheduler should run at appropriate times. The data should be loaded into the chosen lake tools, whether it's Databricks, Delta Lake, or Snowflake, and it should have a time count or timestamp to ensure the data is populated at the right timings. In terms of design patterns, this is one pattern I would suggest to handle time series data. If there are any discrepancies, such as a failed load at a particular point in time, alerts should be posted to the appropriate channels for stakeholders to be notified that the data failed to load.

How do you detect and handle skewness in a large data set when performing data transformation using PySpark? Skewness, in terms of large datasets, can be defined as when the data is not appropriately distributed, then there is a skewness in the data. So, I suppose that when the data is received and stored, how it is partitioned while saving might cause the skewness. If it is not partitioned properly, then a few partitions might have a larger amount of data while the others do not, which will cause high latency for those partitions with the larger size. If we encounter such issues while reading the data where a few partitions are having higher size than the others, then we should pick up the partitions appropriately. In Databricks, we have terminology to set around 4 columns initially, and then a total of 30 columns to keep track of the statistics of those columns. The data will be stored appropriately, meaning we will order the data based on the value of those columns. When the data is fetched, this particular order will be of great help. For example, suppose we are filtering the data based on a value. If the value is on a specific layer in the stack, it will appropriately go and fetch the data from that location. This information will get from the metadata, actually. That is the way I might handle the data skewness with my current knowledge.

When optimizing a skill query for reporting purposes in Snowflake, yeah, so here the Snowflake stores data in terms of micropartitions. So, basically, all those queries, whatever we give the filter conditions and all, the filtering should happen first in the inner query rather than in the outer query. So, once we have all the inquiries, we have to have the filter condition as and when we give it. Because if you're doing the filter queries on the outermost, fill filter conditions in the outermost query, what will happen is it will take up more resources, and all that data will get loaded into the memory and the processing, like the warehouse will be throttled. So, the best practice is to put all those subqueries, and only the maximum filter conditions should go in, and only the computation should happen in the outer query. So, in this way, we can ensure that only the relevant data is actually fetched for the computation.

ViceSpark's job has to that has to join 2 large datasets. Okay. Probably, I'm not sure if my approach is correct. But what can be done is this join can be done, and then we can write the file as parquet. And that parquet, we can actually, you know, copy into the corresponding Snowflake table. So in that way, the processing will be faster. And the join of the two large datasets, obviously, the filter conditions should be applied on the appropriate subqueries first, and then the join should happen on those queries, on the partition columns, basically. All the filters should happen on the partition columns so that the data fetches faster. Because in the case of Databricks, Delta, if you are fetching the data based on the partition columns, the statistics of the data are already available, and hence the data doesn't need to scan the entire data actually of the other partitions. So it can go directly to the corresponding partition and then fetch the data and apply the computational logic.

AWS managed Airflow can be a full service wherein it can pick data from multiple data sources and store it to multiple endpoints. We can leverage that end-to-end, which I am currently using in my project. The lambda functions can fit in the architecture. Lambda functions are basically used when we have to do some computation or transformation on the data, irrespective of loading data into a table. So we have some source data, and upon that, we are doing some processing, and we have to store the data into another location. That is when Lambda functions come into the picture. It can be a call from an API or some triggered Lambda functions. What happens is when data is available in a particular location, this can be triggered so that the data gets processed appropriately and stored into the appropriate location. Lambda functions can do spinning up the EC2 machines or any of the AWS services needed to process. All that can be done. And once the result is available in another location, the downstream systems can use it.

What strategy would you use to migrate Python ETL scripts running on legacy systems to utilize PySpark for enhanced parallel processing capabilities? Okay, this was one aspect we had done for the project with our media and editing team involved. So we had Python scripts running in a SnapLogic tool, which was scheduling them and triggering EMR jobs in the background. But what we did was, we brought Databricks into place and leveraged the parallel processing capability of Databricks. We coded in PySpark and used Spark SQL. Since it's a seamless transition, even though the scripts are in Python, internally it would be a Spark DataFrame, which would do all the processing in the background in a parallel way. We could then save the data to a Lakehouse or save it in Parquet format, as needed. This can be achieved using Databricks, and we've already done that.

What key metrics would you use to measure and improve the performance of an ETL pipeline that frequently handles JSON and CSV data files? In terms of performance perspective, multiple things come into consideration. How fast it is able to process the data? What is the level of failure? And when the JSON files come in, when we need to encounter some new elements, does it fail? Or in CSV, if we encounter any new columns, does it fail? So, we have generalized it. The main thing is to have a schema in place with the appropriate data types and all mentioned. So, when we expect this data of this data type in the source, and then only we'll process. This can be ensured to improve the quality and also the performance of our ETL pipeline.

Can you propose a method for real-time data processing using AWS Lab and Kinesis for a data-driven application? Yeah, as I said before, AWS Lambda is something that I have not used. But, in the project that I worked on, there was a data science aspect to it. So, what was done is, we had all the models ready and deployed. So, this Lambda function would actually run on a scheduled basis, taking in data, generating results, and storing them into the target table. And that target table would be used subsequently for predicting, like, the best products that customers can buy and all that. For Kinesis, yeah, this was used for iTrouble. So iTrouble sends data in batches. Kinesis was configured to get the data through Kinesis for processing, and that data would actually be stored in our S3 location.