
Data Scientist
The Family Office Company BscAI ML Engineeer
Biomed InformaticsData Science Internship
Biomed Informatics
R PROGRAMMING
Kelly Turan. I'm currently working as an AI engineer at Biomed Informatics. My background is in computer science engineering, where I earned my bachelor's degree. I then completed a postgraduate diploma in AI and ML. After that, I started working with Biomed Informatics as an AI engineer. Here, our base is developing AI and ML models used in the healthcare sector mostly. We use computer vision technologies to build predictive models that can analyze CT scans and X-ray images to detect tumors or fractures a patient may be suffering from. We also do predictive analysis for clients based on their data. For example, recently, we worked on a project predicting IPL run scores and wicket predictions for bowlers based on the client's data. We also develop chatboards integrated with LLMs, implementing APIs and fine-tuning models for specific use cases. We then quantize the models to deploy them easily on hardware systems with minimal requirements. These are the main aspects of my education and work experience.
So, basically, the ETL pipelines are extract, transform, and load pipelines, which is used in our process of transforming the raw data into processing methods so the model can digest the data. So geospatial data means data with diverse kinds of text, numbers, or else categorical values. To develop a procedure in this pipeline, firstly, we need to convert the categorical ones, which is weather, sunny, into vectors. So the model can convert them into numbers. And then, we have different procedures, like in the geospatial where we have to transform the scaling procedure of our data. So in this case, we use the MinMax standard scaler or robust scaler based on the kind of data and their values. So, particularly, we do scaling to make all the values into a particular range so the model doesn't get overfitting or underfitting. To set up the ETL pipeline, firstly, we use a pipeline method from scikit-learn to set up the pipeline. So, in that, we include the scaling methods and the model, what kind of model we have to use, and the metrics we need to check for the model. So this is a basic procedure and a high-level procedure for implementing our ETL pipeline to handle the geospatial data. So using the pipeline from scikit-learn, we can create a ETL pipeline which can extract, transform, and load the data, whether it's geospatial or any other kind of data. It can handle all kinds of preprocessing techniques.
So, basically, the system is designed to automate the recognition and flagging of outdated POA designs. So, like, I am not familiar with this kind of system, but I have a bit of knowledge on automate recognition. Like, POA refers to UI listings. So, basically, the POI listing means it's a technique which can leverage the pipeline of a process. Like, it's an important process where we can, sorry, I'm not able to recognize exactly.
So the LLM is the best choice to enhance an existing NLP-based system because the NLP-based systems are built using LSTMs or encoder and decoder, but they are not trained on a huge amount of data. So they are not able to give a good response to the users, basically. So instead of the NLP-based system, if we use an LLM on the data which we want, then it is a very good technique to make our chatbot very effective, which can be liked by the users, and the users will be using more. And they will also be satisfied with the model. So it is also easy to train the model and leverage the model because using existing LLM, which is trained on huge amount of data with billions of parameters, can enhance more knowledge on our data after converting our data into vectors. So based on using LLM, it can not only generate good responses, but also leverage techniques and enhance the model to respond as a human instead of a chatbot. It can also help in many ways with a less number of training hours, and then it is very cost-effective also for the training purpose. Like, instead of using hundreds of GPUs, we can use a single GPU and train a model or train a multi-billion model with multimodal, which can be effective and very best in the case of using an existing NLP-based system.
More script for that event to Huddl do now. Can fold increase in data science. So, basically, instead of using our R code, we can efficiently handle the data, like using data table instead of a data frame for faster data manipulation. Also, the memory management cleanup used objects like using R's m and garbage collector functions to free up the memory. Also, vectorization replaced loops with vectorized operations whenever possible, and it supports parallel processing, like parallel libraries utilizing packages like parallel and foreach for parallelized tasks, and distributed computing can also be used. We're considering using sparklyr to interface with Hadoop for distributed data processing and also algorithm optimization, like choosing algorithms that scale well with the data size, such as stochastic gradient descent for linear models, and sampling techniques. Then we have batch processing for algorithms that suppose process data in batches instead of loading all the data. And it is efficient operations with the data storage and database integration with Postgres SQL and MySQL, and hardware utilization, like using machines with more RAM and CPU for cloud solutions like AWS for scalability and GPU acceleration using packages like TensorFlow, Fintech, Flow, and Keras for GPU acceleration and machine learning. And also benchmarking and profiling, we can use the Profvis package, and then cloud and distributed computing, like using H2 for scalable machine learning with Huddl interface. So these are the techniques, so we can conclude that scaling an art script for a 10-fold increase in data science requires a combination of coding practices, leveraging parallel and distributed computing, and potentially utilizing more powerful hardware or cloud services. By systematically applying these strategies, we can ensure that our script remains performant even with significantly larger datasets.
So, basically, firstly, we can involve the system architecture, like, using real-time data streams, like, by each Kafka, Amazon Kinesis, or Google Cloud, and then maintaining a database of POI information which could be stored in a relational database like PostgreSQL or MySQL. And then the data processing layer, we have stream processing using stream processing frameworks like AppEdge, Flyte, or Spark Streaming, or Kafka Streams. And then microservices, we have implementing microservices for different tasks such as data ingestion, data processing, and data enrichment. And then, using the POI table, we can store the information in a relational table with appropriate indexing. And in real-time, we have data tables. Then, enriching and processing, like, using the data in this and data parsing, geo-spatial enrichment. We process our data and integrate it with POI information. And then, we have setting up real-time, like, streaming data. Here, we can use, like, geospatial queries using the PostGIS extension and PostSQL. And then we have monitoring and scaling where we are implementing monitoring techniques using tools to track performance and detect bottlenecks and auto-scaling, like, using Kubernetes or cloud-based solutions, like scaling factors to handle varying loads. So, by leveraging real-time stream processing frameworks, an efficient database, geo-spatial capabilities of PostGIS, and microservices, we can build a robust SQL-based solution for real-time POI data enrichment. And, of course, we need to ensure we monitor and scale our system as needed to handle growing data volumes.
Selecting UI table with certain attributes is what it's trying to accomplish. So, like, in this scenario, we are trying to accomplish. So, basically, looking at the table, I can see that we are selecting the name and the location category from POIS. Like, it's a table. So we fill the table by filtering by category. Like, it attempts to filter the POI where the category is either hotel or restaurant. And then non null location, it filters out the POIs where the location field is null. And then sort by name length, we are doing the results are intended to be ordered by the length of the name in the field in descending order. Then we are limiting the results up to the top ten results. I think there are a few syntax errors in that select field, like a few syntax errors. So this is what we are trying to do from this SQL.
So we are examining this Python code for passing GeoSpecial because there is a code that could lead to unhandled exceptions. Yes. I think there are a few issues in the code. Firstly, in the 10th day, there's an indentation error, the code is inside the try block, but it's not properly indented. So, this slice fixes the indentation error, then the import statement uses GeoPandas as 'gdp', but later uses 'GPT', which is inconsistent. It should be consistent throughout. And the read file function calls the mismatcher code around the file name. And then the syntax in the except exception handling the string inside the print function has mismatched quotes and is missing a closing parenthesis and double quotes in the first print statement. The print statement for the second general exception error handling is missing an opening parenthesis and the closing parenthesis. So these are the errors I can see in the code.
So, basically, to extend an ETL pipeline to integrate with the POA and finding involves several steps. Like, we can use the API integration, like using the restful APIs or web services provided by the third-party POI data providers using tools like the requests in Python for fetching data and then scheduling extraction, like implementing cron or jobs or scheduling tools, like Apache Airflow to regularly pull data. And then, using the transforming phase, like data cleansing and normalization, scheme mapping, like mapping the third-party data scheme to our internal schema or data validation, like validating data types, removing duplicates, handling missing data. Then we have the loading phase where we have transactional loading and stacking tables, like using database transactions to ensure asset properties. Right. And tagging tables, like loading data into tagging tables. Firstly, like, to validate before merging and merging into main tables, then ensuring asset properties, like ensuring that the ETL step is atomic in case of failure. The system should revert to the previous consistent state. And using constraints like triggers and validation rules in the database to maintain integrity and the isolation we have, like implementing proper transaction isolation levels to ensure that transactions do not interfere with each other and the durability, like ensuring that once the transaction is committed, it is stored permanently, so we use reliable storage solutions and regular backups. So this is the process where we can integrate third-party POI data feeds while maintaining
So, basically, the divider being so, basically, if I click without self-filing, data integrity is compromised. So, in this case, we had to do the extracting phase using API integration with restful APIs or web services. Then, we used the transforming phase, which included schema mapping and data validation, and loading phase, which included transactional loading and stacking tables. We then ensured the asset property for greater efficiency without sacrificing data integrity. So, we can plan our optimization by optimizing this process below, similar to the previous process for the previous question. We had asset properties, and we can do it or slightly refactoring queries. This involves analyzing existing queries to identify bottlenecks and using indexes. We identify columns frequently used in the where clause to end joint conditions, and we limit the use of distinct to minimize its impact on query performance. We ensure that it isn't necessary for specific views, and we use union all instead of union. We also normalize the data model and denormalize for reoperation. We consider denormalizing tables for read-heavy operations to reduce join operations and improve query performance and query execution plan analysis. We can do this using join operations. So, we need to choose appropriate algorithms, such as hash, merge, and index-based algorithms, depending on the size of the table and available indexes. Then, parameterization is used instead of embedding values directly into SQL queries. We use parameterized queries to avoid SQL injection and regular maintenance. Then, testing is done, and we plan and optimize SQL queries used in the ETL process for greater efficiency without sacrificing data integrity.
Auto plus which are data integrated in an existing Python-based ETL pipeline for announced processing. So, basically, we can test the button. Firstly, like, choosing the model and then data preprocessing and then model inference, like, using the NLLM model inference on the input data depending on the case. And then post-processing the output generated by the LLM model as needed. This may include decoding, formatting, and further analysis, and then handling. We can do testing and validation. After that, we have to optimize the LLM integration for performance by leveraging techniques like batch processing and parallelization, and then documentation and training, documenting the integration process, providing training and support for the team members working with the integrated LLM model. We then use the CICD pipelines to continuously monitor and evaluate the performance of LLM integration. This is the main process. In other cases, we can also use AB testing, prompt tuning, re-tuning, prefix tuning, fine-tuning the models based on the data we have, which can improve some models for rogue. We can check rogue metrics or use an LLM only. We can do model evaluation. Using these steps, we have the Weights and Biases, also known as WNB, which can also be used in the ETL pipeline. It regularly monitors the model's performance and sends a mail if the model is stacking or lacking behind the scores, which we have benchmarked. This is a process to enhance NLP processing.