Open thesis projects @TDS Lab

  1. [ML] An evaluation of data analysis techniques in digital health applications

    With the surge in data availability in healthcare, the potential of data-driven digital health applications rises. In this research you will review different categories of digital health applications, and investigate the suitability of different data analysis techniques to digital health applications. Finally, you will benchmark these different techniques on a real-world medical dataset in the context of proving the effectiveness of a digital health application. You will contribute to showing which techniques are most effective for specific types of digital health applications.

    Daily supervisor: Jim Achterberg (LUMC), Marco Spruit
  2. [ML] Balanced and balancing distance measures for mixed variable types

    Many AI, ML and data science methods depend on the notion of a distance, which often acts as a dissimilarity measure between observations in the data set. In real-world data sets, variables have various types, e.g. continuous, ordinal, nominal/categorical and binary, contained within one data set. In such cases, dissimilarity is almost always measured using Gower's distance. It min-max-scales numeric variables, and assigns distances to non-numeric variables as 1 if the values are unequal, and 0 if they are. Dimensions are just added directly, like in the Manhattan distance measure. The implication is that distances are dominated by categorical dimensions, as the distance (if non-zero) corresponds to the largest possible distance in the numeric dimensions, which will typically have smaller values. Also, average distances per dimension are not equalized (not even if the dimensions themselves are normalized or standardized first), and are dominated by imbalanced columns. This project will develop a balanced version of Gower's distance that makes the contribution of every feature on average equal, and leaves the possibility to re-weigh the contribution of features. The resulting distance measure will be used for risk stratification of people with metabolic syndrome on a large scale data warehouse with health, demographic and socio-economic data, but is expected to find wide-spread use in distance-based machine learning tasks on heterogeneous data.

    Daily supervisor: Marcel Haas (LUMC), Marco Spruit
  3. [NLP] From mobile app to furry social robot: Welzijn.AI

  4. [NLP] Dutch NLP with English BERT models

    It has been shown that fine-tuning English BERT models on translated Dutch clinical text can achieve results comparable to using Dutch BERT models on the original Dutch text. Given that Dutch BERT models are often trained on smaller datasets, translating into English to take advantage of larger and more robust English BERT models presents an exciting opportunity. Interestingly, this translation approach is very unexplored in current NLP research, which typically focuses more on building new models for each language. A shift to translation could save tremendous time and resources. In this project, you will get the chance to research how well translation works on a broader range of Dutch NLP tasks, with the opportunity to expand it to other languages as well. Your work could play a key role in shaping the future of BERT research for minority languages, and it's an opportunity to make a very meaningful impact in the field of non-English NLP!

    Based on Extracting Patient Lifestyle Characteristics from Dutch Clinical Text with BERT Models. Daily supervisor: Hielke Muizelaar, Marco Spruit
  5. [ML] MDL-based association rule mining on ELAN data

    Further the research in MSc thesis by

    Daily supervisor: Marco Spruit, t.b.d.

  6. [NLP] LLMs in Dutch Elderly Care

    The MINUTES study: The aim of the COVID-19 management in nursing homes by outbreak teams (MINUTES) study is to describe the challenges, responses and the impact of the COVID-19 pandemic in Dutch nursing homes. In this first article, we describe the MINUTES Study and present data characteristics.
    The MINUTES study has been very valuable in managing the crisis in nursing homes, due to the COVID-19 pandemic. Data, minutes of crisis-team meetings, were gathered and analysed using traditional qualitative research methods. In total, more than 10.000 separate minutes have been collected.

    The RQ that we are interested in, is:

    Relevant MINUTES papers: HERE.
  7. [NLP] LLMs in the analysis of interviews with older people about goals of care: a pilot study

    Large language models such as used in ChatGPT and chatbots are a form of conversational artificial intelligence. Qualitative research using interviews and focus groups use analysis of conversations to identify themes or paradigms.
    Especially in the care for older people conversations regarding goals, qualitative research plays an important goal as older people have care needs that may be different that younger adults. These qualitative interview studies are, however, time consuming and it is unknown what role LLMs may play.

    Our objective is to compare the results of the qualitative analysis of single interviews performed according to current standards with the analysis performed by LLMs.

    Methods: As a part of the Master Health Ageing and Society, ten groups of four students will perform interviews with older people about the perspective on life and care. The interviews are transcribed ad verbatim, coded using atlas.ti software and an inductive analysis will be performed too.
    In parallel, taking the interview transcripts will be analyzed by LLMs with prompts to analyze with the same intention using different prompts and LLMs.
    Results of the conventional analysis and analysis by LLMs will be compared. Interviewees will be asked to blindly score all of the analyses (scale of 1 to 10) on how good it reflects the their perspective and to indicate which of the four analyses performed best.

  8. [ML] PHAETON: Portable platform-as-a-service for crowdsourced and privacy respecting data analysis and modeling in pandemic response

    Daily supervisors: Marcel Haas (LUMC), Marco Spruit