Dear Leiden University-only students, below you can read about the open thesis positions at TDS Lab for 2025-2026. More topics will be added in the coming weeks! Contact me if you are interested in one of these Master's Thesis Research Projects. Cheers, Marco

Open thesis projects @TDS Lab

  1. [NLP] External validation of NLP model for lung cancer prediction

    In a research collaboration with Amsterdam UMC and Erasmus MC, you will externally validate a successful NLP model to predict early detection of lung cancer in GPs' clinical notes, as developed at Amsterdam UMC, at both/either LUMC and/or ErasmusMC using real-world GP clinical texts. Finetuning procedures based on error analyses and ablation studies to further optimise the original model will likely constitute part of your scientific contribution.

    Daily supervisors: Marco Spruit and others

  2. [LLM] Leveraging Large Language Models to Improve Prediction in Geriatric Care

    In prospective studies of older patients, such as the TENT study (cancer patients, n=2000) and the APOP study (Emergency Department patients, n=750), we have performed comprehensive baseline geriatric assessments. These included daily functioning, comorbidities, living situation, cognition, nutrition, and frailty, with follow-up for one year on mortality, quality of life, and functional decline. Using these structured data, we developed prediction models, but their performance was modest, with AUCs typically below 0.75. We hypothesize that this limitation reflects the fact that many aspects of frailty, multimorbidity, and patient context are recorded not in structured variables but in free-text data such as referral letters, discharge summaries, and clinician notes. Recent advances in large language models (LLMs) allow scalable extraction of clinically relevant features from such text, potentially yielding more accurate and clinically useful predictions.

    Aim: To validate whether privacy-friendly large language models applied to unstructured electronic health record (EHR) text can improve prediction of mortality, functional decline, and quality of life in older patients, compared with models based only on structured geriatric assessments.

    Design & Setting: Retrospective analysis of existing prospective cohorts, enriched with EHR text data available via CTCue at Leiden University Medical Center.

    Expected Impact: This project directly tests whether LLMs can unlock hidden predictive value from routinely collected clinical text. If successful, it will provide the first validated evidence that LLM-enhanced models outperform standard geriatric assessments in predicting outcomes across two distinct high-risk populations. This could improve prognostication, support shared decision-making, and help allocate geriatric resources more effectively - contributing to more person-centered, equitable care for older patients.

    Daily supervisors: Simon Mooijaart, Bram van Dijk (LUMC), Marco Spruit
  3. [ML] Balanced and balancing distance measures for mixed variable types

    Many AI, ML and data science methods depend on the notion of a distance, which often acts as a dissimilarity measure between observations in the data set. In real-world data sets, variables have various types, e.g. continuous, ordinal, nominal/categorical and binary, contained within one data set. In such cases, dissimilarity is almost always measured using Gower's distance. It min-max-scales numeric variables, and assigns distances to non-numeric variables as 1 if the values are unequal, and 0 if they are. Dimensions are just added directly, like in the Manhattan distance measure. The implication is that distances are dominated by categorical dimensions, as the distance (if non-zero) corresponds to the largest possible distance in the numeric dimensions, which will typically have smaller values. Also, average distances per dimension are not equalized (not even if the dimensions themselves are normalized or standardized first), and are dominated by imbalanced columns. This project will develop a balanced version of Gower's distance that makes the contribution of every feature on average equal, and leaves the possibility to re-weigh the contribution of features. The resulting distance measure will be used for risk stratification of people with metabolic syndrome on a large scale data warehouse with health, demographic and socio-economic data, but is expected to find wide-spread use in distance-based machine learning tasks on heterogeneous data.

    Daily supervisors: Marcel Haas (LUMC), Marco Spruit