Evaluating Vision-Language Foundation Models for Chest X-ray Abnormality Detection ■
Chest X-ray imaging is a frequently used medical imaging modality and plays an important
role in the diagnosis and follow-up of thoracic diseases such as pneumonia, pleural
effusion, atelectasis, cardiomegaly, pulmonary edema and lung opacities. The
interpretation of chest X-rays is challenging because abnormalities can be subtle,
overlapping, or ambiguous, and because the same visual finding may be described
differently in radiology reports.
In recent years, foundation models and vision-language models have shown strong
performance in natural image understanding by learning joint representations of images
and text. Similar models are now being developed for the medical domain, often using
large collections of medical images paired with radiology reports. These models could
potentially support tasks such as abnormality detection, zero-shot classification, report
retrieval, or image-text matching.
However, it remains unclear how reliable such models are when applied to clinically
relevant chest X-ray interpretation tasks. General-purpose models may not understand
medical concepts sufficiently, while biomedical models may still be sensitive to prompts,
dataset bias, and domain shifts. A systematic evaluation is therefore needed before such
models can be considered useful in medical imaging workflows.
The objective of this thesis is to evaluate the usefulness and limitations of pretrained
vision-language foundation models for chest X-ray interpretation. The project will
comprise three main steps. Firstly, a literature study will be performed on vision-language
models, medical image foundation models and chest X-ray analysis. Secondly, several
pretrained models will be applied to public chest X-ray datasets using zero-shot and fewshot
strategies. Thirdly, the obtained results will be compared with classical supervised
machine learning or deep learning baselines.
Framework of the Thesis ■
The developments will be performed in Python, using open-source medical imaging,
machine learning and deep learning libraries.
The project will involve:
Literature study on medical vision-language models, chest X-ray classification and
prompt-based learning.
2/2
Selection and preprocessing of one or more public chest X-ray datasets, for
example CheXpert, NIH ChestX-ray14, MIMIC-CXR or a smaller curated subset.
Implementation of an evaluation pipeline for pretrained vision-language models.
Design and comparison of different text prompts for relevant abnormalities.
Evaluation of zero-shot and, if feasible, few-shot classification performance.
Implementation or reuse of a supervised baseline model.
Quantitative evaluation using metrics such as AUC, F1-score, precision, recall and
calibration.
Analysis of model robustness, prompt sensitivity and failure cases.
Thesis writing.