ETROVUB

Improving Factual Consistency in Radiology Vision-Language Models through Image-Text Grounding ■

Subject ■

Radiology reports contain important diagnostic information and are routinely used to
communicate findings from medical images. Recent advances in vision-language models
have made it possible to generate textual descriptions from medical images, retrieve
similar cases, or answer questions about imaging findings. These technologies could
eventually support radiologists by assisting with report drafting, quality control or
structured reporting.
However, radiology report generation remains a challenging and high-risk task. A model
may produce text that is fluent and plausible but medically incorrect. Such factual errors,
often referred to as hallucinations, are particularly problematic in healthcare because they
may introduce false findings, omit important abnormalities, or misrepresent the severity
of disease.
A promising way to reduce these errors is to ground generated text more explicitly in
image evidence, predicted clinical labels or retrieved similar cases. Instead of directly
generating a full report from an image, a model can first identify visual findings, retrieve
comparable image-report pairs, or constrain the generated text using structured medical
concepts. The effectiveness of such grounding strategies remains an active research
question.

Kind of work ■

The objective of this thesis is to study whether image-text grounding strategies can
improve the factual consistency of radiology vision-language models. The project will
comprise three main steps. Firstly, a literature study will be performed on radiology report
generation, medical vision-language models and hallucination evaluation. Secondly, a
baseline pipeline for image-conditioned report generation or report assistance will be
implemented using pretrained models. Thirdly, one or more grounding strategies will be
added and evaluated against the baseline.

Framework of the Thesis ■

The developments will be performed in Python, using open-source machine learning,
natural language processing and medical imaging libraries.
The project will involve:
• Literature study on medical vision-language models, radiology report generation
and factual consistency.
• Selection and preprocessing of a public image-report dataset.
• Implementation of a baseline report generation or report-assistance pipeline.
• Implementation of one or more grounding strategies, such as:
o retrieval of similar image-report pairs
o prediction of structured findings before text generation
o concept-based report generation
o post-hoc factual consistency checking
• Evaluation of generated text using both generic language metrics and medically
oriented metrics.
• Analysis of hallucinated findings, omitted findings and clinically relevant errors.
• Qualitative comparison of generated reports with reference reports.
• Thesis writing.