Explainable Visual Anomaly Detection with Multimodal Models and Metadata-Augmented Prompts

Explainable Visual Anomaly Detection with Multimodal Models and Metadata-Augmented Prompts ■

Natalia Wojak-Strzelecka, Szymon Bobek, Mehrdad Asadi, Ann Nowe, Krzysztof Kutt, José García, Grzegorz J. Nalepa

Abstract ■

Anomaly detection is critical in industrial domains such as quality control and predictive maintenance, where combining it with visual inspection and explainability enhances trust and reduces errors. This study evaluates a pre-trained multimodal foundation model for visual anomaly detection and explanation on the MVTec dataset, using a post hoc fusion strategy that integrates outputs from independent models. The setup includes comparisons with PatchCore and extended configurations incorporating metadata such as heatmaps, segmentation masks, and patches. Results show that the multimodal model outperforms PatchCore on texture categories (mean F1: 0.960 vs. 0.947), but underperforms on object categories. It offers interpretable explanations in simple cases, but limited classification accuracy, marginal benefits from metadata augmentation, and reduced specificity in complex scenes indicate that further refinement is needed. These findings highlight the potential of multimodal models for explainable anomaly detection while underscoring current limitations in handling complex object structures.