Publication Details

International Conference on Quality of Multimedia Experience (QoMEX)

Contribution To Book Anthology


Explainable artificial intelligence (XAI) has gained considerable attention in recent years as it aims to help humans better understand machine learning decisions, making complex black-box systems more trustworthy. Visual explanation algorithms have been designed to generate heatmaps highlighting image regions that a deep neural network focuses on to make decisions. While convolutional neural network (CNN) models typically follow similar processing operations for feature encoding, the emergence of vision transformer (ViT) has introduced a new approach to machine vision decision-making. Therefore, an important question is which architecture provides more human-understandable explanations. This paper examines the explainability of deep architectures, including CNN and ViT models under different vision tasks. To this end, we first performed a subjective experiment asking humans to highlight the key visual features in images that helped them to make decisions in two different vision tasks. Next, using the human-annotated images, ground-truth heatmaps were generated that were compared against heatmaps generated by explanation methods for the deep architectures. Moreover, perturbation tests were performed for objective evaluation of the deep models' explanation heatmaps. According to the results, the explanations generated from ViT are deemed more trustworthy than those produced by other CNNs, and as the features of the input image are more dispersed, the advantage of the model becomes more evident.

DOI scopus