Authenticating devices based on audio feature selection with scene-specific tuning and landmark augmentation

Authenticating devices based on audio feature selection with scene-specific tuning and landmark augmentation ■

Andi Bahtiar Semma, Kusrini, Arief Setyanto, Bruno da Silva Gomes, An Braeken

Abstract ■

Authentication methods have evolved significantly, transitioning from traditional passwords to biometric and multi-factor techniques, with audio-based systems now emerging as a promising frontier. These systems leverage ambient sounds or device-generated noise for continuous authentication but encounter challenges such as environmental noise, spoofing risks, and standardized audio feature selection. This study tackles these issues by focusing on robust handling of environmental variations and interference and optimizing audio feature selection for effective environmental audio authentication. A key innovation introduced is the concept of “audio landmarks,” randomly generated signals embedded into audio samples. These landmarks enhance device authentication by enriching feature representation and reducing sensitivity to noise, leading to significant improvements in precision, recall, and F1 scores across various scenarios. In some cases, features achieved a perfect F1 score 1.00 under ideal conditions. Among audio features analyzed, the Constant-Q Transform (CQT) excels, particularly in music or speech scenes. However, combining multiple features often introduces redundancy due to overlapping information and varying optimal thresholds, which may not constantly improve performance. Additionally, spectral centroids and spectral contrast, which are computationally lightweight at 9 ms and 10 ms, respectively, deliver excellent performance, making them ideal for real-time or resource-constrained applications, as tested on the Raspberry Pi 4. These findings provide practical guidelines for audio-based device authentication by leveraging cryptographic hashing for deterministic landmark generation and the balanced fusion of landmark and acoustic features. This enables robust authentication even in challenging scenarios where environmental sounds are insufficient.