On March 29th 2024 at 17::00, Jurgen Vandendriessche will defend their PhD entitled “TOWARDS SMART ACOUSTIC CAMERAS FOR SIMULTANEOUS SOUND LOCALIZATION AND RECOGNITION”.
Everybody is invited to attend the presentation in room I.0.01, or digitally via this link.
Acoustic cameras are devices that visualize sound by utilizing an array of microphones. The signal from each microphone is combined using a beamforming algorithm to generate an acoustic heatmap or acoustic image. These beamforming algorithms tend to have a high computational cost, which increases with the number of microphones. The combination of a high number of Input/Output (I/O) requirements for the microphones combined with the high amount of parallel computations makes Field Programmable Gate Arrays (FPGAs) very suitable for processing the signals from these microphone arrays. FPGAs have a low power consumption, which makes them especially viable when targeting battery powered devices such as handheld acoustic cameras or nodes in a sensor network. Despite the high computational power per watt of FPGAs, satisfying real time scenarios still present a challenge, especially when targeting acoustic images with a higher resolution. To overcome this challenge, a multi-mode acoustic camera has been developed. The camera supports multiple modes depending on the task at hand. To satisfy the real time requirement for each mode, the resolution of the acoustic heatmap can be adapted. A second limitation of the existing acoustic cameras is the identification of the type of sound, which commonly requires human expertise to recognize and profile the sound.
In recent years, deep learning, a form of Artificial intelligence (AI), has shown promising results towards the task of sound recognition by using Convolutional Neural Networks (CNNs). However, most of the research focuses on improving the accuracy of such models without considering the limitations one encounters when deploying such a model on resource constrained devices such as FPGAs. FPGAs are used nowadays for embedding deep learning inference, mainly using two architectures. One type of architecture uses a general-purpose soft-core inside the Programmable Logic (PL) of the FPGA. On the other hand, there are also dataflow-based architectures that translate each layer in a CNN to a functional block in the PL. Embedding these CNNs for inference on FPGAs is not a trivial task and come with trade-offs in terms of resource consumption, accuracy, supported layers,… These two architectures are compared against other embedded solutions such as Google’s edge Tensor Processing Unit (TPU) and a Raspberry Pi (RPi) to find the best fit for acoustic cameras. Acoustic cameras are targeted in this instance because they can identify the location of a sound source, which is not possible when using one microphone. Furthermore, existing beamforming techniques such as delay-and-sum reconstruct audio signals, which can be used for audio classification tasks.