On November 7th 2024 at 16:00, Boris Joukovsky will defend their PhD entitled “ SIGNAL PROCESSING MEETS DEEP LEARNING: INTERPRETABLE AND EXPLAINABLE NEURAL NETWORKS FOR VIDEO ANALYSIS, SEQUENCE MODELING AND COMPRESSION”.
Everybody is invited to attend the presentation in room I.0.01 or online via this link.
There is growing use of deep learning for solving signal processing tasks, and deep neural networks (DNNs) often outperform traditional methods little domain knowledge needed. However, DNNs behave as black boxes, making it difficult to understand their decisions. The empirical approaches to design DNNs often lack theoretical guarantees and create high computational requirements, which poses risks for applications requiring trustworthy artificial intelligence (AI). This thesis addresses these issues, focusing on video processing and sequential problems across three domains: (1) efficient, model-based DNN designs, (2) generalization analysis and information-theory-driven learning, and (3) post-hoc explainability.
The first contributions consist of new deep learning models for successive frame reconstruction, foreground-background separation, and moving object detection in video. These models are based on the deep unfolding method, a hybrid approach that combines deep learning with optimization techniques, leveraging low-complexity prior knowledge of the data. The resulting networks require fewer parameters than standard DNNs. They outperform DNNs of comparable size, large semantic-based convolutional networks, as well the underlying non-learned optimization methods.
The second area focuses on the theoretical generalization of deep unfolding models. The generalization error of reweighted-RNN (the model that performs video reconstruction) is characterized using Rademacher complexity analysis. This is a first-of-its-kind result that bridges machine learning theory with deep unfolding RNNs.
Another contribution in this area aims to learn optimally compressed, quality-scalable representations of distributed signals: a scheme traditionally known as Wyner-Ziv coding (WZC). The proposed method shows that deep models can retrieve layered binning solutions akin to optimal WZC, which is promising to learn constructive coding schemes for distributed applications.
The third area introduces InteractionLIME, an algorithm to explain how deep models learn multi-view or multi-modal representations. It is the first model-agnostic explanation method design to identify the important feature pairs across inputs that affect the prediction. Experimental results demonstrate its effectiveness on contrastive vision and language models.
In conclusion, this thesis addresses important challenges in making deep learning models more interpretable, efficient, and theoretically grounded, particularly for video processing and sequential data, thereby contributing to the development of more trustworthy AI systems.