ETROVUB

Building reliable Multimodal Foundation Models for Cooperative Perception ■

Subject ■

Context
Cooperative perception in V2X systems enables multiple agents — vehicles,
infrastructure, and other sensors — to collaboratively perceive their environment by
sharing complementary information. This paradigm has the potential to significantly
improve robustness and safety in autonomous driving, particularly in complex and
occluded scenarios.
In recent years, numerous V2V, V2I, and V2X datasets have been introduced, covering a
wide range of sensor modalities such as cameras, LiDAR, radar, and even aerial or dronebased
observations. However, these datasets remain fragmented: they differ in formats,
annotation standards, sensor configurations, and underlying assumptions, which limits
their joint exploitation.
At the same time, the success of foundation models in computer vision has demonstrated
the value of training large-scale models on diverse and heterogeneous data. Extending
this paradigm to cooperative perception requires aggregating and harmonizing multiple
datasets into a unified training framework, while handling multimodal inputs and crossdomain
variability. This represents a key step toward building general-purpose, robust
cooperative perception systems.

Kind of work ■

Objectives
The main objective of this thesis is to aggregate and harmonize existing V2V/V2I/V2X
datasets into a unified multimodal dataset suitable for large-scale training.
A secondary objective is to design and train a multimodal cooperative perception
foundation model leveraging this aggregated data, and to evaluate its generalization
capabilities across datasets and modalities.

Framework of the Thesis ■

Description of Work
• Literature review: conduct a comprehensive review of existing cooperative
perception datasets, multimodal learning approaches, and foundation models in
computer vision and autonomous driving. Identify key challenges in dataset
aggregation and cross-dataset generalization.
• Dataset aggregation and harmonization: collect and integrate multiple public
V2X datasets. This includes unifying data formats, aligning coordinate systems,
standardizing annotations, and handling heterogeneous sensor modalities (e.g.,
LiDAR, radar, cameras, aerial views).
• Multimodal model design and training: design and implement a multimodal
architecture capable of ingesting heterogeneous inputs. Train a large-scale
model on the aggregated dataset, leveraging recent advances in representation
learning and foundation models.
• Evaluation and benchmarking: evaluate the trained model across multiple
datasets and modalities. Assess its generalization capabilities, robustness to
domain shifts, and performance compared to single-dataset or single-modality
baselines.

Expected Student Profile ■

• Strong knowledge of machine learning, deep learning, and computer vision
• Solid experience in Python programming and deep learning frameworks (e.g.,
PyTorch)
• Interest in multimodal learning, large-scale models, and autonomous systems
• Ability to work independently, conduct a literature review, and implement
research-level code