ETROVUB

Vision-Language-Action Model for Autonomous Drone Mission Execution ■

Subject ■

Classical autonomous drone mission planning relies on a hand-engineered stack: a mission planner generates waypoints, a path planner finds collision-free trajectories, and a low-level controller tracks them. This pipeline is brittle and cannot reason about abstract goals (such as “survey the tree line while avoiding foggy areas”), adapt to unexpected environmental conditions described in natural language, or leverage the rich semantic understanding embedded in modern foundation models. Vision-Language-Action (VLA) models such as RT-2 and OpenVLA have demonstrated the ability to map language instructions and visual observations directly to robot actions in manipulation tasks, but their application to aerial robotics remains largely unexplored. In this master thesis, the goal is to investigate whether a VLA, fine-tuned on drone flight data and grounded by a tightly-coupled RVIO pipeline, can serve as a high-level mission executor for a multispectral drone platform.

Kind of work ■

The student will survey existing VLA architectures (RT-2, OpenVLA, Octo) and select one suitable for adaptation to the aerial domain, then design the action-space interface so that the VLA outputs high-level velocity commands or waypoint deltas while the RVIO pipeline provides real-time metric grounding for closed-loop control. A demonstration dataset will be collected by flying scripted missions (area survey, perimeter patrol, point inspection) with synchronised RGB, thermal, IMU, range, and operator commands paired with natural-language descriptions. The student will then fine-tune the VLA on this dataset and evaluate zero-shot generalisation on unseen mission descriptions. A safety layer leveraging the RVIO state estimate will provide geofence enforcement and collision avoidance as hard constraints on VLA outputs. Validation will progress from simulation (Gazebo or AirSim) to the Tarot 990 platform, concluding with a research paper.

Framework of the Thesis ■

The thesis will start with a literature review on VLA architectures (RT-2, OpenVLA, Octo), foundation models for robotics, language-conditioned policies, and the integration of learned policies with classical state estimators.
Next, the student will define the complete experimental framework: VLA architecture selection and adaptation to the aerial action space, demonstration data-collection protocol on the Tarot 990, fine-tuning pipeline, and design of the safety layer that grounds VLA outputs through the RVIO metric pipeline. Simulation infrastructure (Gazebo or AirSim) will be set up for safe iterative testing.
In the final phase, the student will conduct experimental validation: fine-tuning the VLA on the collected dataset, evaluating spatial accuracy and zero-shot generalisation on unseen mission descriptions, validating the contribution of multispectral input (RGB vs. RGB + thermal), and progressively testing more complex missions on the Tarot 990 platform. The validation phase concludes with a publication-ready research paper.

Expected Student Profile ■

The ideal candidate has a strong machine-learning background with hands-on PyTorch experience, particularly in fine-tuning large language and vision-language models. Familiarity with NLP, foundation models, and robot learning is essential. Solid robotics and ROS2 skills are required, alongside experience with simulation environments (Gazebo, AirSim). Knowledge of state estimation is helpful for building the safety/grounding layer. Strong Python programming skills and experience with GPU training are required.