Interactive view synthesis is a fundamental task in computer vision, aiming at recreating a natural scene from any viewpoint using a sparse set of images. The primary focus of this Ph.D. thesis is to explore the acquisition processes and the ability to render high-quality novel views dynamically to a user. Furthermore, this research targets real-time rendering as the second objective. In this thesis, we explore two different ways a scene can be reconstructed. The first option is to estimate the three-dimensional (3D) structure of the scene, by means of a set of points in space called a point cloud (PC). Such PC can be captured by a variety of devices and algorithms such as Time of Flight (ToF) cameras, stereo matching, or structure-from-motion (SfM). Alternatively, the scene can be represented by a set of input views with their associated depth maps, which can be used in depth image-based rendering (DIBR) to synthesize new images. We explore depth image-based rendering algorithms, using pictures of a scene and their as- sociated depth maps. These algorithms project the color values at the novel view position using the depth information. However, the quality of the depth map highly impacts the accuracy of the synthesized views. Therefore, we started by improving the Depth Estimation Reference Software (DERS) of the Moving Picture Experts Group (MPEG), a worldwide standardization committee for video compression. Unlike DERS, our Reference Depth Estimation (RDE) software can take any number of input views, leading to more robust results. It is currently used to generate novel depth maps for standardized datasets. The depth estimation did not reach real-time generation it takes minutes to hours to create a depth map depending on the input views. We therefore explored active depth sensing devices, such as Microsoft Kinect, to acquire color data and depth maps simultaneously. With the availability of these depth maps, we address the DIBR problem by providing a novel algorithm that seamlessly blends several views together. We focus on obtaining a real-time rendering method in particular, we exploited the Open Graphics Library (OpenGL) pipeline to rasterize novel views and customize dynamic video loading algorithms to provide frames from video data to the software pipeline. The developed Reference View Synthesizer (RVS) software achieves 2x90 frames per second in a head-mounted display while rendering natural scenes. RVS was initially the default rendering tool in the MPEG-Immersive (MPEG-I) community. Over time, it has evolved to function as the encoding tool and continues to play a crucial role as the reference verifcation tool during experiments. We tested our methods on conventional, head-mounted, and holographic displays. Finally, we explored advanced acquisition devices and display mediums, such as (1) plenoptic cameras for which we propose a novel calibration method and an improved conversion to sub-aperture views, and (2) a three-layers holographic tensor display, able to render multiple views without wearing glasses. Each piece of this work contributed to the development of photo-realistic methods we captured and open-sourced several public datasets of high quality and precision to the research community. They are also used by MPEG to develop novel algorithms for the future of immersive television.
Bonatto, D 2024, ' From multi-modal capture to photo-realistic view synthesis: A high-quality and real-time multiview approach ', Vrije Universiteit Brussel.
Bonatto, D. (2024). From multi-modal capture to photo-realistic view synthesis: A high-quality and real-time multiview approach .
@phdthesis{6854daa9455745768b4c9a21acc9a8aa,
title = " From multi-modal capture to photo-realistic view synthesis: A high-quality and real-time multiview approach " ,
abstract = " Interactive view synthesis is a fundamental task in computer vision, aiming at recreating anatural scene from any viewpoint using a sparse set of images. The primary focus of this Ph.D.thesis is to explore the acquisition processes and the ability to render high-quality novel viewsdynamically to a user. Furthermore, this research targets real-time rendering as the secondobjective.In this thesis, we explore two different ways a scene can be reconstructed. The first optionis to estimate the three-dimensional (3D) structure of the scene, by means of a set of pointsin space called a point cloud (PC). Such PC can be captured by a variety of devices andalgorithms such as Time of Flight (ToF) cameras, stereo matching, or structure-from-motion(SfM). Alternatively, the scene can be represented by a set of input views with their associateddepth maps, which can be used in depth image-based rendering (DIBR) to synthesize newimages.We explore depth image-based rendering algorithms, using pictures of a scene and their as-sociated depth maps. These algorithms project the color values at the novel view position usingthe depth information. However, the quality of the depth map highly impacts the accuracy ofthe synthesized views. Therefore, we started by improving the Depth Estimation ReferenceSoftware (DERS) of the Moving Picture Experts Group (MPEG), a worldwide standardizationcommittee for video compression. Unlike DERS, our Reference Depth Estimation (RDE)software can take any number of input views, leading to more robust results. It is currentlyused to generate novel depth maps for standardized datasets.The depth estimation did not reach real-time generation it takes minutes to hours to createa depth map depending on the input views. We therefore explored active depth sensing devices,such as Microsoft Kinect, to acquire color data and depth maps simultaneously. With theavailability of these depth maps, we address the DIBR problem by providing a novel algorithmthat seamlessly blends several views together. We focus on obtaining a real-time renderingmethod in particular, we exploited the Open Graphics Library (OpenGL) pipeline to rasterizenovel views and customize dynamic video loading algorithms to provide frames from video datato the software pipeline. The developed Reference View Synthesizer (RVS) software achieves2x90 frames per second in a head-mounted display while rendering natural scenes. RVS wasinitially the default rendering tool in the MPEG-Immersive (MPEG-I) community. Over time, ithas evolved to function as the encoding tool and continues to play a crucial role as the referenceverifcation tool during experiments.We tested our methods on conventional, head-mounted, and holographic displays. Finally,we explored advanced acquisition devices and display mediums, such as (1) plenoptic camerasfor which we propose a novel calibration method and an improved conversion to sub-apertureviews, and (2) a three-layers holographic tensor display, able to render multiple views withoutwearing glasses.Each piece of this work contributed to the development of photo-realistic methods wecaptured and open-sourced several public datasets of high quality and precision to the researchcommunity. They are also used by MPEG to develop novel algorithms for the future ofimmersive television. " ,
author = " Daniele Bonatto " ,
year = " 2024 " ,
language = " English " ,
school = " Vrije Universiteit Brussel " ,
}