From multi-modal capture to photo-realistic view synthesis: A high-quality and real-time multiview approach
 
From multi-modal capture to photo-realistic view synthesis: A high-quality and real-time multiview approach 
 
Daniele Bonatto
 
Abstract 

Interactive view synthesis is a fundamental task in computer vision, aiming at recreating anatural scene from any viewpoint using a sparse set of images. The primary focus of this Ph.D.thesis is to explore the acquisition processes and the ability to render high-quality novel viewsdynamically to a user. Furthermore, this research targets real-time rendering as the secondobjective.In this thesis, we explore two different ways a scene can be reconstructed. The first optionis to estimate the three-dimensional (3D) structure of the scene, by means of a set of pointsin space called a point cloud (PC). Such PC can be captured by a variety of devices andalgorithms such as Time of Flight (ToF) cameras, stereo matching, or structure-from-motion(SfM). Alternatively, the scene can be represented by a set of input views with their associateddepth maps, which can be used in depth image-based rendering (DIBR) to synthesize newimages.We explore depth image-based rendering algorithms, using pictures of a scene and their as-sociated depth maps. These algorithms project the color values at the novel view position usingthe depth information. However, the quality of the depth map highly impacts the accuracy ofthe synthesized views. Therefore, we started by improving the Depth Estimation ReferenceSoftware (DERS) of the Moving Picture Experts Group (MPEG), a worldwide standardizationcommittee for video compression. Unlike DERS, our Reference Depth Estimation (RDE)software can take any number of input views, leading to more robust results. It is currentlyused to generate novel depth maps for standardized datasets.The depth estimation did not reach real-time generation; it takes minutes to hours to createa depth map depending on the input views. We therefore explored active depth sensing devices,such as Microsoft Kinect, to acquire color data and depth maps simultaneously. With theavailability of these depth maps, we address the DIBR problem by providing a novel algorithmthat seamlessly blends several views together. We focus on obtaining a real-time renderingmethod; in particular, we exploited the Open Graphics Library (OpenGL) pipeline to rasterizenovel views and customize dynamic video loading algorithms to provide frames from video datato the software pipeline. The developed Reference View Synthesizer (RVS) software achieves2x90 frames per second in a head-mounted display while rendering natural scenes. RVS wasinitially the default rendering tool in the MPEG-Immersive (MPEG-I) community. Over time, ithas evolved to function as the encoding tool and continues to play a crucial role as the referenceverifcation tool during experiments.We tested our methods on conventional, head-mounted, and holographic displays. Finally,we explored advanced acquisition devices and display mediums, such as (1) plenoptic camerasfor which we propose a novel calibration method and an improved conversion to sub-apertureviews, and (2) a three-layers holographic tensor display, able to render multiple views withoutwearing glasses.Each piece of this work contributed to the development of photo-realistic methods; wecaptured and open-sourced several public datasets of high quality and precision to the researchcommunity. They are also used by MPEG to develop novel algorithms for the future ofimmersive television.