Free viewpoint television (FTV) is a system for viewing natural video, allowing the user to interactively control the viewpoint and generate new views of a dynamic scene from any 3D position. The equivalent system for computer-simulated video is known as virtual reality. With FTV, the focus of attention can be controlled by the viewers rather than a director, meaning that each viewer may be observing a unique viewpoint. It remains to be seen how FTV will affect television watching as a group activity.
Systems for rendering arbitrary views of natural scenes have been well known in the computer vision community for a long time but only in recent years has the speed and quality reached levels that are suitable for serious consideration as an end-user system.
Professor Masayuki Tanimoto from Nagoya University (Japan) has done much to promote the use of the term "free-viewpoint television" and has published many papers on the ray space representation, although other techniques can be, and are used for FTV.
QuickTime VR might be considered a predecessor to FTV.
In order to acquire the views necessary to allow a high-quality rendering of the scene from any angle, several cameras are placed around the scene; either in a studio environment or an outdoor venue, such as a sporting arena for example. The output Multiview Video (MVV) must then be packaged suitably so that the data may be compressed and also so that the users' viewing device may easily access the relevant views to interpolate new views.
It is not enough to simply place cameras around the scene to be captured. The geometry of the camera set up must be measured by a process known in computer vision as "camera calibration." The manual alignment would be too cumbersome so typically a "best-effort" alignment is performed prior to capturing a test pattern that is used to generate calibration parameters.
Restricted free-viewpoint television views for large environments can be captured from a single location camera system mounted on a moving platform. Depth data must also be captured, which is necessary to generate a free viewpoint. The Google Street View capture system is an example with limited functionality. The first full commercial implementation, iFlex, was delivered in 2009 by Real Time Race.
Multiview video capture varies from partial (usually about 30 degrees) to complete (360 degrees) coverage of the scene. Therefore, it is possible to output stereoscopic views suitable for viewing with a 3D display or other 3D methods. Systems with more physical cameras can capture images with more coverage of the viewable scene, however, it is likely that certain regions will always be occluded from any viewpoint. A larger number of cameras should make it possible to obtain high-quality output because less interpolation is needed.
More cameras mean that efficient coding of the Multiview Video is required. This may not be such a big disadvantage as there are representations that can remove the redundancy in MVV; such as inter-view coding using MPEG-4 or Multiview Video Coding, the ray space representation, geometry videos, etc.
In terms of hardware, the user requires a viewing device that can decode MVV and synthesize new viewpoints, and a 2D or 3D display.
The Moving Picture Experts Group (MPEG) has normalized Annex H of MPEG-4 AVC in March 2009 called Multiview Video Coding after the work of a group called '3DAV' (3D Audio and Visual) headed by Aljoscha Smolic at the Heinrich-Hertz Institute.