Pano2Vid: Automatic Cinematography for Watching 360° Videos

Concept figure

A 360° camera captures the entire visual world from its optical center, which provides exciting new ways to record and experience visual content by relieving restrictions on the field-of-view (FOV). Videographers no longer have to determine what to capture in the scene, and human viewers can freely explore the visual content. On the other hand, it also introduces new challenges for the video viewer. The video viewer has to decide “where and what” to look at by controlling the viewing direction throughout the full duration of the video. Because the viewer has no information about the content beyond the current FOV, it may be difficult to find interesting content and determine where to look in real time.

To address this difficulty, we define “Pano2Vid”, a new computer vision problem. The task is to design an algorithm that automatically controls the pose and motion of a virtual normal-field-of-view (NFOV) camera within an input 360° video. Camera control must be optimized to produce video that could conceivably have been captured by a human observer equipped with a real NFOV camera. A successful Pano2Vid solution would therefore take the burden of choosing “where to look” off both the videographer and the end viewer: the videographer could enjoy the moment without consciously directing her camera, while the end viewer could watch intelligently-chosen portions of the video in the familiar NFOV format.



When watching 360° videos, the human viewer needs to actively control the viewing direction. This is not a trivial task, because the viewer has no information beyond the current field of view. For example, in the above video, the viewer fails to notice that there is an elephant approaching the camera from the opposite direction at the beginning. Therefore, the key challenge for watching 360° video is how to find the right direction to watch.


The Pano2Vid Problem

To overcome the challenge of viewing 360° video, we propose a new computer vision problem that helps people determine where and what to look at in 360° video.



Spatio-temporal Glimpse

Spatio-temporal Glimpse

A spatio-temporal glimpse is a short normal-field-of-view video extracted from 360° video with fixed viewing direction. It transforms 360° content into normal video and makes the visual feature comparable.



We define capture-worthiness as how much a spatio-temporal glimpse looks like human-captured normal-field-of-view videos ("HumanCam").

Sample Spatio-temporal Glimpse

Sample ST-glimpses

Given a 360° video, we densely sample spatio-temporal glimpses both spatially and temporally. We sample 198 glimpses at 18 azimutal angles and 11 polar angles every 5 seconds. We then estimate the capture-worthiness score for all the glimpses.

Construct Virtual Camera Trajectory

Construct trajectories

We transform the problem of controlling viewing direction into selecting one spatio-temporal glimpse each moment in the video. We find a path over the spatio-temporal glimpses that maximize the accumulated capture-worthiness score while obeying a smooth camera motion constraint, which forbids the virtual camera from performing abrupt motion. The problem can be reduced to a shortest path problem and solved by dynamic programming.




We collect the 360° and HumanCam videos from YouTube using following keywords: "Hiking", "Mountain Climbing", "Parade", "Soccer".

# Videos Total Length
360° Videos 86 7.3 hours
HumanCam 9,171 343 hours


  • Center prior – random trajectories biased toward center in 360° camera axis
  • Eye-level prior – static trajectories lying on the equator
  • Saliency – replace capture-worthiness score with saliency score in AutoCam

Evaluation Metrics

  • HumanCam-based Metrics – whether the algorithm generated videos look like HumanCam videos
    • Distinguishability – are algorithm generated and HumanCam videos distinguishable?
    • HumanCam-Likeness – which algorithm generates videos that are closer to HumanCam videos?
    • Transferability – do semantic classifiers transfer between algorithm generated and HumanCam videos?
  • HumanEdit-based Metrics – whether the algorithm controls the viewing direction similar to human viewers in the same 360° video
    • Cosine – cosine similarity between the viewing directions of human viewer and algorithm output in the same 360° video
    • Overlap – field-of-view overlap of human viewer and algorithm output in the same 360° video


Cosine Similarity
FOV Overlap

Video Examples

AutoCam Outputs

Comparison with Baseline

These two examples show why center and eye-level heuristics are reasonable baselines to compared with.

  • Center – Videographers often hold the 360° camera in an orientation such that the center corresponds to some special directions, e.g. the direction pointing to the videographer.
  • Eye-level – Most events appear near the horizon.

Nevertheless, they were unable to adapt to the content and often fail to achieve good framing.

Failure Cases

The first example shows two problems in the AutoCam algorithm:

  • The virtual camera has limited FOV and cannot capture the entire subject at once
  • The camera motion is restricted by the smooth camera motion constraint and can't turn to the subject promptly

The second example shows that the capture-worthiness score does not encode the preference over different contents that look like HumanCam videos.

HumanEdit Interface

Design highlights:

  • Display 360° video in equirectangular projection, so the editor can see all content at once
  • Expand the panoramic strip by 90° on both side to avoid discontinuous content
  • Human editor controls the virtual camera direction by the mouse location
  • Backproject the camera field-of-view to the 360° video