‘DeepStereo’: Google research uses deep networks to turn Street View into actual movies
Wed 24 Jun 2015
New research from Google has produced some of the most impressive results ever seen from image interpolation – the creation of new images from other photos. By using deep networks to generate ‘missing’ frames from Google Street view, a team led by Google researcher and former visual effects wizard John Flynn have discovered a technique, dubbed ‘DeepStereo’, that can turn the staccato images of Google Maps’ Street View into what appears to be genuine video footage.
The new paper [PDF] from Flynn and associate Google researchers Ivan Neulander, James Philbin and Noah Snavely outlines in detail the problems associated with image interpolation and the solutions that the team found via the application of deep network computing.
The core task is to generate a convincing ‘tween’ image from two adjacent images in a sequence:
The images are analysed for both colour and depth information. Without the latter objects such as trees outside buildings would appear to become part of the building. The deep network analysis recognises the greater movement of nearer images across an image-set as indication of nearer proximity, and consequently depth-maps can be generated to aid in the creation of the completely new ‘virtual’ filler frames.
Initial tests were generated from photo sequences specially taken for the project, in circumstances very similar to those in which Google records street imagery for the diorama-style ‘Street Map’ feature in Google Maps. The paper notes:
To train our network, we used images of street scenes captured by a moving vehicle. The images were posed using a combination of odometry and traditional structure-from-motion techniques . The vehicle captures a set of images, known as a rosette, from different directions for each exposure. The capturing camera uses a rolling shutter sensor, which is taken into account by our camera model. We used approximately 100K of such image sets during training.
To test the efficacy of the deep networks image interpolation, the researchers tasked the system with recreating a ‘middle’ image that had purposely been removed from an image sequence, with convincing results:
The paper notes that the test model using Google Streetview image sets compares well to the results from custom shoots. Blurred objects, such as moving cars, ‘fail gracefully’ without disturbing the progression of the scene, though naturally two adjacent images cannot supply image detail for topography which is only visible (in the real world) from the point of view of the interstitial, interpolated frame. Thus if the DeepStereo system had two ‘before and after’ views of a house with a deep porch, each at about 45 degrees relative to the house, the system would be unable to recreate the deeply recessed door and any other objects, based on only the input from the two lateral views.
Currently the DeepStereo system is fairly hard-baked and though fast is not yet capable of real-time rendering, but the researchers see potential in utilising additional or alternative networks to make on-the-fly interpolation possible: ‘We believe that with some of these improvements our method has the potential to offer real-time performance on a GPU,’
The researchers see potential in the technology in the fields of image stabilization, 2D-3D movie conversion techniques, teleconferencing, cinematography and, of course, virtual reality.
Project leader Flynn is an alumnus of Berkeley, and formerly developed 3D tracking software for legendary effects house Digital Domain, before moving on to research into multi-threaded image processing for Sony Pictures Imageworks, moving in 2008 to a senior software development position at Google.