Disney’s artificial intelligence project pairs objects with sound
Wed 16 Nov 2016
Scientists at Disney Research, partnering with ETH Zurich, have designed an artificial intelligence system that can associate an object with the sound it makes. Using a core algorithm based on audio-visual input, then applying iterated clustering to filter out irrelevant noise, the team was able to teach the system to correctly choose appropriate audio when presented with a visual input.
While video/audio recording tracks provide a natural learning environment to pair object with sound, it has been difficult to teach a machine to disregard background noise, whether it be background music, narration or off-screen sounds.
Uncorrelated sounds create an ambiguity in a machine’s attempt to associate sound with object. But Disney and ETH Zurich created a method for filtering out unrelated sounds. The team uses a core algorithm that analyzes audio-visual input then applies an iterated mutual kNN cluster that recognizes only the sounds that are recurring across several different inputs.
For example, when teaching a system to recognize the sound of a car’s engine and associate it with the image of a car, they might play several different videos of vehicles in motion. Using AI, the machine learns to recognize the recurring engine sound, and filter out irrelevant data: music, conversation, narration, and other background noise.
Jean-Charles Bazin, associate research scientist at Disney, said, “If we have a video collection of cars, the videos that contain actual car engine sounds will have audio features that recur across multiple videos. On the other hand, the uncorrelated sounds that some videos might contain generally won’t share any redundant features with other videos, and thus can be filtered out.”
Once irrelevant noise is filtered out, the algorithm can learn what sound is associated with that image, and when the image is presented independently, the system can pair it with the appropriate sound. A related study showed that using the filtered data consistently returned better results than using unfiltered inputs.
The system can currently recognize a range of object/sound pairs, including the slamming of a door, the clink of glasses in a toast, or a running vacuum, car, or tram.
This system could have a number of applications, whether in sound effects for film, or in assisting people with visual disabilities.