The future of video streaming – Training AI to see with the human eye
Mon 6 Apr 2020 | Professor Yiannis Andreopoulos
The latest research from Cisco says that global internet traffic will reach 4.8 zetabytes a year in 2022, or 150,700 gigabytes a second. Video will represent at least 80 per cent of the total internet traffic
That research was published before the current coronavirus pandemic, which may well have a dramatic change in the shape and per-type breakdown of global internet traffic as face-to-face meetings are being overwhelmingly replaced with video conference calls and live video streaming. For example, NAB, the biggest event of the year in media production and distribution, has recently announced it will switch to a virtual conference for the 2020 year, with live presentations and meetings taking place via video streamed over the web.
We are all aware of the bandwidth issues and contention that can pose a risk for internet connections. This is particularly true for video transfer over IP because of its need for sustained and consistent data rates and for low-latency packet delivery. But there is another significant implication of the rapid rise in data traffic for the chain of data centres that comprise “the internet”, each drawing huge amounts of power to support massive numbers of processor racks, vast storage arrays and a very significant number of IP routing servers.
KTH, the Royal Institute of Technology in Stockholm, has calculated that data centre services already comprise more than 10 per cent of the world’s total energy consumption. That puts its carbon footprint at the same level as air travel today.
There is a real possibility that data centre operators – and therefore their customers – will have to declare their carbon impact, and potentially get taxed on it. There is an imperative, therefore, to reduce the demand on processing and storage, particularly for video.
Video traffic on the internet can be grouped under three broad headings:
- Video on demand – the likes of Netflix, Amazon Prime, Disney Streaming, Hulu, YouTube, Now TV, etc
- Live streaming – everything from video conferencing to sports and live broadcast of entertainment and infotainment outlets, such as music concerts and live news feeds
- Emerging markets – new services in the social media sphere like TikTok, Instagram Live Stories, and online gaming, where the game is rendered on GPU farms in the cloud and is streamed to the player as an ultra-low latency live video feed
Video is compressed before being sent over the web, using internationally-ratified standards from the likes of ISO MPEG, ITU-T VCEG, or, more recently, the Alliance for Open Media (AOMedia). These standards are asymmetrical: complex, generally hardware-supported encoders deliver to simple decoders, usually comprising software-based video decoding, and display on the web browser or an app like the Netflix player in the consumer device.
Consumer expectation is that image quality will continue to rise, for instance by moving from full HD to 4K Ultra HD, which means that encoding standards and hardware will need to work significantly harder to maintain transmission bitrates that do not scale up in a very significant manner to what networks can support today. The result is that content streaming platforms are rapidly hitting a processing bottleneck if they strive to achieve very high video quality at the lowest possible bitrate. That is, encoding complexity increases exponentially with increased volumes of content, increased resolutions and increased sophistication of newer encoding standards, with the result being that many suppliers are hitting a so-called “complexity wall”.
Video vs audio
Let us take a step back from video for a moment, and think about audio. When the web started to become significant as a communications medium 25 years ago, high-fidelity audio could be delivered, but only if the IP connection could sustain 1 – 2 Mb/s, something exceptionally rare at the time.
This massive bottleneck was effectively resolved by tapping into our understanding of the psychoacoustics of the human auditory perception. For example, MPEG 1 audio layer 3 (commonly known as MP3) eliminated the spectral parts of the audio signal that typical human audiences do not hear, thereby slashing the data rate from 1-2Mb/s to as low as 64 kb/s. Audiophiles at the time were critical of the audio quality of MP3 (and some still are to this day!), but, for the vast majority of consumers listening on conventional speakers or headphones, the MP3 audio quality turned out to perfectly acceptable, and MP3 and its subsequent incarnations like AAC and the like are now completely ubiquitous in commercial services all around the world.
Given the complexity wall of video encoding, we now begin to see it is time to consider the same human-perception driven principles in the case of video encoding and delivery. Not every pixel of every frame is equal. If one determines in an automated manner what people see versus what they do not see in video frames, then one can attenuate the detail of pixel areas that are less important and thereby save encoding complexity and bitrate, even when using existing encoders for the actual compression. Such work has taken place until now in a hand-crafted manner, where manual designs attempt to find perceptually-significant areas in the content and attenuate or “blur out” the remaining parts. Alas, such hand-crafted designs only work in very narrow contexts (e.g., full-frontal videos of talking heads in conversational video where the areas of the background can be easily detected and blurred out). The challenge here is two fold:
- The manner that we perceive and interpret video is very complicated – much more so than the way we humans perceive sound
- From an information theory point of view, video signals are extremely diverse in quantity and content in comparison to audio tonal information.
Therefore, developing automated, data-driven, techniques to optimise images without visual distortion is a very complex problem, and one of the emerging long-term challenges in video representation and compression research.
Video multi-method assessment fusion
A number of proposals that quantify perceptual video quality of compression algorithms versus the source video have been laboriously developed by leading industry and academic stakeholders. Probably the most widely used “reference-based” approach today is the VMAF approach, which stands for video multi-method assessment fusion. This was developed by Netflix, and was underpinned, amongst others, by academic research at the University of Southern California, the University of Texas, and the Université de Nantes in France. VMAF draws on existing image quality metrics and adds its own “combining” factors to give a score for visual perception – ranging from 100 for visual quality of the compressed video that is completely indistinguishable versus the reference video, down to 0 for complete visual incoherence between the compressed video and the reference (uncompressed) video.
Work continues on VMAF. Recently, variants have been added to cover 4K video material, and to adapt for screen size, reflecting the fact that the same video signal will be perceived differently when watching on an HD or 4K smartphone screen than when watching on a large flat-panel 4K TV.
Armed with a means of providing reliable metrics for human visual perception, it is now rapidly becoming possible to develop machine learning (ML) tools to preprocess video to provide a chosen balance between file size and perceived image quality as scored by VMAF and similar high-level quality metrics. At iSize, we have so far achieved state-of-the-art results in this field worldwide.
There are two important points to make about this approach. First, as noted earlier, this is a preprocessing solution. The raw video passes through the ML-enabled processing stage, which is designed to output pixel volumes such that the subsequent encoder will allocate less bits to the parts of the image that are not important to human perception, while enhancing the parts that human viewers will notice the most.
The perceptually-preprocessed video then goes on to the standards-based video compression engine using normal encoders, and is received by our consumer devices in exactly the same way as it is done today. The video clients (consumer devices) remain completely unaware that the video has been optimised by the preprocessing stage and no changes are required in the stream packaging, delivery, decoding and viewing software or hardware. This means, such solutions are deployable today, with no disruption needed in the way video is delivered, decoded and played by our devices.
The benefit is that the preprocessing stage changes the content such that the subsequent encode will achieve the same perceptual quality at significantly lower data rate. In particular, the ML models of iSize have been shown to reduce the bitrate of H.264/AVC, H.265/HEVC and AV1 encoding standards by 20 to 40 per cent without changing the video resolution and while achieving the same or even higher VMAF score. This means that, for the same or even less compute cycles than what they consume today, standards-based encoders can achieve the same visual quality at higher compression.
Importantly, since the ML-based preprocessing stage is a massively parallel process, it is ideally suited to graphics processing units (GPUs). These are readily available in the cloud. Thus, the processing can be spun up or down as needed, and consume as little as 12milliseconds per 1080p (full HD) video frame on an NVIDIA T4 GPU. Since such perceptual preprocessing is a one-off process and multiple encoding bitrates (or even encoding standards) can benefit from it, this mild overhead is over-compensated by the reduction in compression complexity, the avoidance of being tightly coupled with highly-optimized hardware-enhanced codecs that carry out a separate compression stage for every format and every bitrate needed by every edge server. Moreover, no change or upgrade is required to consumer devices, thereby prolonging their longevity and improving their quality of experience for their users. The result is a substantial decrease in bandwidth requirements in delivering video, for a barely noticeable processing overhead.
As a final thought, consider this. If we assume a median data rate saving of 30 per cent based on such perceptual preprocessing, and apply it to the over-80 per cent of the internet traffic that Cisco says will be video in a couple of years’ time, an approach like the one of iSize could reduce internet traffic by close to 25 per cent Even a fraction of that saving would be a welcome reduction in bandwidth contention and the carbon footprint generated by its support systems.