Object recognition will change the meaning of surveillance
Mon 3 Nov 2014
In 2013 The British Security Industry Authority estimated that the United Kingdom has 5.9mn CCTV cameras – at the time, representing one video feed per 11 people. Across articles and their comment streams in the liberal media, Orwell is cited daily, Britain declared ‘the most surveilled country’ in the world.
But assuming that Britain has reached peak CCTV infrastructure – which is unlikely – it has not even begun to fully utilise the petabytes of video which stream daily through the 70,000 cameras operated by the British police and authorities, and the millions which run in the private sector. This is very big data indeed.
Storage alone is a problem, because the signal-to-noise ratio is appallingly unproductive, and the cost of permanently archiving it all inconceivable. Most security cameras that are not dummies and actually do record data are linked to systems with limited time-period retainment. You cannot usually go back years – sometimes you cannot even go back months. And the availability of crucial detail may depend on the lossiness of the storage codec.
But new and on-going research initiatives offer the possibility of automated, real-time video stream analysis that transcends ‘targeted’ observations such as facial recognition and the automated licence-plate reading which operates on motorways and underpins London’s Congestion Charging scheme.
What if you needed to identify a suspect’s car with unknown whereabouts and were able to command an intelligent Video Content Analysis (VCA) system to locate instances of that model, parked or in motion, in any surveilled part of the country? Or to identify people standing too near the platform edge as a train approaches? Or to raise an alarm when a person-entity splits into two (i.e. walks away from the bag they were carrying in a public place)?
Real time Video Content Analysis in view
Kaiming He of Microsoft Research Asia has recently published a paper [PDF] , in collaboration with Xiangyu Zhang of Xi’an Jiaotong University and Shaoqing Ren and Jian Sun of the University of Science and Technology of China, outlining a breakthrough in the field of real-time visual recognition by developing a Deep ‘Convolutional Neural Network’ (CNN) which does not require a pre-defined frame-size for the input source, but can create dynamic matrices across standards.
Above: Cropping or warping to a fixed size due to the need to standardise input for the combined feed matrix. Below: the non-destructive SPP method of collating feeds into an analysable matrix
‘Spatial Pyramid Pooling’ resolves the bottleneck issue of interoperability between different video systems by changing the way Deep Neural Networks and deep learning approach the ‘stitching’ together of source feeds, and some of the problems of logic that arise from discontinuity or other anomalies in video feeds.
Anthony C Davies and Sergio A Valastin outlined many of these challenges in 2005 in ‘A Progress Review of Intelligent CCTV Surveillance Systems’ [PDF].
‘Occlusion’ is a major challenge for automated recognition systems:
“[If] the object disappears behind an obstacle, it is possible to use its velocity to estimate the place and time of its emergence assuming no change in its velocity (more sophisticated systems might use acceleration data too)…Since surveillance is typically achieved by multiple cameras, the ‘hand-over’ of identified objects from the field of view of one camera is needed. Sometimes the fields overlap, in other cases, there may be a part of the scene not covered by any camera.”
Davies and Valastin hypothesise the kind of occlusion that is likely to challenge an automated observation system:
“…two people being tracked may disappear behind an obstacle, and while there may meet, hold a conversation, then split up and depart in opposite directions. To automatically determine which one is which from the image sequences following their re- emergence is obviously not at all easy.”
The problem is magnified when the video feed is mobile, such as a police camera in a moving vehicle.
Much of the research into the field of Video Content Analysis has been driven towards progress in automating the extraction of evidence from CCTV footage or other archived sources, but He’s work is oriented to improving the current analysis speed of automated visual recognition systems by 20-100 times.
Microsoft has made major advances with Deep Neural Network analysis this year, with Project Adam able to distinguish the two possible breeds of corgi dog from video footage – something humans might find difficult.
SPP is a potential quantum leap in deep-learning object-detection, as it can generate fixed-size output from disparate input sizes. Most of the research in these fields is currently using the 22mn+ database of images provided by image.net, a reasonable definition of ‘big data’ – and the Microsoft solution is the sole real-time CNN detection system out of 38 entered into the ImageNet Large Scale Visual Recognition Challenge 2014.
Ghost in the machine
Commenting on Project Adam’s ability to recognise corgi breeds, Microsoft researcher Trishul Chilimbi is curiously uncertain as to what he is helping to create. “The deep, mysterious thing that we still don’t understand,” he says. “is how does a DNN, where all you’re presenting it is an image, and you’re saying, ‘This is a Pembroke Welsh corgi’—how does it figure out how to decompose the image into these levels of features? There’s no instruction that we provide for that. You just have training algorithms saying, ‘This is the image, this is the label.’”
Chilimbi compares the state of the art in Deep Neural Networks VCA to that of Quantum physics at the beginning of the 20th century. “We tend to overestimate the impact of disruptive technologies in the short term and underestimate their long-term impact—the Internet being a good case in point.”
The potential for intelligent automated analysis in real time does not merely concern speeding up human reaction to an event with a computer has recognised, but ensuring that significant events do not get lost in the buffer dumps of obscure systems which either wipe their archives frequently or do not produce any recorded output at all.
Hanging over a notoriously unsafe railway station that I occasionally have to pass through are signs reminding me that station activity is monitored and recorded ‘for my safety and security’. I feel at least the comfort that I may gain a starring role on Britain’s Most Murdered. But I do not feel that anyone is actually likely to be watching in that moment, much as the omnipresent telescreens in Orwell’s ‘Nineteen Eighty Four’ were an aide to occasional human operatives, but otherwise just a high-tech scarecrow for thought criminals. It is strange to think that each of those single and unblinking eyes may be actually looking back at me one day, each of us wondering what the other is thinking.