Algorithm allows analysis of compressed data without reconstruction
Tue 17 Jan 2017
A researcher from Telecom Bretagne has found that applying a K-means algorithm directly over compressed data allows for analysis with greater efficiency and at a lower coding rate.
The research focuses specifically on the types of data collected by sensors and sent for processing at fusion centers – information sharing centers responsible for analysing information shared between U.S. governmental agencies. This binary sensor data can include information from environmental monitoring, telecommunications, and nanoelectric devices. These sensors typically collect a large amount of data, as such researchers are eager to look at how the complexity of analytics tasks can be reduced.
K-means clustering is a method whereby data points are placed on a spectrum and clustered according to their physical relation to the nearest center, or ‘k’. The system is a simple and efficient method of compressing data, popular in many different data analysis applications.
While clustering and compressing data is an effective way of transmitting data to fusion centers, the center has to then reconstruct all of the data from the sensors. This new research shows that a K-means algorithm can be applied directly to compressed data without the need for decompression and reconstruction.
Lead researcher Elsa Dupraz conducted an analysis of the K-means algorithm as applied to a Monte Carlo simulation – a type of probability analysis for understanding risk and uncertainty in forecasting. Under these simulations, she discovered that a K-means algorithm could be successfully applied to binary data without reconstruction, allowing data analysts to perform complex analytical functions directly over the compressed data. The new method was thus successful in eliminating a time and resource-consuming step in the process of analyzing binary sensor data in a fusion center environment.
Additionally, the tests confirmed that applying K-means to compressed data was an efficient and accurate method of analyzing data without reconstruction, and that the coding rate to perform K-means on compressed vectors was lower than the rate needed to reconstruct data.