Milk: working round the big data bottleneck in modern CPUs
Fri 16 Sep 2016
On Tuesday at the International Conference on Parallel Architectures and Compilation Techniques in Israel, MIT researchers presented a new run-time technique capable of speeding up Big Data access by up to a factor or four.
The paper presented, Optimizing Indirect Memory References with milk by Vladimir Kiriansky, Yunming Zhang and Saman Amarasinghe, explains why data retrieval in large databases needs to overcome the problem of the ‘principle of locality’ in modern CPU architectures.
Modern CPU architecture is built to fetch large amounts of data at a time, since journeys from the central controller to the memory bank need to be kept to a minimum for purposes of latency. But ‘sparse’ datasets contain many non-contiguous data points which are sitting next to nothing else that is of any interest – which means that the ten items you need to request may require ten individual journeys across the bus.
CPU architecture has developed in this unhelpful way because for many common types of data access, this approach really speeds things up. If you are fetching the first tenth of a large image into Photoshop, and you’re not otherwise constrained by D/RAM requirements or other bottlenecks, bringing the adjacent block of data along for the ride is likely to be useful. But calls to unstructured datasets are very unlikely to have the same benefit.
Therefore the researchers have originated a C++ language extension which ‘queues up’ a number of these diverse and unrelated requests and executes them in efficient batches, with the queuing time outweighed by the speed benefits of fewer journeys to storage.
PhD Student Vladimir Kiriansky explains the origin of the extension’s name, which also indicates the nature of the problem Milk is trying to solve:
“It’s as if, every time you want a spoonful of cereal, you open the fridge, open the milk carton, pour a spoonful of milk, close the carton, and put it back in the fridge.”
Once the queues from each of the CPU cores reach a worthwhile level, the cores merge their lists, re-distribute the addresses among themselves to minimise the total number of requests needed and ensure that item #3 and item #24 (which may have had no other relationship until now, but which happen to reside in the same or near block of memory) are fetched together.
The challenges of this method are aggravated by the increasingly inefficient layers of cache around core memory, and Milk also needs to negotiate data stored at those addresses, and to decide which addresses are likely to be needed again, and avoid flushing them after the operation.
The paper itself claims a threefold latency improvement, the surrounding MIT article indicates a potential fourfold improvement, and hints that further development could radically increase this lead. It is an impressive leap for a runtime extension; it is easy to imagine what this approach could mean for a fully-developed framework, even without the development of bespoke accompanying hardware.