Beware the Data Lake
Wed 20 Aug 2014
Paul Miller explains the potential risks behind the recent ‘data lake’ hype and warns that businesses may be heading into murky waters.
There’s a new wonder technology in town, and it promises to solve every problem you’ve ever had with all those separate silos of data, scattered throughout your organisation. Or so some would have us believe.
Despite seeing real potential in certain use cases, I’ve been wary of current vendor enthusiasm for the concept of the ‘data lake.’ The industry’s cheerleaders gush, breathlessly, that ‘simply’ throwing all of your data into one place allows magic to happen. Real businesses, with real (and therefore messy) data may find the process is slightly more complicated and a lot less magical than they are being led to believe.
Data lakes, in which multiple applications across the business can read and write from a shared body of data, make a huge amount of sense. They free us from the seemingly endless duplication of an organisation’s core data. They hold out the promise of helping analysts spot – and act upon – trends hidden in data scattered across different systems. Their close association with big data darling Hadoop frees us from the strait-jacket of having to predefine every schema and every query in advance. They often make use of commodity hardware and open source software, hugely reducing the cost of storing and maintaining increasingly large volumes of data. They make it feasible to compare data from different sources and in different formats.
The potential is certainly huge, and companies like America’s General Electric have been in the news recently to talk about some of the advantages they saw in pouring jet engine diagnostics into a data lake partly powered by their recent big investment in EMC spin-off Pivotal. But a data lake is a tool, not a panacea. And unthinkingly pouring data of different resolution, provenance and intent into one place and then treating it all as equivalent is enough to give anyone who thinks about it for even a moment the most terrifying nightmares.
Some of the safeguards, some of the rules, some of the metadata, and some of the checks and balances baked so deeply into current systems are actually there for pretty good reasons. Without them, your crystal-clear lakes of beautiful, valuable, data can all too quickly become scum-infested stagnant ponds. The digital mosquitoes buzzing across the muck carry nasty digital malaria, and you won’t like it when it gets into the lifeblood of your business.
Aggregate the data coming from lots of different sensors and processes associated with your jet engines by all means. Gather up everything you can find about stock levels, customer flow through your stores, and the effectiveness of promotional offers. Slice, dice, and slice again with the myriad metrics available on your ecommerce operation. Even look at combining what you know about your online and offline selling, to take a stab at understanding the customer who shops in your stores and on your website.
But, and it’s a big but, think a little before mixing your data. Experiment if you want to, but remember where everything came from, what it was originally collected for, and where its limitations lie. Limitations from one data set added to limitations from another data set can do one of two things. If you’re lucky, the aggregate offers you the means to fill in the blanks and to interpolate over the holes. If you’re unlucky, you end up with a whole that looks really convincing, but which means absolutely nothing. And, if you forgot to preserve the originals, you now have a data lake that’s even worse than what you started with. Goodbye, business.
And, as Silicon Angle reports, Gartner would appear – broadly – to agree.
For those who haven’t encountered the term before, PwC has a nice introduction to the data lake and some of its promise.