In the era of big data how can we safeguard data integrity?
Fri 15 Mar 2019 | Neil Barton
With the right data lineage tools, ensuring data integrity in a big data environment becomes far easier
Multi-dimensional is something of an understatement when it comes to describing the beast of big data. However, the dimension that outweighs all others – so much so that it’s even in the name – is volume.
With the enormous potential that this data could possibly hold, the challenge becomes applying all of the usual methodologies and technologies at scale. This is particularly important in a world where about 2.5 quintillion bytes of data are created every single day, and the rate of data growth is only increasing. In addition, an increasingly large portion of this data is unstructured data, which is harder to categorise and sort than its structured counterpart; IDC has estimated that as much as 90 percent of all big data is unstructured.
Cutting through the noise
Compounding the problem, most businesses expect that decisions made based on data will be more effective and successful in the long run. However, with big data often comes big noise – after all, the more information you have, the more chance that some of that information might be incorrect, duplicated, outdated or otherwise flawed. This is a challenge that most data analysts are prepared for, but one that IT teams need to consider and factor into their downstream processing and decision making to ensure that any bad data does not skew the resulting insights.
This is why overarching big data analytics solutions alone are not enough to ensure data integrity in the era of big data. In addition, while new technologies like AI and machine learning can help make sense of the data en masse, often these rely on a certain amount of cleaning and condensing going on behind the scenes to be effective and able to run at scale.
One of the best solutions for managing the beast of big data overall is also one that builds in a way to ensure data integrity
While accounting for some errors in the data is fine, being able to find and eliminate mistakes where possible is a valuable capability – particularly if there is a configuration error or problem with a single data source creating a stream of bad data which can have a catastrophic effect in terms of derailing effective analysis and delaying the time to value.
Without the right tools, these kinds of errors can create unexpected results and leave data professionals with an unwieldy mass of data to sort through to try and find the culprit.
This problem is compounded when data is ingested from multiple different sources and systems, each of which may have treated the data in a different way. The sheer complexity of big data architecture can turn the challenge from finding a single needle in a haystack to one more akin to finding a single needle in a whole barn.
Meanwhile, this problem has become one that doesn’t just affect the IT function and business decision making, but is becoming a legal requirement to overcome. Legislation like the GDPR mandates that businesses find a way to manage and track all of their personal data, no matter how complicated the infrastructure or unstructured the information. In addition, upon receiving a valid request, organisations need to be able to delete information pertaining to an individual, or collect and share it as part of an individual’s right to data portability.
Safeguarding data integrity
So, what’s the solution? One of the best solutions for managing the beast of big data overall is also one that builds in a way to ensure data integrity – ensuring a full data lineage by automating data ingestion. This creates a clear path showing how data has been used over time, as well as its origins.
In addition, this process is done automatically, making it much easier and more reliable. However, it is important to ensure that lineage is done at down to the fine detail level. WhereScape automation software, for example, can retrospectively go out and catalogue data sources and easily enable complex data extraction while ensuring compliance with GDPR requirements.
With the right data lineage tools, ensuring data integrity in a big data environment becomes far easier. The right tracking means that data scientists can track data back through the process to explain what data was used, from where, and why.
Meanwhile, businesses can track down the data of a single individual, sorting through all the noise to fulfil subject access requests without disrupting the big data pipeline as a whole, or diverting significant business resource. As a result, analysis of big data can deliver more insight, and thus more value, faster – despite its multidimensional complexity.
- Neil Barton is the Chief Technology Officer for WhereScape
Tags:data data integrity data provenance GDPR
Big Data Fri 15 Mar 2019Are you missing the low-hanging fruit on your data tree?
Big Data Fri 15 Mar 2019Big data need not mean high costs and lengthy training ...
Big Data Fri 15 Mar 2019JustGiving’s RAVEN platform turns data into donations
AI Fri 15 Mar 2019Automation is forcing us to radically rethink the company