Data, the lifeblood of the scientific experiment
Mon 7 Dec 2015
Science is the beating heart of our society. It has shaped almost every part of the modern era and it is still continuing to drive advances in the way we live our lives.
Millions of hours have been spent researching new concepts and theories. The Large Hadron Collider in CERN for instance, spent one million hours just to reinforce the connections to the superconducting magnets. Imagine the amount of time spent on actually conducting experiments and gathering data for analysis.
Modern scientific experimentation invariably relies on the data gathered during research. The data centre in CERN processes roughly 25 petabytes of data each year, or 3 gigabytes per second. This is a colossal amount of data which has to be secured, classified and shared for the purpose of analysis. This data is also made available to the public allowing enthusiasts and scientists alike to go through the data and provide their own input on the research conducted.
Most modern scientific research experiments rely on digital data to validate a theory or generate new knowledge. Many experiments and observatories have been in action for decades, creating vast quantities of data. Protecting this data is imperative for the future success and validity of the scientific process.
This is where a robust database comes into play. A database, where data is classified, stored and is made searchable, is central to the management of data. There are many different database variations, particularly in the way it functions. The most famous are Relational Databases (SQL) and Non Structured Query Language Databases (NoSQL).
Both database architectures have the same end function: they support storing, searching, and managing data. However, they differ in the guarantees they provide about the data and in their data modelling. A relational database provides a relational model for data (schemas) based on set theory, whereas non-relational databases don’t need a particular structure for the data to be accessed. This distinction is crucial when scientists consider the type of data being stored.
SQL and NoSQL databases have their benefits and limitations in certain settings. Banks, for instance, rely on the relational database to store and process customer information because of the uncomplicated nature of the database and the guarantees the relational model can provide on eliminating redundant data and maintaining data integrity. Account information can be stored in one table and personal details are stored in another, but the integrity of the relation between both tables can be managed/guaranteed by the database. When a request is made for an individual customer’s details, the data is consolidated from the back-end of the infrastructure and made visible in the application layer.
Scientific research experiments make use of the relational database such as MySQL and MariaDB. Particular emphasis is placed on the transactions of data, as all the data they create needs to be easily found to be analysed. Some laboratories, however, use both database architectures to make best use of the technology available. NoSQL databases, for instance, are well suited to scientific research because of the ease of which it scales alongside the enormous volume of data created and non-relational databases are suited to the classification of metadata, allowing for easier classification and subsequent analysis.
The tools for managing databases are also important, especially when managing distributed systems, as these tend to be more complex. Consider a research experiment which uses several partners and data centres located around the world. These data centres have the same database architectures used by a research experiment but requires those databases to communicate and share data between each other. This is necessary for failure contingency planning, where an architecture is needed to ensure the data is always highly available even if a database fails. The mechanism for ensuring highly available data is replication – more than one copy of the data is stored in two or more databases on different sites, so that failure of one database does not cause the system to be unavailable. Clustering, the process by which two or more nodes communicate for the purpose of replication, is another way to describe the mechanism used by a highly available, robust database.
The European Gravitational Observatory, which is part of a community of 19 different laboratories, is trying to determine the existence of gravitational waves, theorised by Albert Einstein. They implemented clustering technology to protect the data in case of a database outage but to also make it easier to share the data it has collected for the purpose of analysis by the other research teams.
Having a robust database in place is imperative to running a valid research experiment. Scientists rely on the data they collect just like chefs rely on the ingredients they use to cook. The data could give them a glimpse into the past, the future or the makeup of the universe which could change the lives for everyone on earth.
Science experiments have given us so many new advancements in computing, manufacturing and medicine. It would be irresponsible to put valuable research data at risk without a solid strategy behind the collection, classification and the analysis of that data. The beating heart of science needs the lifeblood of data to be able to make our understanding as advanced as it can be.