Q&A: Why the data centre industry needs a global database of operating incidents
Thu 24 Sep 2020 | Dr. Dennis D. Cronin
Founded in 2017, the Data Center Incident Reporting Network (DCIRN) is a global database monitoring the types, frequency, and impacts of data centre operating incidents. We sat down with Dr. Dennis Cronin, CEO-Americas at DCIRN, to understand more about the organisation’s objectives and why establishing a global database of data centre incidents has benefits for everyone involved in the sector
How did DCIRN come about and what is the organisation’s mission?
Like most great ideas, DCIRN came about through a casual conversation. In 2016 Ed Ansett of i3Solutions, Peter Gross then with Bloom Energy, George Rocket of DCD, and others were discussing how data centre end users were questioning them about how often different Data Centres experience the same or similar incidents. It was realised, despite their vast collective knowledge, that they really did not have any hard industry statistics or a list of associated solutions. At that point, the idea of DCIRN was born.
While the idea was sound, there is still the issue of how to get the industry comfortable contributing what was historically stifled by the industry’s culture of extreme secrecy on such matters. To address those concerns the group identified and adopted the Confidential Reporting Programme for Aviation and Maritime (CHIRP) whose focus is to contribute to the enhancement of aviation and maritime safety worldwide, by providing a totally independent confidential reporting system for all individuals employed or associated with these industries.
As a parallel, DCIRN’s mission is to manage an independent, confidential reporting programme for data centre incidents and failures that operators of digital infrastructure can use to improve the safety, reliability and availability of the facilities they work in along with the network services they provide.
Why is an organisation like DCIRN so important for the future of the data centre industry?
Technology is built on the advances made by previous generations. If the knowledge embedded in those previous versions is not captured, then one has nothing to build upon and the technology fades into the abyss. The same is true about those who operate technology sites. Each generation of data centre operator builds upon the knowledge of their predecessors. If the predecessor’s knowledge/experiences are not captured, then the succeeding generations cannot advance.
This is not a new concept but in our hyper-speed world is one often forgotten:
- “Those who don’t know history are doomed to repeat it.” ― Edmund Burke.
- “Those who forget the past are doomed to repeat it” ― Sara Shepard, Wanted.
- “Those unable to catalog the past are doomed to repeat it.” ― Lemony Snicket, The End.
- “Those who cannot remember the past are condemned to repeat it.” (George Santayana)
Which industry stakeholders, in which locations, are you encouraging to get involved?
The data centre footprint is global with heavy concentrations in select geographies, so while it is expeditious to focus on the sector’s dense geographies, all areas are welcome to participate.
In reality, geography is not a factor in the types, frequency, and severity of data centre incidents. One of our challenges is convincing the broad breadth of data centre stakeholders that they all have a stake in this:
- Operators: They do not know what they do not know so let’s give them a reference book, show them industry trends and solutions others have successfully implemented.
- Designers: Are their great designs working as designed. If not, they need the facts as to why.
- Contractors: They do not want future liabilities from incidents that tie back to their work quality. DCIRN can give them experience references on what are the most common quality errors.
- Cx: The Cx is always challenged when testing doesn’t go well. With DCIRN they can now prep the build teams on the most common testing issues thereby reducing the test schedule and cost.
- Support Vendors: The multitude of outside vendors can now prep their technicians on what to be cautious about when working at various sites and demonstrate first-hand accounts of the potential ramifications.
- Investors: Investors now have an evaluation guide of what to look for when evaluating various data centre opportunities. A conservative investor might prefer everything to line up according to industry best practices however an aggressive investor can identify opportunities where a small improvement in operational Reliability can provide enhanced returns. The DCIRN database can help identify the pros and cons of an investment.
What has been the response so far? Have you found obtaining detailed incident information a challenge?
We are working in an industry whose culture is challenged to be transparent about any appearance of negative results. As more and more incidents and outages are getting reported in the Press, the culture is beginning to change as management is beginning to recognize that they need to get out ahead of the Press reports and, more importantly, the rumour mills. Data centre clients want prompt and accurate reports from data centre management and not from the evening news or a Twitter blast.
What is your message to those data centre operators who are reticent to divulge detailed incident information?
If you do not report an incident at your site, then someone else will and they are likely to get it wrong. Further, with all the cell phone recordings available today and client’s getting instant reports from their own equipment, managers will continue to be challenged to stay ahead. The solution is to learn from and avoid making the same mistakes that others have already made in the past.
This can only be done by studying past documented incidents, their causes, and their solutions.
Are there any recent examples that demonstrate the benefits of this sort of transparency?
Actually, there is an older but great example from the early days of the Colo subsection of the industry.
It went like this:
The Tier III Colo lost half its incoming power from the Utility for several days. Market rumours were ablaze with how this young Colo would fail in a month.
What the rumour mill didn’t know is that the management was totally transparent. They set up client conference calls every 4 hours and provided extensive details of where they were with repairs and shifting client loads where they could.
When this event happened, the site was 50% occupied with few sales leads. Within 6 months after the incident the site was approaching 100% occupancy. Why did this happen? It seems that the clients were so impressed with the management transparency and subsequent follow-up that they took additional space and talked up the Colo companies to others.
The bottom-line is that in a pinch, clients just want the truth that they can rely on.
Aside from providing a “bible” of sorts for the next generation of data centre professionals, how else will this endeavour benefit the industry?
As I stated before establishing this global database of data centre incidents has benefits for everyone involved in data centres. Once the database grows to several thousand incidents with regular additions, we can begin to analyse the data for trends such as frequencies of certain types of incidents and their causes. Are there certain calendar dates or outside events that trigger incidents? Are designs living up to the promises of their TIER ratings? If not, why not?
What is DCIRN’s roadmap?
First is to create awareness of what we are doing, grow membership, and get industry buy-in.
Next is data collection. In order to be successful, we need to get to receiving incident reports every day to build the database. The more data the database has the more significant will be any resulting analysis. Note: all data entered into the database is anonymized. We are not interested in the people, companies or sites where the incident happened as they are just distractions from the facts of an incident.
How can people get involved today?
First, start contributing to the database, all reports are anonymous, and no incident is too small, as we may one day be able to identify certain small incidents as predecessors to full outages or interruptions that no one wants.
Next, become a member or sponsor of DCIRN so we can continue to improve the data collection processes, continuously run the analytics and perhaps one day collect data centre anomalies via artificial intelligence apps.
- Read more about DCIRN and find out how to contribute to their database: www.dcirn.org/mission