News Hub

Study: data centre switches running SONiC OS twice as reliable

Written by Tue 18 May 2021

A study of 180,000 switches measuring data centre performance over a three-month period found those running SONiC OS twice as reliable as others using alternate operating systems.

Customer expectations for data centre reliability are high: and tolerance for costly downtime is extremely low. In this environment, it is critical for data centre operators to understand the reasons for any performance interruptions so that the risk of downtime can be mitigated.

To this end, Microsoft conducted a widespread study of data centre switch performance, the results of which Microsoft has released in a peer-reviewed paper.  The researchers looked at 180,000 switches, over 130 locations, over a three-month period,  determining the reason for switch failure in each instance. The researchers polled every switch at six-hour intervals, and determined the reason for any reboots that occurred since the previous poll.

The researchers found that there is a 2% chance that a given switch will fail in three months. The main reasons for switch failure are:

  • Hardware (32%)
  • Unplanned power loss (28%)
  • Software (17%)

According to the researchers, the instances in which there was a reboot rarely affected end users, due to a robust failure mitigation process and redundancies built into the data centre environment. However, a subset of switch failures was referred for manual analysis because they caused a “significant disruption in services, making them a higher priority than others.” The manual analysis of this subset of failures was focused on determining root cause, with an eye to prevent disruption in the future.

Extrapolating data about the hardware that was used, the researchers found that one of the three main switch vendors was significantly more likely to fail than the other two. In the paper, the researchers anonymized the data and referred to them only as Vendor 1, 2, and 3. However, they found that nearly 70% of Vendor 3’s failures were hardware issues. That is unfortunate for Vendor 3, as, “hardware issues are harder to resolve, since software upgrades and fast reloads do not fix the problem.”

With these results, it appears that Vendor 3 is a clear loser as far as switch hardware reliability.

On the software side, however, there was a clear winner. Researchers found that switches running open-source operating system SONiC were significantly more reliable than switches running a different OS. This held true for Vendor 1 hardware using SONiC OS, and for Vendor 2 as well.

Even more significant is the fact that over time, the gap in reliability between switch OS’ widens, so that at the end of a three-month period the likelihood of survival of SONiC switches is 2% (as opposed to 1% for non-SONiC switches).

There’s an additional factor making SONiC OS more attractive to data centre operators: the fact that it is open-source. Vendor software patches can take far longer to roll out, as there is a smaller audience invested in identifying vulnerabilities and creating solutions. This leads to recurring issues on devices that are waiting for fixes from the vendor. SONiC failures, instead, “are root-caused and fixed over short timescales due to in-house expertise and development teams.”

Written by Tue 18 May 2021


networking switches
Send us a correction Send us a news tip