The Stack Archive

New analysis engine puts syslogs to fore in data centre management

Wed 6 Jul 2016

Researchers from Duke University in North Carolina have teamed up with Yahoo to develop a diagnosis framework which exploits the difficult syslog output from network devices in a data centre environment.

The system, dubbed Log-Prophet, addresses an analytical resource that has traditionally been under-regarded by developers due to the opaque nature of syslogs:

‘Surprisingly, network device syslogs, which have proved invaluable for diagnosing problems in ISP networks, have remained largely ignored in the data center space. Most notably, works focused on using syslogs to quantify and characterize physical failures. Instead of syslogs, others have used packet traces, SNMP, and ICMP to detect and diagnose problems; however, while highly powerful in detecting problems, these techniques often provide minimal aide in problem resolution.’

Syslogs are debugging indicators aimed at network operators seeking to determine the cause of network anomalies, and not intended by design to co-exist within a more comprehensive system-wide analysis framework. Millions of messages are output to syslogs each day in a busy data centre. The messages are non-standard across devices and resistant to templating, not only because of problems in establishing event time-stamps, but because meaningful analysis in many cases would require access to closed-source, proprietary software running the device.

Therefore the researchers have adopted inference/correlation techniques more commonly practiced by social scientists, including Quasi-Experimental Design (QED). QED establishes causal relationships between events derived from syslogs without the need to open up source in order to impose precise time-stamps.

‘For example, using QED we are able to affirm a causal relationship between a link failure (treatment) and a routing protocol outage (outcome), conditioning for confounding factors such as device config changes, power, firmware, and operating system.’

Diagram of Log-Prophet's Workflow

Diagram of Log-Prophet’s Workflow

The research team tested Log-prophet on seven months’ worth of data from one of Yahoo’s data centres, and were able to distil millions of verbose and unstructured output messages into as few as 13 Problem Graphs (PGs) – effectively imposing templates retroactively on data intended for limited and strictly proprietary analysis, without agreement or cooperation with equivalent or similar systems.

Validation of Log-prophet’s analysis model was carried out by the team interviewing data centre operators, and presenting the boiled-down results of six weeks’ worth of log analysis for the centre. Those interviewed agreed that Log-prophet had ‘succinctly captured events within a device’, in spite of being hampered by the lack of precise time-stamps (and therefore to reveal ‘happens-before’ relationships).


Data Centre news research
Send us a correction about this article Send us a news tip