Why wait until meltdown, finger-pointing, and stale pisza take over your world as a system manager? You can pre-plan what you're going to do and make your life a lot easier. There are five useful building blocks that can help you to construct a successful incident management plan:
- Configuration database and audit
- External experts
- Triage (Quick incident isolation)
- Dye tracing
Using these building blocks and a skilled crisis manager, you should be able to create a more smoothly-running system that handles both routine incidents and complex problems with less stress.
Configuration database and audit
A single, unified configuration management database (CMDB), completely up to date, is extremely valuable. We can ask it for the changes made in important components just before failure occurred; we can ask it what will happen if we take a particular element down for maintenance.
In the real world, unfortunately, there almost certainly isn't a CMDB. Therefore, the system manager should start the organisation on the path to CMDB implementation -- which, by the way, will almost certainly require severe penalties for any staff who make changes without entering it in the CMDB.
With or without the CMDB, all-encompassing and daily audits are
All major applications should be reviewed regularly by the all-encompassing audit to detect their dependencies and bring documentation up to date. If some mission-critical application depends on a cheap PC stored under someone's desk, or is running only because operations has written a PERL script to nudge it along, it's better to discover that before there's a crisis.
Then, at least once a day, a quick audit of the network and system configuration should be made to detect any changes or out-of-specification configurations or measurements. For example, the NetMRI tool from Netcordia can find and document changes in network topology and in the configuration of network elements. It can also run scripts to discover incompatible configuration changes. Similar tools are available for servers and other system components. The key idea is to find configuration changes before they interact with subsequent failures to create a major meltdown. Even if their impact is not recognised immediately, documenting the change will help troubleshooters when a problem occurs.
Few organisations have the expertise in-house to handle extremely complex problems. When those appear, external diagnostic experts who understand many different system components in depth and who are experienced with many different meltdown situations are invaluable.
To find these experts, ask within your local community and ask major vendors who they use when they have a multi-vendor problem. Bring in the experts to work with your staff while setting up your plans for system instrumentation and for triage, then put them on retainer to wait until you have your next problem that you can't solve without outside help.
Triage (quick incident isolation)
There are three goals for triage:
- Recognise well-understood incidents (and the quick fix for them!)
- Isolate subsystem problems to the responsible organisation credibly
- Recognise complex, major-crisis problems quickly
The operations staff should be trained to recognise and fix commonly-occurring incidents quickly. Even if they don't recognise an incident, it is often possible to isolate it to a particular subsystem if appropriate metrics are available. For example, synthetic transaction agents can repeatedly test the DNS, all access paths to the Internet, the availability and performance of critical VLANS, the response time of major database queries, and many other functions. Overhead is minimal, but these metrics can be extremely valuable in decreasing finger-pointing and in helping the operations staff quickly isolate an incident and convince the responsible group to take responsibility without needing to spend a lot of time arguing. If it's too difficult to create these metrics on your system, maybe your system needs to be redesigned. (Otherwise, it will be even more difficult to isolate a problem when you're in the middle of a meltdown and no one has slept for two days!)
If the operations staff doesn't recognise a problem as being one of the commonly-occurring incidents, and if it can't be quickly isolated to a particular subsystem, then it's time to realise that a major crisis may be brewing. The staff should alert the crisis manager, notify all concerned programmers and diagnostics experts that they may be called in soon, and start any special trace or measurement facilities that have been designed for these troublesome situations.
Triage plans and their associated metrics and baselines should be designed by the same groups who are asked to respond when there's a meltdown: the operations staff, network designers, server and middleware staff, developers, security staff, system architects, and any external experts who might be called in.
The tools for taking measurements are readily available from many vendors, and many tools are embedded in devices and systems that enterprises already own. For example, all Cisco routers contain Cisco's IP-SLA measurement tools, and F5 load balancers contain extensive test tools to evaluate server performance. There are also worldwide services, such as those from Gomez and Keynote, to test systems that provide services accessible from the public Web.
Within the enterprise's systems, it's important that test tools are familiar, that all passwords are known, and that the tools are easily connected to appropriate test points. It's a real time-waster when a test "spanning port" is difficult to configure or, worse, already has a cable plugged into it. (What happens if that cable is disconnected to allow attachment of a test device? What other thing will fail?) Vendors such as Apcon provide complex patching switches for testing; simpler multi-port taps are also available.
Be sure that applications log all errors they encounter, and that the log messages contain accurate time stamps and transaction identification information. Tools such as Splunk can be used to scan logs quickly to find possible indications of the reason for system failure. Poor handling of intermittent connection failure is a very common cause of system failure (for example, the system may freeze with one end thinking that a connection is up while the other end thinks it is down); how will that situation be detected and reset?
Because it's almost impossible to reproduce a modern multi-server, network-based production system in a test lab, and because it's usually the subtleties of the production environment that are causing a complex incident, system managers must expect that programmers will appear in their operations center. Those programmers will be trying to trace an individual transaction through a maze of equipment and network paths. (In effect, the network is the backplane of a huge multi-server computer system.) Presenting them with simple protocol traces isn't an efficient solution.
Dye tracing, which can follow an individual transaction through multiple servers, is often worth the cost of installation and the additional overhead, because it gives programmers a familiar diagnostic environment. They can watch each program call, its timings, and its parameters, as if the entire process were contained within a single server. Examples of these tools are CA's Introscope Transaction Tracer, HP's Transaction Analyzer, Quest Software's PerformaSure, IBM Tivoli's Composite Applications Manager and Symphoniq's TrueVue.
Simple metrics and procedures can detect and categorise ("triage") many performance incidents to assist in rapid service restoration, while complex incidents can be handled by specialised tools, such as dye-tracers, and pre-planned management strategies under the supervision of a trained crisis manager. In some complex cases, the assistance of external experts who are already familiar with the system is invaluable. These experts should help design the triage metrics and the diagnostic tools -- then be placed on retainer to await problems that require their aid.