Single Points of Failure

Processes and systems are often streamlined in an attempt to improve efficiency, increase productivity and reduce expenses. This can also help reduce mistakes by eliminating the variety of options from which to choose. On the other hand, there's the little issue of single points of failure. If there is only one path and only one option, when something happens to disrupt the path, the endpoint cannot be reached. It's the age old problem of putting all one's eggs in a single basket. One trip and the whole shebang is broken. Here are some real life examples:

Google Mistakenly Labels the Entire Internet as Malware
Earlier today, all Google searches returned results which were flagged with the notice "This site may harm your computer." Clicking on a link to a search result didn't take you to the result link, but directed you to a warning page telling you to go back to the original search page and choose a safe link. The warning page did provide a link to an explanation page, but that link only returned 502 errors. In order to actually pursue any of the search results, you had to copy/paste the URIs into a browser window. This mess was later cleared up, but it's a fine example of a single point of failure in action.

Google later explained that it is trying to make the Internet a safer place for everybody by having a company called vet all the search results so as to prevent Google search users from being exposed to malware. I'm really not ok with that system. I don't need someone else deciding for me to what I should have access. Sounds a like a close cousin to censorship to me. I would imagine a very similar system of pre-approving search results is what has been used in China to keep residents from easily finding things which would be embarassing to the Chinese government.

Storms Cause Phone Outages for Thousands of "877" and "866" Phone Numbers
Cynergy apparently doesn't have redundant systems for one of its switching centers. Not a big deal if you don't live in the area? Maybe, maybe not. It all depends upon where you call goes when you dial. The outage is also affecting credit card processing and impacting the ability of customers to make credit card and debit card purchases. I understand individual customers not having phone service due to local transmission lines being damaged. I don't understand large-scale infrastructure failures. I would have assumed that there would be a second switching center somewhere else that would have been able to take the load and things would proceed as usual. Apparently this is not the case.

Now, not all systems require redundancy. It would be impractical to have two electrical lines from two separate power grids run to my house or multiple engines in my car. However, a large facility like a data center or a hospital might be on two power grids and also have automated backup generators. When the consequences for failure are high, redundancy is even more important. Yes, complete and utter disaster is rare under normal operating circumstances, but sometimes normal circumstances vanish. The engineers were absolutely certain that the Titanic was unsinkable so no plan was made for the unthinkable situation of having her sink. There were life boats and life preservers on board, but in insufficient numbers for the number of passengers and crew on board. Even today we have Titanic Syndrome. Data backup systems can be automated to prevent data loss, but often aren't set up. Multiple staff can be trained to perform various duties in case someone becomes ill, but often people specialize and cross-training is viewed as a nuisance. For a while, continuity of operations plans were a big buzzword, but now they're falling by the wayside. Sooner or later that will bite organizations and corporations in the behind.

Compared to cleaning up after something has gone spectacularly badly, preparing for the rare possibility of a disaster is cheap and easy. Think critically about your system or organization and identify those points that are most subject to failure. Figure out how you can devise backups to those critical points and implement them before you need them. Also, test the backup systems periodically to make sure they can withstand the load they may receive. Don't just write down what you'll do on paper, then never try it. I also recommend periodically reviewing your plan and verifying that the proposed backup plans are actually still viable--sometimes changes in circumstances and personnel render a plan useless. It would be best to figure that out before the plan is put to the test.


Popular Posts