Do you have a bug from hell running in your data center?

Before working in data centers I worked on operating systems at Apple and Microsoft.  Creating software and working on data centers are kind of a yin and yang - soft and hard, white and black, etc.

In Chinese philosophy, the concept of yin-yang (simplified Chinese阴阳traditional Chinese陰陽pinyinyīnyáng), which is often called "yin and yang", is used to describe how seemingly opposite or contrary forces are interconnected and interdependent in the natural world; and, how they give rise to each other as they interrelate to one another. Many natural dualities (such as male and female, light and dark, high and low, hot and cold, water and fire, life and death, and so on) are thought of as physical manifestations of the yin-yang concept.

Never thought about it until now that the hard part of IT is it is yin and yang.  Software and hardware.  bits and physical buildings.  web services and real physical infrastructure.  a SW engineer and a mechanical engineer.

Getting everything to work in a data center just right can be frustrating as things sometimes don't work exactly the way they are supposed.  In all that physical infrastructure there are software are bugs from hell.  Bugs that are so nasty and nerve wracking it will make you want to pull your hair out.  Some of these nastiest bugs exist at the transition from light and dark like yin and yang.  Here is a description of bugs from hell.

BugFromHell is any bug where several hours or more of time is spent by a veteran developer attempting to track-down (and fix) the cause of a software bug. By definition, any bug that takes this long to find is almost always the result of a side-effect of the problematic code (otherwise, the problem would be readily visible via typical debug tools--e.g., stack trace, stepping through code in debug mode, etc). A BugFromHell is very elusive and is typically cannot be isolated or consistently reproduced.

  • Hours? Nah, a true BFH is one that takes weeks to find. (Especially in embedded systems work, when "it's a hardware problem" is always a possibility).

The effects of a BugFromHell typically appear anywhere except near the problematic code. Such a bug will write to random part of memory, flip bits that aren't detected for a long period of running time, or appear to happen randomly without appearing to have been triggered by anything; or, worse, appear to be affected by the act of observing it (a HeisenBug).

In the example that the author uses you can see many of these bugs from hell exist at the interface between software and hardware.

Examples:

  • overwriting part of the stack frame
  • writing to a memory location that has been moved or deleted (and is now occupied by a different object)
  • using an uninitialized variable that ultimately leads to writing to a random memory location
  • an unforeseen interaction between two threads or processes that only has a very small chance of occurring
  • thread interaction that won't happen running on a single CPU box, but which manifests on multiple CPUs
  • assumptions made by developers of one webbrowser that aren't made by any other. (You'll always have a <title> tag when setting the charset.)
  • Hardware drivers that aren't sufficiently paranoid / robust.
  • JMPing into an unprotected NULL, or into some other executable gibberish.
  • returning from a function with an unbalanced stack (primarily when embedding assembly code, for embedded systems).

Bugs from hell are running in every data center and are so frustrating.  

Why did I write this post, because my SW dev lead has been in three weeks of bug from hell working full time to fix.  Ouch three weeks of unexpected productivity sapped by a bug so nasty it was elusive yet extremely damaging.