Mike Manos has a post pointing out what he calls “data center junk science” and the data center thermal shock requirement.
Mike’s post got my curiosity up, and I spent time researching to build on Mike’s post. This is my 956th post in less than 2 years, and people many times think I have a journalism writing background. Well fooled you, I am an Industrial Engineer and Operations Research graduate from Cal Berkeley. So, even thought I write a lot, you are reading my notebook of stuff that I discover I want to share with others. For those of you who don’t want industrial engineers do.
Industrial engineering is a branch of engineering that concerns with the development, improvement, implementation and evaluation of integrated systems of people, money, knowledge, information, equipment, energy, material and process. It also deals with designing new prototypes to help save money and make the prototype better. Industrial engineering draws upon the principles and methods of engineering analysis and synthesis, as well as mathematical, physical and social sciences together with the principles and methods of engineering analysis and design to specify, predict and evaluate the results to be obtained from such systems. In lean manufacturing systems, Industrial engineers work to eliminate wastes of time, money, materials, energy, and other resources.
This background all helps me think of how to green the data center.
And Operations Research helps me think about the technical methods and SW to do this.
interdisciplinary branch of applied mathematics that uses methods such as mathematical modeling, statistics, andalgorithms to arrive at optimal or near optimal solutions to complex problems. It is typically concerned with determining the maxima (of profit, assembly line performance, crop yield, bandwidth, etc) or minima (of loss, risk, etc.) of some objective function. Operations research helps management achieve its goals using scientific methods.
Mike’s post got me thinking because one of my summer internships was at HP where I worked as a reliability/quality engineer figuring out how to build better quality HP products. The team I worked in were early innovators in thermal cycling and stressing components back in the early 1980’s.
Data Center Junk Science: Thermal Shock \ Cooling Shock
October 1, 2009 by mmanos
I recently performed an interesting exercise where I reviewed typical co-location/hosting/ data center contracts from a variety of firms around the world. If you ever have a few long plane rides to take and would like an incredible amount of boring legalese documents to review, I still wouldn’t recommend it.
I did learn quite a bit from going through the exercise but there was one condition that I came across more than a few times. It is one of those things that I put into my personal category of Data Center Junk Science. I have a bunch of these things filed away in my brain, but this one is something that not only raises my stupidity meter from a technological perspective it makes me wonder if those that require it have masochistic tendencies.
I am of course referring to a clause for Data Center Thermal Shock and as I discovered its evil, lesser known counterpart “Cooling” Shock. For those of you who have not encountered this before its a provision between hosting customer and hosting provider (most often required by the customer) that usually looks something like this:
If the ambient temperature in the data center raises 3 degrees over the course of 10 (sometimes 12, sometimes 15) minutes, the hosting provider will need to remunerate (reimburse) the customer for thermal shock damages experienced by the computer and electronics equipment. The damages range from flat fees penalties to graduated penalties based on the value of the equipment.
As Mike asks the issue of duration.
Which brings up the next component which is duration. Whether you are speaking to 10 minutes or 15 minutes intervals these are nice long leisurely periods of time which could hardly cause a “Shock” to equipment. Also keep in mind the previous point which is the environment has not even violated the ASHRAE temperature range. In addition, I would encourage people to actually read the allowed and tested temperatures in which the manufacturers recommend for server operation. A 3-5 degree swing in temperature would rarely push a server into an operating temperature range that would violate the range the server has been rated to work in or worse — void the warranty.
Here is the military specification typically used by vendors. MIL-STD- 810G to define temperature/thermal shock.
MIL-STD-810G
METHOD 503.5
METHOD 503.5
TEMPERATURE SHOCK
1.
SCOPE.
1.1
Purpose.
Use the temperature shock test to determine if materiel can withstand sudden changes in the temperature of the surrounding atmosphere without experiencing physical damage or deterioration in performance. For the purpose of this document, "sudden changes" is defined as "an air temperature change greater than 10°C (18°F) within one minute."
1.2
Application.
1.2.1
Normal environment.
Use this method when the requirements documents specify the materiel is likely to be deployed where it may experience sudden changes of air temperature. This method is intended to evaluate the effects of sudden temperature changes of the outer surfaces of materiel, items mounted on the outer surfaces, or internal items situated near the external surfaces. This method is, essentially, surface-level tests. Typically, this addresses:
a.
The transfer of materiel between climate-controlled environment areas and extreme external ambient conditions or vice versa, e.g., between an air conditioned enclosure and desert high temperatures, or from a heated enclosure in the cold regions to outside cold temperatures.
b.
Ascent from a high temperature ground environment to high altitude via a high performance vehicle (hot to cold only).
c.
Air delivery/air drop at high altitude/low temperature from aircraft enclosures when only the external material (packaging or materiel surface) is to be tested.
As Mike says the surprising part is the requirement for thermal shock is coming from technical people, most likely who have military backgrounds.
Even more surprising to me was that these were typically folks on the technical side of the house more then the lawyers or business people. I mean, these are the folks that should be more in tune with logic than say business or legal people who can get bogged down in the letter of the law or dogmatic adherence to how things have been done. Right? I guess not.
I can’t imagine any business person or attorney thinking a thermal shock is 3 degree change in 15 minutes. If there was an attorney involved they would go to MIL-STD 810G definition of temperature shock being greater than 10°C (18°F) within one minute.
So where does this myth come from? Most likely their is a social network effect of people who have consider themselves smarter than others and have added thermal shock to the requirements. One of the comments from Mike’s blog documents the possible social network source.
Dave Kelley, Liebert Precision Cooling
The only place where something like this is “documented” in any way is in the ASHRAE THermal Guidelines book. Since the group that wrote this book included all of the major server vendors, it must have been created with some type of justifiable reason. It states that the “maximum rate of temperature change is 5 degress C (9 degrees F) per hour.
And as Mike closes this has unintended consequences.
But this brings up another important point. Many facilities might experience a chiller failure, or a CRAH failure or some other event which might temporarily have this effect within the facility. Lets say it happens twice in one year that you would potentially trigger this event for the whole or a portion of your facility (your probably not doing preventative maintenance – bad you!). So the contract language around Thermal shock now claims monetary damages. Based on what? How are these sums defined? The contracts I read through had some wild oscillations on damages with different means of calculation, and a whole lot more. So what is the basis of this damage assessment? Again there are no studies that says each event takes off .005 minutes of a servers overall life, or anything like that. So the cost calculations are completely arbitrary and negotiated between provider and customer.
This is where the true foolishness then comes in. The providers know that these events, while rare, might happen occasionally. While the event may be within all other service level agreements, they still might have to award damages. So what might they do in response? They increase the costs of course to potentially cover their risk. It might be in the form of cost per kw, or cost per square foot, and it might even be pretty small or minimal compared to your overall costs. But in the end, the customer ends up paying more for something that might not happen, and if it does there is no concrete proof it has any real impact on the life of the server or equipment, and really only salves the whim of someone who really failed to do their homework. If it never happens the hosting provider is happy to take the additional money.
Temperature/thermal shock is a term that doesn’t apply to data centers. Hopefully you’ll know when to call temperature/thermal shock requirements in data center operations a myth.
Thanks Mike for taking the time to write on this.