Modeling the Path to Higher Efficiency Servers - PUE for Servers?

James Hamilton has a good post on the next point of server differentiation being efficiency at very high temperature.

Next Point of Server Differentiation: Effiiciency at Very High Temprature

High data center temperatures is the next frontier for server competition (see pages 16 through 22 of my Data Center Efficiency Best Practices talk:http://mvdirona.com/jrh/TalksAndPapers/JamesHamilton_Google2009.pdf and 32C (90F) in the Data Center). At higher temperatures the difference between good and sloppy mechanical designs are much more pronounced and need to be a purchasing criteria.

The infrastructure efficiency gains of running at higher temperatures are obvious. In a typical data center 1/3 of the power arriving at the property line is consumed by cooling systems. Large operational expenses can be avoided by raising the temperature set point. In most climates raising data center set points to the 95F range will allow a facility to move to a pure air-side economizer configuration eliminating 10% to 15% of the overall capital expense with the later number being the more typical.

James gives 3 downsides to higher temperature.

These savings are substantial and exciting. But, there are potential downsides: 1) increased server mortality, 2) higher semi-conductor leakage current at higher temperatures, 3) increased air movement costs driven by higher fan speeds at higher temperatures. The former, increased server mortality, has very little data behind it. I’ve seen some studies that confirm higher failure rates at higher temperature and I’ve seen some that actually show the opposite. For all servers there clearly is some maximum temperature beyond which failure rates will increase rapidly. What’s unclear is what that temperature point actually is.

In my early career at HP I worked as a reliability engineer and stress tested equipment in extreme cold and heat, analyzing failures. This problem reminds me also of one of the lessons I learned working in distribution logistics at HP and Apple, it is cost prohibitive to design the 99.9999% packaging to ship things, and you need to strike the right balance dependent on what you are shipping and its value.

Intel, AMD, and disk driver vendors will discus their energy efficiency, but just like packaging design, thermal efficiency is not sexy and what people think about for energy efficiency.

The complexity of this is huge.

We also know that the knee of the curve where failures start to get more common is heavily influenced by the server components chosen and the mechanical design. Designs that cool more effectively, will operate without negative impact at higher temperatures. We could try to understand all details of each server and try to build a failure prediction model for different temperatures but this task is complicated by the diversity of servers and components and the near complete lack of data at higher temperatures.

And, here is where James totally gets it.  He says we need models.

We also know that the knee of the curve where failures start to get more common is heavily influenced by the server components chosen and the mechanical design. Designs that cool more effectively, will operate without negative impact at higher temperatures. We could try to understand all details of each server and try to build a failure prediction model for different temperatures but this task is complicated by the diversity of servers and components and the near complete lack of data at higher temperatures.

So, not being able to build a model, I chose to lean on a different technique that I’ve come to prefer: incent the server OEMs to produce the models themselves. If we ask the server OEMs to warrant the equipment at the planned operating temperature, we’re giving the modeling problem to the folks that have both the knowledge and the skills to model the problem faithfully and, much more importantly, they have ability to change designs if they aren’t fairing well in the field. The technique of transferring the problem to the party most capable of solving it and financially incenting them to solve it will bring success.

My belief is that this approach of transferring the risk, failure modeling, and field result tracking to the server vendor will control point 1 above (increased server mortality rate). We also know that the Telecom world has been operating at 40C (104F) for years (see NEBS)so clearly equipment can be designed to operate correctly at these temperatures and last longer than current servers are used. This issue looks manageable.

How do you solve this problem?

One smart guy dev guy, Ade Miller.had a good answer which I hope he’ll blog about soon is calculating PUE for a desktop and server. 

So is it more like a equipment PUE vs. data center PUE.

If server vendors started to publish their equipment PUE. What is the IT load of the motherboard, what is the overhead for the power supply and fans?  Would we be looking to buy the best PUE servers?

For you who get power supplies, fan, and analog devices, this will make a lot of sense. Oh yeh, I also was program manager on the Macintosh II power supplies, and learned a lot from a great development team.