ex-Intel engineers at Microsoft share processor secrets, optimize performance per watt

Microsoft’s Dileep Bhandarkar and Kushagra Vaid published a paper on Rightsizing Servers for cost and power savings which are important in a green data center strategy.  To put things in context both Dileep and Kushagra are ex-Intel processor engineers.  Let’s start with the summary from their paper

In conclusion, the first point to emphasize is that there is more to performance than just speed. When your definition of performance includes cost effectiveness, you also need to consider power. The next point is that in many cases processor speed has outpaced our ability to consume it. It’s difficult to exploit CPU performance across the board. This platform imbalance presents an opportunity to rightsize your configurations. The results will offer a reduction in both power and costs, with power becoming an increasingly important factor in the focus on total cost of ownership.

It is also important to remember that industry benchmarks may not reflect your environment. We strongly recommend that IT departments do their own workload characterization, understand the behavior of the applications in their own world, and then optimize for that.

Dileep and Kushagra are going out on a limb sharing details most wouldn’t.  Intel and server manufacturers goal is to maximize revenue per unit (chips or servers).  If you buy high performance chips in the belief you are buying  high performance per watt systems, then they’ll make more money.  But, the truth is many times you don’t need the high performance processors.  There are many server manufacturers who are selling to big data center companies high performance per watt systems that have low cost processors.

Dileep has a blog post that goes along with the paper.

Before I came to Microsoft to manage server definition and purchases I worked on the other side of the fence. For 17 years I focused on processor architecture and performance at Digital Equipment Corporation, and then worked for 12 years at Intel, focusing on performance, architecture, and strategic planning. It’s interesting how now that I’m a hardware customer, the word “performance” encompasses cost effectiveness almost as much as it does throughput and response time. As my colleague Kushagra Vaid and I point out in our paper, when you look up performance in the dictionary it is defined as “how well something performs the functions for which it’s intended”.

Why should you read this paper? Because as Dileep points out the vast majority of people are purchasing based on unrealistic configurations run under processor benchmarks.

Figure: Three-year total cost of ownership of a basic 1U server

It also surprises me that so many IT groups base their purchasing decisions on published benchmark data about processors, even though that data is often generated using system configurations that are completely unrealistic when compared to real-world environments. Most folks sit up and take note when I display the facts about these topics, because the subject is important.

Rightsizing can clearly reduce the purchase price and the power consumption of a server. But the benefits go beyond the savings in capital expenditure. The lower power consumption has a big impact on the Total Cost of Ownership as shown in the Figure.

So, let’s start diving into the secrets in Dileep and Kushagra’s paper.  Here is the background.

Introduction
How do you make sure that the servers you purchase and deploy are most efficient in terms of cost and energy? In the Microsoft Global Foundation Services organization (GFS)—which builds and manages the company’s datacenters that house tens of thousands of servers—we do this by first performing detailed analysis of our internal workloads. Then by implementing a formal analysis process to rightsize the servers we deploy an immediate and long term cost savings can be realized. GFS finds that testing on actual internal workloads leads to much more useful comparison data versus published benchmark data. In rightsizing our servers we balance systems to achieve substantial savings. Our analysis and experience shows that it usually makes more sense to use fewer and less expensive processors because the bottleneck in performance is almost invariably the disk I/O portion of the platform, not the CPU.

What benchmarks?  SPEC CPU2006.  Understand the conditions of the test.

One of the most commonly used benchmarks is SPEC CPU2006. It provides valuable insight into performance characteristics for different microprocessors central processing units (CPUs) running a standardized set of single-threaded integer and floating-point benchmarks. A multi-threaded version of the benchmark is CPU2006_rate, which provides insight into throughput characteristics using multiple running instances of the CPU2006 benchmark.

But important caveats need to be considered when interpreting the data provided by the CPU2006 benchmark suite. Published benchmark results are almost always obtained using very highly tuned compilers that are rarely if ever used in code development for production systems. They often include settings for code optimization switches uncommon in most production systems. Also, while the individual benchmarks that make up the CPU2006 suite represent a very useful and diverse set of applications, these are not necessarily representative of the applications running in customer production environments. Additionally, it is very important to consider the specifics of the system setup used for obtaining the benchmarking data (e.g., CPU frequency and cache size, memory capacity, etc.) while interpreting the benchmark results since the setup has an impact on results and needs to be understood before making comparisons for product selection.

and TPC.

Additionally, the system configuration is often highly tuned to ensure there are no performance bottlenecks. This typically means using an extremely high performing storage subsystem to keep up with the CPU subsystem. In fact, it is not uncommon to observe system configurations with 1,000 or more disk drives in the storage subsystem for breakthrough TPC-C or TPC-E results. To illustrate this point, a recent real-world example involves a TPC-C
4 | Rightsizing Servers to Achieve Cost and Power Savings in the Datacenter Published December 2009 result for a dual-processor server platform that has an entry level price a little over $3,000 (Source: http://www.tpc.org). The result from the published benchmark is impressive: more than 600,000 transactions per minute. But the total system cost is over $675,000. That’s not a very realistic configuration for most companies. Most of the expense comes from employing 144 GB of memory and over a thousand disk drives.

Both of these test are in general setup to show the performance of CPUs, but as Dileep and Kushagra say, few systems are used in these configurations.  So what do you do?  Rightsize the system which usually means don’t buy the high performing CPU.  As the CPU is not the bottleneck.  Keep in mind these are ex-Intel processor engineers.

CPU is typically not your bottleneck: Balance your systems accordingly
So how should you look at performance in the real world? First you need to consider what the typical user configuration is in your organization. Normally this will be dictated either by the capability or by cost constraints. Typically your memory sizes are smaller than what you see in published benchmarks, and you have a limited amount of disk I/O. This is why CPU utilization throughout the industry is very low: server systems are not well balanced. What can you do about it? One option is to use more memory so there are fewer disk accesses. This adds a bit of cost, but can help you improve performance. The other option—the one GFS likes to use—is to deploy balanced servers so that major platform resources (CPU, memory, disk, and network) are sized correctly.

So, what happens if you don’t rightsize?

If memory or disk bandwidth is under-provisioned for a given application, the CPU will remain idle for a significant amount of time, wasting system power. The problem gets worse with multicore CPUs on the technology roadmap, offering further increases in CPU pipeline processing capabilities. A common technique to mitigate this mismatch is to increase the amount of system memory to reduce the frequency of disk accesses.

The old rule was to buy the highest performing processors i could afford.  Why not?  Because it wastes money and increases your power costs.

Another aspect to consider is shown in Figure 2 below. If you look at performance as measured by frequency for any given processor, typically there is a non-linear effect. At the higher frequency range, the price goes up faster than the frequency. To make matters worse, performance does not typically scale linearly with frequency. If you’re aiming for the highest possible performance, you’re going to end up paying a premium that’s out of proportion with the performance you’re going to get. Do you really need that performance, and is the rest of your system really going to be able to use it? It’s very important from a cost perspective to find the sweet spot you’re after.

image

What is the relationship of system performance, CPU utilization and disks?

See Figure 5 on the next page shows CPU utilization increasing with disk count as the result of the system being disk limited. As you increase the number of disk drives, the number of transactions per second goes up because you’re getting more I/O and consequently more throughput. With only eight drives CPU utilization is just 5 percent. At 24 drives CPU utilization goes up to 20 percent. If you double the drives even more, utilization goes up to about 25 percent. What that says is that you’re disk I/O limited, so you don’t need to buy the most expensive, fastest processor. This kind of data allows us to rightsize the configuration, reducing both power and cost.

image

The paper goes on to discuss Web Servers where if content is cached a faster processor does help.

image

To share the blame, two RAID controllers are looked at one with 256 MB and another with 512MB of cache.

But when we looked at the results from our ETW workload analysis, we found that most of the time our queue depth never goes beyond 8 I/Os. So in our operational area, there is no difference in performance between the two RAID controllers. If we didn’t have the workload analysis and just looked at those curves, we might have been impressed by the 10-15 percent performance improvement at the high end of the scale, and paid a premium for performance we would never have used.

image