Google's Urs Hölzle explains why beefier cores are better than whimpy cores

The Register covers a new paper by Google's Urs Hölzle.

Google ops czar condemns multi-core extremists

Sea of 'wimpy' cores will sink you

By Cade Metz in San FranciscoGet more from this author

Posted in Servers, 17th September 2010 07:04 GMT

Free whitepaper – The Reg Guide to Solutions for the Virtual Era

Google is the modern data poster-child for parallel computing. It's famous for splintering enormous calculations into tiny pieces that can then be processed across an epic network of machines. But when it comes to spreading workloads across multi-core processors, the company has called for a certain amount of restraint.

With a paper (PDF) soon to be published in IEEE Micro, the IEEE magazine of chip and silicon design, Google Senior Vice President of Operations Urs Hölzle – one of the brains overseeing the web giant's famous back-end – warns against the use of multi-core processors that take parallelization too far. Chips that spread workloads across more energy-efficient but slower cores, he says, may not be preferable to chips with faster but power-hungry cores.

The paper is here and only 2 pages long.  And, when thinking what motivated Urs to write this paper i think it was his frustration that too many people are focusing on the number of cores to solve a problem and not taking into consideration what happens to the overall system when you try to solve problems with a bunch of whimpy cores vs. brawny cores.

We classify multicore systems as brawny-core systems, whose single-core performance is fairly high, or wimpy-core systems, whose single-core performance is low. The latter are more power efficient. Typically, CPU power decreases by approximately O(k2) when CPU frequency decreases by k, and decreasing DRAM access speeds with core speeds can save additional power.

Urs as usual uses excellent presentation skills to make his point in three areas.

First, the more threads handling a parallelized request, the larger the overall response time. Often all parallel tasks must finish before a request is completed, and thus the overall response time becomes the maximum response time of any subtask, and more subtasks will push further into the long tail of subtask response times. With 10 subtasks, a one-in-a-thousand chance of suboptimal process scheduling will affect 1 percent of requests (recall that the request time is the maximum of all subrequests), but with 1,000 subtasks it will affect virtually all requests.

In addition, a larger number of smaller systems can increase the overall cluster cost if fixed non-CPU costs can’t be scaled down accordingly. The cost of basic infrastructure (enclosures, cables, disks, power supplies, network ports, cables, and so on) must be shared across multiple wimpy-core servers, or these costs might offset any savings. More problematically, DRAM costs might increase if processes have a significant DRAM footprint that’s unrelated to throughput. For example, the kernel and system processes consume more aggregate memory, and applications can use memory-resident data structures (say, a dictionary mapping words to their synonyms) that might need to be loaded into memory on multiple wimpy-core machines instead of a single brawny-core machine.

Third, smaller servers can also lead to lower utilization. Consider the task of allocating a set of applications across a pool of servers as a bin-packing problem—each of the servers is a bin, and we try to fit as many applications as possible into each bin. Clearly that task is harder when the bins are small, because many applications might not completely fill a server and yet use too much of its CPU or RAM to allow a second application to coexist on the same server. Thus, larger bins (combined with resource containers or virtual machines to achieve performance isolation between individual applications) might offer a lower total cost to run a given workload.

How many data center operation VPs can write this paper?  One.  :-)

Keep the number of cores in mind for a green data center, smaller energy efficient processors may not be the most efficient overall.