Two Ways to Save Server Power - Google (Tune to Latency) vs. Facebook (Efficient Load Balancing)

October 12, 2014 Dave Ohara

Saving energy in the data center is more than a low PUE. Using 100% renewable power while wasting energy is not a good practice. I’ve been meaning to post on what Google and Facebook have done in these areas for a while and have been staring at these open browser tabs for a while.

1st is Google in June 2014 shared its method of turning down the power consumption of a server as low as they could as long as it met performance latency. The Register covered this method.

Google has worked out how to save as much as 20 percent of its data-center electricity bill by reaching deep into the guts of its infrastructure and fiddling with the feverish silicon brains of its chips.

In a paper to be presented next week at the ISCA 2014 computer architecture conference entitled "Towards Energy Proportionality for Large-Scale Latency-Critical Workloads", researchers from Google and Stanford University discuss an experimental system named "PEGASUS" that may save Google vast sums of money by helping it cut its electricity consumption.

The Google paper is here.

We presented PEGASUS, a feedback-based controller

that implements iso-latency power management policy for

large-scale, latency-critical workloads: it adjusts the powerperformance

settings of servers in a fine-grain manner so that

the overall workload barely meets its latency constraints for user

queries at any load. We demonstrated PEGASUS on a Google

search cluster. We showed that it preserves SLO latency guarantees

and can achieve significant power savings during periods

of low or medium utilization (20% to 40% savings). We also es-

tablished that overall workload latency is a better control signal

for power management compared to CPU utilization. Overall,

iso-latency provides a significant step forward towards the goal

of energy proportionality for one of the challenging classes of

large-scale, low-latency workloads.

Facebook in Aug 2014 shared Autoscale its method of using load balancing to reduce energy consumption. Gigaom covered this idea.

The social networking giant found that when its web servers are idle and not taking user requests, they don’t need that much compute to function, thus they only require a relatively low amount of power. As the servers handle more networking traffic, they need to use more CPU resources, which means they also need to consume more energy.

Interestingly, Facebook found that during relatively quiet periods like midnight, while the servers consumed more energy than they would when left idle, the amount of wattage needed to keep them running was pretty close to what they need when processing a medium amount of traffic during busier hours. This means that it’s actually more efficient for Facebook to have its servers either inactive or running like they would during busier times; the servers just need to have network traffic streamed to them in such a way so that some can be left idle while the others are running at medium capacity.

Facebook posts on Autoscale here.

Overall architecture

In each frontend cluster, Facebook uses custom load balancers to distribute workload to a pool of web servers. Following the implementation of Autoscale, the load balancer now uses an active, or “virtual,” pool of servers, which is essentially a subset of the physical server pool. Autoscale is designed to dynamically adjust the active pool size such that each active server will get at least medium-level CPU utilization regardless of the overall workload level. The servers that aren’t in the active pool don’t receive traffic.

Figure 1: Overall structure of Autoscale

We formulate this as a feedback loop control problem, as shown in Figure 1. The control loop starts with collecting utilization information (CPU, request queue, etc.) from all active servers. Based on this data, the Autoscale controller makes a decision on the optimal active pool size and passes the decision to our load balancers. The load balancers then distribute the workload evenly among the active servers. It repeats this process for the next control cycle.

Why 7x24 Exchange conference is popular with my friends?

October 11, 2014 Dave Ohara

7x24 Exchange Conference Phoenix is coming up and now that it is two weeks away. I am checking in with some friends to see if they’ll be there. So far I am batting 100% of the people I am looking forward to see. Why are so many of my data center friends going to 7x24 Exchange Conferences? At 7x24 Exchange Conferences there is a critical mass of friends and ideas that support data center innovation. Almost every DC conference will claim it is driving data center innovation, but so many times the innovation is coming from the conversations not in the program.

I return to 7x24 to see friends and make news ones. Is this just a social event? No, there are good presentations which is the benefit of not using a “pay to play” presentation model. Some conferences, a Platinum sponsor means you get a keynote spot. Silver you’ll get a small breakout room. etc.

What ideas are discussed? That is constantly changing which is part of why you return, and have had value from past conferences.

Disclosure: In the past I would meet most of my friends at another conference that I am blacklisted from attending, so I have an incentive to help drive my friends to a conference where we will feel free to talk about anything we want. 7x24 Exchange has been supportive and open to feedback on what it takes to be a data center event that my friends find useful for so many reasons.

Love Your Dog? You may love them more after watching this video

October 10, 2014 Dave Ohara

We have two kids and a dog. In many ways our dog is like our third child. In this 60 minutes video a dog owner thinks of his dog as his child.

Anderson Cooper: Do you view Chaser as a family pet? As a friend? How do you see Chaser?

John Pilley: She's our child.

Anderson Cooper: She's your child?

John Pilley: She's our child, a member of the family. Oh yes. She comes first.

Many people think of their dogs as children, but John Pilley has been teaching her like a child as well. By assigning names to toys, Pilley has been helping Chaser learn words and simple sentences.

Check out this video that shows the smartest dog in the world, and you may love your dog more.

DCIM has not taken off the way people thought, Why?

October 10, 2014 Dave Ohara

In the data center world there has been hype on DCIM. Multiple start-ups have tried to build businesses on DCIM. The electrical equipment supplier have added DCIM solutions. Yet DCIM has not taken off. I have had the pleasure or pain of seeing some DCIM implementations first hand and seen how they work or don’t.

So here are some of the reasons why I think DCIM has not lived up to its hype.

- Given the limited deployments many systems don’t scale well.

- Usability is not there yet. Main focus has been to just get things to work.

- Manual data entry is required too many times.

- Decision makers who choose DCIM are not the operations staff, so there is a disconnect from expectations and reality. Many people don’t know the operating expense of running a DCIM system.

- The data center market is actually a decreasing market from the total number of companies who are running data centers even though overall capacity is increasing.

- The big players have tried many of the services, and none is the killer app.

Given the hype is dying down it is pretty hard to launch a start-up targeting DCIM. I would expect DCIM teams within electrical suppliers is finding it harder to get more resources and money given the limited sales.

If a DCIM solution scaled to 100K+ of servers, was easy to use, automated data entry, bridged the reality of operations with executive expectations, a standard at the big data center users, then it would be the killer app.

I don’t see this happening any time soon. Do you? If you do which one of these can do it?

NewImage

Having the Best doesn't necessarily work if you don't have the knowledge that supports it

October 10, 2014 Dave Ohara

F1 Racing is the most technically advanced racing out there. More money and more technology is thrown at winning that any other racing. Back in the early 90s working at Microsoft there were a bunch of us who would get together at somebody’s house at 6a Sunday morning to watch the European F1 races. One guy was so into F1, he quit Microsoft and joined the Ferrari race team to work on the computer systems in the cars.

McLaren racing dumped Mercedes engines for Honda in the 2015 season, and part of the reason is McLaren wanted the source for the engine systems.

"A modern grand prix engine at this moment in time is not just about sheer power; it's about how you harvest the energy, store the energy and effectively if you don't have control of that process - meaning access to source code - then you are not going to be able to stabilise your car in the entry to corners, for instance, and you lose lots of lap time. So even though you have the same brand of engine you do not have the ability to optimise the engine."

I have been out of following F1, but 2015 might be when I start following again. Here is a Honda video they released on their 2015 engine. Honda has bet on one team McLaren to win. Which means they’ll be sharing everything they can to get the most performance out of their engine.