How Facebook Ships Code, hints of how their data centers work

My wife and I just watched The Social Network on DVD.  We have a 9 and 6 year old so going out to movies together is rare.

The Social Network Trailer

My wife worked in sales for many companies like IDG, Ziff Davis with clients like Intel, Palm, and Microsoft.  She’s seen the SW developers, but never was really been exposed to their world.  Watching The Social Network was entertaining and it is Hollywood spin on the SW culture.  She was amazed at how the SW developers were portrayed, and their focus on writing code.  The bit of irony is working on the Apple Mac OS team and Microsoft Windows team, I could recognize behaviors that reminded of the days when I was much younger and would work in the same type of mode as Facebook was portrayed. The reality not the Hollywood version.

So what is it like in Facebook Development?  Her is a post on How Facebook Ships Code.

How Facebook Ships Code

January 17, 2011 — yeeguy

I’m fascinated by the way Facebook operates.  It’s a very unique environment, not easily replicated (nor would their system work for all companies, even if they tried).  These are notes gathered from talking with many friends at Facebook about how the company develops and releases software.

Seems like others are also interested in Facebook…   The company’s developer-driven culture is coming under greater public scrutiny and other companies are grappling with if/how to implement developer-driven culture.   The company is pretty secretive about its internal processes, though.  Facebook’s Engineering team releases public Notes on new features and some internal systems, but these are mostly “what” kinds of articles, not “how”…  So it’s not easy for outsiders to see how Facebook is able to innovate and optimize their service so much more effectively than other companies.  In my own attempt as an outsider to understand more about how Facebook operates, I assembled these observations over a period of months.  Out of respect for the privacy of my sources, I’ve removed all names and mention of specific features/products.  And I’ve also waited for over six months to publish these notes, so they’re surely a bit out-of-date.   I hope that releasing these notes will help shed some light on how Facebook has managed to push decision-making “down” in its organization without descending into chaos…  It’s hard to argue with Facebook’s results or the coherence of Facebook’s product offerings.  I think and hope that many consumer internet companies can learn from Facebook’s example.

I have had friends over the last 2 years interview at Facebook, most turn down working at Facebook as they were being recruited for senior engineering manager positions and they couldn’t see how they could do their job and be successful.

The post has lots of information, and here are parts that gives you an idea of how Facebook thinks about its SW which influences hardware and data centers.

engineers generally want to work on infrastructure, scalability and “hard problems” — that’s where all the prestige is.  can be hard to get engineers excited about working on front-end projects and user interfaces.  this is the opposite of what you find in some consumer businesses where everyone wants to work on stuff that customers touch so you can point to a particular user experience and say “I built that.”  At facebook, the back-end stuff like news feed algorithms, ad-targeting algorithms, memcache optimizations, etc. are the juicy projects that engineers want.

Note the above reference can be implied to Apple (consumer business).

Additional information that backs up a focus on infrastructure.

  • as of June 2010, the company has nearly 2000 employees, up from roughly 1100 employees 10 months ago.  Nearly doubling staff in under a year!
  • the two largest teams are Engineering and Ops, with roughly 400-500 team members each.  Between the two they make up about 50% of the company.

More details are explained here in a process for releases.

  • by default all code commits get packaged into weekly releases (tuesdays)
  • with extra effort, changes can go out same day
  • tuesday code releases require all engineers who committed code in that week’s release candidate to be on-site
  • engineers must be present in a specific IRC channel for “roll call” before the release begins or else suffer a public “shaming”
  • ops team runs code releases by gradually rolling code out
    • facebook has around 60,000 servers
    • there are 9 concentric levels for rolling out new code
    • [CORRECTION thx epriest] “The nine push phases are not concentric. There are three concentric phases (p1 = internal release, p2 = small external release, p3 = full external release). The other six phases are auxiliary tiers like our internal tools, video upload hosts, etc.”
    • the smallest level is only 6 servers
    • e.g., new tuesday release is rolled out to 6 servers (level 1), ops team then observes those 6 servers and make sure that they are behaving correctly before rolling forward to the next level.
    • if a release is causing any issues (e.g., throwing errors, etc.) then push is halted.  the engineer who committed the offending changeset is paged to fix the problem.  and then the release starts over again at level 1.
    • so a release may go thru levels repeatedly:  1-2-3-fix. back to 1. 1-2-3-4-5-fix.  back to 1.  1-2-3-4-5-6-7-8-9.
  • ops team is really well-trained, well-respected, and very business-aware.  their server metrics go beyond the usual error logs, load & memory utilization stats — also include user behavior.  E.g., if a new release changes the percentage of users who engage with Facebook features, the ops team will see that in their metrics and may stop a release for that reason so they can investigate.
  • during the release process, ops team uses an IRC-based paging system that can ping individual engineers via Facebook, email, IRC, IM, and SMS if needed to get their attention.  not responding to ops team results in public shaming.
  • once code has rolled out to level 9 and is stable, then done with weekly push.
  • if a feature doesn’t get coded in time for a particular weekly push, it’s not that big a deal (unless there are hard external dependencies) — features will just generally get shipped whenever they’re completed.
  • getting svn-blamed, publicly shamed, or slipping projects too often will result in an engineer getting fired.  ”it’s a very high performance culture”.  people that aren’t productive or aren’t super talented really stick out.  Managers will literally take poor performers aside within 6 months of hiring and say “this just isn’t working out, you’re not a good culture fit”.  this actually applies at every level of the company, even C-level and VP-level hires have been quickly dismissed if they aren’t super productive.
  • [CORRECTION, thx epriest“People do not get called out for introducing bugs. They only get called out if they ask for changes to go out with the release but aren’t around to support them in case something goes wrong (and haven’t found someone to cover for you).”

Note the 60,000 server count is not accurate base on my research and is at least twice with another 1/3 growth in the short term before Prineville DC comes on line.

Would you want to be a senior executive hired for this environment?  Now you can see why a lot of my friends turned down jobs.