Do you have Sniper Attacks as part of your power outage scenario? Calif grid was hit last year

How many of you have thought about the possibility of the power grid to your data center being attacked by Snipers?  GigaOm’s Katie Fehrenbacher has a post on what happened in CA.

Snipers took down major grid systems in California last year & it could have been a dress rehearsal

 

56 MINS AGO

1 Comment

Hunter with gun
photo: Flickr / Akash_Kurdekar
SUMMARY:

Turns out cyber security might not be the biggest threat to the power grid after all.

There are cases of power being taken out by hunters who choose to shoot at transformers.

And, Yahoo Mail story continues, Mail continues to be recovered

I wrote about how the Yahoo Mail problem seems like it could be like the Danger T-Mobile outage.  And a week later it does look more like it has amazing similarities.

Why am I continuing to follow this when almost all the rest of the media has dropped the story?  Because, I think the root cause is operation issues which is interesting to those who run mission critical services.  Think about it a storage system went out that affected 1% of the users.  Yahoo immediately restores from back-up bringing down mail for many users.  Somehow it would seem like if 1% were truly only affected, there could have been a better way to restore Mail Service.

Ironically the impact was much more than 1% loss mail.  Millions of their users had no mail for days.

Here is a replay of the events from their log.  On this thread you can see the history.

A description of the problem on Dec 9.

So, what happened?

On Monday, December 9th at 10:27 p.m. PT, our network operating center alerted the Mail engineering team to a specific hardware outage in one of our storage systems serving 1% of our users. The Mail team immediately started working with the storage engineers to restore access and move to our back-up systems, estimating that full recovery would be complete by 1:30 p.m. PT on Tuesday.

Yahoo Mail said it was up and running with updates from Marissa Mayer and the operations team that 100% restored was successful.

Update 12/14/13 10:40 am PST

Here are this morning’s updates:

  • Account Access: 99.9% of affected users may access their accounts
  • Outage Message Queue: 100% cleared
  • IMAP access: 100% restored

We're making progress on restoring full access to messages for affected customers and will update again with more information. 

+ Update 12/13/13 5:00 pm PST

We have posted an update on the Yahoo blog here:http://yahoo.tumblr.com/post/69929616860/an-update-on-yahoo-mail

Users were still complaining.  Two days later there is another update that explains the problem getting to mail.  So, even though the queues were cleared for mail from Dec 9, the older mail was not restored.

+ Update 12/16/13 9:00 pm PST

We’ve restored access for users and continue to make progress on recovering email messages, folders and inboxes for those users who are still missing messages in their inbox.

As the engineering team continues the restoration process, we wanted to give a couple answers to the top questions we’re seeing:
 

Q:  “I’m missing emails in my inbox from certain dates, but can see everything else.”
A:  There are three periods of time at question when it comes to message restoration. Message restoration for each period can follow a different timeline.

  • Emails from Dec. 9 - now: 100% of emails during this time period have been delivered
  • Emails from Nov. 25 - Dec 9, 2013: 75% of emails from this period have been restored
  • Emails prior to Nov. 25: 90% of emails from this period have been restored

After Dec 9 you have 100% of your mail.  Before that you have between 75% and 90%.  Somehow users don’t think that is mail restored.

And now Yahoo Mail users and Yahoo Customer support is in support hell.

Update 12/18/13 12:30 pm PST

Here’s the latest update from us answering some of your questions:


Q:  I’m on hold for a while when I call Customer Care.  What’s happening?
We’ve heard that some users are experiencing longer wait times than usual. We appreciate your patience while we work through a large volume of calls.  We are adding agents quickly to support this large volume of calls. Alternatively, you can click the link to the right here that says “Contact Customer Care.”  We’ll ask you to provide us with a few more details and then will follow up with you.  


Thank you for your patience.

Q:  I still can't access my account, what can I do?
We believe we've restored access for all users related to the outage. If you're having trouble accessing your account, please reach out to customer service so that we can provide you with 1:1 support.

Previous updates

+ Update 12/17/13 2:45 pm PST

We continue to work on recovering email messages, folders and inboxes for users who are still not seeing some messages in their inbox. In the last 24 hours, we've seen an accelerated rate of message recovery for affected users. Additionally, we are reaching out directly to the impacted users with an update specifically related to their accounts. 
 
We believe that we have restored access for all affected users, but if you are still having trouble accessing your account for any reason, please contact Customer Care at 1-800-318-0612.

 

Growth of IT Operations and Automation

35 years ago when I was getting my degree in Industrial Engineering and Operations Research I liked manufacturing, and realized that going into automotive was not were I wanted to go. I focused on distribution logistics and going to a high tech firm.  My first company was HP.  At HP I got really good at distribution logistics which got me recruited by Apple.

When I started my degree it was in the early days of computers and automation.  The PC was just getting going with Apple II and commodore 64.  The PC is not nearly as interesting as mobile and the data centers where I spend my time now.  But, some things come back like Operations.  What is a big topic that continues to grow is IT Operations and Automation.  I like IT operations as it is the same type of problems as manufacturing operations and distribution logistics.

And, one benefit I have over many others is I have been through lots of different stuff in 35 years working at HP, Apple, Microsoft and being independent.

One of the ski friends I know said he has a new job at company X starting in a week.  “Have you heard of them?”  Yep.  Know the co-founders, VP of marketing, CTO.  IT Automation is a hot topic and will get bigger.  My wife even chimed in she knew the company as she sees the t-shits all the time.

NewImage

IT Operations and Automation is a necessity if you are building a cloud.  

FUBAR & SNAFU are words for NSA's Utah Data Center's bad habit of frequent Arc Flash events

WSJ has an article that covers the electrical problems that the NSA data center is having.

What comes to mind are the military acronyms - FUBAR and SNAFU.  

SNAFU is a military slang acronym meaning "Situation Normal: All Fucked Up".

FUBAR stands for fucked up beyond all recognition/repair/reason, like SNAFU and SUSFU, dates from World War II. The Oxford English Dictionary lists Yank, the Army Weekly magazine (1944, 7 Jan. p. 8) as its earliest citation: "The FUBAR squadron. ‥ FUBAR? It means 'Fouled Up Beyond All Recognition."[7] NFG is equipment that is not functional, but may or may not be repairable, FUBAR is beyond repair.

Here is the WSJ article below points made.  Can you imagine the size of the analysis documents of these outages.  It would probably take weeks to read and make your brain hurt.

Meltdowns Hobble NSA Data Center

Investigators Stumped by What's Causing Power Surges That Destroy Equipment

 

Chronic electrical surges at the massive new data-storage facility central to the National Security Agency's spying operation have destroyed hundreds of thousands of dollars worth of machinery and delayed the center's opening for a year, according to project documents and current and former officials.

[image]

There have been 10 meltdowns in the past 13 months that have prevented the NSA from using computers at its new Utah data-storage center, slated to be the spy agency's largest, according to project documents reviewed by The Wall Street Journal.

 

 

 

 

 

Sounds like there is a lot of covering of asses, and pointing of fingers of blame.

But another government assessment concluded the contractor's proposed solutions fall short and the causes of eight of the failures haven't been conclusively determined. "We did not find any indication that the proposed equipment modification measures will be effective in preventing future incidents," said a report last week by special investigators from the Army Corps of Engineers known as a Tiger Team.

The architectural firm KlingStubbins designed the electrical system. The firm is a subcontractor to a joint venture of three companies: Balfour Beatty Construction, DPR Construction and Big-D Construction Corp. A KlingStubbins official referred questions to the Army Corps of Engineers.

The joint venture said in a statement it expected to submit a report on the problems within 10 days: "Problems were discovered with certain parts of the unique and highly complex electrical system. The causes of those problems have been determined and a permanent fix is being implemented."

There have been 10 arc flash events since Aug 2012.

The first arc fault failure at the Utah plant was on Aug. 9, 2012, according to project documents. Since then, the center has had nine more failures, most recently on Sept. 25. Each incident caused as much as $100,000 in damage, according to a project official.

It took six months for investigators to determine the causes of two of the failures. In the months that followed, the contractors employed more than 30 independent experts that conducted 160 tests over 50,000 man-hours, according to project documents.

 

Who drives your data center? Finance or Operations

Most of you are hard core data center folks.  Operating a data center can be a pain or put you in the "zone" so you can focus on bigger issues to run the data center.  A good analogy is like driving a car.  How many of you would like to drive a data center that is designed by a bunch of finance guys?  This issue is illustrated by Bob Lutz in "Car Guys vs. Bean Counters."  BusinessWeek reviews the book.

In Car Guys, Lutz argues that Detroit’s steady decline can be blamed on the fact that there aren’t enough Bob Lutzes anymore. After legendary designer and car-guy’s-car-guy Bill Mitchell retired as GM’s design chief in 1977, Lutz writes, the balance of power—at the company, in particular, and in Detroit, in general—began shifting from the car guys to the number crunchers. As a consequence, product planners determined which customers to target with a new sedan or wagon; engineers fretted over inexpensive assembly; and managers fretted about cheap mass production. Only at the end were designers summoned to wrap a steel body around a nearly completed vehicle.

The results, Lutz laments, were the not-so-fondly remembered Cadillac Cimarron, GMC Envoy XUV, Pontiac Aztek, and others.

How many of you have walked into a data center and you can tell the data center was driven primarily by number crunchers who didn't have a clue what the electrical, mechanical, or operational issues are.  They have a budget.  Hit it.

What the finance guys miss is you can't reduce outages by saying you X nine's of availability.  One of the top things that is going to affect your outage is operations.  And Operations is going to be #1 when it comes to mean time to repair to reduce the outage time.

Now there are some technical staff who are like a Tim Allen character who like the "sweetest" tools that give them a jolt of adrenaline, ups their testosterone level, and a new thing they can brag about driving envy in their friends.  

NewImage

There is a balance of designing, and being fiscally responsible.  The days of the over designed data centers are in the past and are being phased out as too expensive.

Part of the problem though for an Operations guy vs. a Finance guy is almost always the Finance guy is better at the company politics.

Lutz: This is the downside of being a creative person who does not play the political game too well. If I had,for instance, been a little bit more circumspect in my dealings with Lee Iacocca and perhaps had held my mouth, I might well have been his successor at Chrysler Corporation.

Ryssdal: Seems kind of an easy answer: You shot your mouth too much.

Lutz: I tend to be a person, when I don't know something I say, "I don't know, I'll have to look it up." I think boards like a CEO who is totally buttoned up, has all the figures. People with my personality generally don't make CEO.

How many Data Center Operations Guys are good at politics?