It is currently Wed Feb 21, 2018 12:17 pm

All times are UTC - 8 hours [ DST ]




Post new topic This topic is locked, you cannot edit posts or make further replies.  [ 29 posts ]  Go to page Previous  1, 2
Author Message
PostPosted: Tue Mar 11, 2014 11:30 am 
Offline
Tv Watcher
Tv Watcher

Joined: Tue Sep 02, 2003 7:21 am
Posts: 14981
Location: Colorado
Quote:
Well.....he's just make believe.....we're the real thing. :D

Coca Cola?


Top
 Profile  
 
PostPosted: Tue Mar 11, 2014 11:34 am 
Offline
Site Admin
Site Admin
User avatar

Joined: Wed Jan 08, 2003 5:17 pm
Posts: 1823
Location: Fairhaven, MA
I don't know why I'm so surprised how quickly this train runs off the track. :huh:

_________________
Check These Out: MYSThillarium Volume 1 and other Forumite trips!


Top
 Profile  
 
PostPosted: Tue Mar 11, 2014 4:18 pm 
Offline
Site Admin
Site Admin
User avatar

Joined: Tue Jan 07, 2003 3:24 am
Posts: 2134
Location: California, USA
lswot wrote:
We are number one!

Thanks, Trucker...... :clap: :clap:


Of course you are. :)

Henry J wrote:
And here I thought Will Ryker was number one! (Make it so?)


Nope. He's number 2.
Henry J wrote:
Quote:
Well.....he's just make believe.....we're the real thing. :D

Coca Cola?


:rotfl:

Donahoo wrote:
I don't know why I'm so surprised how quickly this train runs off the track. :huh:


You were surprised?? With this group??? :scratchhead:
Hehe. I'm not surprised.
:coffee:

_________________
You can teach an old dog new tricks. :D
Sometimes.
Forum Host


Top
 Profile  
 
PostPosted: Tue Mar 11, 2014 5:16 pm 
Offline
Tv Watcher
Tv Watcher
User avatar

Joined: Sun Aug 31, 2003 11:53 am
Posts: 13339
Location: California
trucker2000 wrote:
lswot wrote:
We are number one!

Thanks, Trucker...... :clap: :clap:


Of course you are. :)

Henry J wrote:
And here I thought Will Ryker was number one! (Make it so?)


Nope. He's number 2.
Henry J wrote:
Quote:
Well.....he's just make believe.....we're the real thing. :D

Coca Cola?



Donahoo wrote:
I don't know why I'm so surprised how quickly this train runs off the track. :huh:


You were surprised?? With this group??? :scratchhead:
Hehe. I'm not surprised.
:coffee:

:rotfl: :rotfl:

_________________
:beamup: lswot
eccl 2:13

"A Government big enough to give you every thing you want, is big enough to take away every thing you have."
......Thomas Jefferson......


Top
 Profile  
 
PostPosted: Wed Mar 12, 2014 10:09 am 
Offline
Tv Watcher
Tv Watcher

Joined: Tue Sep 02, 2003 7:21 am
Posts: 14981
Location: Colorado
Surprise, surprise, surprise!


Top
 Profile  
 
PostPosted: Wed Mar 12, 2014 11:05 am 
Offline
Tv Watcher
Tv Watcher
User avatar

Joined: Sun Aug 31, 2003 11:53 am
Posts: 13339
Location: California
:lol: it's a good thing we're not in Donahoo's vicinity or we would be getting slaps on our wrists. :bdsmile:

_________________
:beamup: lswot
eccl 2:13

"A Government big enough to give you every thing you want, is big enough to take away every thing you have."
......Thomas Jefferson......


Top
 Profile  
 
PostPosted: Thu Mar 13, 2014 12:05 pm 
Offline
Tv Watcher
Tv Watcher
User avatar

Joined: Tue Sep 02, 2003 6:13 am
Posts: 13506
Location: Ohio
Tracks? What tracks?? I don't see no stinken' tracks!!! :scratchhead:


Top
 Profile  
 
PostPosted: Thu Mar 13, 2014 5:03 pm 
Offline
Tv Watcher
Tv Watcher
User avatar

Joined: Sun Aug 31, 2003 11:53 am
Posts: 13339
Location: California
They're at the end of the tunnel

_________________
:beamup: lswot
eccl 2:13

"A Government big enough to give you every thing you want, is big enough to take away every thing you have."
......Thomas Jefferson......


Top
 Profile  
 
 Post subject:
PostPosted: Thu Mar 13, 2014 5:41 pm 
Offline
Site Admin
Site Admin
User avatar

Joined: Wed Jan 08, 2003 5:17 pm
Posts: 1823
Location: Fairhaven, MA
lswot wrote:
They're at the end of the tunnel


No, those are the lights at the end of the tunnel. :huh:

_________________
Check These Out: MYSThillarium Volume 1 and other Forumite trips!


Top
 Profile  
 
PostPosted: Thu Mar 13, 2014 8:06 pm 
Offline
Tv Watcher
Tv Watcher

Joined: Tue Sep 02, 2003 7:21 am
Posts: 14981
Location: Colorado
Wait, the tracks are at the end of the tunnel? On what does the train run between here and there? :lol:


Top
 Profile  
 
PostPosted: Fri Mar 14, 2014 9:57 am 
Offline
Tv Watcher
Tv Watcher
User avatar

Joined: Sun Aug 31, 2003 11:53 am
Posts: 13339
Location: California
....imagination?

_________________
:beamup: lswot
eccl 2:13

"A Government big enough to give you every thing you want, is big enough to take away every thing you have."
......Thomas Jefferson......


Top
 Profile  
 
 Post subject:
PostPosted: Fri Mar 14, 2014 10:32 am 
Offline
Site Admin
Site Admin
User avatar

Joined: Wed Jan 08, 2003 5:17 pm
Posts: 1823
Location: Fairhaven, MA
lswot wrote:
:lol: it's a good thing we're not in Donahoo's vicinity or we would be getting slaps on our wrists. :bdsmile:


Every time I come here from the email notice that there is a new post, the first thing I see is that Lock Topic option at the bottom of the page. :lol:

_________________
Check These Out: MYSThillarium Volume 1 and other Forumite trips!


Top
 Profile  
 
PostPosted: Fri Mar 14, 2014 3:44 pm 
Offline
Site Admin
Site Admin
User avatar

Joined: Tue Jan 07, 2003 3:24 am
Posts: 2134
Location: California, USA
For something that started out seriously "Not Funny", This thread has turned into comedy central. :rotfl:

I'm sorry to ruin the fun folks, but...it's time.

Locked.

_________________
You can teach an old dog new tricks. :D
Sometimes.
Forum Host


Top
 Profile  
 
PostPosted: Sun Mar 16, 2014 11:29 am 
Offline
Site Admin
Site Admin
User avatar

Joined: Tue Jan 07, 2003 3:24 am
Posts: 2134
Location: California, USA
Unlocked for a moment to post this from my webhost:

Quote:
As many of you know, last week we experienced a major outage, affecting multiple email and database servers for just under 5 days.

It was, without a doubt, the worst technical issue we’ve ever experienced as a company.

Thankfully, all systems were brought back up without any loss of data. While we were happy that we were able to restore our customers’ services, we knew the duration of the incident, as well as the incident itself, was unacceptable.

To make certain that something like this never happens again, we launched a full investigation to determine the root cause of the issue, outline the steps we took to handle it, and help us take preventative measures against this type of problem in the future.

I want to share this report with you, both to satisfy the curiosities of our more tech-savvy customers, and to illustrate the amount of time, work, and research it took to resolve this difficult issue. I also want to share the steps we’ll be taking to prevent this issue in the future, which are included at the end of the report.

So, here is the post-issue follow-up my system administrators delivered to me this morning. I think you’ll find it informative, and I hope it sheds some light on the mess that was last week:

Incident Name: Storage Outage – sas3
Incident Date: 2014-03-02
Report Date: 2014-03-14

Services Impacted:
Storage sas3 on dbmail01
93 Shared vms (mail and mysql of cp9-11), resources of 30366 accounts were affected.

Incident Root Cause:
The existing SAS sans use a RAID50 configuration that consists of two RAID 5 groups (one parity group consisting of the even numbered drives and one parity group consisting of the odd numbered drives). The array can handle two disk failures at the same time as long as they are not part of the same parity group. In the case of our outage, drive 6 failed and drive 10 was added to the RAID to rebuild the group. During the rebuild process, drive 0 failed causing us to lose the even-numbered parity group. This occurred just before 4AM EST on 3/2/2014 and caused the RAID to go into an unrecoverable state. We contacted our hardware vendor’s support line before acting, because there was a large potential for data loss, and were escalated to their engineering team. Total call time was 10 hours.

raidfailure

Response:
In order to regain access to the data, we had to manually disable slots 10 and 15 (the spare drives) so that the RAID would not attempt to rebuild. Next, we reseated drive 6 which brought it online, but not as part of the RAID. This allowed the entire RAID to come back online in a degraded state with drive 0 active. Because drive 0 was still failing, we knew the RAID was in a very fragile state and that we had to move forward with great care or we would risk losing data.

Our hardware vendor showed us a binding procedure that allowed us to move the affected volumes from the storage system. We learned that if we triggered another failure in drive 0 at any point during this process, the RAID would go offline and we could lose access to the data. With this in mind, we began to migrate the volumes, one at a time (that way, it would reduce the stress on drive, thereby reducing the chance of it failing again). We were methodical and deliberate in the way we approached this and thankfully, we were successful in migrating all data from the storage without triggering another failure in drive 0. The process completed, and all customers were back online as of 3/6/2014 just after 6PM EST. The whole process took just under 5 days.

Timeline:
Click here to view the timeline of events, from the initial outage to the final server’s reactivation.

What We’re Doing To Prevent This:
Improve Monitoring

Currently, our automated hardware checks are set to notify us when a storage system has an issue of any type. While sophisticated, it’s not specific enough to tell us what the actual problem is. For instance, if a drive fails, we get a general notification, rather than a ‘drive x has failed’ message. We are looking into using more specific, granular notifications for individual disks.

Proactive Hardware Replacement

It may be possible to check via SNMP for things like disk errors on specific disks before they actually fail out of the RAID and trigger a rebuild. This will result in less drive failures and less rebuilds.

Switch All Arrays to More Stable RAID

Our RAID currently rebuilds on storage arrays using RAID50. Though this is standard, it can take more than eight hours to complete a rebuild. This is an 8-hour window where we risk losing two drives from the same parity group. We can decrease this risk by moving to a significantly faster RAID10 setup, which can rebuild in about 3 hours.

Thanks for reading and again, we’re so sorry about this inconvenience.

_________________
You can teach an old dog new tricks. :D
Sometimes.
Forum Host


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic This topic is locked, you cannot edit posts or make further replies.  [ 29 posts ]  Go to page Previous  1, 2

All times are UTC - 8 hours [ DST ]


Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB® Forum Software © phpBB Group