Jump to content
Brian Enos's Forums... Maku mozo!

USPSA WEB SERVER DOWN


Recommended Posts

The USPSA web server went down at about 4AM UTC-04 today. I am really annoyed, since USPSA pays extra for RAID 5 storage. A drive failed on 4/20, and the server farm staff sent an email I never received asking for permission to replace the failed drive and reboot - which would have been a "non-event" and about 30 minutes (max) downtime.

Unfortunately, although the server farm has my number as an emergency contact on file, as well as Dave Thomas' cell phone, neither of us were called. A second drive failed today resulting in the problem.

We have daily backups, and are seeing if we can get the raid array to come up for a bit to extract the current data - if not, it's restore from backup time. USPSA subscribes to a service from a premium quality service that specializes in managing web servers (mytss.com), and they are already working this issue. Those folks are in Uruguay, are hard core nerds, and mean that the issue is being actively worked by experts so USPSA does not have to wait until I get home from the day job for action.

This also means mail to @uspsa.org addresses won't go anywhere until this is solved.

Link to comment
Share on other sites

  • Replies 66
  • Created
  • Last Reply

Top Posters In This Topic

Bummer! They'll get it fixed. These things happen, but it's REALLY rare that a second drive on the same server went out shortly after. That's like 1 in a million. LITERALLY.

Link to comment
Share on other sites

We have daily backups, and are seeing if we can get the raid array to come up for a bit to extract the current data -

When was the last time the backup was performed? Just curious as I uploaded classifier scores ~1600 CDT yesterday, and wondering if I may have to do it again?

Link to comment
Share on other sites

I am really annoyed, since USPSA pays extra for RAID 5 storage. A drive failed on 4/20, and the server farm staff sent an email I never received asking for permission to replace the failed drive and reboot - which would have been a "non-event" and about 30 minutes (max) downtime.

The raid would have hot swap drives so a reboot would not have been necessary unless they were going to do firmware updates as well. They should have put the drive in and let it rebuild with or without permission and addressed any other updates as a separate issue. That's just plain ugly.

Edited by Gregg K
Link to comment
Share on other sites

When I set up a RAID 5, I add one more drive as a Hot Spare (cheap insurance)...then when 1 drive fails, the system automatically swaps to the spare and it rebuilds. Now loose two at the same time (which rarely, if ever happens) and you are hosed....welcome to restore.

Why should they ask to replace the failed drive? That's a no-brainer...replace it to keep the system up...that's what the customer is paying for.

Rob, I'd be talking to these guys.

Link to comment
Share on other sites

*snip*

Rob, I'd be talking to these guys.

Talking? I'd be yelling....

If they man AND own/rent the equipment to USPSA, swapping the drive should have been a non-issue. If they're just managing it they STILL should have swapped the drive but could have been the reason they were supposed to call (which they didn't). Sorry, had to chime in again....LAME!!! This kind of stuff pisses me off.

Heck, for less than $10/month, I rent a web host with unlimited storage space, subdomains, addon domains, email accounts, etc and all my stuff is fully redundant. i.e. if the server my stuff is on crashes, it's already been backed up real-time to a redundant cluster somewhere else.

Link to comment
Share on other sites

Meanwhile, back at the thread...

From an email I received this afternoon:

"

Pardon The Interruption

The USPSA server is down, perhaps until Monday or even longer. The issue is a hardware failure and has resulted in the USPSA website going down. USPSA staff email is also down.­­

­­

USPSA is aware of the issue and is working on correcting it as soon as possible. We apologize for the interruption of service and thank you for your patience.­

USPSA Staff

"

Later,

Chuck

Link to comment
Share on other sites

I am not familiar with this particular RAID controller of if it was hot-swap - but the server farm asked for a bit of downtime. Too bad they didn't call when I didn't get the email. It was an Adaptec controller (we're getting a new raid controller as part of the repair since they want us on a newer model). Not all of the lower end raid controllers support things like hot spare (some just have 3 drive slots). RAID6 would have been just the trick. I live in the RAID world so I know more than a little bit about this sort of thing.

Backups are performed daily, however, my fingers remain crossed until I do the restore and see for myself. The hardware is back up and the server farm finishing some config issues. Once that is done, a restore will be kicked off that should take a few hours, after which there will be some more config work (it's backup of the data, not a bare metal recovery backup).

As to the waitlist:

If I am not able to have the server up by Sunday, the waitlist opening will be delayed exactly one week. (This has been cleared with USPSA HQ, so I will be able to post info immediately if this delay is implemented). I'm hoping that won't be necessary, however, this is the best procedure to make sure everyone gets a good shot a the waitlist. I have always grabbed a copy of the database from the server to my PC about 2 hours after the waitlist opens, and then again the next day to give that particular data a bit extra protection.

And yes, I most certainly am going to have a talk with the server farm folks. There is no excuse for me not getting a phone call (There are multiple emergency numbers on file, including my home # and Dave Thomas's cell # on file with the company). The only thing I haven't decided is if the overnight letter goes to the CEO or the VP of customer service and technology. I'm waiting until the issue is resolved and the server is back up before taking that step.

Now loose two at the same time (which rarely, if ever happens) and you are hosed.

Not with RAID6 (that allows you to lose two drives), however, RAID6 setups tend to do more in software and less in the drive itself, so there are often performance implications.

Link to comment
Share on other sites

{snip}

And yes, I most certainly am going to have a talk with the server farm folks. There is no excuse for me not getting a phone call (There are multiple emergency numbers on file, including my home # and Dave Thomas's cell # on file with the company). The only thing I haven't decided is if the overnight letter goes to the CEO or the VP of customer service and technology. I'm waiting until the issue is resolved and the server is back up before taking that step.

Now loose two at the same time (which rarely, if ever happens) and you are hosed.

Not with RAID6 (that allows you to lose two drives), however, RAID6 setups tend to do more in software and less in the drive itself, so there are often performance implications.

I would go straight to the CEO as to the fact that his/her company has a process that is broken and needs his/her attention. Since a simple hard drive replacement would have prevented this from happening and the employees failed to escalate the resolution in a timely manner. Just my $0.02.

Link to comment
Share on other sites

Raid 6 or DP can also be done in hardware. How much data is backed up daily? For a few extra dollars, mirrored sites can be setup. A single server or raid controller is always a single point of failure.

Suggest that they put up a simple website that indicates the site is down and will be up early next week?

Good luck with Sr. Management, your going to get we're sorry and credit for the month, unless you have a Service Level Agreement (SLA) that has teeth in it.

PS: IM or Email me if I can help, I shoot on the evenings and weekends, build complex systems during the day ;)

Edited by Olivers_AR
Link to comment
Share on other sites

Putting up a temp site is problematic since restore work is active on the server.

Mirrored sites are not as simple as with a purely static site, since the web site "self modifies". A mirrored site for a static web site is easy, and also not that hard for a system where all dynamic content is database driven. It's a bit trickier with CMS and "self modification". Clustered systems either rely on duplication of data on multiple nodes, or use SAN or NAS storage so nodes can access the same datastore when they switch over.

Classifier uploads go to files downloaded by USPSA HQ; content including photos is managed by a custom content management system; and certain other files are regularly updaed. It's easy to replicate to cluster a database; a bit trickier with the file system.

I could use cloud based storage so that in the event of a hardware failure, I just get another server and configure the FS mount point to the cloud - but even that has a failure possibility. There was a big new item recently where Amazon had a cloud failure, with some loss of customer data.

What I am interested from the server farm is if they even consider the failure to make a phone call in a situation like this a process failure, and if they are going to do something to improve their process. I really don't care about getting a bit of service fee back - I do care about having decent assurances that they can learn from this mistake ... which can only happen if they consider it a mistake. If they tell me "email is sufficient notice, we did exactly what we should, and exactly what we will do if it happens again", I may have some shopping to do.

Link to comment
Share on other sites

It depends on how much content changes on the FS, even if one took snapshots once an hour from site A to site B, it at worse would be one hour behind, stand up another server, physical or virtual and then at worse it would be one hour data lost. The content would be available immediately vs having to restore (assumption is from tape). Things like uploaded scores, photos, etc. could be seperately replicated to a holding area in the second site, so at least its captured prior to being ingested by the CMS.

Yep the Amazon glitch has made people say, this cloud concept isn't bulletproof.

Link to comment
Share on other sites

<update> Yippie Kay Yay - the server farm staff found and fixed the problem, data is moving back. There is work to be done after the data is back, but this is the most critical (and scariest) part of the process. I expect we will lose less than one day's worth of updates to the web site. </update>

This is frustrating. The backups were configured correctly, but the restore program installed on the server errors out. I have escalated to the vendor, but this could take time as I don't know if the backup vendor offers weekend tech support at the level I need.

As to the suggestions - yes, I know we could rewrite the CMS system to put the uploaded files in a holding area; configure multi-system synchronizations; cluster the database; etc. I know how do to all these things, but it takes one of two things - time or money. I don't have the time to dedicate all of my USPSA efforts to sysdmin work, and USPSA does not have tech staff on board to implement all the suggestions. Of course, if you're offering to write the code, make sure all current processes and features workl configure the systems, and monitor it to make sure it continues to operate, HQ would like to talk to you :rolleyes: In other words, we do the best with the resources we have.

It's kind of ironic since I spend my days making sure that a lot of very familiar corporate names have access to quick, reliable disk based backups.

Link to comment
Share on other sites

Bummer! They'll get it fixed. These things happen, but it's REALLY rare that a second drive on the same server went out shortly after. That's like 1 in a million. LITERALLY.

Maybe not in Uruguay.

Link to comment
Share on other sites

Bummer! They'll get it fixed. These things happen, but it's REALLY rare that a second drive on the same server went out shortly after. That's like 1 in a million. LITERALLY.

Maybe not in Uruguay.

The server is in Texas (Dallas or Houston; not sure which). The server farm provides support, but USPSA also subscribes to www.totalserversolutions.com, also known as "The Uruguayians". The folks at TSS are absolutely exceptional and well worth what USPSA pays to keep them "on call" (well under $100/month).

"Shortly thereafter" was about 3 weeks - which really, really annoys me as timely notice via a call to the emergency contact # for the account would have avoided this problem.

At least I am getting to test our restore procedures.

Link to comment
Share on other sites

WAITLIST ANNOUNCEMENT

The opening of the nationals waitlist will be delayed exactly one week due to the server issues. There is a decent chance the server will be in operation by the originally scheduled time, however, Kim and I are concerned that someone may not get notice due to the outage and this additional week will give everyone time to plan for the waitlist opening.

The notice on www.uspsa.org will be updated shortly after the site is back in operation.

Link to comment
Share on other sites

A single web server with a RAID 5 and no hot spare runs the entire USPSA website?

Sounds like a management and planning fail more than anything. Sure you can blame the datacenter jockies, but in reality you set yourself up for failure.

Edited by waktasz
Link to comment
Share on other sites

Sorry to hear all this Mr. Boudrie.

I commiserate with this unfortunate occurance as someone who has experience with the notification failure of multiple hard drive failures in a Raid 5 array that had no hot spares due to budgetary & other resource issues. It stinks when the failsafes you put into place to prevent this exact situation do not work as they should. Raid 5 plus hot spares are nice when the budget & resources allocated allow for it. Occurrences like this tend to expand the budget and fill in the holes in the processes.

It is unfortunate that it was one the few scheduled times of the year that the Nationals bound folks would really become aware that the web site was down. :sick: A lot a major sites have gone down unexpectedly.

Good luck (especially with the restore, there are usually problems with unrehearsed restores.) Your work is appreciated.

Link to comment
Share on other sites

After browsing through this thread, it becomes apparent that the universal rule of

manufacturing and repair also applies to web servers:

You can have it done cheap

You can have it done well

You can have it done quickly

You can only have two of the above.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now



×
×
  • Create New...