Rob Boudrie Posted May 6, 2011 Share Posted May 6, 2011 The USPSA web server went down at about 4AM UTC-04 today. I am really annoyed, since USPSA pays extra for RAID 5 storage. A drive failed on 4/20, and the server farm staff sent an email I never received asking for permission to replace the failed drive and reboot - which would have been a "non-event" and about 30 minutes (max) downtime. Unfortunately, although the server farm has my number as an emergency contact on file, as well as Dave Thomas' cell phone, neither of us were called. A second drive failed today resulting in the problem. We have daily backups, and are seeing if we can get the raid array to come up for a bit to extract the current data - if not, it's restore from backup time. USPSA subscribes to a service from a premium quality service that specializes in managing web servers (mytss.com), and they are already working this issue. Those folks are in Uruguay, are hard core nerds, and mean that the issue is being actively worked by experts so USPSA does not have to wait until I get home from the day job for action. This also means mail to @uspsa.org addresses won't go anywhere until this is solved. Link to comment Share on other sites More sharing options...
Erik S. Posted May 6, 2011 Share Posted May 6, 2011 Bummer! They'll get it fixed. These things happen, but it's REALLY rare that a second drive on the same server went out shortly after. That's like 1 in a million. LITERALLY. Link to comment Share on other sites More sharing options...
centermass Posted May 6, 2011 Share Posted May 6, 2011 We have daily backups, and are seeing if we can get the raid array to come up for a bit to extract the current data - When was the last time the backup was performed? Just curious as I uploaded classifier scores ~1600 CDT yesterday, and wondering if I may have to do it again? Link to comment Share on other sites More sharing options...
Gregg K Posted May 6, 2011 Share Posted May 6, 2011 (edited) I am really annoyed, since USPSA pays extra for RAID 5 storage. A drive failed on 4/20, and the server farm staff sent an email I never received asking for permission to replace the failed drive and reboot - which would have been a "non-event" and about 30 minutes (max) downtime. The raid would have hot swap drives so a reboot would not have been necessary unless they were going to do firmware updates as well. They should have put the drive in and let it rebuild with or without permission and addressed any other updates as a separate issue. That's just plain ugly. Edited May 6, 2011 by Gregg K Link to comment Share on other sites More sharing options...
Mark R Posted May 6, 2011 Share Posted May 6, 2011 When I set up a RAID 5, I add one more drive as a Hot Spare (cheap insurance)...then when 1 drive fails, the system automatically swaps to the spare and it rebuilds. Now loose two at the same time (which rarely, if ever happens) and you are hosed....welcome to restore. Why should they ask to replace the failed drive? That's a no-brainer...replace it to keep the system up...that's what the customer is paying for. Rob, I'd be talking to these guys. Link to comment Share on other sites More sharing options...
MoNsTeR Posted May 6, 2011 Share Posted May 6, 2011 (firmly tongue-in-cheek) http://www.miracleas.com/BAARF/BAARF2.html Link to comment Share on other sites More sharing options...
Erik S. Posted May 6, 2011 Share Posted May 6, 2011 *snip* Rob, I'd be talking to these guys. Talking? I'd be yelling.... If they man AND own/rent the equipment to USPSA, swapping the drive should have been a non-issue. If they're just managing it they STILL should have swapped the drive but could have been the reason they were supposed to call (which they didn't). Sorry, had to chime in again....LAME!!! This kind of stuff pisses me off. Heck, for less than $10/month, I rent a web host with unlimited storage space, subdomains, addon domains, email accounts, etc and all my stuff is fully redundant. i.e. if the server my stuff is on crashes, it's already been backed up real-time to a redundant cluster somewhere else. Link to comment Share on other sites More sharing options...
ChuckS Posted May 6, 2011 Share Posted May 6, 2011 Meanwhile, back at the thread... From an email I received this afternoon: " Pardon The Interruption The USPSA server is down, perhaps until Monday or even longer. The issue is a hardware failure and has resulted in the USPSA website going down. USPSA staff email is also down. USPSA is aware of the issue and is working on correcting it as soon as possible. We apologize for the interruption of service and thank you for your patience. USPSA Staff " Later, Chuck Link to comment Share on other sites More sharing options...
pjb45 Posted May 6, 2011 Share Posted May 6, 2011 Does this impact the wait list sign up? Link to comment Share on other sites More sharing options...
Pat Miles Posted May 7, 2011 Share Posted May 7, 2011 Does this impact the wait list sign up? If ya can't get to the site ya can't sign up now can ya. Pat Link to comment Share on other sites More sharing options...
Rob Boudrie Posted May 7, 2011 Author Share Posted May 7, 2011 I am not familiar with this particular RAID controller of if it was hot-swap - but the server farm asked for a bit of downtime. Too bad they didn't call when I didn't get the email. It was an Adaptec controller (we're getting a new raid controller as part of the repair since they want us on a newer model). Not all of the lower end raid controllers support things like hot spare (some just have 3 drive slots). RAID6 would have been just the trick. I live in the RAID world so I know more than a little bit about this sort of thing. Backups are performed daily, however, my fingers remain crossed until I do the restore and see for myself. The hardware is back up and the server farm finishing some config issues. Once that is done, a restore will be kicked off that should take a few hours, after which there will be some more config work (it's backup of the data, not a bare metal recovery backup). As to the waitlist: If I am not able to have the server up by Sunday, the waitlist opening will be delayed exactly one week. (This has been cleared with USPSA HQ, so I will be able to post info immediately if this delay is implemented). I'm hoping that won't be necessary, however, this is the best procedure to make sure everyone gets a good shot a the waitlist. I have always grabbed a copy of the database from the server to my PC about 2 hours after the waitlist opens, and then again the next day to give that particular data a bit extra protection. And yes, I most certainly am going to have a talk with the server farm folks. There is no excuse for me not getting a phone call (There are multiple emergency numbers on file, including my home # and Dave Thomas's cell # on file with the company). The only thing I haven't decided is if the overnight letter goes to the CEO or the VP of customer service and technology. I'm waiting until the issue is resolved and the server is back up before taking that step. Now loose two at the same time (which rarely, if ever happens) and you are hosed. Not with RAID6 (that allows you to lose two drives), however, RAID6 setups tend to do more in software and less in the drive itself, so there are often performance implications. Link to comment Share on other sites More sharing options...
adively Posted May 7, 2011 Share Posted May 7, 2011 {snip} And yes, I most certainly am going to have a talk with the server farm folks. There is no excuse for me not getting a phone call (There are multiple emergency numbers on file, including my home # and Dave Thomas's cell # on file with the company). The only thing I haven't decided is if the overnight letter goes to the CEO or the VP of customer service and technology. I'm waiting until the issue is resolved and the server is back up before taking that step. Now loose two at the same time (which rarely, if ever happens) and you are hosed. Not with RAID6 (that allows you to lose two drives), however, RAID6 setups tend to do more in software and less in the drive itself, so there are often performance implications. I would go straight to the CEO as to the fact that his/her company has a process that is broken and needs his/her attention. Since a simple hard drive replacement would have prevented this from happening and the employees failed to escalate the resolution in a timely manner. Just my $0.02. Link to comment Share on other sites More sharing options...
Olivers_AR Posted May 7, 2011 Share Posted May 7, 2011 (edited) Raid 6 or DP can also be done in hardware. How much data is backed up daily? For a few extra dollars, mirrored sites can be setup. A single server or raid controller is always a single point of failure. Suggest that they put up a simple website that indicates the site is down and will be up early next week? Good luck with Sr. Management, your going to get we're sorry and credit for the month, unless you have a Service Level Agreement (SLA) that has teeth in it. PS: IM or Email me if I can help, I shoot on the evenings and weekends, build complex systems during the day Edited May 7, 2011 by Olivers_AR Link to comment Share on other sites More sharing options...
Rob Boudrie Posted May 7, 2011 Author Share Posted May 7, 2011 Putting up a temp site is problematic since restore work is active on the server. Mirrored sites are not as simple as with a purely static site, since the web site "self modifies". A mirrored site for a static web site is easy, and also not that hard for a system where all dynamic content is database driven. It's a bit trickier with CMS and "self modification". Clustered systems either rely on duplication of data on multiple nodes, or use SAN or NAS storage so nodes can access the same datastore when they switch over. Classifier uploads go to files downloaded by USPSA HQ; content including photos is managed by a custom content management system; and certain other files are regularly updaed. It's easy to replicate to cluster a database; a bit trickier with the file system. I could use cloud based storage so that in the event of a hardware failure, I just get another server and configure the FS mount point to the cloud - but even that has a failure possibility. There was a big new item recently where Amazon had a cloud failure, with some loss of customer data. What I am interested from the server farm is if they even consider the failure to make a phone call in a situation like this a process failure, and if they are going to do something to improve their process. I really don't care about getting a bit of service fee back - I do care about having decent assurances that they can learn from this mistake ... which can only happen if they consider it a mistake. If they tell me "email is sufficient notice, we did exactly what we should, and exactly what we will do if it happens again", I may have some shopping to do. Link to comment Share on other sites More sharing options...
Olivers_AR Posted May 7, 2011 Share Posted May 7, 2011 It depends on how much content changes on the FS, even if one took snapshots once an hour from site A to site B, it at worse would be one hour behind, stand up another server, physical or virtual and then at worse it would be one hour data lost. The content would be available immediately vs having to restore (assumption is from tape). Things like uploaded scores, photos, etc. could be seperately replicated to a holding area in the second site, so at least its captured prior to being ingested by the CMS. Yep the Amazon glitch has made people say, this cloud concept isn't bulletproof. Link to comment Share on other sites More sharing options...
LexTalionis Posted May 7, 2011 Share Posted May 7, 2011 Also all the club sites that use USPSA's hosting are down as well. I have a match this morning, and my guys aren't able to get last-minute info. :/ Link to comment Share on other sites More sharing options...
Rob Boudrie Posted May 7, 2011 Author Share Posted May 7, 2011 <update> Yippie Kay Yay - the server farm staff found and fixed the problem, data is moving back. There is work to be done after the data is back, but this is the most critical (and scariest) part of the process. I expect we will lose less than one day's worth of updates to the web site. </update> This is frustrating. The backups were configured correctly, but the restore program installed on the server errors out. I have escalated to the vendor, but this could take time as I don't know if the backup vendor offers weekend tech support at the level I need. As to the suggestions - yes, I know we could rewrite the CMS system to put the uploaded files in a holding area; configure multi-system synchronizations; cluster the database; etc. I know how do to all these things, but it takes one of two things - time or money. I don't have the time to dedicate all of my USPSA efforts to sysdmin work, and USPSA does not have tech staff on board to implement all the suggestions. Of course, if you're offering to write the code, make sure all current processes and features workl configure the systems, and monitor it to make sure it continues to operate, HQ would like to talk to you In other words, we do the best with the resources we have. It's kind of ironic since I spend my days making sure that a lot of very familiar corporate names have access to quick, reliable disk based backups. Link to comment Share on other sites More sharing options...
Carmoney Posted May 7, 2011 Share Posted May 7, 2011 Bummer! They'll get it fixed. These things happen, but it's REALLY rare that a second drive on the same server went out shortly after. That's like 1 in a million. LITERALLY. Maybe not in Uruguay. Link to comment Share on other sites More sharing options...
Rob Boudrie Posted May 7, 2011 Author Share Posted May 7, 2011 Bummer! They'll get it fixed. These things happen, but it's REALLY rare that a second drive on the same server went out shortly after. That's like 1 in a million. LITERALLY. Maybe not in Uruguay. The server is in Texas (Dallas or Houston; not sure which). The server farm provides support, but USPSA also subscribes to www.totalserversolutions.com, also known as "The Uruguayians". The folks at TSS are absolutely exceptional and well worth what USPSA pays to keep them "on call" (well under $100/month). "Shortly thereafter" was about 3 weeks - which really, really annoys me as timely notice via a call to the emergency contact # for the account would have avoided this problem. At least I am getting to test our restore procedures. Link to comment Share on other sites More sharing options...
Rob Boudrie Posted May 7, 2011 Author Share Posted May 7, 2011 WAITLIST ANNOUNCEMENT The opening of the nationals waitlist will be delayed exactly one week due to the server issues. There is a decent chance the server will be in operation by the originally scheduled time, however, Kim and I are concerned that someone may not get notice due to the outage and this additional week will give everyone time to plan for the waitlist opening. The notice on www.uspsa.org will be updated shortly after the site is back in operation. Link to comment Share on other sites More sharing options...
spanky Posted May 8, 2011 Share Posted May 8, 2011 That is a repair issue, not a maintenance issue and, IMO should have been fixed immediately and without permission. Link to comment Share on other sites More sharing options...
waktasz Posted May 8, 2011 Share Posted May 8, 2011 (edited) A single web server with a RAID 5 and no hot spare runs the entire USPSA website? Sounds like a management and planning fail more than anything. Sure you can blame the datacenter jockies, but in reality you set yourself up for failure. Edited May 8, 2011 by waktasz Link to comment Share on other sites More sharing options...
spanky Posted May 8, 2011 Share Posted May 8, 2011 I hope this doesn't delay my lifer upgrade I did last week. Link to comment Share on other sites More sharing options...
furyalecto Posted May 8, 2011 Share Posted May 8, 2011 Sorry to hear all this Mr. Boudrie. I commiserate with this unfortunate occurance as someone who has experience with the notification failure of multiple hard drive failures in a Raid 5 array that had no hot spares due to budgetary & other resource issues. It stinks when the failsafes you put into place to prevent this exact situation do not work as they should. Raid 5 plus hot spares are nice when the budget & resources allocated allow for it. Occurrences like this tend to expand the budget and fill in the holes in the processes. It is unfortunate that it was one the few scheduled times of the year that the Nationals bound folks would really become aware that the web site was down. A lot a major sites have gone down unexpectedly. Good luck (especially with the restore, there are usually problems with unrehearsed restores.) Your work is appreciated. Link to comment Share on other sites More sharing options...
open17 Posted May 8, 2011 Share Posted May 8, 2011 After browsing through this thread, it becomes apparent that the universal rule of manufacturing and repair also applies to web servers: You can have it done cheap You can have it done well You can have it done quickly You can only have two of the above. Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now