UPDATE: This was mostly a non-event and shouldn’t have impacted many users.
Amazon has given us a notice that their will be a maintenance window on the data centers that host the primary cloud servers for the designrelated.com app starting at 11PM EST tonight.
We’ve been working over the last few weeks to build an automation tool to quickly launch new cloud servers for cases like this and some planned growth and changes in 2012.
The estimated window lasts about 6 hours, but we may have an outage of as little as 6 minutes. :)
UPDATE: October 20th, 11:30PM EST
Turns out a wacky kernel failure caused the design:related app to require a rebuild from scratch. As fun as that sounds, we ran through the whole process a couple times and have been testing all of the functionality. The site should start to come back online for most people tomorrow morning, but it might take up to 24 hours to reach everyone. Thanks for your patience!
UPDATE: October 19th, 6:00PM EST
We came up with a plan last night after getting to the root of the problem. Something unusual happened to our server connections due to a hardware crash, which has slowed things down considerably in terms of getting the site back online. Unfortunately, the process is a bit time consuming, but we’ll post any additional updates here.
Apparently one of Amazon’s cloud servers crashed last night, which took down the design:related site/service with it. We ran into a snag after their servers restarted, and we’re working on fixing this asap to get things back online just as soon as we can. We apologize for the inconvenience!
Designrelated.com is down due to network issues affecting all sites with our service provider. They’re working to correct it ASAP. (See update below) … We are putting up the information as it comes through, and we are very sorry for the inconvenience!
Update 10:00PM EST: The main designrelated.com site started to come back online as of about 2 and 1/2 hours ago. We are continuing to monitor things as Rimu Hosting restores service for all of their customers after a major power outage event.
“There has been an issue affecting one of our 6 service entrances. The actual ATS (Automatic Transfer Switch) is having an issue and all vendors are on site. Unfortunately, this is affecting service entrance 2 in the 3000 Irving facility so it is affecting a lot of the customers that have been here the longest…
Our electrical contractors, UPS maintenance team and generator contractor are all on-site and working to determine what the best course of action is to get this back up.”
In basic terms, this results to a wide-scale power outage that has affected all of their customers who host in Dallas, including design:related
, with an unknown ETA for when it will be back online (ETA was posted around 5:30PM EST). Rimu is working on restoring services as soon as they can. Toggl’s support page has a long thread on the work that is currently underway.
We have been working with RimuHosting in some capacity since 2007, and we were really hoping they would be a large-scale solution for design:related’s architecture in 2011 and beyond. Our small team is actively discussing making some major changes in the near future so that we can prevent issues like this from occurring again down the road. This is likely to include a transition to more cloud computing and additional server redundancy. We have been using Amazon S3 to store/serve images for some time now, and we will be working on optimizing our app to best compliment this as well.
We sincerely apologize for the issues, and we thank you again for your patience as we continue to build and enhance the design:related platform.
Check back for updates here and on Matt Sung’s twitter.
UPDATE: July 14th, 10:15PM:
After much investigation, we confirmed that the issues with some integral user actions on design:related were related to a hardware/network issue with some servers in the cluster, and we’re doing a bit of maintenance after we had a short outage for 10 minutes or so.
We hope to be back at full capacity once we verify that everything is running smoothly.
UPDATE: July 4th, 11:15PM:
Please note that we will have some needed downtime starting around 1:30AM EST on Tuesday, July 5th due to maintenance required by our hosting company. Our team blog will be still be up and running, and we will post any updates here. The network hardware is still having some issues that affects the entire data center, so they will be rebooting all their systems.
Hopefully, it won’t take more than 15 minutes or so. If it takes much longer, it is either because:
A) Our hosting company is spending a little more time than planned on the network maintenance…
B) We are making sure that everything is running smoothly before restoring full service.
Thanks in advance for your patience!
– Your friends @ design:related