Posts tagged downtime

Scheduled downtime for maintenance Monday, December 19th

UPDATE: This was mostly a non-event and shouldn’t have impacted many users.

Amazon has given us a notice that their will be a maintenance window on the data centers that host the primary cloud servers for the designrelated.com app starting at 11PM EST tonight.

We’ve been working over the last few weeks to build an automation tool to quickly launch new cloud servers for cases like this and some planned growth and changes in 2012.

The estimated window lasts about 6 hours, but we may have an outage of as little as 6 minutes. :)

@MattSung

Amazon server crash = bad times

UPDATE: October 20th, 11:30PM EST

Turns out a wacky kernel failure caused the design:related app to require a rebuild from scratch. As fun as that sounds, we ran through the whole process a couple times and have been testing all of the functionality. The site should start to come back online for most people tomorrow morning, but it might take up to 24 hours to reach everyone. Thanks for your patience!

__

UPDATE: October 19th, 6:00PM EST

We came up with a plan last night after getting to the root of the problem. Something unusual happened to our server connections due to a hardware crash, which has slowed things down considerably in terms of getting the site back online. Unfortunately, the process is a bit time consuming, but we’ll post any additional updates here.

__

Apparently one of Amazon’s cloud servers crashed last night, which took down the  design:related site/service with it. We ran into a snag after their servers restarted, and we’re working on fixing this asap to get things back online just as soon as we can. We apologize for the inconvenience!

Network issues affecting design:related along with all of the hosting company’s Dallas sites

Designrelated.com is down due to network issues affecting all sites with our service provider. They’re working to correct it ASAP. (See update below) … We are putting up the information as it comes through, and we are very sorry for the inconvenience!

___

Update 10:00PM EST: The main designrelated.com site started to come back online as of  about 2 and 1/2 hours ago. We are continuing to monitor things as Rimu Hosting restores service for all of their customers after a major power outage event. 

___

Update 3:00PM EST: Our service provider, @rimuhosting has posted a tweet linking to a forum thread about what’s going on with the server farm in Dallas.


“There has been an issue affecting one of our 6 service entrances. The actual ATS (Automatic Transfer Switch) is having an issue and all vendors are on site. Unfortunately, this is affecting service entrance 2 in the 3000 Irving facility so it is affecting a lot of the customers that have been here the longest…

Our electrical contractors, UPS maintenance team and generator contractor are all on-site and working to determine what the best course of action is to get this back up.”


In basic terms, this results to a wide-scale power outage that has affected all of their customers who host in Dallas, including design:related, with an unknown ETA for when it will be back online (ETA was posted around 5:30PM EST). Rimu is working on restoring services as soon as they can. Toggl’s support page has a long thread on the work that is currently underway.


We have been working with RimuHosting in some capacity since 2007, and we were really hoping they would be a large-scale solution for design:related’s architecture in 2011 and beyond. Our small team is actively discussing making some major changes in the near future so that we can prevent issues like this from occurring again down the road. This is likely to include a transition to more cloud computing and additional server redundancy. We have been using Amazon S3 to store/serve images for some time now, and we will be working on optimizing our app to best compliment this as well. 

 

We sincerely apologize for the issues, and we thank you again for your patience as we continue to build and enhance the design:related platform.

-Matt

___

Check back for updates here and on Matt Sung’s twitter.

Scheduled downtime for maintenance

UPDATE: July 14th, 10:15PM:

After much investigation, we confirmed that the issues with some integral user actions on design:related were related to a hardware/network issue with some servers in the cluster, and we’re doing a bit of maintenance after we had a short outage for 10 minutes or so. 

We hope to be back at full capacity once we verify that everything is running smoothly.

__

UPDATE: July 4th, 11:15PM:

Please note that we will have some needed downtime starting around 1:30AM EST on Tuesday, July 5th due to maintenance required by our hosting company. Our team blog will be still be up and running, and we will post any updates here. The network hardware is still having some issues that affects the entire data center, so they will be rebooting all their systems.

Hopefully, it won’t take more than 15 minutes or so. If it takes much longer, it is either because:

A) Our hosting company is spending a little more time than planned on the network maintenance…

And/or:

B) We are making sure that everything is running smoothly before restoring full service.

Thanks in advance for your patience!

– Your friends @ design:related

NOTE: We will have some planned downtime starting  around 11pm EST on Tuesday, June 28th due to maintenance planned by our hosting company.
UPDATE: June 26, 2011 – 8:10PM EST: 
The site’s service has been restored since around 7:30PM EST after we got our backups back online and ran some tests on the replacement hardware.
—
UPDATE: June 26, 2011 – 6:30PM EST: 
We had to take the site offline temporarily to due some server maintenance relating to the hardware changes we had to make early Saturday morning (read below). We are working to get the site and its services back online just as soon as we can.
Thanks for your patience!
—
UPDATE: June 25, 2011 – 4AM EST: 
As it turns out, some of the issues we’ve been noticing in problems with network requests, influences, deleting portfolios, and a couple other random things may come down to hardware issues with one of the servers in our cluster.
Our hosting company just swapped out the defective server with a new one right now (after we discovered the issue around 1:30AM EST), so if you see any weirdness it’s probably due to the server-side things that we’re working on.
—
UPDATE: June 21, 2011 – 1:35AM EST: 
Search and all related functionality is now back online! This affected our jobs board as well as some modules and other pages, but it’s back up and running on the cluster servers as of 10pm EST or so.
Page loads should also be a bit speedier across the board, including Login, Dashboard, Profiles, Portfolios, and Inspirations.
More news to come!
—
UPDATE: June 20, 2011 – 3:35AM EST: 
Portfolio and Inspiration tools are back online!
The new server cluster seems to be humming along, load balancers and all. As it turns out, upgrading to Rails 3.0 (in addition to all the other changes/upgrades) caused a few additional hiccups along the way. We are working to fully restore the design:related search engine as well as fixing a few other bugs just as soon as we can.
—
UPDATE: June 18, 2011 – 1:45AM EST: 
Things are back up and running, and the DNS should be completely updated in the next 24 hours or so. We have a few more things to do, and we’re working to restore a few key user functionalities just as soon as we can.
—
June 17, 2011 – 4:45AM EST: 
All of our user data  has successfully migrated over to our new servers, and we’re doing some testing now. We are working  around the clock to get the site back up and running, and we appreciate  your patience as we finish up some final touches on our new back-end  architecture. :)
Read more here
– Your friends @ design:related

NOTE: We will have some planned downtime starting around 11pm EST on Tuesday, June 28th due to maintenance planned by our hosting company.

UPDATE: June 26, 2011 – 8:10PM EST:

The site’s service has been restored since around 7:30PM EST after we got our backups back online and ran some tests on the replacement hardware.

UPDATE: June 26, 2011 – 6:30PM EST:

We had to take the site offline temporarily to due some server maintenance relating to the hardware changes we had to make early Saturday morning (read below). We are working to get the site and its services back online just as soon as we can.

Thanks for your patience!

UPDATE: June 25, 2011 – 4AM EST:

As it turns out, some of the issues we’ve been noticing in problems with network requests, influences, deleting portfolios, and a couple other random things may come down to hardware issues with one of the servers in our cluster.

Our hosting company just swapped out the defective server with a new one right now (after we discovered the issue around 1:30AM EST), so if you see any weirdness it’s probably due to the server-side things that we’re working on.

UPDATE: June 21, 2011 – 1:35AM EST:

Search and all related functionality is now back online! This affected our jobs board as well as some modules and other pages, but it’s back up and running on the cluster servers as of 10pm EST or so.

Page loads should also be a bit speedier across the board, including Login, Dashboard, Profiles, Portfolios, and Inspirations.

More news to come!

UPDATE: June 20, 2011 – 3:35AM EST:

Portfolio and Inspiration tools are back online!

The new server cluster seems to be humming along, load balancers and all. As it turns out, upgrading to Rails 3.0 (in addition to all the other changes/upgrades) caused a few additional hiccups along the way. We are working to fully restore the design:related search engine as well as fixing a few other bugs just as soon as we can.

UPDATE: June 18, 2011 – 1:45AM EST:

Things are back up and running, and the DNS should be completely updated in the next 24 hours or so. We have a few more things to do, and we’re working to restore a few key user functionalities just as soon as we can.

June 17, 2011 – 4:45AM EST:

All of our user data has successfully migrated over to our new servers, and we’re doing some testing now. We are working around the clock to get the site back up and running, and we appreciate your patience as we finish up some final touches on our new back-end architecture. :)

Read more here

– Your friends @ design:related

Server Migration recheduled for Thursday/Friday, June 17

UPDATED June 16, 2011:

To all of our amazing members and viewers:

It’s that time again! We’re working to scale our hardware and infrastructure on Tuesday night, starting around 10pm EST Friday starting at 12:30AM EST.

The reason for the downtime mostly surrounds hardware upgrades—adding/replacing hard drives (with the fastest ones we could order), moving to a private server cluster configuration, upgrading CPUs and RAM—all while adding load balancers on top of new operating systems and software.

The main site, designrelated.com, will be down while as we go through a laundry list of items to smoothly migrate the application and all of our user data over to our custom server cluster.

Transferring everyone’s data and images will take quite a few hours, but we’ll be working to get things back up and running just as soon as we can on Friday. 

Thanks in advance for your patience!

– Your friends @ design:related