As you might know, last Thursday at the end of the morning, our data center suffered a power outage. This outage caused the complete inavailability of our site for a few minutes, followed by serious disruptions over several hours that rendered the service almost, if not completely, unusable. This service interruption is historic for Dailymotion, which hasn’t experienced an interruption of this scale since its launch in 2005. This is why we think we owe you an explanation…but be warned, it’s rather technical.
We’ll start by explaining how the power is treated in a data center such as ours. Electricity is brought in by two separate EDF (Electricité de France) circuits, then passed through some transformers. In case of an outage, batteries keep things running while onsite generators start up (the site has a bit more than 24 hours of fuel in reserve). As it exits the transformers, the current is “cleaned” then stored in a battery back-up system. This back-up, separated into two groups of 3, furnish two independent electrical supplies. According to our hoster, Equinix, human error caused the cutoff of all 6 back-ups at once, triggering a total blackout for a good minute - the time it takes for the back-ups to restart.
And there you have the cause.
Due to this (violent) cutoff, our network equipment and our servers sustained some damage. The core of the network normalized in about 12 minutes, but only superficially. In reality, certain machines came back online with a mix of different configurations, more or less out-of-date. The cause of this mix remains rather blurry for the moment.
One of the consequences of this configuration mix-up was a breakdown in multicast routing between our different frontal web servers, which kept them from synchronizing to correctly share workloads. Imagine it as if the traffic lights of a busy intersection changed color at random, and not as part of a bigger system.
Meanwhile, the databases needed a certain amount of time to verify their integrity. The storage servers, likewise, took a good hour to confirm that the video files had not been damaged. To top it off, a number of circuit breakers couldn’t support the simultaneous restart of all the servers, and crashed. Thus we had to wait for the help of Equinix, which was working hard elsewhere, to get them back up.
And there you have the result.
Here’s a timeline:
· 11:16am - Power outage.
· 11:17am - Power restored, service unavailable.
· 11:30am - The network core goes back up, some videos are available in external players.
· 12:15pm - Part of the platform becomes available, but the site is in read-only mode and difficult to access.
· 12:45pm - The final circuit breakers are switched back on, the last machines restart.
· 1:00pm - The greater part of the platform is available, but the site remains read-only and difficult to access. A few non-critical machines fail to restart. Traffic begins to ramp back up.
· 1:30pm - The storage servers are available again.
· 2:30pm - The database become available in write mode.
· 3:00pm - The communication between our frontal servers is reestablished, the site is fully accessible in read and write mode. A few services remain perturbed (webcam upload, search, encoding).
· 6:00pm - All services fully operational again.
To conclude, we would like to offer our apologies for any inconvenience caused. We assure you that we will do everything we can, in the near future, to improve the quality of service and the availability of the site.