
On Wednesday 19 October, Editablewp Market websites suffered a protracted incident and have been intermittently unavailable for over eight hours. Throughout this time, customers would have seen our “Upkeep” web page intermittently and due to this fact wouldn't have been capable of work together with the websites. The problem was brought on by an inaccessible listing on a shared filesystem, which in flip was brought on by a quantity filling to capability. The incident length was eight hours 26 minutes; whole downtime of the websites was 2 hours 56 minutes.
We’re sorry this occurred. Throughout the durations of downtime, the location was utterly unavailable. Customers couldn’t discover or buy gadgets, authors couldn’t add or handle their gadgets. We’ve let our customers down and let ourselves down too. We goal larger than this and are working to make sure it doesn’t occur once more.
Within the spirit of our “Inform it like it's” firm worth, we're sharing the main points of this incident with the general public.

Context
Editablewp Market websites recently moved from a conventional internet hosting service to Amazon Net Providers (AWS). The websites use a variety of AWS providers, together with Elastic Compute Cloud (EC2), Elastic Load Balancing (ELB), and the CodeDeploy deployment service. The websites are served by a Ruby on Rails software, fronted by the Unicorn HTTP server. The online EC2 cases all hook up with a shared community filesystem, powered by GlusterFS.
Timeline

Evaluation
This incident manifested as 5 “waves” of outages, every subsequent one occurring after we thought the issue had been mounted. In actuality there have been a number of issues occurring on the similar time, as is often the case in advanced methods. There was not one single underlying trigger, however reasonably a sequence of occasions and circumstances that led to this incident. A bit follows for every of the most important issues we discovered.
Disk area and Gluster issues
The primary prevalence of the outage was resulting from a easy drawback which went embarrassingly uncaught: our shared filesystem ran out of disk area.

As proven within the graph, free area began reducing pretty rapidly previous to the incident, reducing from round 200 GiB to six GiB in a few days. Low free area isn’t an issue in an of itself, however the truth that we didn’t acknowledge and proper the difficulty is an issue. Why didn’t we find out about it? As a result of we uncared for to set an alert situation for it. We have been amassing filesystem utilization knowledge, however by no means producing any alerts! An alert about quickly reducing free area could have allowed us to take motion to keep away from the issue totally. It’s value mentioning that we did have alerts on the shared filesystem in our earlier setting however they have been inadvertently misplaced throughout our AWS migration.
An out-of-space situation doesn’t clarify the conduct of the location through the incident, nevertheless. As we got here to understand, at any time when a consumer made a request that touched the shared filesystem, the Unicorn employee servicing that request would hold endlessly ready to entry the shared filesystem mount. If the disk have been merely full, one would possibly anticipate the usual Linux error in that state of affairs (ENOSPC No area left on gadget).
The GlusterFS shared filesystem is a cluster consisting of three impartial EC2 cases. When the Gluster knowledgeable on our Content material crew investigated he discovered that the total disk had prompted Gluster to close down as a security precaution. When the shortage of disk area was addressed and Gluster began again up, it did so in a split brain situation, with the info in an inconsistent state between the three cases. Gluster tried to robotically heal this drawback, however was unable to take action as a result of our software stored trying to write down information to it. The top end result was that any entry to a specific listing on the shared filesystem stalled endlessly.
A compounding issue was the uninterruptible nature of any course of which tried to entry this listing. Because the Unicorn staff piled up, caught, we tried killing them, first gracefully with SIGTERM, then with SIGKILL. The one choice to clear these caught processes was to terminate the cases.
Decision
One of many largest contributors to the prolonged restoration time was how lengthy it took to establish the issue with the shared filesystem’s inaccessible listing–simply over seven hours. As soon as we understood the issue, we reconfigured the appliance to make use of a unique listing, redeployed, and had the websites again up in lower than an hour.
GlusterFS is a reasonably new addition to our tech stack and that is the primary time we’ve seen errors with it in manufacturing. As we didn’t perceive its failure modes, we weren’t capable of establish the underlying explanation for the difficulty. As an alternative, we reacted to the symptom and continued making an attempt to isolate our code from the shared filesystem. Fortunately the difficulty was recognized and we have been capable of work round it.
Takeaway: new methods will fail in surprising methods, be ready for that when placing them into manufacturing
Unreliable outage flip
So as to isolate our methods from dependent methods which expertise issues, we’ve applied a set of “outage flips” – principally choke factors that every one code accessing a given system goes by means of, permitting that system to be disabled in a single place.
We have now such a flip round our shared filesystem and most of our code respects it, however not all of it does. Waves three and 5 have been each resulting from code paths that accessed the shared filesystem with out checking the the flip state first. Any requests that used these code paths would contact the problematic listing and stall their Unicorn employee. When all of the obtainable staff on an occasion have been thus stalled the occasion was unable to service additional requests. When that occurred on all cases the location went down.
Decision
Throughout the incident we recognized two code paths which didn't respect the shared filesystem outage flip. Had we not recognized the underlying trigger, we in all probability would have continued the cycle of fixing damaged code paths, deploying, and ready to seek out the following one. Fortunately, as we mounted the damaged code the frequency with which the issue reoccurred decreased (the damaged code we present in wave 5 took for much longer to devour all obtainable Unicorn staff than that within the first wave).
Takeaway: testing emergency tooling is vital, be certain it really works earlier than you want it.
Deployment difficulties
We use the AWS CodeDeploy service to deploy our software. The character of how CodeDeploy deployments work in the environment severely slowed our potential to react to points with code adjustments.
If you deploy with CodeDeploy, you create a revision which will get deployed to cases. When deploying to a fleet of operating cases this revision is deployed to every occasion within the fleet and the standing is recorded (profitable or failed). When an occasion first comes into service it receives the revision from the most recent profitable deployment.
A few instances through the outage we would have liked to deploy code adjustments. The method went one thing like this:
- Deploy the appliance
- The deployment would fail on a number of cases, which have been within the technique of beginning up or shutting down as a result of ongoing errors.
- Scale the fleet right down to a small variety of cases (2)
- Deploy once more to solely these two cases
- As soon as that deployment was profitable, scale the fleet again to nominal capability
This course of takes between 20-60 minutes, relying on the present state of the fleet, so can actually impression the time to restoration.
Decision
This course of was sluggish however practical. We are going to examine whether or not we’ve configured CodeDeploy correctly and search for methods to lower the time taken throughout emergency deployments.
Takeaway: think about each happy-path and emergency eventualities when designing essential tooling and processes
Upkeep mode script
Throughout outages, we generally block public entry to the location as a way to perform sure duties that may disrupt customers. To implement this, we use a script which creates a community ACL (NACL) entry in our AWS VPC which blocks all inbound visitors. We discovered that after we used this script, outbound visitors destined for the web was additionally blocked. This was particularly problematic as a result of it prevented us from deploying any code.
CodeDeploy makes use of an agent course of on every occasion to facilitate deployments: it communicates with the distant AWS CodeDeploy service and runs code regionally. To speak to its service it initiates outbound requests to the CodeDeploy service endpoint on port 443. After we enabled upkeep mode the agent was now not capable of set up connections with the service.
As quickly as we realized that the upkeep mode change was at fault, we disabled it (and blocked customers from the location with a unique mechanism). After the incident, we investigated the trigger additional, which turned out to be an oversight within the design of the script. Our community is partitioned into private and non-private subnets. Net cases reside in non-public subnets, and talk with the surface world by way of gateways residing in public subnets. Visitors destined for the general public web crosses the boundary between non-public and public subnets, and at that time the community entry controls are imposed. On this case, the internet-bound visitors was blocked by the NACL added by the upkeep mode script.
Decision
As quickly as we realized that the upkeep mode script was blocking deployments, we disabled it and used a unique mechanism to dam entry to the location. This was successfully the primary time the script was utilized in anger, and though it did work, it had unintended unintended effects.
Takeaway: once more, testing emergency tooling is vital
Corrective measures
Throughout this incident and the following post-incident evaluation assembly, we’ve recognized a number of alternatives to stop these issues from reoccurring.
- Alert on low disk area situation in shared filesystem: This alert ought to have been in place as quickly as Gluster was put into manufacturing. If we’d been alerted in regards to the low disk area situation earlier than it ran out, we could have been capable of keep away from this incident totally. We’re additionally contemplating extra superior alerting choices to keep away from the state of affairs the place the obtainable area is used up quickly.This motion is full; we now obtain alerts when the free area drops under a threshold.
- Add monitoring for GlusterFS error situations: When Gluster is just not serving information as anticipated (resulting from low disk area, shutdown, therapeutic, or some other sort of error) we need to find out about it as quickly as doable.
- Add extra disk area: House was made on the server by deleting some unused information on the day of the incident. We additionally want so as to add extra space so we've an acceptable quantity of “headroom” to keep away from comparable incidents sooner or later.
- Examine interruptible mounts for GlusterFS: The stalled processes which have been unable to be killed considerably elevated our time to restoration. If we may have killed the caught staff, we could have been capable of get better the location a lot sooner. We’ll look into how we will mount the shared filesystem in an interruptible approach.
- Rethink GlusterFS: Is GlusterFS the precise selection for us? Are there alternate options that will work higher? Do we'd like a shared filesystem in any respect? We are going to think about these inquiries to resolve the way forward for our shared filesystem dependency. If we do keep on with Gluster, we’ll upskill our on-callers in troubleshooting it.
- Guarantee all code respects outage flip: Had all our code revered the shared filesystem outage flip, this is able to have been a a lot smaller incident. We are going to audit all code which touches the shared filesystem and guarantee it respects the state of the outage flip.
- Repair the upkeep mode script: The unintended aspect impact of blocking deployments by our upkeep script prolonged the downtime unnecessarily. The script might be mounted to permit the location to perform internally, whereas nonetheless blocking public entry.
- Guarantee incident administration course of is adopted: We have now an incident administration course of to observe, which (amongst different issues) describes how incidents are communicated internally. The method was not adopted appropriately, so we’ll be sure that it’s clear to on-call engineers.
- Hearth drills: The incident response course of might be practiced by operating “hearth drills”, the place an incident is simulated and on-call engineers reply as if it have been actual. We’ve not had many main incidents lately, so we'd like some observe. This observe will even embody shared filesystem failure eventualities, since that system is comparatively new.
Abstract
Like many incidents, this was resulting from a sequence of occasions that finally resulted in a protracted, drawn out outage. By addressing the hyperlinks in that chain, comparable issues might be prevented sooner or later. We sincerely remorse the downtime, however we’ve discovered a number of precious classes and welcome this chance to enhance our methods and processes.
A model of this text initially appeared on WeBuild.
Featured picture: DarioLoPresti