Incident: An operation created a large backlog of events that was unable to be processed within our timeout. The event processing job was rescheduled in the failing state, causing corresponding workers to be stuck and resulting in severe delays for all event distribution. Full recovery to expected conditions took ~11 hours.
Impact: Events associated with this database were delayed. A subset of these events failed to be delivered because they aged out due to the delay. Of the events that were not delivered, only ~5-10% were customer events. No customer data was lost.
Moving forward: As a result of this incident, Asana is implementing changes to make our event distribution systems more resilient to cascading failures and high event volume.
Our metric considers a weighted average of uptime experienced by users at each data center. The number of minutes of downtime shown reflects this weighted average.