Incident: To ensure our systems can reliably recover when systems fail, Asana triggers an individual node failure at a time of low traffic when engineers are available to address any problems. A software bug in an internal application which replaces failed nodes prevented recovery when a node was terminated in this manner.
Impact: Until engineers intervened to manually replace the failed node, about 12.5% of users experienced application crashes and about 1% of API requests failed.
Moving forward: Planned work includes improved monitoring and resilience for node failures.
Our metric considers a weighted average of uptime experienced by users at each data center. The number of minutes of downtime shown reflects this weighted average.