Partial outage

Incident Report for Asana

Postmortem

Incident: To ensure our systems can reliably recover when systems fail, Asana triggers an individual node failure at a time of low traffic when engineers are available to address any problems. A software bug in an internal application which replaces failed nodes prevented recovery when a node was terminated in this manner.

Impact: Until engineers intervened to manually replace the failed node, about 12.5% of users experienced application crashes and about 1% of API requests failed.

Moving forward: Planned work includes improved monitoring and resilience for node failures.

Our metric considers a weighted average of uptime experienced by users at each data center. The number of minutes of downtime shown reflects this weighted average.

Posted Dec 23, 2022 - 18:58 UTC

Resolved

We are currently investigating this issue.

Posted Dec 17, 2022 - 01:13 UTC