Issues with loading Asana

Incident Report for Asana

Postmortem

On August 12, 2021, we experienced an outage in our invalidation pipeline [0] for a small number of users lasting between 13:15 and 16:14 UTC. During this period, these users may have been unable to see updates reflected on their Asana application. They also may have been unable to access the mobile app or API. The following is the sequence of events causing this outage:

13:15 UTC Amazon Web Services (AWS) reports experiencing packet loss between availability zones in us-east-1, affecting several of our databases containing customer data. This outage was initially triggered due to a networking issue in AWS. This incident causes Asana to be unavailable for customers on the affected databases.
13:44 UTC AWS reports that packet loss between availability zones in us-east-1 has recovered. At this point, most affected users are able to access Asana again, but the invalidation pipeline had not yet fully recovered due to failures in a routing component.
14:38 UTC The routing component partially recovers, sending client traffic to the new invalidators. This registration recovered the invalidation pipeline for subset of the affected users, but didn't entirely solve the problem for the rest.
16:14 UTC We decided to skip all updates since 13:14 UTC on the affected databases, fast-forwarding all clients to the most recent updates. This recovered the invalidation pipeline, meaning users would see their new updates reflected. Unfortunately, this also meant that users would not be able to see their updates over the last 3 hours until we cleared our caches. At this point, we fully cleared our caches for the affected databases, ensuring all users would be able to see all of the updates they had made.

We're taking a few steps to ensure similar incidents don't occur in future:

We're planning to remove the routing component that failed during this incident. We've made a number of changes to our infrastructure that have eliminated the need for this layer.
We're improving the scalability of our invalidators. We currently have tight restrictions on the number of invalidators active per database. We're looking into loosening those restrictions and auto-scaling in response to increased load.
We're improving the signal of our alerts during database failures. When databases fail, many services depending on them also alert, making it less obvious which services are still failing once the databases have recovered. We're working on reducing the likelihood of alerts due only to upstream service failures.

[0]: The invalidation pipeline is the system that powers Asana's reactivity, keeping user data up-to-date. Failures in this pipeline mean that users may not be able to see the most recent version of their data. We make a read-your-own-writes guarantee on the API and mobile app, so failures in the pipeline may also translate to failures in those systems if the user has writes that haven't yet been processed. More information about our invalidation pipeline is available here.

Posted Aug 13, 2021 - 21:09 UTC

Resolved

This incident has been resolved.

Posted Aug 12, 2021 - 16:41 UTC

Update

We are continuing to investigate this issue.

Posted Aug 12, 2021 - 16:34 UTC

Investigating

We're currently experiencing some difficulties. Some users may have issues loading Asana. We are investigating the issue.

Posted Aug 12, 2021 - 16:24 UTC

This incident affected: US (App).