Incident: Between 20:00 UTC and 23:30 UTC, Asana experienced a major production outage resulting in limited web application functionality and elevated error rates for API traffic. Two principal application services that serve Asana's web traffic were impacted: LunaDb, which performs data loading and handles communication with web clients, and Worldstore, which functions as a database caching layer and allows users to see the changes they’ve made. Typically these services deploy independently to reduce the load on either system. However, during this incident we saw updates to both services overlap which placed stress on a shared service, causing it to fail which then cascaded to other services. Engineers responded to automated alerts within minutes of the start of the incident, but stabilizing the Worldstore cluster took several hours and several different attempts.
Impact: For the duration of the incident, web-app users saw a loss of reactivity, i.e. they perceived their own changes not being saved or did not receive collaborative edits made by other users. Users of Asana’s API and mobile may have been unable to make changes to Asana at all. At around 23:30 UTC, full application functionality across webapp and API was restored.
Moving forward: We are changing the configurations of our LunaDb and Worldstore services to prevent overload under similar circumstances, and adjusting deployment times of these services to avoid updating both simultaneously.
Our metric considers a weighted average of uptime experienced by users at each data center. The number of minutes of downtime shown reflects this weighted average.