Issue with webhooks

Incident Report for Asana

Postmortem

Incident: An operation created a large backlog of events that was unable to be processed within our timeout. The event processing job was rescheduled in the failing state, causing corresponding workers to be stuck and resulting in severe delays for all event distribution. Full recovery to expected conditions took ~11 hours.

Impact: Events associated with this database were delayed. A subset of these events failed to be delivered because they aged out due to the delay. Of the events that were not delivered, only ~5-10% were customer events. No customer data was lost.

Moving forward: As a result of this incident, Asana is implementing changes to make our event distribution systems more resilient to cascading failures and high event volume.

Our metric considers a weighted average of uptime experienced by users at each data center. The number of minutes of downtime shown reflects this weighted average.

Posted Oct 08, 2021 - 21:14 UTC

Resolved

This incident has been resolved.

Posted Oct 05, 2021 - 18:21 UTC

Update

We believe the incident has been resolved.

Posted Oct 05, 2021 - 18:06 UTC

Investigating

We believe there is an issue at the moment that could result in delayed or failed notifications to webhooks. We are investigating and will update with more information as we get it.

Posted Oct 05, 2021 - 10:58 UTC

This incident affected: US (Webhooks and Event Streams).