Incident: On Dec 16, 2021, operator error during a planned migration to complete webhook improvements resulted in webhook and event stream data incorrectly expiring after 2 days. In response to the data loss, we attempted to restore from a backup taken during the migration. After restoring from our backup, data inconsistencies caused by new events prevented webhook and event delivery until we manually reconciled data from our backup with our database. No User Work Content was accessed to resolve the issue.
Impact: All webhooks and event streams were delayed longer than our published maintenance window, and some webhooks and event streams permanently missed events for 60 hours.
Moving Forward: As a result of this incident we are improving our monitoring of webhook delivery and adjusting our incident response processes to more transparently update developers on webhook system status. As a result of the completed migration, webhooks and event streams are now resilient to issues that caused significant data loss previously.