Incident: At 2022-04-05 00:00 UTC, maintenance on underlying hardware caused a leader node in a critical system to become unavailable. This had an unexpected impact on dependent systems, causing them to crash. Automated resolution can take up to 15 minutes; we took manual action to recover sooner. Unfortunately, it took longer for the dependent systems to stabilize.
Impact: The web app was unavailable from 00:00 -00:25 UTC, and the API and mobile apps from 00:00-00:33 UTC.
Moving Forward: We are in the process of deprecating the dependency of our core systems on the system which failed. We are also looking at how our systems could have continued running in a degraded state instead of crashing.
Our metric considers a weighted average of uptime experienced by users at each data center. The number of minutes of downtime shown reflects this weighted average.