PyPI Down
Incident Report for Python Infrastructure
Postmortem

At approximately 15:00 UTC, the TLS certificates that PyPI’s internal deployment tooling uses to secure access to Vault expired. This led to a cascading failure within the PyPI infrastructure that caused running pods to lose access to secure credentials and stopped new instances from being launched.

Under normal circumstances, this would have been resolved as the Vault instances restarted and retrieved a new TLS certificate, but an abnormally large backlog of expired leases caused the new Vault instances to crash on startup and required manual intervention to cleanup extraneous leases.

The initial remedy will be to upgrade our Vault instances to a version that resolves the crash on launch issue when the quantity of expired leases is too high, which would allow for this outage to have been recovered in a more automated fashion. Longer term, research and development time will be allocated to improving the automation around detection of instances nearing expiration as well as mechanisms to securely automate the unseal process for our secure storage.

Posted Apr 05, 2022 - 16:43 UTC

Resolved
This incident is resolved.
Posted Apr 05, 2022 - 16:36 UTC
Monitoring
Applications are coming back online and we are monitoring for stability.
Posted Apr 05, 2022 - 15:41 UTC
Identified
Core failed service has been identified and is coming back online. Next we'll bring up the ancillary services. Once all our automation services are online we can begin to bring the applications back.
Posted Apr 05, 2022 - 15:38 UTC
Investigating
All backend services for PyPI are down due to a cascading failure in our deployment tooling. We are investigating and working on restoring service.
Posted Apr 05, 2022 - 15:25 UTC
This incident affected: PyPI (pypi.org - Backends, files.pythonhosted.org - Redirects, files.pythonhosted.org - Redirects Backends).