PyPI experienced an outage in its backends from 18:46-18:53 UTC on 2022-07-28. This outage stemmed from a cascading failure in the backend infrastructure that disrupted the management layer that deploys and secures PyPI's application servers.
Beginning at 17:24 UTC multiple cloud instances that are members of the Kubernetes cluster that PyPI runs on were automatically replaced by the provider due to an unknown failure. By 18:20 UTC all instances were replaced and healthy as part of the Kubernetes cluster.
This replacement event caused the system to reshuffle many pods which impacted both our Consul and Vault clusters inside Kubernetes. Both services were restarted across new nodes.
At 18:23 UTC the first alert was delivered to PSF Infrastructure to notify that at least one Vault container was down and required unsealing to come back online.
Upon investigation it was determined that all Vault containers in the cluster were unavailable and that the Consul containers were unable to establish a new leader. The responding Infrastructure Team member worked to bring Consul back online, then unseal the new Vault containers.
Complications arose in bringing Consul back into a healthy state as one of the pods was stuck in the “Terminating” state, which caused the discovery mechanisms for Consul servers to get stuck expecting an additional server to participate in elections. Once this pod was forcefully removed and allowed to be re-created, recovery commenced.
These backing services being offline cause new containers launched to serve PyPI’s applications to fail. As a result, containers rescheduled due to node replacements were unable to start. This eventually caused PyPI to become unresponsive at 18:46 UTC as more and more containers were rescheduled by Kubernetes in an attempt to meet the specified instance count.