Shortly after 2021-04-05 14:00:00 UTC, long term (1 year) internal TLS certificates for a core service in the infrastructure that runs PyPI began expiring.
These TLS certificates are issued automatically by our platform to secure internal communications among the hosted applications that power PyPI.
As the last of these certificates expired around 2021-04-05 14:04:00 UTC our deployment tooling was unable to provision new instances of the hosted applications that compose PyPI.
Due the the volume of short-term (30 day) internal TLS certificates issued by the impacted core service, when the core service was redeployed to reissue the long term (1 year) internal TLS certificates, the backlog of expirations and revocations to process caused the Certificate Revocation List (CRL) to exceed it's maximum size and the service to crash.
This complication with the CRL led to an extended outage as an administrator of PyPI worked to restore the core service without interfering with its other functions.
Once the existing short-term (30 day) internal TLS certificate store was properly tidied around 2021-04-05 16:00:00 UTC, the PyPI administrator was able to bring back online all hosted applications and PyPI service was restored.
Beginning around 2021-04-05 14:04:00 UTC requests to PyPI that reached downed backends began to fail. By 2021-04-05 14:25:00 UTC all requests to the PyPI backends resulted in errors. Once hosted applications were restored at 2021-04-05 16:04:00 UTC services were restored.
The total outage window was 2 hours.
All uploads, requests to the Web UI which were not cached, XMLRPC requests were unsuccessful and received errors.
Additionally uncached requests to the Simple API and JSON API endpoints were diverted to our internal static mirror.
Diverted requests to the Simple API internal mirror experienced additional confusing behavior due to missing data-requires-python
metadata may have led to some users reliant on that feature receiving confusing errors or incorrect versions of unpinned dependencies.
The applications that compose the PyPI service are deployed to a cluster using tooling built on top of Kubernetes.
This tooling uses Hashicorp Consul and Hashicorp Vault to manage access control to secrets and configuration as well as issue short-lived (30 day) TLS certificates to secure all internal traffic within the cluster.
In order to secure traffic to our Consul and Vault instances, Kubernetes Certificates API is used to issue long-lived (1 year) certificates to the pods that will run these services.
This outage was initiated when the long-lived certificates for our Vault deployment expired leading to the inability for our deployment to deploy new Kubernetes Pods due to a lack of short-lived TLS certificates and access to secrets.
Due to the volume of Pods with such TLS certificates, once the issue was identified and redeployment of our Vault instances began, enough of the short-lived certificates had expired that the given the time since the tidy operation for the automated Certificate Authority last ran a large enough Certificate Revocation List was generated by the backend to exceeded the size allowed for storage.
This led to instances of Vault crashing when attempting to take on the leader role. Identifying this issue and determining the appropriate way to clean up the backend without corrupting our Vault storage backend led to the extended outage.
We will deploy a similar tool to the one that monitors our short-lived certificates to ensure that our long-lived certificates do not reach >75% of their lifespan.
We will reassess the tuning parameters of the operation that revokes certificates for decommissioned pods. As PyPI's footprint has grown, so has the number of pods running behind the services. This along with deployment frequency has led to much larger Certificate Revocation Lists than we initially tuned for.
Our internal mirror will be upgraded to mirror all current Simple API features so that in the event of future outages, installs will not be impacted.