PyPI Unavailable

Incident Report for Python Infrastructure

Postmortem

PyPI Outage Report 2021-04-05

Summary

Shortly after 2021-04-05 14:00:00 UTC, long term (1 year) internal TLS certificates for a core service in the infrastructure that runs PyPI began expiring.

These TLS certificates are issued automatically by our platform to secure internal communications among the hosted applications that power PyPI.

As the last of these certificates expired around 2021-04-05 14:04:00 UTC our deployment tooling was unable to provision new instances of the hosted applications that compose PyPI.

Due the the volume of short-term (30 day) internal TLS certificates issued by the impacted core service, when the core service was redeployed to reissue the long term (1 year) internal TLS certificates, the backlog of expirations and revocations to process caused the Certificate Revocation List (CRL) to exceed it's maximum size and the service to crash.

This complication with the CRL led to an extended outage as an administrator of PyPI worked to restore the core service without interfering with its other functions.

Once the existing short-term (30 day) internal TLS certificate store was properly tidied around 2021-04-05 16:00:00 UTC, the PyPI administrator was able to bring back online all hosted applications and PyPI service was restored.

Impact

Timeline

Beginning around 2021-04-05 14:04:00 UTC requests to PyPI that reached downed backends began to fail. By 2021-04-05 14:25:00 UTC all requests to the PyPI backends resulted in errors. Once hosted applications were restored at 2021-04-05 16:04:00 UTC services were restored.

The total outage window was 2 hours.

Results

All uploads, requests to the Web UI which were not cached, XMLRPC requests were unsuccessful and received errors.

Additionally uncached requests to the Simple API and JSON API endpoints were diverted to our internal static mirror.

Unexpected behavior

Diverted requests to the Simple API internal mirror experienced additional confusing behavior due to missing data-requires-python metadata may have led to some users reliant on that feature receiving confusing errors or incorrect versions of unpinned dependencies.

Details

The applications that compose the PyPI service are deployed to a cluster using tooling built on top of Kubernetes.

This tooling uses Hashicorp Consul and Hashicorp Vault to manage access control to secrets and configuration as well as issue short-lived (30 day) TLS certificates to secure all internal traffic within the cluster.

In order to secure traffic to our Consul and Vault instances, Kubernetes Certificates API is used to issue long-lived (1 year) certificates to the pods that will run these services.

This outage was initiated when the long-lived certificates for our Vault deployment expired leading to the inability for our deployment to deploy new Kubernetes Pods due to a lack of short-lived TLS certificates and access to secrets.

Due to the volume of Pods with such TLS certificates, once the issue was identified and redeployment of our Vault instances began, enough of the short-lived certificates had expired that the given the time since the tidy operation for the automated Certificate Authority last ran a large enough Certificate Revocation List was generated by the backend to exceeded the size allowed for storage.

This led to instances of Vault crashing when attempting to take on the leader role. Identifying this issue and determining the appropriate way to clean up the backend without corrupting our Vault storage backend led to the extended outage.

Mitigations

Kubernetes Certificate Authority Monitoring

We will deploy a similar tool to the one that monitors our short-lived certificates to ensure that our long-lived certificates do not reach >75% of their lifespan.

Internal Certificate Authority Tidiness

We will reassess the tuning parameters of the operation that revokes certificates for decommissioned pods. As PyPI's footprint has grown, so has the number of pods running behind the services. This along with deployment frequency has led to much larger Certificate Revocation Lists than we initially tuned for.

Updates to Internal Mirror

Our internal mirror will be upgraded to mirror all current Simple API features so that in the event of future outages, installs will not be impacted.

Posted Apr 05, 2021 - 17:10 UTC

Resolved

This incident has been resolved.

Posted Apr 05, 2021 - 17:08 UTC

Monitoring

Backend services have been restored. We are monitoring for stability and preparing notes for an incident report.

Posted Apr 05, 2021 - 16:06 UTC

Update

Our core backend service is back online and stable. We are slowly bringing the backing services back online followed by the PyPI applications.

Posted Apr 05, 2021 - 16:01 UTC

Update

We are continuing to try to bring our backend systems back online.

Posted Apr 05, 2021 - 15:39 UTC

Identified

An internal certificate in our deployment infrastructure has expired and we are working to roll out a new certificate and restart services.

Posted Apr 05, 2021 - 14:54 UTC

Investigating

All Web UI and Uploads impacted.

Posted Apr 05, 2021 - 14:30 UTC

This incident affected: PyPI (pypi.org - CDN, pypi.org - Backends, files.pythonhosted.org - Redirects, files.pythonhosted.org - Redirects Backends).