PyPI Backends are unavailable.

Incident Report for Python Infrastructure

Postmortem

Summary

The cluster that hosts PyPI’s backends as well as multiple ancillary services experienced an outage during maintenance that interrupted access to services over HTTPS.

Details

From 2022-10-24 14:43 UTC until 2022-10-24 15:03 UTC, PyPI’s backends were not accessible over HTTPS. This interfered with our CDNs ability to fetch pages, made uploads to uploads.pypi.org impossible, and interrupted other services such as our legacy file redirect service.

PyPI services run as deployments in a Kubernetes cluster and are exposed via Ingress with the AWS Elastic Load Balancer (ELB) integration. The TLS certificate that this load balancer uses is managed via Amazon’s Certificate Manager (ACM). When initially deploying our Kubernetes cluster, the Ingress managed ELB was configured to use an existing ACM TLS certificate.

Earlier today, regular maintenance of our Kubernetes cluster required a rolling restart of all nodes in the cluster to distribute upgrades and new configurations to all nodes. During this rolling restart as the Kubernetes API server hosts were deployed, Kubernetes validated and refreshed the Ingress configurations we had previously defined.

Since the most recent rolling upgrade, an additional hostname was needed for PyPI’s Ingress. PyPI administrators created a new ACM TLS certificate including that hostname and updated the Ingress managed ELB to use this new certificate. As a result, the new Kubernetes API servers were unable to find the previous ACM TLS certificate and disabled the HTTPS listener for the Ingress configuration that serves PyPI and associated services as they came online.

Once identified, the PyPI admins updated the Ingress configuration to point to the new ACM TLS certificate and Kubernetes restored the HTTPS listener on the Ingress managed ELB, restoring access to all services.

Mitigation

We will investigate mechanisms by which the hostnames needed on ACM TLS certificates for PyPI’s Ingress configurations can be managed via Kubernetes resources rather than manually via the AWS console. By managing resources all via the Kubernetes API, drift between desired state and reality will be less likely to occur and surface during similar maintenance in the future.

Posted 3 years ago. Oct 24, 2022 - 16:05 UTC

Resolved

This incident has been resolved.

Posted 3 years ago. Oct 24, 2022 - 15:44 UTC

Monitoring

We have identified and resolved the reason for the outage and are monitoring to ensure it remains stable.

Posted 3 years ago. Oct 24, 2022 - 15:05 UTC

Investigating

The backend that hosts PyPI and associated services is experiencing a major outage, we are investigating.

Posted 3 years ago. Oct 24, 2022 - 14:51 UTC

This incident affected: PyPI (pypi.org - Backends, files.pythonhosted.org - Redirects, files.pythonhosted.org - Redirects Backends).