The cluster that hosts PyPI’s backends as well as multiple ancillary services experienced an outage during maintenance that interrupted access to services over HTTPS.
From 2022-10-24 14:43 UTC until 2022-10-24 15:03 UTC, PyPI’s backends were not accessible over HTTPS. This interfered with our CDNs ability to fetch pages, made uploads to uploads.pypi.org impossible, and interrupted other services such as our legacy file redirect service.
PyPI services run as deployments in a Kubernetes cluster and are exposed via Ingress with the AWS Elastic Load Balancer (ELB) integration. The TLS certificate that this load balancer uses is managed via Amazon’s Certificate Manager (ACM). When initially deploying our Kubernetes cluster, the Ingress managed ELB was configured to use an existing ACM TLS certificate.
Earlier today, regular maintenance of our Kubernetes cluster required a rolling restart of all nodes in the cluster to distribute upgrades and new configurations to all nodes. During this rolling restart as the Kubernetes API server hosts were deployed, Kubernetes validated and refreshed the Ingress configurations we had previously defined.
Since the most recent rolling upgrade, an additional hostname was needed for PyPI’s Ingress. PyPI administrators created a new ACM TLS certificate including that hostname and updated the Ingress managed ELB to use this new certificate. As a result, the new Kubernetes API servers were unable to find the previous ACM TLS certificate and disabled the HTTPS listener for the Ingress configuration that serves PyPI and associated services as they came online.
Once identified, the PyPI admins updated the Ingress configuration to point to the new ACM TLS certificate and Kubernetes restored the HTTPS listener on the Ingress managed ELB, restoring access to all services.
We will investigate mechanisms by which the hostnames needed on ACM TLS certificates for PyPI’s Ingress configurations can be managed via Kubernetes resources rather than manually via the AWS console. By managing resources all via the Kubernetes API, drift between desired state and reality will be less likely to occur and surface during similar maintenance in the future.