Python Infrastructure Outage

Incident Report for Python Infrastructure

Postmortem

The incident has been resolved.

Incident Report: Cluster-wide service outage

Duration: ~36 minutes (20:04–20:40 UTC)

Impact: Some PSF-hosted services were unavailable, including python.org, us.pycon.org, PyPI stats, bugs.python.org, and related services.

What was unaffected was our other cluster that manages PyPI.org among other services related to PyPI.

Root Cause: During local development of kubernetes workloads locally there was an incorrect context switch to one of our production clusters.

The scale-down commands ran against the production cluster instead of the local environment, iterating through all deployments and setting them to zero replicas. which created cascading failures.

Recovery: Services were restored with the help of Ee Durbin by bringing up infrastructure in dependency order, original replica counts were recovered from Kubernetes event history.

Action items:

Separate kubeconfig files for production vs local, rather than relying on context switching
Research adding admission control or policies to prevent bulk scale-to-zero operations
Document the infrastructure dependency chain and recovery runbook for future incidents

Jacob Coffee, PSF Infrastructure Team

Posted Feb 26, 2026 - 23:27 UTC

Resolved

The incident has been resolved.

Posted Feb 26, 2026 - 23:26 UTC

Update

The root cause has been identified as an erroneous scaling operation that affected some application workloads. All services have been restored and are coming back online. We are monitoring to confirm full recovery.

Posted Feb 26, 2026 - 20:40 UTC

Investigating

We are currently investigating this issue

Posted Feb 26, 2026 - 20:04 UTC

This incident affected: python.org (python.org - CDN, python.org - Backends, python.org - Downloads Backends) and bugs.python.org, us.pycon.org.