The incident has been resolved.
Duration: ~36 minutes (20:04–20:40 UTC)
Impact: Some PSF-hosted services were unavailable, including python.org, us.pycon.org, PyPI stats, bugs.python.org, and related services.
What was unaffected was our other cluster that manages PyPI.org among other services related to PyPI.
Root Cause: During local development of kubernetes workloads locally there was an incorrect context switch to one of our production clusters.
The scale-down commands ran against the production cluster instead of the local environment, iterating through all deployments and setting them to zero replicas. which created cascading failures.
Recovery: Services were restored with the help of Ee Durbin by bringing up infrastructure in dependency order, original replica counts were recovered from Kubernetes event history.
Action items:
Jacob Coffee, PSF Infrastructure Team