PyPI Simple Index Timing Out

Incident Report for Python Infrastructure

Postmortem

What happened?

PyPI experienced an outage which resulted in serving 500-level responses for certain requests, including requests to a subset of the Simple Repository API, from approximately 19:00:00 UTC to 20:15:00 UTC. As a result, some packages were uninstallable during this period. While such outages are not entirely uncommon and generally easy to recover from, this outage was more severe due to PyPI becoming un-deployable as a result of its own outage.

Impact

At its peak, PyPI served errors at approximately 25 responses per second and in total nearly 100K requests during this ~1hr 15m period resulted in an error. This represents approximately 0.04% of usual traffic to the affected services.

Background

PyPI is served by a Python web application named ‘Warehouse’, with several dependencies on Python packages that are hosted on PyPI. As part of the deployment pipeline, Warehouse and its dependencies are built into a series of container images that are ultimately deployed into a container orchestration service.

At the time of the incident, PyPI contributors were working on adding new functionality, which included introducing a new service to the service layer used with PyPI’s web framework. While the actual details of this new functionality is unrelated, a misconfiguration of this new service resulted in views that depended on this service failing to successfully serve a response, which was missed by PyPI’s unit test suite and by manual testing of the feature branch.

Investigation

PyPI administrators were notified of a production error approximately 10 minutes after it was introduced, and immediately after a faulty request was served:

2024-08-21T18:48:29 UTC - The feature branch is merged into the `main` branch of the Warehouse repository
2024-08-21T18:55:42 UTC - The container image is built and deployed to production
2024-08-21T18:58:37 UTC - PyPI admins are notified of an exception in production affecting a Simple Repository API view

At this point, PyPI administrators began working on a fix.

Root Causes and Trigger

PyPI administrators immediately determined that the recent merge was at fault due to timing and clear signals from the error observed.

As a result of PyPI’s container orchestration and deployment service not explicitly supporting rollbacks, issues like this are generally resolved by introducing a new pull request which reverts the problematic commit, which in turn is merged and deployed.

In this instance, a revert commit was prepared, however the decision was made to wait for a forward fix to be prepared instead. This was partly because it was thought that this fix would be a trivial change and could be prepared as quickly as a revert, thus saving time overall.

However, additionally, the introduction of the new feature included a database migration which had already been applied in production. This migration could not be reverted simply by reverting the commit, as deployment would fail when it would attempt to bring the database up to date with the migration history, and find that the current migration version was no longer present. This instead would require an additional migration to migrate the database forwards to a state that would be compatible with the reverted commit, which would potentially take more time than introducing a fix.

The fix was prepared, merged and deployed, however it contained an additional issue that resulted in it not ultimately resolving the root cause. An additional pull request to rectify the issue with the forward fix was prepared.

At this time, PyPI administrators noted that PyPI’s CI/CD suite was failing to build the containers required to run the test suite on new pull requests, due to the `/simple` detail page for one of PyPI’s dependencies being unresolvable due to the outage. As a result, the team decided to pursue a full revert of the feature instead, hoping to leverage our container layer cache to ensure that external requests to PyPI would not be necessary to build. An additional pull request to fully revert the faulty commit was prepared.

While the container image build on the new PR was successful, a CI/CD check which did not use the container image, namely a check that ensures Warehouse’s dependency lock files are up-to-date, failed due to the outage. PyPI administrators attempted disabling the required status of these checks in branch protection for the repository, however the failed dependency check caused the remaining tests to be canceled before completion, which prevented the deployment pipeline from picking up on the new commit and triggering a deployment.

At this time, PyPI administrators determined that overriding the build/deploy service to manually revert the deployed container image back to a known good image, bypassing the release phase that included migrations, was required.

Mitigation

Mitigation required bypassing PyPI’s build and deploy pipeline to manually re-deploy a previously built image, without running the release phase which included migrations. This allowed PyPI to successfully respond to previously failing requests, allowing the reverted pull request to build and be deployed, fully resolving the issue in production.

2024-08-21T20:17:58 UTC - The revert was fully deployed to production and errors subsided

Additionally, after the production incident was resolved, it was determined that although the final deployment to TestPyPI had been reported as successful, it had not succeeded due to the database migration failing to get a lock, highlighting that there was additionally a logic error in our deployment reporting when deployments fail.

Lessons Learned

Until additional protections are in place, when an outage affects `/simple`, reverts must be immediate or they run the risk of requiring manual intervention.
It is clear that a lack of functional testing is a gap in assurances against production outages.

Things That Went Well

Maintaining cached container image layers proved potentially useful for recovering from issues where new container images could not be built due to the outage.

Things That Went Poorly

PyPI used to maintain a mirror of itself on a third-party service to prevent incidents like this, however this is no longer used due to maintenance of the mirror becoming an issue due to on-disk size.
The decision to not revert and fix forward instead proved to not work out. While the additional time to do the additional work to revert could have still resulted in a stuck merge, waiting for the forward fix, which had additional issues, resulted in enough time passing for one of PyPI’s dependencies to be affected by the outage.
A popular dependency, which was also a Warehouse sub-dependency, quickly fell out of cache. PyPI’s CDN is configured to serve stale responses while it revalidates for 5 minutes, and to serve stale responses while the backends are returning errors for 1 day, so the CDN should not have served an error for any of these pages.
The lack of a mirror and a production dependency on PyPI allowed for an ‘ouroboros’ style paradoxical outage that could not be resolved by a simple revert.
A subset of CI/CD checks in the critical path to deployment reach out to PyPI directly as part of their tests, and as a result they will likely fail in conditions like this outage.
Failing CI/CD checks that were not strictly required by branch protection rules were preventing the deployment pipeline from proceeding with the deployment due to setting `cancel-in-progress: true` on the parent workflow, causing the workflow to fail when they failed or were canceled.
The Warehouse codebase has extensive unit testing which depends on stubbing/mocking which masked the root cause. The codebase only recently gained the ability to perform functional testing, but the problematic PR did not add additional functional tests, and the limited functional testing that did exist missed this issue.

Where We Got Lucky

At the time of the incident, all of PyPI’s administrators were online and able to contribute towards mitigating the outage, including the sole administrator that was capable of performing the manual rollback.

Action Items

Determine why the sub-dependency fell out of cache more quickly than expected based on the CDN configuration, and how that could have been prevented.
Add functional testing to cover the Simple Repository API, with a focus on critical code paths that can cause similar outages.
Update the deployment pipeline to correctly report on deployment status.
Introduce rollback functionality for the deployment pipeline so that a rollback does not require manual intervention or specific knowledge.
Re-introduce a mirror (either static or a caching proxy) for PyPI that would not be affected by production outages.
Revisit the CI/CD dependency check and determine how it be hardened against a production outage to not block a deployment.

Posted 11 months ago. Aug 22, 2024 - 20:48 UTC

Resolved

This incident has been resolved.

Posted 11 months ago. Aug 21, 2024 - 20:23 UTC

Identified

An error serving /simple requests has caused timeouts for some requests. We have identified the issue and are working on a revert.

Posted 11 months ago. Aug 21, 2024 - 19:53 UTC

This incident affected: PyPI (pypi.org - Backends).