What happened?
PyPI experienced an outage which resulted in serving 500-level responses for certain requests, including requests to a subset of the Simple Repository API, from approximately 19:00:00 UTC to 20:15:00 UTC. As a result, some packages were uninstallable during this period. While such outages are not entirely uncommon and generally easy to recover from, this outage was more severe due to PyPI becoming un-deployable as a result of its own outage.
Impact
At its peak, PyPI served errors at approximately 25 responses per second and in total nearly 100K requests during this ~1hr 15m period resulted in an error. This represents approximately 0.04% of usual traffic to the affected services.
Background
PyPI is served by a Python web application named ‘Warehouse’, with several dependencies on Python packages that are hosted on PyPI. As part of the deployment pipeline, Warehouse and its dependencies are built into a series of container images that are ultimately deployed into a container orchestration service.
At the time of the incident, PyPI contributors were working on adding new functionality, which included introducing a new service to the service layer used with PyPI’s web framework. While the actual details of this new functionality is unrelated, a misconfiguration of this new service resulted in views that depended on this service failing to successfully serve a response, which was missed by PyPI’s unit test suite and by manual testing of the feature branch.
Investigation
PyPI administrators were notified of a production error approximately 10 minutes after it was introduced, and immediately after a faulty request was served:
At this point, PyPI administrators began working on a fix.
Root Causes and Trigger
PyPI administrators immediately determined that the recent merge was at fault due to timing and clear signals from the error observed.
As a result of PyPI’s container orchestration and deployment service not explicitly supporting rollbacks, issues like this are generally resolved by introducing a new pull request which reverts the problematic commit, which in turn is merged and deployed.
In this instance, a revert commit was prepared, however the decision was made to wait for a forward fix to be prepared instead. This was partly because it was thought that this fix would be a trivial change and could be prepared as quickly as a revert, thus saving time overall.
However, additionally, the introduction of the new feature included a database migration which had already been applied in production. This migration could not be reverted simply by reverting the commit, as deployment would fail when it would attempt to bring the database up to date with the migration history, and find that the current migration version was no longer present. This instead would require an additional migration to migrate the database forwards to a state that would be compatible with the reverted commit, which would potentially take more time than introducing a fix.
The fix was prepared, merged and deployed, however it contained an additional issue that resulted in it not ultimately resolving the root cause. An additional pull request to rectify the issue with the forward fix was prepared.
At this time, PyPI administrators noted that PyPI’s CI/CD suite was failing to build the containers required to run the test suite on new pull requests, due to the `/simple` detail page for one of PyPI’s dependencies being unresolvable due to the outage. As a result, the team decided to pursue a full revert of the feature instead, hoping to leverage our container layer cache to ensure that external requests to PyPI would not be necessary to build. An additional pull request to fully revert the faulty commit was prepared.
While the container image build on the new PR was successful, a CI/CD check which did not use the container image, namely a check that ensures Warehouse’s dependency lock files are up-to-date, failed due to the outage. PyPI administrators attempted disabling the required status of these checks in branch protection for the repository, however the failed dependency check caused the remaining tests to be canceled before completion, which prevented the deployment pipeline from picking up on the new commit and triggering a deployment.
At this time, PyPI administrators determined that overriding the build/deploy service to manually revert the deployed container image back to a known good image, bypassing the release phase that included migrations, was required.
Mitigation
Mitigation required bypassing PyPI’s build and deploy pipeline to manually re-deploy a previously built image, without running the release phase which included migrations. This allowed PyPI to successfully respond to previously failing requests, allowing the reverted pull request to build and be deployed, fully resolving the issue in production.
Additionally, after the production incident was resolved, it was determined that although the final deployment to TestPyPI had been reported as successful, it had not succeeded due to the database migration failing to get a lock, highlighting that there was additionally a logic error in our deployment reporting when deployments fail.
Lessons Learned
Things That Went Well
Things That Went Poorly
Where We Got Lucky
Action Items