Redirect Loops on JSON API endpoints.
Incident Report for Python Infrastructure
Postmortem

Summary

PyPI's JSON API experienced an incident that caused redirect loops to occur for all clients when requesting many of the endpoints available on the service. These endpoints provide JSON documents describing projects and specific releases of projects hosted on pypi.org.

This incident was initiated by changes intended to make the JSON API more performant and less of a burden on the PyPI backends.

Specifically, the combination of this change to the warehouse codebase and this change to the CDN configuration for PyPI update the canonical locaiton for accessing the JSON documents deterministic. This allows for deeper cache efficiency and the ability to move redirects to the canonical URL out to the CDN edge. The result is faster response times for clients and reduced load on the PyPI backends.

The redirect updates specifically had an unintended consequence that led to this outage. Namely the existing elements in cache for the "normalized" project names redirected to the verbatim names of projects.

Example:

  • Before

    • /pypi/pyOpenSSL/json "verbatim" project name url returns a 200 with the document and is canonical.
    • /pypi/pyopenssl/json "normalized" project name url 301 redirects to canonical URL.
  • After

    • /pypi/pyOpenSSL/json "verbatim" project name url returns a 301 redirect to canonical URL.
    • /pypi/pyopenssl/json "normalized" project name url returns a 200 with the document and is canonical.

The issue at hand is that what is intended to the new canonical URL using the "normalized" project name was cached for many projects, and that cache included a redirect to the "verbatim" project name, which... redirected back.

Impact

This impacted any client of the PyPI JSON API. These include the poetry tool for installing and managing Python dependencies, mirroring tools such as bandersnatch, and even our own internal service that redirects legacy file URLs on files.pythonhosted.org to their new locations.

The outage began with the deployment of this change to the warehouse codebase and this change to the CDN configuration for PyPI at approximately 2022-06-10T10:50 UTC and continued to impact some URLs in a "long-tail" fashion through to 2022-06-10T16:00 UTC. Most notably project names beginning with p were the last to be affected as it is the most common first letter for projects uploaded to PyPI.

Mitigation

The changes described that led to this outage have been actively being attempted for over 6 weeks after more than a year of being aware of the problem. When faced with a system that was functioning correctly, but required purging of cache to be fully established the PyPI administrator managing the deployment chose to roll forward rather than roll back, knowingly extending the impact of this incident in favor of getting PyPI to a more maintainable state for serving the JSON API into the future.

Because of the massive scale of PyPI's caches it was untenable to specifically purge bad URLs, leading to a need to clear the entire PyPI cache. This was undertaken by kicking off pools of processes that would iterate over individual first letters of project names to purge the entire project cache in parallel over a 1-2 hour duration. This duration was required as purging the entire cache would have overloaded the backends for PyPI in such a way that even with massive temporary scale up/out the load would have been too much for our backend and led to many more hours of outage across the entire service.

While most of the purges completed within the two hour estimate, the letter p was the final sgement, taking nearly 4 hours due to the popularity of the py prefix on the index.

While the purges were ongoing PyPI's backends even in a scaled out state provided lackluster response times and performance as the caches were slowly refilled.

Future work

Issues have been filed to create better tooling to safely and expediently purge the caches and to limit the blast radius of purges in the future.

We will also begin to research ways to build more confidence and expose these kinds of errors in our review process for changes to the configuration of our CDN.

Posted Jun 10, 2022 - 16:32 UTC

Resolved
This incident has been resolved.
Posted Jun 10, 2022 - 16:32 UTC
Monitoring
All purges of JSON API documents have completed. Our backends are recovering from the added load of repopulating the entire cache. Any failed purges may result in latent redirect loops for specific projects or releases in the JSON API, these will self-resolve within 24 hours as caches expire.
Posted Jun 10, 2022 - 16:01 UTC
Update
The cache purge has cleared all but projects starting with the letter `p`. Our estimates failed to take into consideration the popularity of project names on PyPI starting with p 🙃
Posted Jun 10, 2022 - 13:54 UTC
Update
Our mass purge operation is continuing. Based on the current rate that we're able to process all purges the processes should be complete in 45-60 minutes.
Posted Jun 10, 2022 - 12:38 UTC
Update
We have started a task which will iterate over all projects and purge the cache for each individually. This will keep the PyPI backends from being overloaded by a completely bare cache. This process will take some time to complete, current estimate is 1-2 hours.
Posted Jun 10, 2022 - 11:34 UTC
Identified
Some cached responses are causing redirect loops for endpoints on the JSON API. We are working to determine how to clear these cached values without impacting the overall health of PyPI.
Posted Jun 10, 2022 - 11:08 UTC
This incident affected: PyPI (pypi.org - CDN, pypi.org - Backends, files.pythonhosted.org - Redirects, files.pythonhosted.org - Redirects Backends).