In the rollout of the new implementation of PyPI (warehouse/https://pypi.org) one part of the work undertaken was explicitly partitioning user hosted content from that maintained by the Python Software Foundation and PyPI maintainers.
Previously, files uploaded to PyPI were hosted with URLs like:
https://pypi.python.org/packages/....
And in the new PyPI, they are hosted with URLs like:
https://files.pythonhosted.org/packages/...
This has provides good semantics regarding the "ownership" of files and also has some benefits in separating the domains for security purposes.
During testing, all went well! But as you may have noticed, when we finally flipped the big switch... things were not so keen.
Three major symptoms occurred:
Symptoms occurred more or less in that sequence until the PyPI maintainers were able to resolve the issue.
The root cause was a non-trivial configuration for handling redirects from pypi.python.org
to pypi.org
combined with mishandling of hostnames in our Content Delivery Network configuration.
Beginning at 15:00:00 UTC, redirects were enabled from pypi.python.org
to pypi.org
for all requests.
These redirects began pointing traffic for file downloads to files.pythonhosted.org
rather than pypi.python.org
. Requests for files which were not available in cache began entering redirect loops.
Overall this led to a period from 15:00:00 UTC to 16:45:00 UTC during which package downloads were for all intents and purposes unavailable.
PyPI relies heavily on a Content Delivery Network (CDN) to serve the community in a reliable and robust manner. The core concern of this incident was a misconfiguration on our part which led to an unforeseen state when we introduced redirects from the legacy PyPI service into the new service.
Files uploaded to PyPI are hosted and served from an industry standard object storage platform, proxied transparently by our CDN. The configuration for this hosting relies on a precise handling of hostnames, request signatures, and request conditions.
In the previous configuration, all of this logic lived in a single service configuration. For the transition to pypi.org
and files.pythonhosted.org
the maintainers of PyPI split these services apart for better maintainability.
The caveat of this approach is that only a single service in our CDN's configuration canonically serves any given hostname. So in making the switch there was an oversight in how that would pan out.
Because we also use a feature called "shielding" which effectively creates an internal cache for our cache, requests to files.pythonhosted.org
which were translated to the appropriate hostname for our object store ended up landing on the internal shielding node for the legacy service rather than being sent to the actual object store.
Since the legacy service was configured to redirect to the new pypi.org
... redirect loops were generated. Our initial attempts to fix the issue were incorrect which led to each successive state (503s, 404s, etc).
Ultimately the resolution was straightforward; removing the object store hostname from the legacy configuration and configuring our services to modify the hostname and perform request signing only when the request was being dispatched to the actual object store host.
For the specific (one-line) fix, see here
None indicated; we rarely serve multiple hostnames from a single CDN configuration, and work to manage and break apart existing services we have already performed will reduce the complexity of future changes. Building in configuration management to the CDN configuration may also lead to the ability to test these changes in isolated environments in the future.