Next Generation PyPI Rollout
Incident Report for Python Infrastructure

PyPI / files.pythonhosted.org Outage 2018-04-16

In the rollout of the new implementation of PyPI (warehouse/https://pypi.org) one part of the work undertaken was explicitly partitioning user hosted content from that maintained by the Python Software Foundation and PyPI maintainers.

Previously, files uploaded to PyPI were hosted with URLs like:

https://pypi.python.org/packages/....

And in the new PyPI, they are hosted with URLs like:

https://files.pythonhosted.org/packages/...

This has provides good semantics regarding the "ownership" of files and also has some benefits in separating the domains for security purposes.

During testing, all went well! But as you may have noticed, when we finally flipped the big switch... things were not so keen.

Three major symptoms occurred:

  • Infinite redirects for file downloads
  • 404s for file downloads
  • 503s for file downloads

Symptoms occurred more or less in that sequence until the PyPI maintainers were able to resolve the issue.

The root cause was a non-trivial configuration for handling redirects from pypi.python.org to pypi.org combined with mishandling of hostnames in our Content Delivery Network configuration.

Impact

Beginning at 15:00:00 UTC, redirects were enabled from pypi.python.org to pypi.org for all requests.

These redirects began pointing traffic for file downloads to files.pythonhosted.org rather than pypi.python.org. Requests for files which were not available in cache began entering redirect loops.

Overall this led to a period from 15:00:00 UTC to 16:45:00 UTC during which package downloads were for all intents and purposes unavailable.

Timeline:

  • 15:00:00 UTC: Initial rollout of new file hosting service.
  • 15:15:00 UTC: Reports began coming in from the community and the PyPI maintainers worked to understand and resolve the issue.
  • 15:26:00 UTC: Initial attempt at fixing the issue began leading to HTTP 503 errors rather than redirects.
  • 15:38:00 UTC: Second change was deployed which began delivering HTTP 404 errors for rather than 503s.
  • 15:50:00 UTC: Third change reverted back to HTTP 503 errors for all package downloads.
  • 16:26:00 UTC: Interim fix deployed which reverted file hosting back to the legacy service
  • 16:45:00 UTC: Final fix deployed bringing our CDN configuration into the desired state.

Cause

PyPI relies heavily on a Content Delivery Network (CDN) to serve the community in a reliable and robust manner. The core concern of this incident was a misconfiguration on our part which led to an unforeseen state when we introduced redirects from the legacy PyPI service into the new service.

Files uploaded to PyPI are hosted and served from an industry standard object storage platform, proxied transparently by our CDN. The configuration for this hosting relies on a precise handling of hostnames, request signatures, and request conditions.

In the previous configuration, all of this logic lived in a single service configuration. For the transition to pypi.org and files.pythonhosted.org the maintainers of PyPI split these services apart for better maintainability.

The caveat of this approach is that only a single service in our CDN's configuration canonically serves any given hostname. So in making the switch there was an oversight in how that would pan out.

Because we also use a feature called "shielding" which effectively creates an internal cache for our cache, requests to files.pythonhosted.org which were translated to the appropriate hostname for our object store ended up landing on the internal shielding node for the legacy service rather than being sent to the actual object store.

Since the legacy service was configured to redirect to the new pypi.org... redirect loops were generated. Our initial attempts to fix the issue were incorrect which led to each successive state (503s, 404s, etc).

Resolution

Ultimately the resolution was straightforward; removing the object store hostname from the legacy configuration and configuring our services to modify the hostname and perform request signing only when the request was being dispatched to the actual object store host.

For the specific (one-line) fix, see here

Future work

None indicated; we rarely serve multiple hostnames from a single CDN configuration, and work to manage and break apart existing services we have already performed will reduce the complexity of future changes. Building in configuration management to the CDN configuration may also lead to the ability to test these changes in isolated environments in the future.

Posted 7 days ago. Apr 16, 2018 - 19:21 UTC

Resolved
We have completed our rollout! If you have any issues, please reach out at https://github.com/pypa/warehouse/issues or #pypa-dev on the freenode IRC network.
Posted 7 days ago. Apr 16, 2018 - 19:19 UTC
Update
We're preparing an Incident Report on the files.pythonhosted.org incident now. All services are stable, and we'll be completing transition of XMLRPC later this afternoon.
Posted 7 days ago. Apr 16, 2018 - 17:31 UTC
Update
We've completed the migration to the intended state for all services. We're continuing to monitor our own telemetry and all support channels.
Posted 7 days ago. Apr 16, 2018 - 17:06 UTC
Update
We have gotten file hosting back into a stable place and are currently working towards a full resolution.
Posted 7 days ago. Apr 16, 2018 - 16:26 UTC
Update
We're continuing to work on resolving the files.pythonhosted.org package service.
Posted 7 days ago. Apr 16, 2018 - 16:05 UTC
Update
Now investigating widespread 404s for files.pythonhosted.org downloads.
Posted 7 days ago. Apr 16, 2018 - 15:45 UTC
Update
We've rolled out a fix for the redirect issue on files.pythonhosted.org for the new PyPI and are monitoring.
Posted 7 days ago. Apr 16, 2018 - 15:35 UTC
Update
files.pythonhosted.org is currently under maintenance to resolve redirect loops.
Posted 7 days ago. Apr 16, 2018 - 15:29 UTC
Update
We are currently investigating redirect loops for some package installs.
Posted 7 days ago. Apr 16, 2018 - 15:22 UTC
Update
pypi.org is live!
Posted 7 days ago. Apr 16, 2018 - 15:04 UTC
Monitoring
Over the next few hours we'll be cutting over to the new https://pypi.org for all traffic to https://pypi.python.org!

Most requests will be redirected or rerouted correctly at our CDN edge; users should not need to perform any specific changes.

The old service will be available at https://legacy.pypi.org for users who find immediate issues they cannot address. We plan to leave this endpoint available until the end of the month when the previous generation PyPI's backend services are spun down.

If you experience any issues, please note them at https://github.com/pypa/warehouse/issues or join #pypa-dev on the freenode IRC network.
Posted 7 days ago. Apr 16, 2018 - 14:20 UTC
This incident affected: PyPI (pypi.org, pypi.python.org (legacy.pypi.org)).