Newly uploaded files are returning 403 from files.pythonhosted.org
Incident Report for Python Infrastructure
Postmortem

Summary

The cloud infrastructure that runs PyPI is managed via an automation tool which allows for configuration changes to be version controlled, reviewed, and deployed automatically. This tool includes multiple backends for state sharing and locking to facilitate collaboration between multiple contributors. While migrating the remote state backend and locking features to a more robust hosted platform an error was made when setting the credentials that grant our CDN access to the object storage where uploaded files to PyPI are hosted.

The result was that newly uploaded files as well as files that had expired from our CDNs cache were unavailable. The outage began 2022-06-022 20:28 UTC and was initially mitigated 2022-06-02 22:38 UTC by rolling back the configuration to a previous version. The configuration issue in the hosted platform was confirmed and corrected 2022-06-02 12:00 UTC and a corrected new version of the CDN configuration was deployed. Future deploys will include this value and will not regress in the same manner.

Details

As the Infrastructure team at the PSF onboards a new hire and the clunkiness of collaborating on our terraform configuration led to only two people having the ability to deploy changes, a decision was made to migrate all state, sensitive variables, and locking to a hosted backend for terraform. This will allow all those with appropriate access (PyPI Admins and PSF Infrastructure Staff) access without having to pass around plaintext secrets necessary to plan and apply changes to the infrastructure.

During the migration from S3 state storage and DynamoDB locking to Hashicorp Terraform Cloud, the Infrastructure staff lead made an error in copying the value for the GCS secret access key that grants Fastly, our CDN, access to read the Cloud Storage bucket where PyPI uploads are stored.

Because the value was marked as sensitive in Terraform Cloud via a setting, it wasn’t obvious to the staff member when planning the changes what would change, and it was assumed to be an artifact of that setting. As a confirmation and for assurances, the staff member configured the changes to Fastly configuration to not auto activate the new configuration.

After applying the changes which would not be automatically activated in Fastly it is unclear why the incorrect change was not identified when diffing service configurations, but ultimately it was missed and the services were activated.

Impact lasted 130 minutes until enough signal from users was raised to alert PyPI admins and affected all new uploaded files as well as any files that had expired from our CDN caches.

Future Mitigation

This outage highlighted two key areas where improvements can be made to ensure timely response

  1. More thorough automated monitoring of file access via our CDN. Due to the caching nature of our CDN, it was not detected that this error was occurring as the file checked never expired from cache. We will implement a monitor that checks for a file that is explicitly not cached to ensure that the entire system from client to object-store is working appropriately
  2. Improvements to secrets handling in our infrastructure automation. We carried some legacy practices for storing secrets/configuration in our systems that led to pertinent configuration value being obscured in a complex object rather than an individual key. We’ll refactor this to make detecting changes to values more obvious to contributors to the project.
Posted Jun 03, 2022 - 12:29 UTC

Resolved
We have resolved this incident, identified the cause of the breakage, and completed ensuring that it will not regress. An incident report will be provided shortly.
Posted Jun 03, 2022 - 12:07 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Jun 02, 2022 - 22:40 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Jun 02, 2022 - 22:40 UTC
Investigating
We are currently investigating this issue.
Posted Jun 02, 2022 - 22:20 UTC
This incident affected: PyPI (files.pythonhosted.org - Files).