The cloud infrastructure that runs PyPI is managed via an automation tool which allows for configuration changes to be version controlled, reviewed, and deployed automatically. This tool includes multiple backends for state sharing and locking to facilitate collaboration between multiple contributors. While migrating the remote state backend and locking features to a more robust hosted platform an error was made when setting the credentials that grant our CDN access to the object storage where uploaded files to PyPI are hosted.
The result was that newly uploaded files as well as files that had expired from our CDNs cache were unavailable. The outage began 2022-06-022 20:28 UTC and was initially mitigated 2022-06-02 22:38 UTC by rolling back the configuration to a previous version. The configuration issue in the hosted platform was confirmed and corrected 2022-06-02 12:00 UTC and a corrected new version of the CDN configuration was deployed. Future deploys will include this value and will not regress in the same manner.
As the Infrastructure team at the PSF onboards a new hire and the clunkiness of collaborating on our terraform configuration led to only two people having the ability to deploy changes, a decision was made to migrate all state, sensitive variables, and locking to a hosted backend for terraform. This will allow all those with appropriate access (PyPI Admins and PSF Infrastructure Staff) access without having to pass around plaintext secrets necessary to plan and apply changes to the infrastructure.
During the migration from S3 state storage and DynamoDB locking to Hashicorp Terraform Cloud, the Infrastructure staff lead made an error in copying the value for the GCS secret access key that grants Fastly, our CDN, access to read the Cloud Storage bucket where PyPI uploads are stored.
Because the value was marked as sensitive in Terraform Cloud via a setting, it wasn’t obvious to the staff member when planning the changes what would change, and it was assumed to be an artifact of that setting. As a confirmation and for assurances, the staff member configured the changes to Fastly configuration to not auto activate the new configuration.
After applying the changes which would not be automatically activated in Fastly it is unclear why the incorrect change was not identified when diffing service configurations, but ultimately it was missed and the services were activated.
Impact lasted 130 minutes until enough signal from users was raised to alert PyPI admins and affected all new uploaded files as well as any files that had expired from our CDN caches.
This outage highlighted two key areas where improvements can be made to ensure timely response