partial outage
Incident Report for Python Infrastructure

Yesterday PSF infrastructure team was debugging an issue with the documentation build. During that process, the cron daemon on the backend server was paused. Unfortunately after the issue was resolved, cron was not restarted.

The internal certificate issued to the docs backend expired at 2017-07-11 03:15:00 UTC. These certificates are issued to authenticate and secure traffic within the PSF infrastructure and are normally renewed via cron.

Once the certificate had expired, requests to for documents which had expired led to 503s as our CDN was unable to reach the backend due to the expired certificate.

Our automated systems also failed to alert us or trigger an incident on the status page as they check a page which had not yet expired.

The incident was resolved by re-enabling cron and allowing the automation to renew the internal certificate at 2017-07-11 10:46:00 UTC.

Of the 1.5M requests issued to over the outage window, 110k requests received an error. The error ratio grew steadily over time as documents expired from our CDNs cache.

We've added a check and alert for the backend as well as automation to update the status page in the event that the backend is having issues.

Posted 12 months ago. Jul 11, 2017 - 11:32 UTC

From 2017-07-11 03:15 UTC until 2017-07-11 10:46 UTC experienced a partial outage. Many individual pages became accessible as they fell out of our CDN's cache due to a backend issue. The issue is resolved and a postmortem will be available shortly.
Posted 12 months ago. Jul 11, 2017 - 11:00 UTC
This incident affected: