python.org and us.pycon.org outage
Incident Report for Python Infrastructure
Postmortem

Database TLS Certificate renewal failure 2018-01-13

Summary

At 2018-01-13 09:30:00 UTC the TLS certificate issued to our internal PostgreSQL database servers for verification of client connections failed to properly renew.

The Internally managed TLS Certificate Authority we operate for distributing TLS Certificates to our infrastructure nodes improperly created the new Certificate for the database cluster which did not match the Key used by the servers.

This database cluster serves many purposes within our infrastructure, but notably www.python.org and us.pycon.org are two of the largest.

Our systems are all designed to fail in scenarios when TLS cannot be verified for connections to the database, and thus systems using the PostgreSQL infrastructure began to error out.

At 2018-01-13 11:15 UTC, operator intervention was required to reset the cached certificate in the Internal Certificate Authority. Shortly thereafter the PostgreSQL instances received a corrected Certificate and database activities resumed.

Discussion

This is a new flaw for the TLS tooling within our infrastructure and as we investigate what lead to the Certificate/Key mismatch, it is an excellent verification of the designed "Fail Safe" nature of our Internal Certificate Authority.

PSF Internal Certificate Authority

In order to secure communications among the servers the Python Software Foundation Infrastructure Team operates, we issue short lived Certificate/Key pairs to servers which are valid for 24 hours. This requires quite a bit of automation that has served us well for years.

Short lived certificates have the advantage of being useless to an attacker in the event of a breach once they expire and reduce the overhead of maintaining, distributing, and checking a Certificate Revocation List internally (although this is something we'd like to do someday).

"Fail Safe"

The long story short here is that while our systems could have continued to connect to the database with an invalid Certificate, they are explicitly designed not to. Full verification of TLS certificates for servers and clients is always performed when services within the PSF infrastructure communicate.

Future work

We certainly want to avoid this failure in the future (it woke the first responder up at ~2:00 AM local time and the second responder up at ~5:30 AM local time.). As such we'll be reviewing the Internal Certificate Authority tooling to ensure that invalid Key/Cert pairs aren't offered up to servers, but are tossed away and regenerated.

Posted Jan 13, 2018 - 11:59 UTC

Resolved
python.org and us.pycon.org are back online. The internally managed TLS certificate for our database cluster failed to appropriately renew. We'll get a summary of the incident and steps we'll be taking to avoid this in the future posted later today.
Posted Jan 13, 2018 - 11:32 UTC
Investigating
We are currently investigating an issue with python.org and us.pycon.org.
Posted Jan 13, 2018 - 10:56 UTC
This incident affected: python.org (python.org - CDN).