PyPI's Backend is Not Responding
Incident Report for Python Infrastructure
Postmortem

PyPI's infrastructure utilizes drbd to store packages, documentation, configuration, and code across VMs on separate hypervisors. Unfortunately failover for the drbd device and PyPI was not configured. The outcome was as follows:

  • Primary PyPI node was stuck at boot attempting to mount the drbd device
  • Secondary PyPI node was not capable of automatically mounting the drbd device

Additionally, the environment for running PyPI and the system/python dependencies were not available or up to date on the secondary node.

Our steps to recovery were as follows:

  • Obtain SSH access to the primary node: failure
  • Obtain Out-of-Band access to the primary node: failure
  • Bring secondary node online partial success
  • Resolve access issues to primary success
  • Bring drbd back online success
  • Stop overloaded PyPI on secondary success
  • Sync drbd device and set primary node to drbd Primary success
  • Start PyPI on primary node success

All told pypi.python.org was unavailable for 3.5 hours.

Our shortcomings through this incident were many:

  • Our alerting notified the necessary teams, but did not elevate to the correct person when the frontline admin was unresponsive.
  • Out-of-Band access to the nodes had not been tested properly after an upgrade to the VM management software. In the end, we just needed to use another browser
  • The team did not have appropriate shared documentation on the moving parts behind PyPI
  • The primary node was not configured to auto reboot after a kernel panic.
  • The primary node was set to mount the drbd device before the service had been started, causing a boot hang.
  • Initially the secondary node was not appropriately configured to run PyPI and was underpowered for the amount of traffic that is received once brought online.

The packaging team and infrastructure team have had plans to resolve some of the configuration management and automation issues that could have made this outage far shorter... if noticeable at all.

Some of the immediate steps we've taken to mitigate this sort of extended action in the future:

  • Servers have been configured to skip mounting the drbd device until the service is available
  • We are implementing auto reboot after kernel panic to assist the team in obtaining access as quickly as possible

In addition we are considering the following to reduce the amount of manual intervention necessary:

  • Updated Configuration Management to keep the primary and secondary nodes in the correct state to run PyPI at all times.
  • Fully automated failover for drbd.
  • Fully automated failover for the PyPI service
  • An official mirror of PyPI to use as a fallback behind the CDN which will maintain read-only access to PyPI regardless of back end state.
  • Active/Active high availability for the backend servers
  • Continued work on the next generation infrastructure and code for PyPI, warehouse

As a team we'd like to apologize for the extended outage. We'll be working hard over the next week to stabilize the existing PyPI infrastructure and even harder in the future to build a more robust service for Python packaging.

Posted Nov 15, 2013 - 17:59 UTC

Resolved
This incident has been resolved.
Posted Nov 15, 2013 - 15:40 UTC
Monitoring
The primary server has come online and we are monitoring the situation, service is now restored.
Posted Nov 15, 2013 - 15:04 UTC
Update
The secondary PyPI server is failing to keep up with the increased load.
Posted Nov 15, 2013 - 14:50 UTC
Update
We're bringing back up the primary PyPI node and synchronizing the drdb cluster
Posted Nov 15, 2013 - 14:34 UTC
Update
We've failed PyPI over to a new server and have brought PyPI back up in a degraded mode. There may be intermittent failures and slow downs.
Posted Nov 15, 2013 - 14:14 UTC
Identified
We're having access issues with the out of band management and we are currently attempting to work around the issue.
Posted Nov 15, 2013 - 13:15 UTC
Investigating
We're investigating 503 errors caused by the PyPI Server not responding.
Posted Nov 15, 2013 - 12:09 UTC