PyPI's Backend is Not Responding

Incident Report for Python Infrastructure

Postmortem

PyPI's infrastructure utilizes drbd to store packages, documentation, configuration, and code across VMs on separate hypervisors. Unfortunately failover for the drbd device and PyPI was not configured. The outcome was as follows:

Primary PyPI node was stuck at boot attempting to mount the drbd device
Secondary PyPI node was not capable of automatically mounting the drbd device

Additionally, the environment for running PyPI and the system/python dependencies were not available or up to date on the secondary node.

Our steps to recovery were as follows:

Obtain SSH access to the primary node: failure
Obtain Out-of-Band access to the primary node: failure
Bring secondary node online partial success
Resolve access issues to primary success
Bring drbd back online success
Stop overloaded PyPI on secondary success
Sync drbd device and set primary node to drbd Primary success
Start PyPI on primary node success

All told pypi.python.org was unavailable for 3.5 hours.

Our shortcomings through this incident were many:

Our alerting notified the necessary teams, but did not elevate to the correct person when the frontline admin was unresponsive.
Out-of-Band access to the nodes had not been tested properly after an upgrade to the VM management software. In the end, we just needed to use another browser
The team did not have appropriate shared documentation on the moving parts behind PyPI
The primary node was not configured to auto reboot after a kernel panic.
The primary node was set to mount the drbd device before the service had been started, causing a boot hang.
Initially the secondary node was not appropriately configured to run PyPI and was underpowered for the amount of traffic that is received once brought online.

The packaging team and infrastructure team have had plans to resolve some of the configuration management and automation issues that could have made this outage far shorter... if noticeable at all.

Some of the immediate steps we've taken to mitigate this sort of extended action in the future:

Servers have been configured to skip mounting the drbd device until the service is available
We are implementing auto reboot after kernel panic to assist the team in obtaining access as quickly as possible

In addition we are considering the following to reduce the amount of manual intervention necessary:

Updated Configuration Management to keep the primary and secondary nodes in the correct state to run PyPI at all times.
Fully automated failover for drbd.
Fully automated failover for the PyPI service
An official mirror of PyPI to use as a fallback behind the CDN which will maintain read-only access to PyPI regardless of back end state.
Active/Active high availability for the backend servers
Continued work on the next generation infrastructure and code for PyPI, warehouse

As a team we'd like to apologize for the extended outage. We'll be working hard over the next week to stabilize the existing PyPI infrastructure and even harder in the future to build a more robust service for Python packaging.

Posted 12 years ago. Nov 15, 2013 - 17:59 UTC

Resolved

This incident has been resolved.

Posted 12 years ago. Nov 15, 2013 - 15:40 UTC

Monitoring

The primary server has come online and we are monitoring the situation, service is now restored.

Posted 12 years ago. Nov 15, 2013 - 15:04 UTC

Update

The secondary PyPI server is failing to keep up with the increased load.

Posted 12 years ago. Nov 15, 2013 - 14:50 UTC

Update

We're bringing back up the primary PyPI node and synchronizing the drdb cluster

Posted 12 years ago. Nov 15, 2013 - 14:34 UTC

Update

We've failed PyPI over to a new server and have brought PyPI back up in a degraded mode. There may be intermittent failures and slow downs.

Posted 12 years ago. Nov 15, 2013 - 14:14 UTC

Identified

We're having access issues with the out of band management and we are currently attempting to work around the issue.

Posted 12 years ago. Nov 15, 2013 - 13:15 UTC

Investigating

We're investigating 503 errors caused by the PyPI Server not responding.

Posted 12 years ago. Nov 15, 2013 - 12:09 UTC