PyPI's infrastructure utilizes drbd
to store packages, documentation, configuration, and code across VMs on separate hypervisors. Unfortunately failover for the drbd
device and PyPI was not configured. The outcome was as follows:
- Primary PyPI node was stuck at boot attempting to mount the drbd device
- Secondary PyPI node was not capable of automatically mounting the
drbd
device
Additionally, the environment for running PyPI and the system/python dependencies were not available or up to date on the secondary node.
Our steps to recovery were as follows:
- Obtain SSH access to the primary node: failure
- Obtain Out-of-Band access to the primary node: failure
- Bring secondary node online partial success
- Resolve access issues to primary success
- Bring
drbd
back online success
- Stop overloaded PyPI on secondary success
- Sync
drbd
device and set primary node to drbd
Primary success
- Start PyPI on primary node success
All told pypi.python.org was unavailable for 3.5 hours.
Our shortcomings through this incident were many:
- Our alerting notified the necessary teams, but did not elevate to the correct person when the frontline admin was unresponsive.
- Out-of-Band access to the nodes had not been tested properly after an upgrade to the VM management software. In the end, we just needed to use another browser
- The team did not have appropriate shared documentation on the moving parts behind PyPI
- The primary node was not configured to auto reboot after a kernel panic.
- The primary node was set to mount the
drbd
device before the service had been started, causing a boot hang.
- Initially the secondary node was not appropriately configured to run PyPI and was underpowered for the amount of traffic that is received once brought online.
The packaging team and infrastructure team have had plans to resolve some of the configuration management and automation issues that could have made this outage far shorter... if noticeable at all.
Some of the immediate steps we've taken to mitigate this sort of extended action in the future:
- Servers have been configured to skip mounting the
drbd
device until the service is available
- We are implementing auto reboot after kernel panic to assist the team in obtaining access as quickly as possible
In addition we are considering the following to reduce the amount of manual intervention necessary:
- Updated Configuration Management to keep the primary and secondary nodes in the correct state to run PyPI at all times.
- Fully automated failover for
drbd
.
- Fully automated failover for the PyPI service
- An official mirror of PyPI to use as a fallback behind the CDN which will maintain read-only access to PyPI regardless of back end state.
- Active/Active high availability for the backend servers
- Continued work on the next generation infrastructure and code for PyPI, warehouse
As a team we'd like to apologize for the extended outage. We'll be working hard over the next week to stabilize the existing PyPI infrastructure and even harder in the future to build a more robust service for Python packaging.