[PyPI] Elevated 5xx error rates.

Incident Report for Python Infrastructure

Postmortem

From approximately 19:05 UTC until 23:35 UTC on July 20, 2014 PyPI experienced an elevated baseline of HTTP 503 errors delivered to users. Additionally, latency for the PyPI service was dramatically higher across the board.

Unfortunately, we were unable to identify anything with the PyPI infrastructure which was strictly to blame.

Our current working theory---which we will attempt to validate---is based around increased network latency between either:

Our CDN provider's internal network from Points-of-Presence to the cluster which shields our backend infrastructure
The network between our CDN provider's cluster which shields our backend and our infrastructure.

When latency or packet loss increases on either of these networks, it will ultimately lead to timeouts which generate HTTP 503 errors for end users.

It is very difficult to express how unfun the above hand-wavy explanation is to convey. In the future, we will be investigating better ways of obtaining and tracking performance metrics for these networks by working with both our hosting provider as well as CDN provider where possible.

On a much brighter note during the prolonged investigation, we found and resolved a long standing performance issue in PyPI's XMLRPC interface. This fix should garner performance increases across the board for mirroring infrastructure which utilizes PyPI's serial log of changes.

Additionally, we attempted to mitigate the incident by scaling up the backend web cluster. The additional infrastructure will increase the overall reliability of PyPI's backend and it appears will further increase performance for all end users of PyPI.

Posted Jul 21, 2014 - 02:21 UTC

Resolved

PyPI is stable and performing at baseline again. The added capacity at the web application layer appears to have been unnecessary. We are working to determine the root cause of this incident. Any further details will be added via a Postmortem.

Posted Jul 21, 2014 - 00:06 UTC

Monitoring

PyPI's application server tier has been scaled up to meet rising demand. We are monitoring to verify resolution.

Posted Jul 20, 2014 - 20:45 UTC

Identified

PyPI's backend web servers appear to be overloaded due to an overzealous mirroring client. We will work to mitigate the 5xx errors while we add additional application server capacity.

Posted Jul 20, 2014 - 20:16 UTC

Investigating

PyPI is experiencing elevated rates of 503 and 502 errors.

Posted Jul 20, 2014 - 20:05 UTC