From approximately 19:05 UTC until 23:35 UTC on July 20, 2014 PyPI experienced an elevated baseline of HTTP 503 errors delivered to users. Additionally, latency for the PyPI service was dramatically higher across the board.
Unfortunately, we were unable to identify anything with the PyPI infrastructure which was strictly to blame.
Our current working theory---which we will attempt to validate---is based around increased network latency between either:
When latency or packet loss increases on either of these networks, it will ultimately lead to timeouts which generate HTTP 503 errors for end users.
It is very difficult to express how unfun the above hand-wavy explanation is to convey. In the future, we will be investigating better ways of obtaining and tracking performance metrics for these networks by working with both our hosting provider as well as CDN provider where possible.
On a much brighter note during the prolonged investigation, we found and resolved a long standing performance issue in PyPI's XMLRPC interface. This fix should garner performance increases across the board for mirroring infrastructure which utilizes PyPI's serial log of changes.
Additionally, we attempted to mitigate the incident by scaling up the backend web cluster. The additional infrastructure will increase the overall reliability of PyPI's backend and it appears will further increase performance for all end users of PyPI.