PyPI experienced increased error rates overnight, beginning at 00:35 UTC and ending at 00:43 UTC.
The root cause was increased latency in calls to the backing Elasticsearch cluster which services PyPI's search. This latency caused our backend servers to spin on search requests until the default 60s timeout leading to exhaustion of our backend worker pools. The root cause of the Elasticsearch issue was a failing job which temporarily filled the disk on the Elasticsearch server. This job's work is not at all critical to PyPI's operation and the volume of data it was processing has grown unchecked.
While search doesn't seem like it would generate sufficient traffic to overload our backends to this extent, the legacy PyPI XMLRPC interface exposes this search method and is frequently used by many automation tools.
- Disabled failing non-critical job, the volume of data it was processing had grown unchecked.
- Implemented metrics on calls to Elasticsearch
- Implemented strict timeouts on calls to Elasticsearch to guard PyPI's availability.
We'll be monitoring the affect of these timeouts via the implemented metrics and adjust them as needed to ensure search availability is not adversely affected during regular operation.
Posted about 1 month ago. Apr 15, 2017 - 14:14 UTC