Slow response times / intermittent request failures

Incident Report for Redivis

Resolved

Our database cluster started experiencing high (~100%) CPU load at approximately 2:15pm PST on 2/17. The root cause was the release of a dataset containing several million variables (thousands of tables each with thousands of variables), which interacted with a poorly optimized code path that created a spike in database load.

Our automated systems first alerted our team of a high request failure rate at approximately 2:19pm local time (T+4). Initial observations suggested that the system's auto-recovery functionality was working, with requests returning to a nominal success rate for periods of several minutes, but such recovery was intermittent.

We identified the underlying error at approximately 3:35pm (T +1:20), which was easily resolved by bypassing an unnecessary reload of variables from the database on summary statistic computation. We began deployment of the fix at 3:53pm (T +1:38), which was successfully deployed on our servers at 3:58pm (T +1:43). We observed an immediate drop in CPU utilization in our database cluster and associated errors after the fix was live.

The deployed fix will prevent large datasets from triggering such an event in the future. We plan to conduct a full internal investigation and review of testing protocols to ensure that similar downtime is best avoided going forward.

Posted Feb 17, 2023 - 22:00 UTC