Homec4science

Fix performance problem for large task queues

Authored by epriestley <git@epriestley.com> on Jan 22 2013, 21:27.

Description

Fix performance problem for large task queues

Summary:
Some time ago, we added ORDER BY id ASC to the worker UPDATE ... query, because someone reported that their MySQL read slaves were complaining about the query (I can't find the exact error message, but something to the effect of the rows the query affected not being deterministic). This seemed harmless since it should be the same as the query's implicit order (I guess?), but actually made the query dramatically slower for large numbers of rows.

On my local machine, this query takes about 2 seconds with ~1M rows. If I run SELECT, or run UPDATE without ORDER BY, the query takes < 0.01s. I don't understand exactly what's happening -- my guess is something to do with the ORDER BY implying that a lot of rows need to be locked?

In T2372, a user is seeing 20-60s rumtimes on this query.

I solved this by doing a SELECT, followed by an UPDATE. Each query runs quickly. This introduces the possibility of a race (two processes SELECT the same rows, then try to UPDATE), which we currently recover from by having the second UPDATE fail and then having that daemon try again 1 second later. This seems generally reasonable. Some alternatives I considered:

  • We could SELECT ... LOCK FOR UPDATE, but failing and retrying a little later seems at least as good as blocking.
  • We could select more rows than we need, and then try to lock some of them randomly. I think this would work well, but it's a bit more complex than what we're doing now so I left it until we have a clearer need.

Test Plan:
Inserted ~1M tasks into the queue. Ran phd debug taskmaster, saw ~2s task updates. Applied patch. Ran phd debug taskmaster, saw <1ms updates. Ran phd launch 8 taskmaster, saw rapid completion of tasks.

This stuff also has fairly thorough unit test coverage.

Reviewers: vrana, btrahan

Reviewed By: vrana

CC: aran

Maniphest Tasks: T2372

Differential Revision: https://secure.phabricator.com/D4576

Details

Committed
epriestley <git@epriestley.com>Jan 22 2013, 21:27
Pushed
aubortJan 31 2017, 17:16
Parents
rPH70a2a653ff23: Revert D4359 and apply a better fix
Branches
Unknown
Tags
Unknown

Event Timeline

epriestley <git@epriestley.com> committed rPH22c64c67ffca: Fix performance problem for large task queues (authored by epriestley <git@epriestley.com>).Jan 22 2013, 21:27