Make task queue more robust against long-running tasks
Summary:
See discussion in D8773. Three small adjustments which should help prevent this kind of issue:
- When queueing followup tasks, hold them on the worker until we finish the task, then queue them only if the work was successful.
- Increase the default lease time from 60 seconds to 2 hours. Although most tasks finish in far fewer than 60 seconds, the daemons are generally stable nowadays and these short leases don't serve much of a purpose. I think they also date from an era where lease expiry and failure were less clearly distinguished.
- Increase the default wait-after-failure from 60 seconds to 5 minutes. This largely dates from the MetaMTA era, where Facebook ran services with high failure rates and it was appropriate to repeatedly hammer them until things went through. In modern infrastructure, such failures are rare.
Test Plan:
- Verified that tasks queued properly after the main task was updated.
- Verified that leases default to 7200 seconds.
- Intentionally failed a task and verified default 300 second wait before retry.
- Removed all default leases shorter than 7200 seconds (there was only one).
- Checked all the wait before retry implementations for anything much shorter than 5 minutes (they all seem reasonable).
Reviewers: btrahan, sowedance
Reviewed By: sowedance
Subscribers: epriestley
Differential Revision: https://secure.phabricator.com/D8774