Homec4science

Fix bad counting in SQL when enforcing Drydock allocator soft limits

Authored by epriestley <git@epriestley.com> on Oct 14 2015, 15:18.

Description

Fix bad counting in SQL when enforcing Drydock allocator soft limits

Summary:
Ref T9252. This fixes a bug from D14236. D14272 discusses the observable effects of the bug, primarily that the window for racing is widened from ~a few milliseconds to several minutes under our configuration.

This SQL query is missing a GROUP BY clause, so all of the resources get counted as having the same status (specifically, the alphabetically earliest status any resource had, I think). For test cases this often gets the right result since the number of resources may be small and they may all have the same status, but in production this isn't true. In particular, the allocator would sometimes see "35 destroyed resources" (or whatever), when the real counts were "32 destroyed resources + 3 pending resources".

Since this allocator behavior is soft/advisory this didn't cause any actual problems, per se (we do expect races here occasionally), it just made the race very very easy to hit. For example, Drydock in production currently has three pending working copy resources. Although we do expect this to be possible, getting 4 resources when the configured limit is 1 should be hard (not lightning strike / cosmic radiaion hard, but "happens once a year" hard).

Also exclude destroyed resources since we never care about them.

Test Plan:
Followed the plan from D14272 and restarted two Harbormaster workers at the same time.

After this patch was applied, they no longer created two different resources (we expect it to be possible for this to happen, just very hard).

We should still be able to force this race by putting something like sleep(10) right before the query, then sleep(10) right after it. That would prevent the allocators from seeing one another (so they would both think there were no other resources) and push us down the pathway where we exceed the soft limit.

Reviewers: chad, hach-que

Reviewed By: hach-que

Maniphest Tasks: T9252

Differential Revision: https://secure.phabricator.com/D14274

Details

Committed
epriestley <git@epriestley.com>Oct 14 2015, 15:18
Pushed
aubortJan 31 2017, 17:16
Parents
rPH083a321dad1b: Fix an issue where newly created Drydock resources could be improperly acquired
Branches
Unknown
Tags
Unknown

Event Timeline

epriestley <git@epriestley.com> committed rPHac7edf54afe4: Fix bad counting in SQL when enforcing Drydock allocator soft limits (authored by epriestley <git@epriestley.com>).Oct 14 2015, 15:18