Diffusion Phabricator ebff07d01983

Automatically sever databases after prolonged unreachability
ebff07d01983
Actions

Authored by epriestley <git@epriestley.com> on Apr 10 2016, 23:18.

Description

Automatically sever databases after prolonged unreachability

Summary:
Ref T4571. When a database goes down briefly, we fall back to replicas.

However, this fallback is slow (not good for users) and keeps sending a lot of traffic to the master (might be bad if the root cause is load-related).

Keep track of recent connections and fully degrade into "severed" mode if we see a sequence of failures over a reasonable period of time. In this mode, we send much less traffic to the master (faster for users; less load for the database).

We do send a little bit of traffic still, and if the master recovers we'll recover back into normal mode seeing several connections in a row succeed.

This is similar to what most load balancers do when pulling web servers in and out of pools.

For now, the specific numbers are:

We do at most one health check every 3 seconds.
If 5 checks in a row fail or succeed, we sever or un-sever the database (so it takes about 15 seconds to switch modes).
If the database is currently marked unhealthy, we reduce timeouts and retries when connecting to it.

Test Plan:

Configured a bad master.
Browsed around for a bit, initially saw "unrechable master" errors.
After about 15 seconds, saw "major interruption" errors instead.
Fixed the config for master.
Browsed around for a while longer.
After about 15 seconds, things recovered.
Used "Cluster Databases" console to keep an eye on health checks: it now shows how many recent health checks were good:

{F1213397}

Reviewers: chad

Reviewed By: chad

Maniphest Tasks: T4571

Differential Revision: https://secure.phabricator.com/D15677

Details

Committed

epriestley <git@epriestley.com>

Apr 11 2016, 17:43

Pushed

aubort

Jan 31 2017, 17:16

Parents

rPH5cf09f567a98: Fix an issue with date parsing when viewer timezone differs from server timezone

Branches

Unknown

Tags

Unknown

Event Timeline

epriestley <git@epriestley.com> committed rPHebff07d01983: Automatically sever databases after prolonged unreachability (authored by epriestley <git@epriestley.com>).Apr 11 2016, 17:43

Changes (6)

				Path
	M			src/__phutil_library_map__.php
	M			src/applications/cache/PhabricatorCaches.php
	M			src/applications/config/controller/PhabricatorConfigClusterDatabasesController.php
	A			src/infrastructure/cluster/PhabricatorDatabaseHealthRecord.php
	M			src/infrastructure/cluster/PhabricatorDatabaseRef.php
	M			src/infrastructure/env/PhabricatorEnv.php