Issue/Server online failure
< Issue
Currently, the worse failure we know about is the case where one of the servers doesn't go down, but is instead in a state where it accepts connections and responds incorrectly or hangs on response.
In this case, a client which happens to access this server in the pool will get get an incorrect response or a delay.
A strategy to minimize this is to use nagios to monitor each server and to remove it from the rotation as soon as possible. Nagios' minimum check interval is 1 minute, and the DNS is currently configured to have a short time to live.
Is this level of risk and mitigation strategy acceptable?
