Issue/Server online failure

Currently, the worse failure we know about is the case where one of the servers doesn't go down, but is instead in a state where it accepts connections and responds incorrectly or hangs on response.

In this case, a client which happens to access this server in the pool will get get an incorrect response or a delay.

A strategy to minimize this is to use nagios to monitor each server and to remove it from the rotation as soon as possible. Nagios' minimum check interval is 1 minute, and the DNS is currently configured to have a short time to live.

Is this level of risk and mitigation strategy acceptable?