Failure detection based on simple heartbeat protocol. Regularly polls members for
liveness. Multicasts SUSPECT messages when a member is not reachable. The simple
algorithms works as follows: the membership is known and ordered. Each HB protocol
periodically sends an 'are-you-alive' message to its *neighbor*. A neighbor is the next in
rank in the membership list, which is recomputed upon a view change. When a response hasn't
been received for n milliseconds and m tries, the corresponding member is suspected (and
eventually excluded if faulty).
FD starts when it detects (in a view change notification) that there are at least
2 members in the group. It stops running when the membership drops below 2.
When a message is received from the monitored neighbor member, it causes the pinger thread to
'skip' sending the next are-you-alive message. Thus, traffic is reduced.
When we receive a ping from a member that's not in the membership list, we shun it by sending it a
NOT_MEMBER message. That member will then leave the group (and possibly rejoin). This is only done if
shun is true.
author: Bela Ban version: $Revision: 1.40.2.2 $ |