Cassandra中失效检测原理详解

qiemengdao

浏览: 277443 次
性别:
来自: 武汉

最近访客更多访客>>

nlxd

pxy7896

郭广川

DavidLuo1

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

cassandra
云计算

cassandra源码失效检测 PHI失效检测

Cassandra中失效检测原理

一、传统失效检测及其不足

传统失效检测方法

在分布式系统中经常使用心跳(Heartbeat)来检测Server的健康状况，但从理论上来说，心跳无法真正检测对方是否crash，主要困难在于无法真正区别对方是宕机还是“慢”。传统的检测方法是设定一个超时时间T，只要在T之内没有接收到对方的心跳包便认为对方宕机，方法简单粗暴，但使用广泛。

传统错误检测存在的缺陷

如上所述，在传统方式下，目标主机会每间隔t秒发起心跳，而接收方采用超时时间T(t<T)来判断目标是否宕机，接收方首先要非常清楚目标的心跳规律（周期为t的间隔）才能正确设定一个超时时间T，而T的选择依赖当前网络状况、目标主机的处理能力等很多不确定因素，因此在实际中往往会通过测试或估计的方式为T赋一个上限值。上限值设置过大，会导致判断“迟缓”，但会增大判断的正确性；过小，会提高判断效率，但会增加误判的可能性。但下面几种场景不能使用传统检测方法：

1. Gossip通信

但在实际应用中，比如基于Gossip通信应用中，因为随机通信，两个Server之间并不存在有规律的心跳，因此很难找到一个适合的超时时间T，除非把T设置的非常大，但这样检测过程就会“迟缓”的无法忍受。

2. 网络负载动态变化

还有一种情况是，随着网路负载的加大，Server心跳的接收时间可能会大于上限值T；但当网络压力减少时，心跳接收时间又会小于T，如果用一成不变的T来反映心跳状况，则会造成判断”迟缓“或误判。

3. 心跳检测与结果的分离

并不是每个应用都只需要知道一个目标主机宕机与否的结果（true/false），即有很多应用需要自己解释心跳结果从而采取不同的处理动作。比如，如果目标主机3s内没有心跳，应用A解读为宕机并重试；而应用B则解读为目标”不活跃“，需要把任务委派到其他Server。也就是说，目标主机是否“宕机”应该由业务逻辑决定的，而不是简单的通过一个超时时间T决定，这就需要把心跳检测过程与对结果的解释相分离，从而为应用提供更好的灵活性。

二Gossiper中采用的 Φ 失效检测方法

由失效检测的经典论文The Phi accrual failure detector （http://vsedach.googlepages.com/HDY04.pdf）中的证明，分布式环境中,对主机的心跳统计,根据以往心跳间隔的经验值,可以由下面的方法判断主机是否宕机。

1. 给定一个阀值 Φ

2. 在一定时间内,记录各个心跳间隔时间

3. 对心跳的间隔值求指数分布(Exponential distribution)概率:

P = E ^ (-1 * (now - lastTimeStamp) / mean) （E是对数2.71828...，mean为此前的间隔时间平均值）

其表示,自上次统计以来,心跳到达时间将超过 now - lastTimeStamp 的概率

4. 计算 φ = - log10 P

5. 当φ > Φ 时，就可以认为主机已经宕机了。

当然这可能会存在误判，误判的可能性如下：

Φ = 1, 1%

Φ = 2, 0.1%

Φ = 3, 0.01%

......

由此可见，当Φ = 8时，误判率已经很小了。cassandra中默认采用Φ = 8。

下面有一个关于Phi失效检测算法的java实现。Cassandra中实现与此类似。

/**

java demo for phi failure detector

import java.util.ArrayDeque;

import java.util.Iterator;

import java.util.concurrent.locks.Lock;

import java.util.concurrent.locks.ReentrantLock;

public class PhiAccrualFailureDetector {

private static final int sampleWindowSize = 1000;

private static int phiSuspectThreshold = 8;

private SamplingWindow simpleingWindow = new SamplingWindow(sampleWindowSize);

public PhiAccrualFailureDetector() {

}

public void addSample() {

simpleingWindow.add(System.currentTimeMillis());

}

public void addSample(double sample) {

simpleingWindow.add(sample);

}

public void interpret() {

double phi = simpleingWindow.phi(System.currentTimeMillis());

System.out.println("PHI = " + phi);

if (phi > phiSuspectThreshold) {

System.out.println("We are assuming the moniored machine is down!");

} else {

System.out.println("We are assuming the moniored machine is still running!");

}

/**

* @param args

* the command line arguments

public static void main(String[] args) {

PhiAccrualFailureDetector pafd = new PhiAccrualFailureDetector();

// first try with phi < phiSuspectThreshold

for (int i = 0; i < 10; i++) {

pafd.addSample();

try {

Thread.sleep(10L);

} catch (InterruptedException ex) {

// no op

}

try {

Thread.sleep(500L);

} catch (InterruptedException ex) {

// no op

}

System.out.println(pafd.simpleingWindow.toString());

pafd.interpret();

// second try result phi > phiSuspectThreshold

for (int i = 0; i < 10; i++) {

pafd.addSample();

try {

Thread.sleep(10L);

} catch (InterruptedException ex) {

// no op

}

try {

Thread.sleep(1500L);

} catch (InterruptedException ex) {

// no op

}

System.out.println(pafd.simpleingWindow.toString());

pafd.interpret();

}

static class SamplingWindow {

private final Lock lock = new ReentrantLock();

private double lastTimeStamp = 0L;

private StatisticDeque arrivalIntervals;

SamplingWindow(int size) {

arrivalIntervals = new StatisticDeque(size);

}

void add(double value) {

lock.lock();

try {

double interval;

if (lastTimeStamp > 0L) {

interval = (value - lastTimeStamp);

} else {

interval = 1000 / 2;

}

lastTimeStamp = value;

arrivalIntervals.add(interval);

} finally {

lock.unlock();

}

double sum() {

lock.lock();

try {

return arrivalIntervals.sum();

} finally {

lock.unlock();

}

double sumOfDeviations() {

lock.lock();

try {

return arrivalIntervals.sumOfDeviations();

} finally {

lock.unlock();

}

double mean() {

lock.lock();

try {

return arrivalIntervals.mean();

} finally {

lock.unlock();

}

double variance() {

lock.lock();

try {

return arrivalIntervals.variance();

} finally {

lock.unlock();

}

double stdev() {

lock.lock();

try {

return arrivalIntervals.stdev();

} finally {

lock.unlock();

}

void clear() {

lock.lock();

try {

arrivalIntervals.clear();

} finally {

lock.unlock();

}

/**

* p = E ^ (-1 * (tnow - lastTimeStamp) / mean)

double p(double t) {

double mean = mean();

double exponent = (-1) * (t) / mean;

return Math.pow(Math.E, exponent);

}

double phi(long tnow) {

int size = arrivalIntervals.size();

double log = 0d;

if (size > 0) {

double t = tnow - lastTimeStamp;

double probability = p(t);

log = (-1) * Math.log10(probability);

}

return log;

}

@Override

public String toString() {

StringBuilder s = new StringBuilder();

for (Iterator<Double> it = arrivalIntervals.iterator(); it.hasNext();) {

s.append(it.next()).append(" ");

}

return s.toString();

}

static class StatisticDeque implements Iterable<Double> {

private final int size;

protected final ArrayDeque<Double> queue;

public StatisticDeque(int size) {

this.size = size;

queue = new ArrayDeque<Double>(size);

}

public Iterator<Double> iterator() {

return queue.iterator();

}

public int size() {

return queue.size();

}

public void clear() {

queue.clear();

}

public void add(double o) {

if (size == queue.size()) {

queue.remove();

}

queue.add(o);

}

public double sum() {

double sum = 0D;

for (Double interval : this) {

sum += interval;

}

return sum;

}

public double sumOfDeviations() {

double sumOfDeviations = 0D;

double mean = mean();

for (Double interval : this) {

double d = interval - mean;

sumOfDeviations += d * d;

}

return sumOfDeviations;

}

public double mean() {

return sum() / size();

}

public double variance() {

return sumOfDeviations() / size();

}

public double stdev() {

return Math.sqrt(variance());

}

参考资料：《分布式系统实现》

分享到：

Cassandra中布隆过滤器实现详解【原创】 | Cassandra启动过程详解【原创】

2011-12-28 14:58
浏览 2812
评论(0)
分类:开源软件
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论