`

Consistent Hashing

 
阅读更多

(ORIGINALLY ADAPTED FROM:

http://weblogs.java.net/blog/2007/11/27/consistent-hashing

CSDN GET ANOTHER BEAUTIFUAL PASSAGE:

http://blog.csdn.net/sparkliang/archive/2010/02/02/5279393.aspx

)

 

I've bumped into consistent hashing a couple of times lately. The paper that introduced the idea (Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web by David Karger et al) appeared ten years ago, although recently it seems the idea has quietly been finding its way into more and more services, from Amazon's Dynamo to memcached (courtesy of Last.fm). So what is consistent hashing and why should you care?

The need for consistent hashing arose from limitations experienced while running collections of caching machines - web caches, for example. If you have a collection of n cache machines then a common way of load balancing across them is to put object o in cache machine number hash(o) mod n. This works well until you add or remove cache machines (for whatever reason), for then n changes and every object is hashed to a new location. This can be catastrophic since the originating content servers are swamped with requests from the cache machines. It's as if the cache suddenly disappeared. Which it has, in a sense. (This is why you should care - consistent hashing is needed to avoid swamping your servers!)

It would be nice if, when a cache machine was added, it took its fair share of objects from all the other cache machines. Equally, when a cache machine was removed, it would be nice if its objects were shared between the remaining machines. This is exactly what consistent hashing does - consistently maps objects to the same cache machine, as far as is possible, at least.

The basic idea behind the consistent hashing algorithm is to hash both objects and caches using the same hash function. The reason to do this is to map the cache to an interval, which will contain a number of object hashes. If the cache is removed then its interval is taken over by a cache with an adjacent interval. All the other caches remain unchanged.

Demonstration

Let's look at this in more detail. The hash function actually maps objects and caches to a number range. This should be familiar to every Java programmer - the hashCode method on Object returns an int, which lies in the range -231 to 231-1. Imagine mapping this range into a circle so the values wrap around. Here's a picture of the circle with a number of objects (1, 2, 3, 4) and caches (A, B, C) marked at the points that they hash to (based on a diagram from Web Caching with Consistent Hashing by David Karger et al):

consistent_hashing_1.png

To find which cache an object goes in, we move clockwise round the circle until we find a cache point. So in the diagram above, we see object 1 and 4 belong in cache A, object 2 belongs in cache B and object 3 belongs in cache C. Consider what happens if cache C is removed: object 3 now belongs in cache A, and all the other object mappings are unchanged. If then another cache D is added in the position marked it will take objects 3 and 4, leaving only object 1 belonging to A.

consistent_hashing_2.png

This works well, except the size of the intervals assigned to each cache is pretty hit and miss. Since it is essentially random it is possible to have a very non-uniform distribution of objects between caches. The solution to this problem is to introduce the idea of "virtual nodes", which are replicas of cache points in the circle. So whenever we add a cache we create a number of points in the circle for it.

You can see the effect of this in the following plot which I produced by simulating storing 10,000 objects in 10 caches using the code described below. On the x-axis is the number of replicas of cache points (with a logarithmic scale). When it is small, we see that the distribution of objects across caches is unbalanced, since the standard deviation as a percentage of the mean number of objects per cache (on the y-axis, also logarithmic) is high. As the number of replicas increases the distribution of objects becomes more balanced. This experiment shows that a figure of one or two hundred replicas achieves an acceptable balance (a standard deviation that is roughly between 5% and 10% of the mean).

ch-graph.png

Implementation

For completeness here is a simple implementation in Java. In order for consistent hashing to be effective it is important to have a hash function that mixes well. Most implementations of Object's hashCode do not mix well - for example, they typically produce a restricted number of small integer values - so we have a HashFunction interface to allow a custom hash function to be used. MD5 hashes are recommended here.

import java.util.Collection;
import java.util.SortedMap;
import java.util.TreeMap;

public class ConsistentHash<T> {

 private final HashFunction hashFunction;
 private final int numberOfReplicas;
 private final SortedMap<Integer, T> circle = new TreeMap<Integer, T>();

 public ConsistentHash(HashFunction hashFunction, int numberOfReplicas,
     Collection<T> nodes) {
   this.hashFunction = hashFunction;
   this.numberOfReplicas = numberOfReplicas;

   for (T node : nodes) {
     add(node);
   }
 }

 public void add(T node) {
   for (int i = 0; i < numberOfReplicas; i++) {
     circle.put(hashFunction.hash(node.toString() + i), node);
   }
 }

 public void remove(T node) {
   for (int i = 0; i < numberOfReplicas; i++) {
     circle.remove(hashFunction.hash(node.toString() + i));
   }
 }

 public T get(Object key) {
   if (circle.isEmpty()) {
     return null;
   }
   int hash = hashFunction.hash(key);
   if (!circle.containsKey(hash)) {
     SortedMap<Integer, T> tailMap = circle.tailMap(hash);
     hash = tailMap.isEmpty() ? circle.firstKey() : tailMap.firstKey();
   }
   return circle.get(hash);
 }

}

The circle is represented as a sorted map of integers, which represent the hash values, to caches (of type T here). When a ConsistentHash object is created each node is added to the circle map a number of times (controlled by numberOfReplicas). The location of each replica is chosen by hashing the node's name along with a numerical suffix, and the node is stored at each of these points in the map.

To find a node for an object (the get method), the hash value of the object is used to look in the map. Most of the time there will not be a node stored at this hash value (since the hash value space is typically much larger than the number of nodes, even with replicas), so the next node is found by looking for the first key in the tail map. If the tail map is empty then we wrap around the circle by getting the first key in the circle.

Usage

So how can you use consistent hashing? You are most likely to meet it in a library, rather than having to code it yourself. For example, as mentioned above, memcached, a distributed memory object caching system, now has clients that support consistent hashing. Last.fm's ketama by Richard Jones was the first, and there is now a Java implementation by Dustin Sallings (which inspired my simplified demonstration implementation above). It is interesting to note that it is only the client that needs to implement the consistent hashing algorithm - the memcached server is unchanged. Other systems that employ consistent hashing include Chord, which is a distributed hash table implementation, and Amazon's Dynamo, which is a key-value store (not available outside Amazon).

分享到:
评论

相关推荐

    Python库 | ConsistentHashing-0.1.9.tar.gz

    而`ConsistentHashing-0.1.9.tar.gz`这个压缩包文件则包含了一个名为`ConsistentHashing`的Python库,版本为0.1.9。这个库主要涉及到一致性哈希(Consistent Hashing)算法,它在分布式系统和负载均衡中扮演着重要...

    Consistent Hashing and Random Trees

    Consistent Hashing and Random Trees

    (10)karger-Consistent Hashing.pdf

    “ConsistentHashingandRandomTrees: Distributed Caching Protocols for Relieving Hot Spots on the Worldwide Web”指的是由David Karger等人撰写的关于一致性哈希算法(Consistent Hashing)以及如何运用该算法...

    一致性Hash(Consistent Hashing)原理剖析1

    一致性哈希(Consistent Hashing)是一种用于分布式系统的哈希算法,主要应用于分布式缓存、分布式数据库等场景,目的是在节点动态增减时保持哈希表的稳定性,从而最小化数据迁移的影响。它解决了传统哈希取模方法在...

    Jump-Consistent-Hashing-Evaluation:对一致性哈希的评估,尤其是 Google 的 Jump Consistent Hashing

    跳跃一致哈希计算 甚至服务器之间的数据分布也非常重要:另一个重要方面是能够... 关于一致性哈希,使用的算法是谷歌的论文“A Fast, Minimal Memory, Consistent Hash Algorithm”中提出的Jump Consistent Hashing。

    Ketama一致性Hash算法(含Java代码) 1

    一致性哈希算法(Consistent Hashing)是一种在分布式系统中平衡数据分布的策略,尤其适用于缓存服务如Memcached或Redis。它的核心思想是通过哈希函数将对象映射到一个固定大小的环形空间中,然后将服务器也映射到这个...

    一致性哈希算法 consistent hashing

    在分布式系统中,常常需要使用缓存,而且通常是集群,访问缓存和添加缓存都需要一个 hash 算法来寻找到合适的 Cache 节点。但,通常不是用取余hash,而是使用我们今天的主角—— 一致性 hash 算法。

    consistent-hash:一致性哈希工具类 Implementing Consistent Hashing in Kotlin

    Implementing Consistent Hashing in Kotlin Java Kotlin实现的一致性哈希工具 简单示例 val a = HostPortPhysicalNode("A", "192.169.1.1", 8080) val b = HostPortPhysicalNode("B", "192.169.1.2", 8080) val c =...

    Consistent-Hashing:Consistent Hashing 一致哈希

    一致性哈希(Consistent Hashing)是一种分布式哈希算法,主要应用于分布式缓存、负载均衡等领域,例如在Redis、Memcached等系统中广泛使用。它解决了传统哈希算法在节点动态增减时导致的大量数据迁移问题。在Java中...

    Go-Consistent-hashing:Go中的散列环实现

    一致性哈希(Consistent Hashing)是一种在分布式系统中解决数据分片问题的算法,它在Go语言中的实现对于构建可扩展且容错的服务至关重要。在Go开发中,尤其是在涉及分布式缓存、负载均衡等场景下,一致性哈希能够...

    开源项目-lafikl-consistent.zip

    开源项目-lafikl-consistent.zip,lafikl/consistent: a package for Consistent Hashing and Consistent Hashing With Bounded Loads.

    consistent-hashing:一致哈希的简单JavaScript实现

    安装使用以下命令安装依赖项: $ npm install 用法var ConsistentHashing = require ( './consistent_hashing' ) ;var nodeNames = [ 'node1' , 'node2' , 'node3' , 'node4' , 'node5' , 'node6' ] ;var replica...

    09-散列3. Hashing - Hard Version (30).zip

    8. 散列在分布式系统中的应用:例如一致性哈希(Consistent Hashing),用于在分布式缓存和分布式数据库中分配节点,保持在节点增减时尽量小的影响。 9. 散列与安全性:在密码学中,散列函数常用于密码存储,通过...

    一致性哈希算法(ketama hashing)

    一致性哈希算法(Consistent Hashing)是一种在分布式系统中实现负载均衡的算法,尤其在分布式缓存如Memcached和Redis等场景下广泛使用。它解决了传统哈希算法在节点增减时导致的大量数据迁移问题,提高了系统的可用...

    Ketama Hashing Algorithm

    Ketama Hashing算法,全称为“Consistent Hashing with Ketama”,是一种在分布式系统中实现负载均衡的哈希算法。它的主要目的是在增加或减少服务器节点时,尽可能少地改变已分配的键(keys)到服务器的映射关系。在...

    分布式的Key-Value存储系统Ringo.zip

    Ringo is an experimental, distributed, replicating key-value store based on consistent hashing and immutable data. Unlike many general-purpose databases, Ringo is designed for a specific use case: For...

Global site tag (gtag.js) - Google Analytics