- 浏览: 851585 次
- 性别:
- 来自: 北京
文章分类
最新评论
-
zjhzwx1212:
为什么用threadLocal后,输出值是从20开始的,而定义 ...
j2ee的线程安全--threadlocal -
aeoluspu:
不错 mysql 测试部分感觉不详细
用sysbench(或者super-smack)测试mysql性能 -
nanPrivate:
有没有例子,只理论,实践起来还是不会啊
JMS可靠消息传送 -
lwclover:
一个网络工程师 装什么b
postfix 如何删除队列中的邮件 -
maimode:
我也欠缺不少啊
理想的计算机科学知识体系
Perhaps you’re considering using a dedicated key-value or document store instead of a traditional relational database. Reasons for this might include:
- You’re suffering from Cloud-computing Mania.
- You need an excuse to ‘get your Erlang on’
- You heard CouchDB was cool.
- You hate MySQL, and although PostgreSQL is much better, it still doesn’t have decent replication. There’s no chance you’re buying Oracle licenses.
- Your data is stored and retrieved mainly by primary key, without complex joins.
- You have a non-trivial amount of data, and the thought of managing lots of RDBMS shards and replication failure scenarios gives you the fear.
Whatever your reasons, there are a lot of options to chose from. At Last.fm we do a lot of batch computation in Hadoop, then dump it out to other machines where it’s indexed and served up over HTTP and Thrift as an internal service (stuff like ‘most popular songs in London, UK this week’ etc). Presently we’re using a home-grown index format which points into large files containing lots of data spanning many keys, similar to the Haystack approach mentioned in this article about Facebook photo storage . It works, but rather than build our own replication and partitioning system on top of this, we are looking to potentially replace it with a distributed, resilient key-value store for reasons 4, 5 and 6 above.
This article represents my notes and research to date on distributed key-value stores (and some other stuff) that might be suitable as RDBMS replacements under the right conditions. I’m expecting to try some of these out and investigate further in the coming months.
Glossary and Background Reading
- Distributed Hash Table (DHT) and algorithms such as Chord or Kadmelia
- Amazon’s Dynamo Paper , and this ReadWriteWeb article about Dynamo which explains why such a system is invaluable
- Amazon’s SimpleDB Service , and some commentary
- Google’s BigTable paper
- The Paxos Algorithm - read this page in order to appreciate that knocking up a Paxos implementation isn’t something you’d want to do whilst hungover on a Saturday morning.
The Shortlist
Here is a list of projects that could potentially replace a group of relational database shards. Some of these are much more than key-value stores, and aren’t suitable for low-latency data serving, but are interesting none-the-less.
Name | Language | Fault-tolerance | Persistence | Client Protocol | Data model | Docs | Community |
Project Voldemort | Java | partitioned, replicated, read-repair | Pluggable: BerkleyDB, Mysql | Java API | Structured / blob / text | A | Linkedin, no |
Ringo | Erlang | partitioned, replicated, immutable | Custom on-disk (append only log) | HTTP | blob | B | Nokia, no |
Scalaris | Erlang | partitioned, replicated, paxos | In-memory only | Erlang, Java, HTTP | blob | B | OnScale, no |
Kai | Erlang | partitioned, replicated? | On-disk Dets file | Memcached | blob | C | no |
Dynomite | Erlang | partitioned, replicated | Pluggable: couch, dets | Custom ascii, Thrift | blob | D+ | Powerset, no |
MemcacheDB | C | replication | BerkleyDB | Memcached | blob | B | some |
ThruDB | C++ | Replication | Pluggable: BerkleyDB, Custom, Mysql, S3 | Thrift | Document oriented | C+ | Third rail, unsure |
CouchDB | Erlang | Replication, partitioning? | Custom on-disk | HTTP, json | Document oriented (json) | A | Apache, yes |
Cassandra | Java | Replication, partitioning | Custom on-disk | Thrift | Bigtable meets Dynamo | F | Facebook, no |
HBase | Java | Replication, partitioning | Custom on-disk | Custom API, Thrift, Rest | Bigtable | A | Apache, yes |
Hypertable | C++ | Replication, partitioning | Custom on-disk | Thrift, other | Bigtable | A | Zvents, Baidu, yes |
Why 5 of these aren’t suitable
What I’m really looking for is a low latency, replicated, distributed key-value store. Something that scales well as you feed it more machines, and doesn’t require much setup or maintenance - it should just work. The API should be that of a simple hashtable: set(key, val), get(key), delete(key). This would dispense with the hassle of managing a sharded / replicated database setup, and hopefully be capable of serving up data by primary key efficiently.
Five of the projects on the list are far from being simple key-value stores, and as such don’t meet the requirements - but they are definitely worth a mention.
1) We’re already heavy users of Hadoop, and have been experimenting with Hbase for a while. It’s much more than a KV store, but latency is too great to serve data to the website. We will probably use Hbase internally for other stuff though - we already have stacks of data in HDFS.
2) Hypertable provides a similar feature set to Hbase (both are inspired by Google’s Bigtable). They recently announced a new sponsor, Baidu - the biggest Chinese search engine. Definitely one to watch, but doesn’t fit the low-latency KV store bill either.
3) Cassandra sounded very promising when the source was released by Facebook last year. They use it for inbox search. It’s Bigtable-esque, but uses a DHT so doesn’t need a central server (one of the Cassandra developers previously worked at Amazon on Dynamo). Unfortunately it’s languished in relative obscurity since release, because Facebook never really seemed interested in it as an open-source project. From what I can tell there isn’t much in the way of documentation or a community around the project at present.
4) CouchDB is an interesting one - it’s a “distributed, fault-tolerant and schema-free document-oriented database accessible via a RESTful HTTP/JSON API”. Data is stored in ‘documents’, which are essentially key-value maps themselves, using the data types you see in JSON. Read the CouchDB Technical Overview if you are curious how the web’s trendiest document database works under the hood. This article on the Rules of Database App Aging goes some way to explaining why document-oriented databases make sense. CouchDB can do full text indexing of your documents, and lets you express views over your data in Javascript. I could imagine using CouchDB to store lots of data on users: name, age, sex, address, IM name and lots of other fields, many of which could be null, and each site update adds or changes the available fields. In situations like that it quickly gets unwieldly adding and changing columns in a database, and updating versions of your application code to match. Although many people are using CouchDB in production, their FAQ points out they may still make backwards-incompatible changes to the storage format and API before version 1.0.
5) ThruDB is a document storage and indexing system made up for four components: a document storage service, indexing service, message queue and proxy. It uses Thrift for communication, and has a pluggable storage subsystem, including an Amazon S3 option. It’s designed to scale well horizontally, and might be a better option that CouchDB if you are running on EC2. I’ve heard a lot more about CouchDB than Thrudb recently, but it’s definitely worth a look if you need a document database. It’s not suitable for our needs for the same reasons as CouchDB.
Distributed key-value stores
The rest are much closer to being ’simple’ key-value stores with low enough latency to be used for serving data used to build dynamic pages. Latency will be dependent on the environment, and whether or not the dataset fits in memory. If it does, I’d expect sub-10ms response time, and if not, it all depends on how much money you spent on spinning rust.
MemcacheDB is essentially just memcached that saves stuff to disk using a Berkeley database. As useful as this may be for some situations, it doesn’t deal with replication and partitioning (sharding), so it would still require a lot of work to make it scale horizontally and be tolerant of machine failure. Other memcached derivatives such as repcached go some way to addressing this by giving you the ability to replicate entire memcache servers (async master-slave setup), but without partitioning it’s still going to be a pain to manage.
Project Voldemort looks awesome . Go and read the rather splendid website , which explains how it works, and includes pretty diagrams and a good description of how consistent hashing is used in the Design section. (If consistent hashing butters your muffin, check out libketama - a consistent hashing library and the Erlang libketama driver ). Project-Voldemort handles replication and partitioning of data, and appears to be well written and designed. It’s reassuring to read in the docs how easy it is to swap out and mock different components for testing. It’s non-trivial to add nodes to a running cluster, but according to the mailing-list this is being worked on. It sounds like this would fit the bill if we ran it with a Java load-balancer service (see their Physical Architecture Options diagram) that exposed a Thrift API so all our non-Java clients could use it.
Scalaris is probably the most face-meltingly awesome thing you could build in Erlang. CouchDB, Ejabberd and RabbitMQ are cool, but Scalaris packs by far the most impressive collection of sexy technologies. Scalaris is a key-value store - it uses a modified version of the Chord algorithm to form a DHT, and stores the keys in lexicographical order, so range queries are possible. Although I didn’t see this explicitly mentioned, this should open up all sorts of interesting options for batch processing - map-reduce for example. On top of the DHT they use an improved version of Paxos to guarantee ACID properties when dealing with multiple concurrent transactions. So it’s a key-value store, but it can guarantee the ACID properties and do proper distributed transactions over multiple keys.
Oh, and to demonstrate how you can scale a webservice based on such a system, the Scalaris folk implemented their own version of Wikipedia on Scalaris, loaded in the Wikipedia data, and benchmarked their setup to prove it can do more transactions/sec on equal hardware than the classic PHP/MySQL combo that Wikipedia use. Yikes.
From what I can tell, Scalaris is only memory-resident at the moment and doesn’t persist data to disk. This makes it entirely impractical to actually run a service like Wikipedia on Scalaris for real - but it sounds like they tackled the hard problems first, and persisting to disk should be a walk in the park after you rolled your own version of Chord and made Paxos your bitch. Take a look at this presentation about Scalaris from the Erlang Exchange conference: Scalaris presentation video .
The reminaing projects, Dynomite , Ringo and Kai are all, more or less, trying to be Dynamo. Of the three, Ringo looks to be the most specialist - it makes a distinction between small (less than 4KB) and medium-size data items (<100MB). Medium sized items are stored in individual files, whereas small items are all stored in an append-log, the index of which is read into memory at startup. From what I can tell, Ringo can be used in conjunction with the Erlang map-reduce framework Nokia are working on called Disco .
I didn’t find out much about Kai other than it’s rather new, and some mentions in Japanese. You can chose either Erlang ets or dets as the storage system (memory or disk, respectively), and it uses the memcached protocol, so it will already have client libraries in many languages.
Dynomite doesn’t have great documentation, but it seems to be more capable than Kai, and is under active development. It has pluggable backends including the storage mechanism from CouchDB, so the 2GB file limit in dets won’t be an issue. Also I heard that Powerset are using it, so that’s encouraging.
Summary
Scalaris is fascinating, and I hope I can find the time to experiment more with it, but it needs to save stuff to disk before it’d be useful for the kind of things we might use it for at Last.fm.
I’m keeping an eye on Dynomite - hopefully more information will surface about what Powerset are doing with it, and how it performs at a large scale.
Based on my research so far, Project-Voldemort looks like the most suitable for our needs. I’d love to hear more about how it’s used at LinkedIn, and how many nodes they are running it on.
What else is there?
Here are some other related projects:
- Hazelcast - Java DHT/clustering library
- nmdb - a network database (dbm-style)
- Open Chord - Java DHT
If you know of anything I’ve missed off the list, or have any feedback/suggestions, please post a comment. I’m especially interested in hearing about people who’ve tested or are using KV-stores in lieu of relational databases.
UPDATE 1: Corrected table: memcachedb does replication, as per BerkeleyDB.
发表评论
-
找出mysql中最大的表
2011-08-04 12:41 1686SELECT concat(round(table_rows/ ... -
find 10 largest table in mysql
2010-07-20 11:09 1110SELECT concat(round(table_rows/ ... -
MySQL Back to Basics: Analyze, Check, Optimize, and Repair
2010-07-14 11:35 1190http://www.pythian.com/news/111 ... -
找出mysql中无用的索引
2010-07-13 14:49 1890select t.TABLE_SCHEMA , ... -
xtrabackup timeout bug
2010-06-13 10:21 1091I modified /usr/bin/innobac ... -
Compiling sysbench 0.4.12 for Debian
2010-06-09 10:16 986http://www.randombugs.com/linux ... -
mysql 实用工具集
2010-06-04 00:08 1110这些工具都是从网上搜集来的,对mysql的管理,调优和恢复有很 ... -
gearman for mysql
2010-05-10 18:26 1007http://www.slideshare.net/datac ... -
几个应该被修改的mysql默认值
2010-05-06 15:24 1495wait_timeout = 20 (不适合持久连接) in ... -
How to Perform a Healthcheck on the Database
2010-02-26 10:05 1401http://bbs.chinaunix.net/thread ... -
PostgreSQL与Innodb并发控制大比拼
2010-01-14 11:38 2091http://wangyuanzju.blog.163.com ... -
关于mysql的很好网站
2009-10-21 10:40 949http://www.mysqlperformanceblog ... -
Should you move from MyISAM to Innodb ?
2009-10-17 02:37 1039There is significant portion of ... -
最好的mysql备份工具
2009-08-19 16:45 1216Xtrabackup https://launchpad.n ... -
mysql 增量备份脚本
2009-08-18 09:44 3657根据网上脚本修改而成 mysqlFullBackup.sh ... -
mysql-zrm备份mysql数据库
2009-07-22 13:58 2777MySQL-zrm是用perl脚本写的 ... -
Base: An Acid Alternative
2009-06-18 15:55 1409http://queue.acm.org/detail.cfm ... -
关于innodb插入性能
2009-04-07 10:54 1600根据某网友的测试,innodb在以下条件下插入性能是稳定的: ... -
MySQL主从服务器的一些技巧
2009-03-25 15:18 962http://www.sunnyu.com/?p=150 -
mytop1.6补丁
2009-03-13 17:27 1007mytop是一个实时监控mysql状态的工具,很好用,但是有一 ...
相关推荐
数据库是存储和管理数据的重要工具,对于初学者来说,理解其基本类型——层次数据库、网状数据库和关系数据库,是掌握数据库系统的关键。这三种类型的数据库各有特点,适应不同的应用场景。 首先,层次数据库以树形...
### 关系数据库建模方法 #### 一、概述 关系数据库建模是数据库设计的核心环节之一,它通过一系列规范化的步骤和技术来确保数据库能够高效、可靠地存储和管理数据。本文档主要围绕Oracle数据库的建模展开讨论,...
它介于关系数据库和非关系数据库之间,被认为是非关系数据库当中功能最丰富,最像关系数据库的产品。 2、mongoDB的基本概念 (1)数据库: 数据库和传统的关系型数据库差不多的概念,每个数据库含有多个集合,每...
除了存储之外,论文还简要探讨了XML文档的重构问题,即如何在需要时从关系数据库中恢复出尽可能接近原始结构的XML文档。虽然这是一个复杂的问题,但通过合理的设计和优化,可以在一定程度上提高重构的准确性和效率。...
除了作为数据库之外,Caché 还包含了应用服务器的功能。这意味着开发者可以利用 Caché 内置的高级对象编程功能轻松地与其他技术进行集成。此外,Caché 提供了一个高性能的运行环境,其中采用了独特而高效的数据...
关系数据库模型设计是软件开发中的关键环节,特别是在面向对象技术广泛应用的今天。关系数据库模型设计结合了UML(Unified Modeling Language)和Rational Rose等工具,以实现更加高效和精确的系统建模。 UML是一种...
### OWL本体关系数据库存储模式设计 #### 引言 随着语义网技术的发展,OWL(Web Ontology Language)本体作为一种标准化的共享概念模型,对于构建智能信息系统至关重要。然而,如何有效地将OWL本体存储到关系...
- **SQL**:结构化查询语言是用于管理关系数据库的标准语言,用于查询、插入、更新和删除数据,以及创建和修改数据库结构。 - **索引**:索引是为了加速数据检索而创建的数据结构,类似于书的目录,可以帮助快速...
在数据库系统中,关系数据库理论是核心组成部分,它主要涵盖了关系数据结构、关系操作和关系的完整性等概念。本章的重点在于理解关系模型的基本概念,包括码(候选键、主键和外键)以及关系模式的完整性规则。此外,...
- **常见数据库类型**:除了SQLite之外,还有MySQL、SQL Server、Oracle、Access等。 ##### 数据库存储数据的步骤 1. **创建数据库**:首先需要创建一个新的数据库实例。 2. **创建表(Table)**:在一个数据库中...
DB2关系型数据库基础教程是针对IBM的DB2数据库系统进行深入学习的资源,它涵盖了DB2的基础概念、安装配置、数据管理、SQL查询、事务处理、安全性控制等多个方面。这个教程不仅适合初学者,也对准备DB2认证考试的人员...
除了存储和检索之外,对于存储在关系数据库中的XML数据执行查询也是必要的功能之一。本文还探讨了如何将AB+ G+查询语句转换为SQL查询语句的方法。这一过程主要包括对AB+ G+查询语法和语义的分析,以适应关系数据库的...
元数建模是一种先进的数据库关系模型设计工具,它旨在丰富数据库生态系统,并且具有独立于具体数据库的优势。在Java功能部分,元数建模利用Java的强大性能和灵活性来提供高效、可扩展的建模解决方案。本篇文章将深入...
- **专门关系运算**:关系数据库管理系统除了提供基本的增删改查操作之外,还支持专门的关系运算,如选择(Select)、投影(Project)、连接(Join)等。 #### 三、关系数据库的其他知识点 11. **元组的唯一性**: - *...
在三级数据库的体系结构中,我们通常指的是层次数据库、网络数据库以及关系数据库这三种类型的数据库系统。以下是对这些数据库类型的详细解析: 1. 层次数据库: 层次数据库采用树形结构来组织数据,其中每个记录都...
### 将对象映射到关系数据库 #### 对象与关系映射(O/R Mapping)详解 在现代商业应用开发中,面向对象技术与关系型数据库的结合已成为标准实践。这一组合通常涉及使用Java或C#等面向对象语言构建应用程序,并利用...
传统的层次、网状和关系数据模型可能无法完全满足工程数据库的需求,因此可能会引入更多面向对象或者特定领域的数据模型。 **工程数据库语言** 为了更好地管理和操作工程数据,工程数据库系统往往需要具备专门的...
数据库课程设计 1.概述 学生管理是一个学校必不可少的部分,随着计算机和计算机知识的普及,学生管理系统得到了更大的发展空间,通过对学生管理系统的开发,可以提高校务人员的工作效率。 随着科学技术的不断提高...