MapReduce is a parallel computing model proposed by Google for large data sets, it’s proved to have high availability, good scalability and fault tolerance, and has been widely used in recent years. However, there is a voice from traditional database community, D. J. DeWitt et al argue that MapReduce loses the performance and efficiency from DBMS, and it’s a step backwards of data analysis techniques. After that, HadoopDB article presents a new type of MapReduce based distributed database implementation, but it has the following disadvantages: (1) It doesn’t add any query execution engine for Hadoop client, HadoopDB is only a pilot project; (2) HadoopDB uses PostgreSQL as its underlying database, where redundancies and backups which Hadoop’s underlying file system HDFS provides is short of, thus it loses Hadoop’s original high availability; (3) There is not a complete partition mechanism, its tables are manually partitioned, this is not practical; (4) Table joining in HadoopDB is assumed in the ideal state, that two tables’ partitions are on the same node, but it’s not the case in the real world.
This paper overcomes the above problems, our implementation named Anthill can keep both advantages of shared-nothing MPP architecture databases and the MapReduce. Experiments show that: (1) Anthill provides better scalability than MPP databases, it can deploy more than 100 nodes, even 500 nodes; (2) Since Anthill uses the column-oriented database MonetDB as its underlying database, where I/O is effectively reduced, it achieves better performance than Hadoop and HadoopDB; (3) Anthill adds a complete query execution engine with parser, optimizer and planner, thus it can more intelligently handle the diversity of data and produce better query plans compared to HadoopDB. In short, Anthill is a new type of distributed database system with high commercial values.
Keywords: Distributed Database; Anthill; MapReduce; Hadoop
分享到:
评论
参见Hadoop自身的DBInputFormat, 不同sql语句就造成不同InputFormat实例,那重用性太差了。
嗯,DBInputFormat,使用jdbc接口,sql语句来进行操作。
应该使用jdbc或odbc这样的数据库接口
分布可以是哈希分布,也可以是Round-Robin分布。表采用Round-Robin可以确保表分布均匀,但是在做表连接时,却需要更多的节点间数据传输。这是一个矛盾点:相同哈希值的数据放一起连接效率高,但可能会引起数据倾斜;采用Round-Robin随机分布数据不会倾斜,但连接效率低。有一种办法是二次哈希,将倾斜的数据再次哈希——这需要精确的元数据控制。
嗯,分布这块是重点,涉及比较多的考虑,呵呵。
参见Hadoop自身的DBInputFormat, 不同sql语句就造成不同InputFormat实例,那重用性太差了。
关于表位置的信息,是放在主节点的元数据库里面,数据节点不承担这部分责任。可以参考Slony-I。
分布可以是哈希分布,也可以是Round-Robin分布。表采用Round-Robin可以确保表分布均匀,但是在做表连接时,却需要更多的节点间数据传输。这是一个矛盾点:相同哈希值的数据放一起连接效率高,但可能会引起数据倾斜;采用Round-Robin随机分布数据不会倾斜,但连接效率低。有一种办法是二次哈希,将倾斜的数据再次哈希——这需要精确的元数据控制。
首先跟RDMS一样,有SQL Engine(Compiler Optimizer Execution Engine),获取client的sql请求,解析sql,执行sql,返回请求结果。
底层是分布式存储,多个节点,那就需要有一个master。
数据有多个备份,增删改查就不好做了。 这个是最难做的吧
不同的table存储的位置不是人工指定的,而是系统自动分配的。
支持mapreduce应该比较好做。