1.阅读 http://wiki.apache.org/hadoop/Hbase/HbaseArchitecture
hbase architect. 先阅读 bigtable .http://labs.google.com/papers/bigtable.html
Want asynchronous processes to be continuously updating
different pieces of data
– Want access to most current data at any time
• Need to support:
– Very high read/write rates (millions of ops per second)
– Efficient scans over all or interesting subsets of data
– Efficient joins of large one-to-one and one-to-many datasets
but how?
2. 重要的概念:
column oriented? 关键点是什么?
write optimized: column aggregation .
http://heart.korea.ac.kr/trac/wiki/ReadablePapers
row/column sparse.
3. http://torrez.us/archives/2005/10/24/407/ ,rdf store, it seems good 。
copy to prevent forgot:
his is excellent news for the Semantic Web. Google is building the RDF database we’ve been trying to build and to this date even though conceptually we are on the right track, our implementations do not scale in ways that would even match standard relational models today. Thus, making it very hard for real systems to adopt RDF as their platform today. However, all of this is going to change with BigTable, but let’s pay attention to the details in the description and a summary from Andrew Hitchcock.
* Storing and managing very large amounts of structured data
* Row/column space can be sparse
* Columns are in the form of “family:optional_qualifier”. RDF Properties, Yeah!
* Columns have type information
* Because of the design of the system, columns are easy to create (and are created implicitly)
* Column families can be split into locality groups (Ontologies!)
Why do I think this is an RDF database? Well, in case you might not know one of the problems with existing relational database models is that they are not flexible enough. If a company like Amazon starts carrying a new type of product with attributes not currently built into their systems, they have to jump through hoops to recreate the tables that store and manage product information. RDF, as an extensible description framework answers this problem, because it allows a resource to have unlimited number of properties associated with it. However, when we implement RDF stores atop existing RDBMS, we begin to use a row for each new property/attribute that we would like to store about the resource, thus making it sub-optimal for joins and other operations. Here is where BigTable comes in, because it’s row/column space can be sparse (not all rows/resources contain all the same properties) and columns can be easily created with very little cost. Additionally, you can maintain a locality for families of properties, which we called Ontologies, so if we wanted all properties about a blog entry, we could get them fast enough (i.e. a locality for all Atom metadata columns). Anyways, I have to get back to my school work, but I hope that everyone sees what I’m seeing and further analyze this talk with more attention to the technical details. I think that better times are coming for the SW and we’ll be soon enjoying a whole new class of semantic services on the Internet. One final note or maybe a whole separate post will be Bosworth’s comments on how we should be limiting our SQL queries in order to gain the performance we need in RDF databases.
These icons link to social bookmarking sites where readers can share and discover new web pages.
4. as to crawl ,it seem uitable, but with later process, it seems better.
5.see ,how pig m/r query?
see, how to design schema for bigtable? bigtable suitable?
6. how to get urls of one host? schema what? index? maybe ,yes,if needed
http://www.infoq.com/news/2008/04/hbase-interview
hbase fit in:
The M/R paradigm applies well to batch processing of data. How does Hadoop apply in a more transaction/single request based paradigm?
MapReduce (both Google's and Hadoop's) is ideal for processing huge amounts of data with sizes that would not fit in a traditional database. Neither is appropriate for transaction/single request processing. While HBase uses HDFS from Hadoop Core, it doesn't use MapReduce in its common operations.
However, HBase does support efficient random accesses, so it can be used for some of the transactional elements of your business. You will take a raw performance hit over something like MySQL, but you get the benefit of very good scaling characteristics as your transactional throughput grows.
分享到:
相关推荐
### Bigtable:一个分布式的结构化数据存储系统 #### 概述 《Bigtable:一个分布式的结构化数据存储系统》是一篇详细介绍了Google内部如何管理和处理大规模数据的论文。Bigtable作为一个分布式存储系统,其设计...
**HBase学习总结** HBase,全称是Apache HBase,是一个分布式的、面向列的开源数据库,它是基于Google的Bigtable模型构建的,专为处理海量数据而设计。HBase是Apache Hadoop生态系统的一部分,它运行在Hadoop分布式...
### HBase学习利器:HBase实战 #### 一、HBase简介与背景 HBase是Apache Hadoop生态系统中的一个分布式、可扩展的列族数据库,它提供了类似Bigtable的能力,能够在大规模数据集上进行随机读写操作。HBase是基于...
HBase借鉴了Google Bigtable的设计理念,但它是开源的,因此可以免费使用并进行社区驱动的开发。 HBase的核心特性包括高可靠性、高性能和可伸缩性。它依赖于Hadoop的HDFS(Hadoop Distributed File System)作为...
hbase是bigtable的开源山寨版本。是建立的hdfs之上,提供高可靠性、高性能、列存储、可伸缩、实时读写的数据库系统。 它介于nosql和RDBMS之间,仅能通过主键(row key)和主键的range来检索数据,仅支持单行事务(可...
HBase,全称为Hadoop Distributed File System上的Base,是一种基于Google Bigtable模型设计的开源分布式数据库。它在Apache软件基金会的Hadoop项目下运行,专为处理海量数据而设计,尤其适合半结构化和非结构化数据...
HBase的数据模型基于BigTable的设计,以行和列来进行数据组织,每个表被分为多个行,行由行键(Row Key)标识,而每一行又包含多个列族(Column Family),列族下有多个列(Qualifier)。 1. **HBase的架构** - **...
HBase是一个基于谷歌Bigtable理念设计的开源分布式数据库,它构建在Hadoop的HDFS之上,并依赖Zookeeper进行协调服务。HBase的设计目标是为了处理大规模的数据存储和快速随机访问。 1. **HBase表结构**: HBase的表...
【标题】"Hadoop之HBase学习笔记"主要聚焦于Hadoop生态中的分布式数据库HBase。HBase是一个基于Google Bigtable理念设计的开源NoSQL数据库,它运行在Hadoop之上,提供高性能、高可靠性以及可水平扩展的数据存储能力...
同时,掌握HBase的安装、配置和基本操作,如创建表、插入数据、查询数据等,是学习HBase的基础。 【HBase与其他技术的集成】 HBase可以与Apache Spark集成,用于实时数据处理和分析。Phoenix是SQL查询引擎,允许...
【标题】:理解HBase与BigTable 【描述】:本文旨在从概念层面解析HBase(Google BigTable的开源实现)及其数据存储系统的本质,帮助读者理解何时选择HBase,何时选择传统数据库。 【标签】:HBase 【正文】: ...
【HBase 学习】 HBase 是一个分布式、列式存储的开源数据库,基于 Google BigTable 的设计理念,专为处理大规模结构化和半结构化数据而设计。它运行在 HDFS(Hadoop 分布式文件系统)之上,提供高可靠性和高性能的...
HBase的数据模型基于BigTable设计,是一个稀疏、多版本、列族式的存储系统。Region Server负责数据的读写,Master Server管理表和Region的分配,Zookeeper用于协调和监控集群状态。了解HBase的Shell操作、数据模型...
《小镜子之HBase学习:深入理解分布式存储的基石》 HBase,全称为Hadoop Database,是一款构建在Hadoop文件系统(HDFS)之上的分布式列式数据库,旨在处理大规模数据集。作为NoSQL数据库家族的一员,HBase在大数据...
HBase是建立在Hadoop之上的,使用Google BigTable的数据模型,它是Apache Software Foundation的Hadoop项目的一部分,适用于需要快速随机访问和处理大量数据的场景。 从描述中可以得知,这份文档是由某开发人员在其...
### HBase 学习知识点详解 #### 一、HBase 概述 HBase 是一个分布式、可扩展的大规模数据存储系统,它基于 Google 的 BigTable 模型设计并实现。作为一个非关系型数据库(NoSQL),HBase 提供了高可靠性和高性能的...
HBase是Google BigTable的开源实现,了解它是一项挑战,因为它的概念并不直观,尤其对于那些习惯了关系数据库管理系统(RDBMS)的人来说。为了更好地理解HBase,我们需要从概念上分析它。本文的目的是从一个理论的...
HBase(hbase-2.4.9-bin.tar.gz)是一个分布式的、面向列的开源数据库,该技术来源于 Fay Chang 所撰写的Google论文“Bigtable:一个结构化数据的分布式存储系统”。就像Bigtable利用了Google文件系统(File System...
14丨BigTable的开源实现:HBase.html
### HBase学习资料:深入解析HBase与Hadoop文件系统 #### HBase概览与应用场景 HBase,作为Apache顶级项目之一,是一个高度分布式的、版本化的列式存储数据库,其设计灵感来源于Google的Bigtable论文。HBase旨在...