bigtable ,hbase 学习

stephen80

浏览: 108069 次
性别:
来自: 北京

最近访客更多访客>>

wu1239

范泽添

guotufu

a1473321851

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

search engine

HBase Hadoop Mapreduce Google Social

1.阅读 http://wiki.apache.org/hadoop/Hbase/HbaseArchitecture
   hbase architect. 先阅读 bigtable .http://labs.google.com/papers/bigtable.html

Want asynchronous processes to be continuously updating
   different pieces of data
– Want access to most current data at any time

   • Need to support:
– Very high read/write rates (millions of ops per second)
– Efficient scans over all or interesting subsets of data
– Efficient joins of large one-to-one and one-to-many datasets

but how?

2. 重要的概念：
   column oriented? 关键点是什么？
   write optimized: column aggregation .

   http://heart.korea.ac.kr/trac/wiki/ReadablePapers

   row/column sparse.

3. http://torrez.us/archives/2005/10/24/407/ ,rdf store, it seems good 。
copy to prevent forgot:

his is excellent news for the Semantic Web. Google is building the RDF database we’ve been trying to build and to this date even though conceptually we are on the right track, our implementations do not scale in ways that would even match standard relational models today. Thus, making it very hard for real systems to adopt RDF as their platform today. However, all of this is going to change with BigTable, but let’s pay attention to the details in the description and a summary from Andrew Hitchcock.

    * Storing and managing very large amounts of structured data
    * Row/column space can be sparse
    * Columns are in the form of “family:optional_qualifier”. RDF Properties, Yeah!
    * Columns have type information
    * Because of the design of the system, columns are easy to create (and are created implicitly)
    * Column families can be split into locality groups (Ontologies!)

Why do I think this is an RDF database? Well, in case you might not know one of the problems with existing relational database models is that they are not flexible enough. If a company like Amazon starts carrying a new type of product with attributes not currently built into their systems, they have to jump through hoops to recreate the tables that store and manage product information. RDF, as an extensible description framework answers this problem, because it allows a resource to have unlimited number of properties associated with it. However, when we implement RDF stores atop existing RDBMS, we begin to use a row for each new property/attribute that we would like to store about the resource, thus making it sub-optimal for joins and other operations. Here is where BigTable comes in, because it’s row/column space can be sparse (not all rows/resources contain all the same properties) and columns can be easily created with very little cost. Additionally, you can maintain a locality for families of properties, which we called Ontologies, so if we wanted all properties about a blog entry, we could get them fast enough (i.e. a locality for all Atom metadata columns). Anyways, I have to get back to my school work, but I hope that everyone sees what I’m seeing and further analyze this talk with more attention to the technical details. I think that better times are coming for the SW and we’ll be soon enjoying a whole new class of semantic services on the Internet. One final note or maybe a whole separate post will be Bosworth’s comments on how we should be limiting our SQL queries in order to gain the performance we need in RDF databases.
These icons link to social bookmarking sites where readers can share and discover new web pages.

4. as to crawl ,it seem uitable, but with later process, it seems better.

5.see ,how pig m/r query?
see, how to design schema for bigtable? bigtable suitable?

6. how to get urls of one host? schema what? index? maybe ,yes,if needed
http://www.infoq.com/news/2008/04/hbase-interview

hbase fit in:

The M/R paradigm applies well to batch processing of data. How does Hadoop apply in a more transaction/single request based paradigm?

MapReduce (both Google's and Hadoop's) is ideal for processing huge amounts of data with sizes that would not fit in a traditional database. Neither is appropriate for transaction/single request processing. While HBase uses HDFS from Hadoop Core, it doesn't use MapReduce in its common operations.

However, HBase does support efficient random accesses, so it can be used for some of the transactional elements of your business. You will take a raw performance hit over something like MySQL, but you get the benefit of very good scaling characteristics as your transactional throughput grows.

分享到：

redirect treatment | 学习hadoop，发现的问题

2008-11-10 10:47
浏览 2574
评论(0)
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论