bigtable ,hbase 学习

1.阅读 http://wiki.apache.org/hadoop/Hbase/HbaseArchitecture
   hbase architect.  先阅读 bigtable .http://labs.google.com/papers/bigtable.html

  Want asynchronous processes to be continuously updating
   different pieces of data
–  Want access to most current data at any time

   • Need to support:
– Very high read/write rates (millions of ops per second)
– Efficient scans over all or interesting subsets of data
– Efficient joins of large one-to-one and one-to-many datasets

but how?

2. 重要的概念:
   column oriented? 关键点是什么?
   write optimized: column aggregation .

   row/column sparse.

3. http://torrez.us/archives/2005/10/24/407/ ,rdf store, it seems good 。
  copy to prevent forgot:

  his is excellent news for the Semantic Web. Google is building the RDF database we’ve been trying to build and to this date even though conceptually we are on the right track, our implementations do not scale in ways that would even match standard relational models today. Thus, making it very hard for real systems to adopt RDF as their platform today. However, all of this is going to change with BigTable, but let’s pay attention to the details in the description and a summary from Andrew Hitchcock.

    * Storing and managing very large amounts of structured data
    * Row/column space can be sparse
    * Columns are in the form of “family:optional_qualifier”. RDF Properties, Yeah!
    * Columns have type information
    * Because of the design of the system, columns are easy to create (and are created implicitly)
    * Column families can be split into locality groups (Ontologies!)

Why do I think this is an RDF database? Well, in case you might not know one of the problems with existing relational database models is that they are not flexible enough. If a company like Amazon starts carrying a new type of product with attributes not currently built into their systems, they have to jump through hoops to recreate the tables that store and manage product information. RDF, as an extensible description framework answers this problem, because it allows a resource to have unlimited number of properties associated with it. However, when we implement RDF stores atop existing RDBMS, we begin to use a row for each new property/attribute that we would like to store about the resource, thus making it sub-optimal for joins and other operations. Here is where BigTable comes in, because it’s row/column space can be sparse (not all rows/resources contain all the same properties) and columns can be easily created with very little cost. Additionally, you can maintain a locality for families of properties, which we called Ontologies, so if we wanted all properties about a blog entry, we could get them fast enough (i.e. a locality for all Atom metadata columns). Anyways, I have to get back to my school work, but I hope that everyone sees what I’m seeing and further analyze this talk with more attention to the technical details. I think that better times are coming for the SW and we’ll be soon enjoying a whole new class of semantic services on the Internet. One final note or maybe a whole separate post will be Bosworth’s comments on how we should be limiting our SQL queries in order to gain the performance we need in RDF databases.
4. as to crawl ,it seem uitable, but with later process, it seems better.

5.see ,how pig m/r query?
see, how to design schema for bigtable? bigtable suitable?

6. how to get urls of one host? schema what? index? maybe ,yes,if needed

hbase fit in:

The M/R paradigm applies well to batch processing of data. How does Hadoop apply in a more transaction/single request based paradigm?

MapReduce (both Google's and Hadoop's) is ideal for processing huge amounts of data with sizes that would not fit in a traditional database. Neither is appropriate for transaction/single request processing. While HBase uses HDFS from Hadoop Core, it doesn't use MapReduce in its common operations.

However, HBase does support efficient random accesses, so it can be used for some of the transactional elements of your business. You will take a raw performance hit over something like MySQL, but you get the benefit of very good scaling characteristics as your transactional throughput grows.


