nutchbase=nutch+hbase

yuhai.china

浏览: 161659 次
性别:
来自: 北京

最近访客更多访客>>

erpaoshouling

leiwuhenfan

clanmei

CURRY_LI

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

JAVA开发

HBase Access SQL IDEA Yahoo

当我们为nutch的架构发愁的时候，nutch的开发人员送来了nutchbase。我一些简单的测试表明，在hadoop0.20.1和hbase0.20.2上，稍加修改可以运行起来。
它的优点很明显：架构合理.

开发者是这样说的，引用自jira
http://issues.apache.org/jira/browse/NUTCH-650

A) Why integrate with hbase?

All your data in a central location
No more segment/crawldb/linkdb merges.
No more "missing" data in a job. There are a lot of places where we copy data from one structure to another just so that it is available in a later job. For example, during parsing we don't have access to a URL's fetch status. So we copy fetch status into content metadata. This will no longer be necessary with hbase integration.
A much simpler data model. If you want to update a small part in a single record, now you have to write a MR job that reads the relevant directory, change the single record, remove old directory and rename new directory. With hbase, you can just update that record. Also, hbase gives us access to Yahoo! Pig, which I think, with its SQL-ish language may be easier for people to understand and use.
B) Design
Design is actually rather straightforward.

We store everything (fetch time, status, content, parsed text, outlinks, inlinks, etc.) in hbase. I have written a small utility class that creates "webtable" with necessary columns.
So now most jobs just take the name of the table as input.
There are two main classes for interfacing with hbase. ImmutableRowPart wraps around a RowResult and has helper getters (getStatus(), getContent(), etc.). RowPart is similar to ImmutableRowPart but also has setters. The idea is that RowPart also wraps RowResult but also keeps a list of updates done to that row. So when getSomething is called, it first checks if Something is already updated (if so then returns the updated version) or returns from RowResult. RowPart can also create a BatchUpdate from its list of updates.
URLs are stores in reversed host order. For example, http://bar.foo.com:8983/to/index.html?a=b becomes com.foo.bar:http:8983/to/index.html?a=b. This way, URLs from the same tld/host/domain are stored closer to each other. TableUtil has methods for reversing and unreversing URLs.
CrawlDatum Status-es are simplifed. Since everything is in central location now, no point in having a DB and FETCH status.
Jobs:

Each job marks rows so that the next job knows which rows to read. For example, if GeneratorHbase decides that a URL should be generated it marks the URL with a TMP_FETCH_MARK (Marking a url is simply creating a special metadata field.) When FetcherHbase runs, it skips over anything without this special mark.
InjectorHbase: First, a job runs where injected urls are marked. Then in the next job, if a row has the mark but nothing else (here, I assumed that if a row has "status:" column, that it already exists), InjectorHbase initializes the row.
GeneratorHbase: Supports max-per-host configuration and topN. Marks generated urls with a marker.
FetcherHbase: Very similar to original Fetcher. Marks urls successfully fetched. Skips over URLs not marked by GeneratorHbase
ParseTable: Similar to original Parser. Outlinks are stored "outlinks:<fromUrl>" -> "anchor".
UpdateTable: Does updatedb's and invertlink's job. Also clears any markers.
IndexerHbase: Indexes the entire table. Skips over URLs not parsed successfully.

分享到：

macbook 使用感受 | hbase 0.20 client编程

2010-01-14 10:57
浏览 2285
评论(0)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论