抓取流程-updatedb

leibnitz

浏览: 283675 次
性别:
来自: 广州

最近访客更多访客>>

eternal1025

bneliao

adapterofcoms

caipeijun666

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

nutch

此过程比较简单，是generate,fetch,udpatedb中的最后一环。其实就是将新发现的和抓取失败的送回crawldb/current下。

过程：

一。input

inputpaths:contains [ crawldb/current,crawl_fetch,crawl_parse］,note:这是利用FileInputPath.addInputpath()，这意味着相当于使用多个files作为输入；但不同于MultiInputs.addInput()!

additionsAllowed ：如果为false，只会更新crawldb中的urls而不会将新parsed的添加进crawldb中

生成job:利用Crawldb.createjob().

二。output

path：crawldb/current，保证下一次fetch的urls总是有效的。

format：MapFileOutputFormat。<text,crawldb> ：<url,crawldatum>

三。MR

M：use default mapper

R：CrawldbReducer,其中对同一个url来自不同result files的进行状态处理：如要有多个，根据fetchTime只取出最后一个。

note:1.在fetch.getStatus()分支中，在schedule.initializeSchedule()时会进行fetchtime的更新到当前时间。同时由linked --> db_unfetched，表明由fetch生成的outlink进行状态转换，等待下一回合的fetch操作。

2.由于使用了Hashpartition，所以同一Urls会进入相同的red，所以不用担心多个reds时如何保证所有urls在同一red中处理。

3.怎样保证已经fetched的urls不再fetch?

有generator进行了过滤：

A.在topn job中的mapper

1)shouldFetch（）

public boolean shouldFetch(Text url, CrawlDatum datum, long curTime) {
    // pages are never truly GONE - we have to check them from time to time.
    // pages with too long fetchInterval are adjusted so that they fit within
    // maximum fetchInterval (segment retention(保留) period).
    if (datum.getFetchTime() - curTime > (long) maxInterval * 1000) {  //maxInterval is 90days by default
      if (datum.getFetchInterval() > maxInterval) {
        datum.setFetchInterval(maxInterval * 0.9f); //超过指定最大间隔，减少至9成
      }
      datum.setFetchTime(curTime);
    }
    if (datum.getFetchTime() > curTime) {
      return false;                                   // not time yet
    }
    return true;  //比当前时间早
  }

这里只是进行了简单的fetchtime比较，由于updatedb后fetchtime设置了一个月后的间隔，所以这里返回 false。

所以以下步骤可以不分析了。这里只是给出更多的原因说明而已

2）如果上述返回 true，执行这个

LongWritable oldGenTime = (LongWritable) crawlDatum.getMetaData().get(
          Nutch.WRITABLE_GENERATE_TIME_KEY);
      if (oldGenTime != null) { // awaiting fetch & update;updatedb后的此值为空
        if (oldGenTime.get() + genDelay > curTime) // still wait for
        // update;还没过期,不需要crawl?
        return;
      }

发现在udpatedb后，对应的_ngt_是不存在的，即在这里为null，同样返回

B.DecreasingFloatComparator

即使通过 A中的过滤，并不意味着此urls就一定被选择为fetch，还要在筛选出topn by score。所以这里进一步过滤了urls.

事实上当进行下一回的generate时，已经不存在www.163.com这个url了：

hadoop fs -text output/debug/segments/20110714121627/crawl_generate/part-00000 | grep www.163.com -B 10 -A 10
http://www.163.com/rss/ Version: 7
Status: 1 (db_unfetched)
Fetch time: Tue Jul 12 23:49:27 CST 2011
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 0.01
Signature: null
Metadata: _ngt_: 1310616920169

出现的只是未fetched的urls。

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论