为了研究怎么解决recrawl的问题,今天仔细观察了一下nutch crawl的每一步具体发生了什么。
==============准备工作======================
(Windows下需要cygwin)
从SVN check out代码;
cd到crawler目录;
==============inject==========================
$ bin/nutch inject crawl/crawldb urls
Injector: starting
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
crawldb目录在这时生成。
查看里面的内容:
$ bin/nutch readdb crawl/crawldb -stats
CrawlDb statistics start: crawl/crawldb
Statistics for CrawlDb: crawl/crawldb
TOTAL urls: 1
retry 0: 1
min score: 1.0
avg score: 1.0
max score: 1.0
status 1 (db_unfetched): 1
CrawlDb statistics: done
===============generate=========================
$bin/nutch generate crawl/crawldb crawl/segments
$s1=`ls -d crawl/segments/2* | tail -1`
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20080112224520
Generator: filtering: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
segments目录在这时生成。但里面只有一个crawl_generate目录:
$ bin/nutch readseg -list $1
NAME GENERATED FETCHER START FETCHER END
FETCHED PARSED
20080112224520 1 ? ? ? ?
crawldb的内容此时没变化,仍是1个unfetched url。
=================fetch==============================
$bin/nutch fetch $s1
Fetcher: starting
Fetcher: segment: crawl/segments/20080112224520
Fetcher: threads: 10
fetching http://www.complaints.com/directory/directory.htm
Fetcher: done
segments多了些其他子目录。
$ bin/nutch readseg -list $s1
NAME GENERATED FETCHER START FETCHER END
FETCHED PARSED
20080112224520 1 2008-01-12T22:52:00 2008-01-12T22:52:00
1 1
crawldb的内容此时没变化,仍是1个unfetched url。
================updatedb=============================
$ bin/nutch updatedb crawl/crawldb $s1
CrawlDb update: starting
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20080112224520]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
CrawlDb update: Merging segment data into db.
CrawlDb update: done
这时候crawldb内容就变化了:
$ bin/nutch readdb crawl/crawldb -stats
CrawlDb statistics start: crawl/crawldb
Statistics for CrawlDb: crawl/crawldb
TOTAL urls: 97
retry 0: 97
min score: 0.01
avg score: 0.02
max score: 1.0
status 1 (db_unfetched): 96
status 2 (db_fetched): 1
CrawlDb statistics: done
==============invertlinks ==============================
$ bin/nutch invertlinks crawl/linkdb crawl/segments/*
LinkDb: starting
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: crawl/segments/20080112224520
LinkDb: done
linkdb目录在这时生成。
===============index====================================
$ bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/*
Indexer: starting
Indexer: linkdb: crawl/linkdb
Indexer: adding segment: crawl/segments/20080112224520
Indexing [http://www.complaints.com/directory/directory.htm] with analyzer
org
apache.nutch.analysis.NutchDocumentAnalyzer@ba4211 (null)
Optimizing index.
merging segments _ram_0 (1 docs) into _0 (1 docs)
Indexer: done
indexes目录在这时生成。
================测试crawl的结果==========================
$ bin/nutch org.apache.nutch.searcher.NutchBean complaints
Total hits: 1
0 20080112224520/http://www.complaints.com/directory/directory.htm
Complaints.com - Sitemap by date ?Complaints ...
参考资料:
【1】Nutch version 0.8.x tutorial
http://lucene.apache.org/nutch/tutorial8.html
【2】 Introduction to Nutch, Part 1: Crawling
http://today.java.net/lpt/a/255
[实际写于Jan 13, 12:10 am 2008]
分享到:
相关推荐
在Nutch的爬取过程中,每次`nutch crawl`操作都会生成一个新的目录,包含爬取的网页数据、链接数据库(linkdb)、网页数据库(crawldb)和索引文件。当需要将多次爬取的结果合并成一个统一的数据库时,可以使用`...
4. 执行抓取周期:`bin/nutch crawl -i crawl` 注意,你需要根据实际的数据库连接信息更新 `gora.properties` 文件,例如设置 `db.url`、`db.driver`、`db.username` 和 `db.password`。 在 Nutch 进行抓取时,...
在Crawl类的`main()`方法中,首先调用`NutchConfiguration.createCrawlConfiguration()`初始化配置,这一步至关重要。`NutchConfiguration`类位于`org.apache.nutch.util`包中,其中的`createCrawlConfiguration()`...
- 将下载的压缩包解压至期望的目录,这一步即是 Nutch 的安装过程。 4. **设置 Nutch 环境变量** - 设置系统环境变量 `NUTCH_JAVA_HOME`,其值为 JDK 的安装目录。 - 在 Cygwin 命令行中验证 Nutch 的安装是否...
综上所述,Nutch 1.4在Windows下的安装配置涉及多个环节,包括Java环境搭建、Cygwin的安装、Nutch与Solr的下载与配置等,每一步都需仔细操作以确保系统正常运行。通过以上步骤,用户不仅能够实现对目标网站的自动化...
- 在整个过程中,通过日志记录和监控工具来跟踪每一步操作的结果,以便及时发现问题并进行调试。 - 可以利用Nutch自带的日志系统或其他第三方工具来收集相关信息。 #### 四、实践案例 为了更好地理解上述解决...
9. **深入理解配置**:深入了解`nutch-default.xml`文件中的每个配置项的实际含义对于提高使用Nutch的能力至关重要。 10. **定制化开发**:想要进行定制化的开发,可以通过研读`build.xml`文件开始,了解构建过程和...