首先看一下Nutch的整个工作流程
下面解析http://lucene.apache.org/nutch/tutorial8.html中关于外部网搜索的部分中所描述的内容:
Whole-web: Boostrapping the Web Database
The injector
adds urls to the crawldb. Let's inject URLs
from the DMOZ
Open Directory. First we
must download and uncompress the file listing all of the DMOZ pages.
(This is a 200+Mb file, so this will take a few minutes.)
wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz
gunzip content.rdf.u8.gz
Next we select a random subset of these pages.
(We use a random subset so that everyone who runs this tutorial
doesn't hammer the same sites.) DMOZ contains around three million
URLs. We select one out of every 5000, so that we end up with
around 1000 URLs:
mkdir dmoz
bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8 -subset 5000 > dmoz/urls
直接去http://rdf.dmoz.org/rdf/content.rdf.u8.gz中下载content.rdf.u8.gz,它能提供给你数量庞大的URL,以供测试使用,然后将下载下来的zip解压在你的nutch根目录中,例如:d:\nutch\nutch-0.9;
在cygwin中运行以下命令:
mkdir dmoz :在当前目录下创建dmoz目录
bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8 -subset 5000 > dmoz/urls:用来生成测试用的URL。利用nutch提供的工具包中的DmozParser解析工具解析content.rdf.u8中的内容,每5000个URL提取一个,因为DMOZ大概包含了三百多万个URL,大概可以得到1000个左右的URL,将这些URL写入到dmoz目录下的urls文件中。
The parser also takes a few minutes, as it must parse the full
file. Finally, we initialize the crawl db with the selected urls.
bin/nutch inject crawl/crawldb dmoz
Now we have a web database with around 1000 as-yet unfetched URLs in it.
bin/nutch inject crawl/crawldb dmoz:将dmoz中的URL注入到crawl/crawldb中,这样就达到了初始化crawldb的目的,也就是上图Nutch工作流程图中的添加初始化url,写入到保存中url信息的crawldb目录中。
Whole-web: Fetching
Starting from 0.8 nutch user agent identifier needs to be configured
before fetching. To do this you must edit the file conf/nutch-site.xml
, insert at minimum
following properties into it and edit in proper values for the properties:
<property>
<name>http.agent.name</name>
<value></value>
<description>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
</description>
</property>
<property>
<name>http.agent.description</name>
<value></value>
<description>Further description of our bot- this text is used in
the User-Agent header. It appears in parenthesis after the agent name.
</description>
</property>
<property>
<name>http.agent.url</name>
<value></value>
<description>A URL to advertise in the User-Agent header. This will
appear in parenthesis after the agent name. Custom dictates that this
should be a URL of a page explaining the purpose and behavior of this
crawler.
</description>
</property>
<property>
<name>http.agent.email</name>
<value></value>
<description>An email address to advertise in the HTTP 'From' request
header and User-Agent header. A good practice is to mangle this
address (e.g. 'info at example dot com') to avoid spamming.
</description>
</property>
To fetch, we first generate a fetchlist from the database:
bin/nutch generate crawl/crawldb crawl/segments
正式开始抓取之前要对Nutch进行配置,主要是用来告诉被抓取的网站此爬虫的一些信息,配置一些说明性强的信息有助于爬虫被人理解。
<property>
<name>http.agent.name</name>
<value></value>
<description>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
</description>
</property>
<property>
<name>http.agent.description</name>
<value>MyNutch</value>
<description>Further description of our bot- this text is used in
the User-Agent header. It appears in parenthesis after the agent name.
</description>
</property>
<property>
<name>http.agent.url</name>
<value>www.XX.com</value>
<description>A URL to advertise in the User-Agent header. This will
appear in parenthesis after the agent name. Custom dictates that this
should be a URL of a page explaining the purpose and behavior of this
crawler.
</description>
</property>
<property>
<name>http.agent.email</name>
<value>XX@163.com</value>
<description>An email address to advertise in the HTTP 'From' request
header and User-Agent header. A good practice is to mangle this
address (e.g. 'info at example dot com') to avoid spamming.
</description>
</property>
bin/nutch generate crawl/crawldb crawl/segments:从crawldb中生成一个要被抓取的fetchlist
也就是上图中第二个步骤:创建新的segment
命令正常完成后生成如下目录:crawl/segments/20080701162119/crawl_generate
This generates a fetchlist for all of the pages due to be fetched.
The fetchlist is placed in a newly created segment directory.
The segment directory is named by the time it's created. We
save the name of this segment in the shell variable s1
:
s1=`ls -d crawl/segments/2* | tail -1`
echo $s1
Now we run the fetcher on this segment with:
bin/nutch fetch $s1
When this is complete, we update the database with the results of the
fetch:
bin/nutch updatedb crawl/crawldb $s1
Now the database has entries for all of the pages referenced by the
initial set.
Now we fetch a new segment with the top-scoring 1000 pages:
bin/nutch generate crawl/crawldb crawl/segments -topN 1000
s2=`ls -d crawl/segments/2* | tail -1`
echo $s2
bin/nutch fetch $s2
bin/nutch updatedb crawl/crawldb $s2
Let's fetch one more round:
bin/nutch generate crawl/crawldb crawl/segments -topN 1000
s3=`ls -d crawl/segments/2* | tail -1`
echo $s3
bin/nutch fetch $s3
bin/nutch updatedb crawl/crawldb $s3
By this point we've fetched a few thousand pages. Let's index
them!
解释1:
第一轮抓取:
s1=`ls -d crawl/segments/2* | tail -1`
echo $s1
是将segments目录20080701162119保存在一个变量s1中,
bin/nutch fetch $s1
开始第一轮抓取,
bin/nutch updatedb crawl/crawldb $s1
更新数据库,把获取到的页面信息存入数据库中。
解释2:
第二轮抓取:
bin/nutch generate crawl/crawldb crawl/segments -topN 1000
创建新的 segment ,选择分值排在前 1000 的URL来进行第二次获取,
s2=`ls -d crawl/segments/2* | tail -1`
echo $s2
把新的segments目录20080701162531存入变量s2中,
bin/nutch fetch $s2
开始第二轮抓取,
bin/nutch updatedb crawl/crawldb $s2
更新数据库,把新的页面信息存入数据库中,
第三轮抓取:
bin/nutch generate crawl/crawldb crawl/segments -topN 1000
与第二轮一样,创建新的segments,选择分值排在前面的1000个URL来进行抓取,
s3=`ls -d crawl/segments/2* | tail -1`
echo $s3
将新的segments目录20080701163439保存在变量s3中,
bin/nutch fetch $s3
开始第三轮抓取,
bin/nutch updatedb crawl/crawldb $s3
更新数据库,把新抓取的页面信息保存到数据库中,
完成以后会在segments目录下生成如下目录:
20080701162119/content,crawl_fetch,crawl_parse,parse_data,parse_text
此部分包括了上图中的第三步“爬行抓取”和第四步“内容分析”。
Whole-web: Indexing
Before indexing we first invert all of the links, so that we may
index incoming anchor text with the pages.
bin/nutch invertlinks crawl/linkdb crawl/segments/*
To index the segments we use the index
command, as follows:
bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/*
Now we're ready to search!
此部分建立索引分两部分:
bin/nutch invertlinks crawl/linkdb crawl/segments/* :反转所有的链接,以便索引页面的锚点文本(此处不是很明白)
正确运行后生成目录:crawl/linkdb
bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/*:建立索引
正确运行后生成目录:crawl/indexes
此处就是上图中的编制索引的步骤。
最后说明一下各目录的用处:
Whole-web: Concepts
Nutch data is composed of:
- The crawl database, or crawldb
. This contains
information about every url known to Nutch, including whether it was
fetched, and, if so, when.
- The link database, or linkdb
. This contains the list
of known links to each url, including both the source url and anchor
text of the link.
- A set of segments
. Each segment is a set of urls that are
fetched as a unit. Segments are directories with the following
subdirectories:
-
- a crawl_generate
names a set of urls to be fetched
- a crawl_fetch
contains the status of fetching each url
- a content
contains the content of each url
- a parse_text
contains the parsed text of each url
- a parse_data
contains outlinks and metadata parsed
from each url
- a crawl_parse
contains the outlink urls, used to
update the crawldb
- The indexes
are Lucene-format indexes.
crawldb:包含nutch已经知道的URL信息,包括URL是否已经被抓取,什么时候被抓取的…
linkdb:URL的链接列表,包括URL资源和URL链接的锚点文本;
segments下的目录:
crawl_generate:将要被抓取的URL列表
crawl_fetch:URL的状态
content:URL的内容
parse_text:URL被分析后的文本
parse_data:URL的外部链接和它的元数据
crawl_parse:URL的外部链接,将用于更新crawldb
分享到:
相关推荐
通过右键点击项目名,选择“Properties > Java Build Path > Source”,然后点击“Add Folder”按钮,将“conf”文件夹添加进来,这是Nutch配置文件所在的位置,对于项目的运行至关重要。 #### 步骤3:处理依赖库 ...
本文详细介绍了在Eclipse环境下编译Nutch-0.9的完整流程,从环境搭建、项目导入,到解决编译错误、外部库集成,再到配置文件调整和最终的运行测试,每一个步骤都旨在帮助用户顺利地启动和操作这个强大的网络爬虫工具...
通过以上步骤,您应该能够在MyEclipse环境中成功配置并运行Nutch项目。这些步骤不仅涵盖了基本的项目配置,还包括了一些针对特定版本的问题解决方案。希望这篇指南能够帮助开发者们快速上手Nutch项目,并有效地进行...
Nutch 2.2.1 是一个开源的网络爬虫框架,它被广泛用于构建搜索引擎和数据抓取项目。在Nutch的开发过程中,依赖管理是至关重要的,它确保了项目能够正确运行所需的所有库。在给定的标题和描述中,提到的"org.restlet....
- **依赖配置**:如果需要连接到 Hadoop 或其他外部系统,需配置相应的依赖库。 - **启动集群**:根据实际情况选择启动本地模式还是集群模式。 - **示例程序运行**:通过运行官方提供的示例程序来验证 Spark 的安装...
Doug Cutting是Apache Nutch项目的主要开发者之一,在Nutch项目的基础上,他创建了Hadoop项目。 2. **终止用户所有进程的命令** - 使用`pkill`命令可以终止指定用户的所有进程。例如: `pkill -u username` 可以...
Doug Cutting是Apache Nutch项目的主要开发者之一,在Nutch项目的基础上,他创建了Hadoop项目。 2. **终止用户所有进程的命令** - 使用`pkill`命令可以终止指定用户的所有进程。例如: `pkill -u username` 可以...
2. **配置文件**:可能包含配置文件(如.properties或.xml),用于设置爬虫的行为,如初始URL、抓取深度、线程数量等。 3. **依赖库**:项目的外部依赖,如HTTP客户端库、解析库等,通常在一个`lib`目录下或通过...
Nutch 搜索引擎 背景介绍 数据结构 Nutch系统利用Hadoop进行数据处理的精选实例 总结 Rackspace的日志处理 简史 选择Hadoop 收集和存储 日志的MapReduce模型 关于Cascading 字段、元组和管道 操作 Tap类,Scheme对象...
Nutch 搜索引擎 背景介绍 数据结构 Nutch系统利用Hadoop进行数据处理的精选实例 总结 Rackspace的日志处理 简史 选择Hadoop 收集和存储 日志的MapReduce模型 关于Cascading 字段、元组和...