四,nutch 1.0 网站与爬虫的属性配置文件研究 -

zolomon

浏览: 23130 次
性别:
来自: 上海

最近访客更多访客>>

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

四,nutch 1.0 网站与爬虫的属性配置文件研究

博客分类：

nutch 1.0专题

lucene XSL Apache XML Mapreduce

本文为solomon@javaeye原创,如有转载,注明出处(作者solomon与链接http://zolomon.iteye.com).
本专题使用中文分词为ikanalyzer,感谢其作者为java中文事业做出的巨大贡献.
我的个人资料http://www.google.com/profiles/solomon.royarr

好不容易有空闲一天的时间写点东西,
可是发现这离开已久(其实只有几天)的办公环境已经没有所需足够的资料.
这边的网速连下载一个nutch也会影响到同事工作,
所以先拿一个网上找到的过去的nutch版本的配置文件讲解一下,
回头再修改成对应nutch 1.0的版本.
先跟读者说声抱歉了.
现在的版本来自http://svn.apache.org/repos/asf/lucene/nutch/branches/branch-0.7/conf/nutch-default.xml

声明:
此篇
1)以翻译为主(主要是nutch-default.xml),
2)外加笔者个人使用nutch的经验,
3)外加官方nutch wiki上的FAQ中http://wiki.apache.org/nutch/FAQ的内容,
4)结合过去网友的nutch配置文件讲解,
主要由以上4部分构成.

此文档中带有和这种注释为笔者提供的非翻译额外讲解.前者为在一段属性翻译之前提供的说明,后者为在一段属性翻译之后提供的解释.这两种注释不一定成对出现.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>






<nutch-conf>




<property>
<name>http.agent.name</name>
<value>NutchCVS</value>
<description>我们的 HTTP 'User-Agent' 请求头.</description>
</property>


<property>
<name>http.robots.agents</name>
<value>NutchCVS,Nutch,*</value>
<description>我们要寻找 robots.txt 文件的目标 agent 字符串,可多个,
以逗号分隔, 按优先度降序排列.</description>
</property>


<property>
<name>http.robots.403.allow</name>
<value>true</value>
<description>在/robots.txt不存在时,有些服务器返回 HTTP status 403 (Forbidden). 这一般也许意味着我们仍然对该网站进行抓取. 如果此属性设为false, 我们会认为该网站不允许抓取而不去抓它.</description>
</property>

<property>
<name>http.agent.description</name>
<value>Nutch</value>
<description>同样用在User-Agent头中. 对bot- 更深入的解释. 它(这个value中的字符串)将出现在agent.name后的括号中.
</description>
</property>

<property>
<name>http.agent.url</name>
<value>http://lucene.apache.org/nutch/bot.html</value>
<description>同样用在User-Agent中. 它(指这个value中的字符串)将出现在agent.name后的字符串中, 只是个用于宣传等的url地址.
</description>
</property>

<property>
<name>http.agent.email</name>
<value>nutch-agent@lucene.apache.org</value>
<description>在 HTTP 'From' 请求头和 User-Agent 头中, 用于宣传的电子邮件地址.</description>
</property>

<property>
<name>http.agent.version</name>
<value>0.7.2</value>
<description>在 User-Agent 头中用于宣传的版本号.</description>
</property>

<property>
<name>http.timeout</name>
<value>10000</value>
<description>默认网络超时, 单位毫秒.</description>
</property>

<property>
<name>http.max.delays</name>
<value>3</value>
<description>抓取一个页面的推迟次数. 每次发现一个host很忙的时候, nutch会推迟fetcher.server.delay这么久. 在http.max.delays次推迟发生过之后, 这次抓取就会放弃该页.</description>
</property>

<property>
<name>http.content.limit</name>
<value>65536</value>
<description>下载内容最大限制, 单位bytes.
如果value中的值非零(>=0), 那么大于这个值的部分将被截断; 否则不截.
</description>
</property>


<property>
<name>http.proxy.host</name>
<value></value>
<description>代理主机名. 如果为空, 则不使用代理.</description>
</property>

<property>
<name>http.proxy.port</name>
<value></value>
<description>代理主机端口.</description>
</property>

<property>
<name>http.verbose</name>
<value>false</value>
<description>If true, HTTP will log more verbosely.</description>
</property>


<property>
<name>http.redirect.max</name>
<value>3</value>
<description>抓取时候最大redirect数, 如果网页有超过这个数的redirect, fetcher就会尝试下一个网页(放弃这个网页).</description>
</property>



<property>
<name>file.content.limit</name>
<value>65536</value>
<description>下载内容的长度, 单位是bytes.
如果值不为零, 大于这个值的内容会被截掉; 否则 (零或负数), 不会有内容被截掉.
</description>
</property>

<property>
<name>file.content.ignored</name>
<value>true</value>
<description>如果为true, 在fetch过程中没有文件内容会被存储.
一般情况我们都是希望这样做的, 因为 file:// 协议的 URL 通常意味着它在本地, 我们可以直接对它执行抓取与建立索引工作. 否则(如果不为真), 文件内容将被存储.
!! NO IMPLEMENTED YET !! (!! 还没实现 !!)
</description>
</property>



<property>
<name>ftp.username</name>
<value>anonymous</value>
<description>ftp登陆用户名.</description>
</property>

<property>
<name>ftp.password</name>
<value>anonymous@example.com</value>
<description>ftp登陆密码.</description>
</property>

<property>
<name>ftp.content.limit</name>
<value>65536</value>
<description>文件内容长度上限, 单位是bytes.
如果这个值大于零, 大于这个值的内容会被截掉; 否则 (零或负数), 什么都不会截. 注意: 传统的
ftp RFCs从未提供部分传输而且, 实际上, 有些ftp服务器无法处理客户端强行关闭
我们努力尝试去处理了这种情况, 让它可以运行流畅.
</description>
</property>

<property>
<name>ftp.timeout</name>
<value>60000</value>
<description>默认ftp客户端socket超时, 单位是毫秒. 也请查阅下边的ftp.keep.connection属性.</description>
</property>

<property>
<name>ftp.server.timeout</name>
<value>100000</value>
<description>一个对ftp服务器idle time的估计, 单位是毫秒. 对于多数fgp服务器来讲120000毫秒是很典型的.
这个设置最好保守一点. 与ftp.timeout属性一起, 它们用来决定我们是否需要删除 (干掉) 当前 ftp.client 实例并强制重新启动另一个 ftp.client 实例. 这是需要的,因为一个fetcher线程也许不会在ftp client远程超时断开前按时进行下一个request
(可能会无所事事). 只有在ftp.keep.connection(参见下边)是真的时候使用.
</description>
</property>

<property>
<name>ftp.keep.connection</name>
<value>false</value>
<description>是否保持ftp连接.在同一个主机上一遍又一遍反复抓取时候很有用. 如果设为真, 它会避开连接, 登陆和目录列表为子序列url安装(原文用的setup,此处意思不同于install)解析器. 如果设为真, 那么, 你必须保证(应该):
(1) ftp.timeout必须比ftp.server.timeout小
(2) ftp.timeout必须比(fetcher.threads.fetch * fetcher.server.delay)大
否则在线程日志中会出现大量"delete client because idled too long"消息.</description>
</property>

<property>
<name>ftp.follow.talk</name>
<value>false</value>
<description>是否记录我们的客户端与远程服务器之间的dialogue. 调试(debug)时候很有用.</description>
</property>



<property>
<name>db.default.fetch.interval</name>
<value>30</value>
<description>默认重抓一个网页的(间隔)天数.
</description>
</property>

<property>
<name>db.ignore.internal.links</name>
<value>true</value>
<description>如果是真, 在给一个新网页增加链接时, 从同一个主机的链接会被忽略. 这是一个非常有效的方法用来限制链接数据库的大小, 只保持质量最高的一个链接.
</description>
</property>


<property>
<name>db.score.injected</name>
<value>1.0</value>
<description>注入新页面所需分数injector.
</description>
</property>


<property>
<name>db.score.link.external</name>
<value>1.0</value>
<description>添加新页面时, 来自新主机页面与原因热面的分数因子 added due to a link from
another host relative to the referencing page's score.
</description>
</property>

<property>
<name>db.score.link.internal</name>
<value>1.0</value>
<description>The score factor for pages added due to a link from the
same host, relative to the referencing page's score.
</description>
</property>

<property>
<name>db.max.outlinks.per.page</name>
<value>100</value>
<description>我们会解析的从一个一页面出发的外部链接的最大数量.</description>
</property>

<property>
<name>db.max.anchor.length</name>
<value>100</value>
<description>链接最大长度.</description>
</property>

<property>
<name>db.fetch.retry.max</name>
<value>3</value>
<description>抓取时最大重试次数.</description>
</property>



<property>
<name>fetchlist.score.by.link.count</name>
<value>true</value>
<description>If true, set page scores on fetchlist entries based on
log(number of anchors), instead of using original page scores. This
results in prioritization of pages with many incoming links.
</description>
</property>



<property>
<name>fetcher.server.delay</name>
<value>5.0</value>
<description>The number of seconds the fetcher will delay between
 successive requests to the same server.</description>
</property>

<property>
<name>fetcher.threads.fetch</name>
<value>10</value>
<description>同时使用的抓取线程数.
 This is also determines the maximum number of requests that are
 made at once (each FetcherThread handles one connection).</description>
</property>

<property>
<name>fetcher.threads.per.host</name>
<value>1</value>
<description>每主机允许的同时抓取最大线程数.</description>
</property>

<property>
<name>fetcher.verbose</name>
<value>false</value>
<description>如果为真, fetcher会做更多的log.</description>
</property>


<property>
<name>parser.threads.parse</name>
<value>10</value>
<description>ParseSegment同时应该使用的解析线程数.</description>
</property>



<property>
<name>io.sort.factor</name>
<value>100</value>
<description>The number of streams to merge at once while sorting
files. This determines the number of open file handles.</description>
</property>

<property>
<name>io.sort.mb</name>
<value>100</value>
<description>The total amount of buffer memory to use while sorting
files, in megabytes. By default, gives each merge stream 1MB, which
should minimize seeks.</description>
</property>

<property>
<name>io.file.buffer.size</name>
<value>131072</value>
<description>The size of buffer for use in sequence files.
The size of this buffer should probably be a multiple of hardware
page size (4096 on Intel x86), and it determines how much data is
buffered during read and write operations.</description>
</property>



<property>
<name>fs.default.name</name>
<value>local</value>
<description>The name of the default file system. Either the
literal string "local" or a host:port for NDFS.</description>
</property>

<property>
<name>ndfs.name.dir</name>
<value>/tmp/nutch/ndfs/name</value>
<description>Determines where on the local filesystem the NDFS name node
 should store the name table.</description>
</property>

<property>
<name>ndfs.data.dir</name>
<value>/tmp/nutch/ndfs/data</value>
<description>Determines where on the local filesystem an NDFS data node
 should store its blocks.</description>
</property>



<property>
<name>mapred.job.tracker</name>
<value>localhost:8010</value>
<description>The host and port that the MapReduce job tracker runs at.
</description>
</property>

<property>
<name>mapred.local.dir</name>
<value>/tmp/nutch/mapred/local</value>
<description>The local directory where MapReduce stores temprorary files
 related to tasks and jobs.
</description>
</property>



<property>
<name>indexer.score.power</name>
<value>0.5</value>
<description>Determines the power of link analyis scores. Each
pages's boost is set to scorescorePower where
score is its link analysis score and scorePower is the
value of this parameter. This is compiled into indexes, so, when
this is changed, pages must be re-indexed for it to take
effect.</description>
</property>

<property>
<name>indexer.boost.by.link.count</name>
<value>true</value>
<description>When true scores for a page are multipled by the log of
the number of incoming links to the page.</description>
</property>

<property>
<name>indexer.max.title.length</name>
<value>100</value>
<description>The maximum number of characters of a title that are indexed.
</description>
</property>

<property>
<name>indexer.max.tokens</name>
<value>10000</value>
<description>
The maximum number of tokens that will be indexed for a single field
in a document. This limits the amount of memory required for
indexing, so that collections with very large files will not crash
the indexing process by running out of memory.

Note that this effectively truncates large documents, excluding
from the index tokens that occur further in the document. If you
know your source documents are large, be sure to set this value
high enough to accomodate the expected size. If you set it to
Integer.MAX_VALUE, then the only limit is your memory, but you
should anticipate an OutOfMemoryError.
</description>
</property>

<property>
<name>indexer.mergeFactor</name>
<value>50</value>
<description>The factor that determines the frequency of Lucene segment
merges. This must not be less than 2, higher values increase indexing
speed but lead to increased RAM usage, and increase the number of
open file handles (which may lead to "Too many open files" errors).
NOTE: the "segments" here have nothing to do with Nutch segments, they
are a low-level data unit used by Lucene.
</description>
</property>

<property>
<name>indexer.minMergeDocs</name>
<value>50</value>
<description>This number determines the minimum number of Lucene
Documents buffered in memory between Lucene segment merges. Larger
values increase indexing speed and increase RAM usage.
</description>
</property>

<property>
<name>indexer.maxMergeDocs</name>
<value>2147483647</value>
<description>This number determines the maximum number of Lucene
Documents to be merged into a new Lucene segment. Larger values
increase indexing speed and reduce the number of Lucene segments,
which reduces the number of open file handles; however, this also
increases RAM usage during indexing.
</description>
</property>

<property>
<name>indexer.termIndexInterval</name>
<value>128</value>
<description>Determines the fraction of terms which Lucene keeps in
RAM when searching, to facilitate random-access. Smaller values use
more memory but make searches somewhat faster. Larger values use
less memory but make searches somewhat slower.
</description>
</property>



<property>
<name>analysis.common.terms.file</name>
<value>common-terms.utf8</value>
<description>The name of a file containing a list of common terms
that should be indexed in n-grams.</description>
</property>



<property>
<name>searcher.dir</name>
<value>.</value>
<description>
Path to root of index directories. This directory is searched (in
order) for either the file search-servers.txt, containing a list of
distributed search servers, or the directory "index" containing
merged indexes, or the directory "segments" containing segment
indexes.
</description>
</property>

<property>
<name>searcher.filter.cache.size</name>
<value>16</value>
<description>
Maximum number of filters to cache. Filters can accelerate certain
field-based queries, like language, document format, etc. Each
filter requires one bit of RAM per page. So, with a 10 million page
index, a cache size of 16 consumes two bytes per page, or 20MB.
</description>
</property>

<property>
<name>searcher.filter.cache.threshold</name>
<value>0.05</value>
<description>
Filters are cached when their term is matched by more than this
fraction of pages. For example, with a threshold of 0.05, and 10
million pages, the term must match more than 1/20, or 50,000 pages.
So, if out of 10 million pages, 50% of pages are in English, and 2%
are in Finnish, then, with a threshold of 0.05, searches for
"lang:en" will use a cached filter, while searches for "lang:fi"
will score all 20,000 finnish documents.
</description>
</property>

<property>
<name>searcher.hostgrouping.rawhits.factor</name>
<value>2.0</value>
<description>
A factor that is used to determine the number of raw hits
initially fetched, before host grouping is done.
</description>
</property>

<property>
<name>searcher.summary.context</name>
<value>5</value>
<description>
The number of context terms to display preceding and following
matching terms in a hit summary.
</description>
</property>

<property>
<name>searcher.summary.length</name>
<value>20</value>
<description>
The total number of terms to display in a hit summary.
</description>
</property>



<property>
<name>urlnormalizer.class</name>
<value>org.apache.nutch.net.BasicUrlNormalizer</value>
<description>Name of the class used to normalize URLs.</description>
</property>

<property>
<name>urlnormalizer.regex.file</name>
<value>regex-normalize.xml</value>
<description>Name of the config file used by the RegexUrlNormalizer class.</description></property>



<property>
<name>mime.types.file</name>
<value>mime-types.xml</value>
<description>Name of file in CLASSPATH containing filename extension and
magic sequence to mime types mapping information</description>
</property>

<property>
<name>mime.type.magic</name>
<value>true</value>
<description>Defines if the mime content type detector uses magic resolution.
</description>
</property>



<property>
<name>ipc.client.timeout</name>
<value>10000</value>
<description>Defines the timeout for IPC calls in milliseconds. </description>
</property>



<property>
<name>plugin.folders</name>
<value>plugins</value>
<description>Directories where nutch plugins are located. Each
element may be a relative or absolute path. If absolute, it is used
as is. If relative, it is searched for on the classpath.</description>
</property>

<property>
<name>plugin.includes</name>
<value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url)</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins.
</description>
</property>

<property>
<name>plugin.excludes</name>
<value></value>
<description>Regular expression naming plugin directory names to exclude.
</description>
</property>

<property>
<name>parser.character.encoding.default</name>
<value>windows-1252</value>
<description>The character encoding to fall back to when no other information
is available</description>
</property>

<property>
<name>parser.html.impl</name>
<value>neko</value>
<description>HTML Parser implementation. Currently the following keywords
are recognized: "neko" uses NekoHTML, "tagsoup" uses TagSoup.
</description>
</property>



<property>
<name>urlfilter.regex.file</name>
<value>regex-urlfilter.txt</value>
<description>Name of file on CLASSPATH containing regular expressions
used by urlfilter-regex (RegexURLFilter) plugin.</description>
</property>

<property>
<name>urlfilter.prefix.file</name>
<value>prefix-urlfilter.txt</value>
<description>Name of file on CLASSPATH containing url prefixes
used by urlfilter-prefix (PrefixURLFilter) plugin.</description>
</property>

<property>
<name>urlfilter.order</name>
<value></value>
<description>The order by which url filters are applied.
If empty, all available url filters (as dictated by properties
plugin-includes and plugin-excludes above) are loaded and applied in system
defined order. If not empty, only named filters are loaded and applied
in given order. For example, if this property has value:
org.apache.nutch.net.RegexURLFilter org.apache.nutch.net.PrefixURLFilter
then RegexURLFilter is applied first, and PrefixURLFilter second.
Since all filters are AND'ed, filter ordering does not have impact
on end result, but it may have performance implication, depending
on relative expensiveness of filters.
</description>
</property>



<property>
<name>extension.clustering.hits-to-cluster</name>
<value>100</value>
<description>Number of snippets retrieved for the clustering extension
if clustering extension is available and user requested results
to be clustered.</description>
</property>

<property>
<name>extension.clustering.extension-name</name>
<value></value>
<description>Use the specified online clustering extension. If empty,
the first available extension will be used. The "name" here refers to an 'id'
attribute of the 'implementation' element in the plugin descriptor XML
file.</description>
</property>



<property>
<name>extension.ontology.extension-name</name>
<value></value>
<description>Use the specified online ontology extension. If empty,
the first available extension will be used. The "name" here refers to an 'id'
attribute of the 'implementation' element in the plugin descriptor XML
file.</description>
</property>

<property>
<name>extension.ontology.urls</name>
<value>
</value>
<description>Urls of owl files, separated by spaces, such as
http://www.example.com/ontology/time.owl
http://www.example.com/ontology/space.owl
http://www.example.com/ontology/wine.owl
Or
file:/ontology/time.owl
file:/ontology/space.owl
file:/ontology/wine.owl
You have to make sure each url is valid.
By default, there is no owl file, so query refinement based on ontology
is silently ignored.
</description>
</property>



<property>
<name>query.url.boost</name>
<value>4.0</value>
<description> Used as a boost for url field in Lucene query.
</description>
</property>

<property>
<name>query.anchor.boost</name>
<value>2.0</value>
<description> Used as a boost for anchor field in Lucene query.
</description>
</property>

<property>
<name>query.title.boost</name>
<value>1.5</value>
<description> Used as a boost for title field in Lucene query.
</description>
</property>

<property>
<name>query.host.boost</name>
<value>2.0</value>
<description> Used as a boost for host field in Lucene query.
</description>
</property>

<property>
<name>query.phrase.boost</name>
<value>1.0</value>
<description> Used as a boost for phrase in Lucene query.
Multiplied by boost for field phrase is matched in.
</description>
</property>



<property>
<name>lang.ngram.min.length</name>
<value>1</value>
<description> The minimum size of ngrams to uses to identify
language (must be between 1 and lang.ngram.max.length).
The larger is the range between lang.ngram.min.length and
lang.ngram.max.length, the better is the identification, but
the slowest it is.
</description>
</property>

<property>
<name>lang.ngram.max.length</name>
<value>4</value>
<description> The maximum size of ngrams to uses to identify
language (must be between lang.ngram.min.length and 4).
The larger is the range between lang.ngram.min.length and
lang.ngram.max.length, the better is the identification, but
the slowest it is.
</description>
</property>

<property>
<name>lang.analyze.max.length</name>
<value>2048</value>
<description> The maximum bytes of data to uses to indentify
the language (0 means full content analysis).
The larger is this value, the better is the analysis, but the
slowest it is.
</description>
</property>

</nutch-conf>

分享到：

七大IT狂热门派，你属于哪一派？ | 〇,概述与索引

2009-05-04 11:04
浏览 5193
评论(0)
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

四,nutch 1.0 网站与爬虫的属性配置文件研究

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

四,nutch 1.0 网站与爬虫的属性配置文件研究

评论

发表评论

相关推荐

〇,概述与索引

三,nutch 1.0 爬虫配置与运行

二,nutch 1.0 web应用部署

一,准备工作,nutch 1.0 的下载与抽取

最近访客更多访客>>