- 浏览: 142200 次
- 性别:
- 来自: 北京
文章分类
最新评论
-
zoutm:
文章写得深入浅出,顶
我们为什么要关注MapReduce? -
gongmingwind:
写的不错
Cookie的格式及组成 -
yanite:
翻译的不全,而且把不该翻译的也译了,郁闷.
HTTP/1.1 RFC2616中文 -
RStallman:
哪一个兼容性最好?最快?前提是免费的。
总结Embedding Brower JAVA API -
jiangzhx:
你好,请问你找到不带GUI,可以渲染html的工具了吗,谢谢j ...
总结Embedding Brower JAVA API
Re: [archive-crawler] Extend QueueAssignmentPolicy
Mr.Mohr,
From your response,NicknameQueueAssignmentPolicy will be problematic,
And I understand why the download speed initially-fast and then very-slow.
Your suggestion is the same host should in a queue.
But my question is if there are 20 thousand URLS download in the same queue,
and there is just one active thread,I can't estimate the time to finished it.
And there are different QueueAssignmentPolicy in the Heritrix WebUI ,but i don't know
in what case i should use different QueueAssignmentPolicy.Could you give me some
advise,
thanks for your mail,i learn a lot from it ,
I'm looking froword to your answer,
Best wishes
Gordon Mohr <gojomo@...> wrote:
Nick,
From your later message, I assume you succeeded in making your
NicknameQueueAssignmentPolicy appear in the web UI. (FYI, it is not
necessary to edit AbstractFrontier. You can do this just by providing a
whitespace-delimited list of implementation classes in the property
'org.archive.crawler.frontier.AbstractFrontier.queue-assignment-policy'. )
A QueueAssignmentPolicy which splits the content of a single host into
separate queues can be problematic. Heritrix generally assumes that all
URIs from the same host land in the same queue.
A basic form of politeness to others' sites, opening only a single
connection at once, comes naturally from this assumption.
Also, the common need to request certain prerequisite URIs (DNS and
robots fetches) before anything else is accomplished by pushing these at
the top of the same-host queue. Then, they are guaranteed to finish --
succeed or fail -- before other URIs on the same host are tried.
If different URIs from the same host land in different queues, the
prerequisites might be redundantly scheduled or not finished when other
URIs are tried, or a site might be overloaded with traffic via multiple
connections. Only if you are sure the site can handle the load, for
example it is your own site, should you risk generating such traffic.
Regarding your specific policy:
Prior QueueAssignmentPolicies have typically loaded multiple hosts into
a single queue, rather than splitting one host over many queues.
I suspect if you're seeing an initially-fast but then-very-slow effect
with your custom policy, that some of your queues have, as their topmost
items, unfetchable URIs. Certain kinds of failed-fetches go into a
slow-timeout retry-cycle, and while a URI is in this cycle, nothing else
from the same queue will be tried. This is a reasonable approach when
all URIs in a queue are subject to the same network failures, but can
cause problems if the queues are mixed, and the deeper URIs would
succeed quickly, but are stuck behind topmost URIs.
You may wish to change your policy so that no URIs on different hosts
land in the same queue. Still, this may not work, or may create problems
as described above, because your custom policy is operating outside the
assumptions of Heritrix.
- Gordon @ IA
nickzwk wrote:
> i wanted to extend QueueAssignmentPolicy
> i create my own QueueAssignmentPolicy NicknameQueueAssignmentPolicy
> /* NicknameQueueAssignmentPolicy
> *
> * $Id: HostnameQueueAssignmentPolicy.java 3838 2005-09-21 23:00:47Z
> gojomo $
> *
> * Created on Oct 5, 2004
> *
> * Copyright (C) 2004 Internet Archive.
> *
> * This file is part of the Heritrix web crawler (crawler.archive.org).
> *
> * Heritrix is free software; you can redistribute it and/or modify
> * it under the terms of the GNU Lesser Public License as published by
> * the Free Software Foundation; either version 2.1 of the License, or
> * any later version.
> *
> * Heritrix is distributed in the hope that it will be useful,
> * but WITHOUT ANY WARRANTY; without even the implied warranty of
> * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> * GNU Lesser Public License for more details.
> *
> * You should have received a copy of the GNU Lesser Public License
> * along with Heritrix; if not, write to the Free Software
> * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-
> 1307 USA
> */
> package org.archive.crawler.frontier;
>
> import java.util.logging.Logger;
> import org.archive.crawler.datamodel.CandidateURI;
> import org.archive.crawler.framework.CrawlController;
>
>
> /**
> * QueueAssignmentPolicy based on the hostname:port evident in the
> given
> * CrawlURI.
> *
> * @author nick
> */
> public class NicknameQueueAssignmentPolicy extends
> QueueAssignmentPolicy {
> private static final Logger logger = Logger
> .getLogger(NicknameQueueAssignmentPolicy.class.getName());
> /**
> *
> */
> public String getClassKey(CrawlController controller,
> CandidateURI cauri) {
> String uri = cauri.getUURI().toString();
> long hash = ELFHash(uri);
> String a = Long.toString(hash % 100);
> return a;
> }
> public long ELFHash(String str)
> {
> long hash = 0;
> long x = 0;
> for(int i = 0; i < str.length(); i++)
> {
> hash = (hash << 4) + str.charAt(i);
> if((x = hash & 0xF0000000L) != 0)
> {
> hash ^= (x >> 24);
> hash &= ~x;
> }
> }
> return (hash & 0x7FFFFFFF);
> }
>
>
> }
>
> my heritrix version is 1.12.1
> and i changed the AbstractFrontier as below:
>
> String queueStr = System.getProperty(AbstractFrontier.class.getName()
> +
> "." + ATTR_QUEUE_ASSIGNMENT_POLICY,
> NicknameQueueAssignmentPolicy.class.getName() + " " +
> HostnameQueueAssignmentPolicy.class.getName() + " " +
> IPQueueAssignmentPolicy.class.getName() + " " +
> BucketQueueAssignmentPolicy.class.getName() + " " +
> SurtAuthorityQueueAssignmentPolicy.class.getName());
> Pattern p = Pattern.compile("\\s*,\\s*|\\s+");
> String [] queues = p.split(queueStr);
> if (queues.length <= 0) {
> throw new RuntimeException("Failed parse of " +
> " assignment queue policy string: " + queueStr);
> }
> t = addElementToDefinition(new SimpleType
> (ATTR_QUEUE_ASSIGNMENT_POLICY,
> "Defines how to assign URIs to queues. Can assign by
> host, " +
> "by ip, and into one of a fixed set of buckets (1k).",
> queues[0], queues));
> t.setExpertSetting(true);
> t.setOverrideable(true);
>
> i also change the heritrix.properties as below:
>
> org.archive.crawler.frontier.AbstractFrontier.queue-assignment-policy
> = \
> org.archive.crawler.frontier.NicknameQueueAssignmentPolicy \
> org.archive.crawler.frontier.HostnameQueueAssignmentPolicy \
> org.archive.crawler.frontier.IPQueueAssignmentPolicy \
> org.archive.crawler.frontier.BucketQueueAssignmentPolicy \
> org.archive.crawler.frontier.SurtAuthorityQueueAssignmentPolicy
> org.archive.crawler.frontier.BdbFrontier.level = INFO
>
> but when reset the heritrix,i find the order.xml didn't accept my
> changes,
> <newObject name="frontier"
> class="org.archive.crawler.frontier.BdbFrontier">
> <float name="delay-factor">5.0</float>
> <integer name="max-delay-ms">30000</integer>
> <integer name="min-delay-ms">3000</integer>
> <integer name="max-retries">30</integer>
> <long name="retry-delay-seconds">900</long>
> <integer name="preference-embed-hops">1</integer>
> <integer name="total-bandwidth-usage-KB-sec">0</integer>
> <integer name="max-per-host-bandwidth-usage-KB-sec">0</integer>
> <string name="queue-assignment-
> policy">org.archive.crawler.frontier.HostnameQueueAssignmentPolicy</st
> ring>
> <string name="force-queue-assignment"></string>
> <boolean name="pause-at-start">false</boolean>
> <boolean name="pause-at-finish">false</boolean>
> <boolean name="source-tag-seeds">false</boolean>
> <boolean name="recovery-log-enabled">true</boolean>
> <boolean name="hold-queues">true</boolean>
> <integer name="balance-replenish-amount">3000</integer>
> <integer name="error-penalty-amount">100</integer>
> <long name="queue-total-budget">-1</long>
> <string name="cost-
> policy">org.archive.crawler.frontier.UnitCostAssignmentPolicy</string>
> <long name="snooze-deactivate-ms">300000</long>
> <integer name="target-ready-backlog">50</integer>
> <string name="uri-included-
> structure">org.archive.crawler.util.BdbUriUniqFilter</string>
> </newObject>
>
> i wonder how can i change the QueueAssignmentPolicy
> because when crawl one site,there is 1 active thread ,it is too slow
> and i want to speed heritrix
> i'm looking forward to the answer
>
>
>
>
> Yahoo! Groups Links
>
>
>
发表评论
-
ant start stop tomcat
2009-05-30 11:55 2576<target name="tomca ... -
暂时存记录:spring乱码过滤器
2008-11-05 22:59 1560<filter> <filt ... -
编译Google浏览器
2008-09-09 09:51 2572Google一直传言要做自己的浏览器,上周 ... -
关于http的Last-Modified和ETag
2008-09-02 17:05 21281) 什么是”Last-Modifie ... -
Http的一些编码
2008-09-01 15:43 1952HTTP Headers The headers of a H ... -
分布式Web爬虫的设计
2008-08-20 11:55 3336URL管理服务器(URL-Server):负责url的集中管理 ... -
Java 5.0的多线程类或接口
2008-08-19 17:49 1123Executor ExecutorService Callab ... -
JDK5.0 Excutor创建线程池
2008-08-19 16:11 1900import java.util.concurrent.Exe ... -
Java正则表达式
2008-08-10 13:21 1362两个问题 a. 如何知道一个url是 ... -
Java theory and practice: Dealing with Interrupte
2008-07-29 15:07 1442Many Java™ language methods, su ... -
Swing HTML显示组件
2008-07-17 10:33 6039Java Swing本身没有提供好的html显示组件,而且也不 ... -
总结Embedding Brower JAVA API
2008-07-10 11:32 3899总结一些找到的嵌入浏览器: WebRenderer 对 ... -
Cookie的格式及组成
2008-06-26 10:49 30216Cookie由变量名和值组 ... -
HTTP头信息
2008-06-25 16:24 1979HTTP的头域包括通用头,请求头,响应头和实体头四个部 ... -
HTTP Cookie & Session
2008-06-25 15:50 3798COOKIECOOKIE是大家都非常 ... -
HTTP 1.0 与 1.1比较
2008-06-25 14:32 5317一个WEB站点每天可能要接收到上百万的用户请求,为了提高系统 ... -
HTTP/1.1 RFC2616中文
2008-06-25 14:27 6621官方RFC2616文档: http://www ... -
Watij - Web Application Testing in Java
2008-05-29 16:54 2400发现一个抓取动态网页的好东东: Watij (pro ... -
Commons 命令行接口使用(未翻译)
2008-05-15 16:12 1405暂未翻译,链接主页http://commons.apache. ... -
Java局部线程变量---ThreadLocal
2008-05-15 16:02 6280ThreadLocal是什么 早在J ...
相关推荐
Heritrix是一款强大的开源网络爬虫工具,由互联网档案...总的来说,配置Heritrix涉及多个方面,从理解工作流机制到解决实际抓取过程中遇到的问题。通过深入学习和实践,可以有效地利用Heritrix构建自己的网络爬虫系统。
7. **日志和监控**:Heritrix有完善的日志记录系统,可以帮助开发者跟踪爬虫状态,定位问题。同时,它还提供了一些性能指标,如抓取速度、错误率等,方便用户监控爬虫运行情况。 8. **安全性与伦理**:使用Heritrix...
### Heritrix安装详细过程及配置指南 #### 一、Heritrix简介 Heritrix是一款开源的网络爬虫工具,被广泛应用于互联网资源的抓取与归档工作。相较于其他爬虫工具,Heritrix提供了更为精细的控制机制,能够帮助用户...
"Myeclipse下安装说明及常见问题.txt" 文件提供了在MyEclipse集成开发环境中安装和运行Heritrix的步骤和可能遇到的问题的解决方案。MyEclipse是一种强大的Java EE集成开发环境,对Java项目的支持非常全面,因此它是...
### Heritrix爬虫安装部署知识点详解 #### 一、Heritrix爬虫简介 Heritrix是一款由互联网档案馆(Internet Archive)开发的开源网络爬虫框架,它使用Java语言编写,支持高度定制化的需求。Heritrix的设计初衷是为了...
Heritrix是一款强大的开源网络爬虫工具,由互联网档案馆(Internet Archive)开发,用于抓取和保存网页数据。在IT行业中,爬虫是获取大量网络数据的重要手段,Heritrix因其灵活性、可扩展性和定制性而备受青睐。标题...
如果你打算对Heritrix进行二次开发,或者需要解决特定问题,那么这个源代码包是必不可少的。 此外,还有一个名为“官方下载地址.txt”的文件,它很可能包含了Heritrix的最新版本或其他相关资源的下载链接。确保从...
它可能涵盖了Heritrix的基本概念、安装步骤、启动和运行爬虫的基本流程,以及一些常见的问题和解决方法。对于新接触Heritrix的人来说,这份资料将是宝贵的起点。 "Heritrix1_14_1在Eclipse下的配置总结 - Java - ...
不过,在进行商业使用前,还需要考虑法律和伦理问题,比如版权、隐私和数据保护等方面。 总结来说,开发一个基于Lucene和Heritrix的搜索引擎是一个涉及多种技术和步骤的复杂过程。理解Lucene和Heritrix的工作原理和...
- **错误处理和重试机制**:遇到HTTP错误、超时或其他网络问题时,Heritrix会自动处理并决定是否重新尝试抓取。 - **监控和日志**:Heritrix提供了丰富的监控接口,可以实时查看爬行进度、错误信息等,并通过日志...
它的设计考虑了可扩展性和易用性,尽管存在一些限制,但随着版本的更新,这些问题有望得到解决。对于那些需要深入研究和定制爬行规则的开发者来说,Heritrix 3 提供了丰富的API和文档支持,使其成为一个理想的工具。
Heritrix 1.14.2 是一个开源的网络爬虫工具,它主要用于抓取互联网上的网页和其他在线资源。这个版本的Heritrix在2007年左右发布,虽然较旧,但它仍然是理解网络爬虫技术的一个重要参考。 Heritrix是一个由Internet...
这个文件可以在`heritrix-1.14.4-src\src\resources\org\archive\util`目录下找到,将其复制到你的`org.archive.util`包中即可解决问题。 完成以上步骤后,Heritrix的安装和基本配置就已经完成。你可以启动Heritrix...
9. **异常处理与恢复**: 配置如何处理网络错误、服务器拒绝等问题,以及在中断后如何恢复爬取。 10. **性能优化**: 考虑并发数、重试策略、DNS缓存等,以提高爬虫效率。注意不要对目标网站造成过大压力,避免被封禁...
Heritrix是一款强大的开源网络爬虫工具,由互联网档案馆(Internet Archive)开发,用于抓取和保存网页内容。这款工具被设计为可扩展和高度配置的,允许用户根据特定需求定制爬取策略。在本工程中,Heritrix已经被预...
Heritrix是一个强大的Java开发的开源网络爬虫,主要用于从互联网上抓取各种资源。它由www.archive.org提供,以其高度的可扩展性而著称,允许开发者自定义抓取逻辑,通过扩展其内置组件来适应不同的抓取需求。本文将...
这有助于及时发现和解决问题。 10. **社区支持**:Heritrix有一个活跃的开发者社区,你可以在论坛、邮件列表或GitHub上寻求帮助,分享经验,参与项目的发展。 在深入了解和实践Heritrix 1.14.4的过程中,你不仅会...
6. **日志和监控**:Heritrix提供详细的日志记录,帮助开发者追踪错误和性能问题。此外,还可以集成外部监控工具,如Prometheus或Grafana,进行实时性能监控。 在使用Heritrix 1.14.4时,需要注意的是,由于这是一...
为了解决这个问题,可以使用如ELF哈希算法来平均分配URL到不同的队列,从而实现更有效的多线程同步。 在Heritrix系统中,爬取过程可以分为四个关键部分: 1. **Page Fetching**:这是从Frontier获取URI并处理的...