`
sole
  • 浏览: 141514 次
  • 性别: Icon_minigender_1
  • 来自: 北京
社区版块
存档分类
最新评论

Heritrix QueueAssignmentPolicy问题

阅读更多
Re: [archive-crawler] Extend QueueAssignmentPolicy


Mr.Mohr,
 
From your response,NicknameQueueAssignmentPolicy will be problematic,
And I understand why the download speed initially-fast and then very-slow.
Your suggestion is the same host should in a queue.
 
But my question is if there are 20 thousand URLS download in the same queue,
and there is just one active thread,I can't estimate the time to finished it.
 
And there are different QueueAssignmentPolicy in the Heritrix WebUI ,but i don't know
in what case i should use different QueueAssignmentPolicy.Could you give me some
advise,
 
thanks for your mail,i learn a lot from it ,
 
I'm looking froword to your answer,
 
Best wishes
 
 


Gordon Mohr <gojomo@...> wrote:
Nick,

From your later message, I assume you succeeded in making your
NicknameQueueAssignmentPolicy appear in the web UI. (FYI, it is not
necessary to edit AbstractFrontier. You can do this just by providing a
whitespace-delimited list of implementation classes in the property
'org.archive.crawler.frontier.AbstractFrontier.queue-assignment-policy'. )

A QueueAssignmentPolicy which splits the content of a single host into
separate queues can be problematic. Heritrix generally assumes that all
URIs from the same host land in the same queue.

A basic form of politeness to others' sites, opening only a single
connection at once, comes naturally from this assumption.

Also, the common need to request certain prerequisite URIs (DNS and
robots fetches) before anything else is accomplished by pushing these at
the top of the same-host queue. Then, they are guaranteed to finish --
succeed or fail -- before other URIs on the same host are tried.

If different URIs from the same host land in different queues, the
prerequisites might be redundantly scheduled or not finished when other
URIs are tried, or a site might be overloaded with traffic via multiple
connections. Only if you are sure the site can handle the load, for
example it is your own site, should you risk generating such traffic.

Regarding your specific policy:

Prior QueueAssignmentPolicies have typically loaded multiple hosts into
a single queue, rather than splitting one host over many queues.

I suspect if you're seeing an initially-fast but then-very-slow effect
with your custom policy, that some of your queues have, as their topmost
items, unfetchable URIs. Certain kinds of failed-fetches go into a
slow-timeout retry-cycle, and while a URI is in this cycle, nothing else
from the same queue will be tried. This is a reasonable approach when
all URIs in a queue are subject to the same network failures, but can
cause problems if the queues are mixed, and the deeper URIs would
succeed quickly, but are stuck behind topmost URIs.

You may wish to change your policy so that no URIs on different hosts
land in the same queue. Still, this may not work, or may create problems
as described above, because your custom policy is operating outside the
assumptions of Heritrix.

- Gordon @ IA

nickzwk wrote:
> i wanted to extend QueueAssignmentPolicy
> i create my own QueueAssignmentPolicy NicknameQueueAssignmentPolicy
> /* NicknameQueueAssignmentPolicy
> *
> * $Id: HostnameQueueAssignmentPolicy.java 3838 2005-09-21 23:00:47Z
> gojomo $
> *
> * Created on Oct 5, 2004
> *
> * Copyright (C) 2004 Internet Archive.
> *
> * This file is part of the Heritrix web crawler (crawler.archive.org).
> *
> * Heritrix is free software; you can redistribute it and/or modify
> * it under the terms of the GNU Lesser Public License as published by
> * the Free Software Foundation; either version 2.1 of the License, or
> * any later version.
> *
> * Heritrix is distributed in the hope that it will be useful,
> * but WITHOUT ANY WARRANTY; without even the implied warranty of
> * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> * GNU Lesser Public License for more details.
> *
> * You should have received a copy of the GNU Lesser Public License
> * along with Heritrix; if not, write to the Free Software
> * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-
> 1307 USA
> */
> package org.archive.crawler.frontier;
>
> import java.util.logging.Logger;
> import org.archive.crawler.datamodel.CandidateURI;
> import org.archive.crawler.framework.CrawlController;
>
>
> /**
> * QueueAssignmentPolicy based on the hostname:port evident in the
> given
> * CrawlURI.
> *
> * @author nick
> */
> public class NicknameQueueAssignmentPolicy extends
> QueueAssignmentPolicy {
> private static final Logger logger = Logger
> .getLogger(NicknameQueueAssignmentPolicy.class.getName());
> /**
> *
> */
> public String getClassKey(CrawlController controller,
> CandidateURI cauri) {
> String uri = cauri.getUURI().toString();
> long hash = ELFHash(uri);
> String a = Long.toString(hash % 100);
> return a;
> }
> public long ELFHash(String str)
> {
> long hash = 0;
> long x = 0;
> for(int i = 0; i < str.length(); i++)
> {
> hash = (hash << 4) + str.charAt(i);
> if((x = hash & 0xF0000000L) != 0)
> {
> hash ^= (x >> 24);
> hash &= ~x;
> }
> }
> return (hash & 0x7FFFFFFF);
> }
>
>
> }
>
> my heritrix version is 1.12.1
> and i changed the AbstractFrontier as below:
>
> String queueStr = System.getProperty(AbstractFrontier.class.getName()
> +
> "." + ATTR_QUEUE_ASSIGNMENT_POLICY,
> NicknameQueueAssignmentPolicy.class.getName() + " " +
> HostnameQueueAssignmentPolicy.class.getName() + " " +
> IPQueueAssignmentPolicy.class.getName() + " " +
> BucketQueueAssignmentPolicy.class.getName() + " " +
> SurtAuthorityQueueAssignmentPolicy.class.getName());
> Pattern p = Pattern.compile("\\s*,\\s*|\\s+");
> String [] queues = p.split(queueStr);
> if (queues.length <= 0) {
> throw new RuntimeException("Failed parse of " +
> " assignment queue policy string: " + queueStr);
> }
> t = addElementToDefinition(new SimpleType
> (ATTR_QUEUE_ASSIGNMENT_POLICY,
> "Defines how to assign URIs to queues. Can assign by
> host, " +
> "by ip, and into one of a fixed set of buckets (1k).",
> queues[0], queues));
> t.setExpertSetting(true);
> t.setOverrideable(true);
>
> i also change the heritrix.properties as below:
>
> org.archive.crawler.frontier.AbstractFrontier.queue-assignment-policy
> = \
> org.archive.crawler.frontier.NicknameQueueAssignmentPolicy \
> org.archive.crawler.frontier.HostnameQueueAssignmentPolicy \
> org.archive.crawler.frontier.IPQueueAssignmentPolicy \
> org.archive.crawler.frontier.BucketQueueAssignmentPolicy \
> org.archive.crawler.frontier.SurtAuthorityQueueAssignmentPolicy
> org.archive.crawler.frontier.BdbFrontier.level = INFO
>
> but when reset the heritrix,i find the order.xml didn't accept my
> changes,
> <newObject name="frontier"
> class="org.archive.crawler.frontier.BdbFrontier">
> <float name="delay-factor">5.0</float>
> <integer name="max-delay-ms">30000</integer>
> <integer name="min-delay-ms">3000</integer>
> <integer name="max-retries">30</integer>
> <long name="retry-delay-seconds">900</long>
> <integer name="preference-embed-hops">1</integer>
> <integer name="total-bandwidth-usage-KB-sec">0</integer>
> <integer name="max-per-host-bandwidth-usage-KB-sec">0</integer>
> <string name="queue-assignment-
> policy">org.archive.crawler.frontier.HostnameQueueAssignmentPolicy</st
> ring>
> <string name="force-queue-assignment"></string>
> <boolean name="pause-at-start">false</boolean>
> <boolean name="pause-at-finish">false</boolean>
> <boolean name="source-tag-seeds">false</boolean>
> <boolean name="recovery-log-enabled">true</boolean>
> <boolean name="hold-queues">true</boolean>
> <integer name="balance-replenish-amount">3000</integer>
> <integer name="error-penalty-amount">100</integer>
> <long name="queue-total-budget">-1</long>
> <string name="cost-
> policy">org.archive.crawler.frontier.UnitCostAssignmentPolicy</string>
> <long name="snooze-deactivate-ms">300000</long>
> <integer name="target-ready-backlog">50</integer>
> <string name="uri-included-
> structure">org.archive.crawler.util.BdbUriUniqFilter</string>
> </newObject>
>
> i wonder how can i change the QueueAssignmentPolicy
> because when crawl one site,there is 1 active thread ,it is too slow
> and i want to speed heritrix
> i'm looking forward to the answer
>
>
>
>
> Yahoo! Groups Links
>
>
>
分享到:
评论

相关推荐

    配置Heritrix及常见问题解决

    Heritrix是一款强大的开源网络爬虫工具,由互联网档案...总的来说,配置Heritrix涉及多个方面,从理解工作流机制到解决实际抓取过程中遇到的问题。通过深入学习和实践,可以有效地利用Heritrix构建自己的网络爬虫系统。

    网络爬虫Heritrix1.14.4可直接用

    7. **日志和监控**:Heritrix有完善的日志记录系统,可以帮助开发者跟踪爬虫状态,定位问题。同时,它还提供了一些性能指标,如抓取速度、错误率等,方便用户监控爬虫运行情况。 8. **安全性与伦理**:使用Heritrix...

    Heritrix-1.4.4.src.zip +Heritrix-1.4.4.zip

    "Myeclipse下安装说明及常见问题.txt" 文件提供了在MyEclipse集成开发环境中安装和运行Heritrix的步骤和可能遇到的问题的解决方案。MyEclipse是一种强大的Java EE集成开发环境,对Java项目的支持非常全面,因此它是...

    heritrix爬虫安装部署

    ### Heritrix爬虫安装部署知识点详解 #### 一、Heritrix爬虫简介 Heritrix是一款由互联网档案馆(Internet Archive)开发的开源网络爬虫框架,它使用Java语言编写,支持高度定制化的需求。Heritrix的设计初衷是为了...

    heritrix1.14.0jar包

    Heritrix是一款强大的开源网络爬虫工具,由互联网档案馆(Internet Archive)开发,用于抓取和保存网页数据。在IT行业中,爬虫是获取大量网络数据的重要手段,Heritrix因其灵活性、可扩展性和定制性而备受青睐。标题...

    Heritrix(windows版)

    如果你打算对Heritrix进行二次开发,或者需要解决特定问题,那么这个源代码包是必不可少的。 此外,还有一个名为“官方下载地址.txt”的文件,它很可能包含了Heritrix的最新版本或其他相关资源的下载链接。确保从...

    很好的heritrix学习资料

    它可能涵盖了Heritrix的基本概念、安装步骤、启动和运行爬虫的基本流程,以及一些常见的问题和解决方法。对于新接触Heritrix的人来说,这份资料将是宝贵的起点。 "Heritrix1_14_1在Eclipse下的配置总结 - Java - ...

    开发自己的搜索引擎 lucene + heritrix

    不过,在进行商业使用前,还需要考虑法律和伦理问题,比如版权、隐私和数据保护等方面。 总结来说,开发一个基于Lucene和Heritrix的搜索引擎是一个涉及多种技术和步骤的复杂过程。理解Lucene和Heritrix的工作原理和...

    heritrix-3.1.0 最新jar包

    - **错误处理和重试机制**:遇到HTTP错误、超时或其他网络问题时,Heritrix会自动处理并决定是否重新尝试抓取。 - **监控和日志**:Heritrix提供了丰富的监控接口,可以实时查看爬行进度、错误信息等,并通过日志...

    Heritrix3手册翻译

    它的设计考虑了可扩展性和易用性,尽管存在一些限制,但随着版本的更新,这些问题有望得到解决。对于那些需要深入研究和定制爬行规则的开发者来说,Heritrix 3 提供了丰富的API和文档支持,使其成为一个理想的工具。

    heritrix-1.14.2.zip

    Heritrix 1.14.2 是一个开源的网络爬虫工具,它主要用于抓取互联网上的网页和其他在线资源。这个版本的Heritrix在2007年左右发布,虽然较旧,但它仍然是理解网络爬虫技术的一个重要参考。 Heritrix是一个由Internet...

    Heritrix的安装与配置

    这个文件可以在`heritrix-1.14.4-src\src\resources\org\archive\util`目录下找到,将其复制到你的`org.archive.util`包中即可解决问题。 完成以上步骤后,Heritrix的安装和基本配置就已经完成。你可以启动Heritrix...

    heritrix正确完整的配置heritrix正确完整的配置

    9. **异常处理与恢复**: 配置如何处理网络错误、服务器拒绝等问题,以及在中断后如何恢复爬取。 10. **性能优化**: 考虑并发数、重试策略、DNS缓存等,以提高爬虫效率。注意不要对目标网站造成过大压力,避免被封禁...

    Heritrix搭建好的工程

    Heritrix是一款强大的开源网络爬虫工具,由互联网档案馆(Internet Archive)开发,用于抓取和保存网页内容。这款工具被设计为可扩展和高度配置的,允许用户根据特定需求定制爬取策略。在本工程中,Heritrix已经被预...

    Heritrix使用详解与高级开发应用

    Heritrix是一个强大的Java开发的开源网络爬虫,主要用于从互联网上抓取各种资源。它由www.archive.org提供,以其高度的可扩展性而著称,允许开发者自定义抓取逻辑,通过扩展其内置组件来适应不同的抓取需求。本文将...

    Heritrix1.14.4(含源码包)

    这有助于及时发现和解决问题。 10. **社区支持**:Heritrix有一个活跃的开发者社区,你可以在论坛、邮件列表或GitHub上寻求帮助,分享经验,参与项目的发展。 在深入了解和实践Heritrix 1.14.4的过程中,你不仅会...

    heritrix1.14.4(内含src)

    6. **日志和监控**:Heritrix提供详细的日志记录,帮助开发者追踪错误和性能问题。此外,还可以集成外部监控工具,如Prometheus或Grafana,进行实时性能监控。 在使用Heritrix 1.14.4时,需要注意的是,由于这是一...

Global site tag (gtag.js) - Google Analytics