论坛首页 Java企业应用论坛

Heritrix QueueAssignmentPolicy问题

浏览 2145 次
该帖已经被评为隐藏帖
作者 正文
   发表时间:2008-04-30  
Re: [archive-crawler] Extend QueueAssignmentPolicy


Mr.Mohr,
 
From your response,NicknameQueueAssignmentPolicy will be problematic,
And I understand why the download speed initially-fast and then very-slow.
Your suggestion is the same host should in a queue.
 
But my question is if there are 20 thousand URLS download in the same queue,
and there is just one active thread,I can't estimate the time to finished it.
 
And there are different QueueAssignmentPolicy in the Heritrix WebUI ,but i don't know
in what case i should use different QueueAssignmentPolicy.Could you give me some
advise,
 
thanks for your mail,i learn a lot from it ,
 
I'm looking froword to your answer,
 
Best wishes
 
 


Gordon Mohr <gojomo@...> wrote:
Nick,

From your later message, I assume you succeeded in making your
NicknameQueueAssignmentPolicy appear in the web UI. (FYI, it is not
necessary to edit AbstractFrontier. You can do this just by providing a
whitespace-delimited list of implementation classes in the property
'org.archive.crawler.frontier.AbstractFrontier.queue-assignment-policy'. )

A QueueAssignmentPolicy which splits the content of a single host into
separate queues can be problematic. Heritrix generally assumes that all
URIs from the same host land in the same queue.

A basic form of politeness to others' sites, opening only a single
connection at once, comes naturally from this assumption.

Also, the common need to request certain prerequisite URIs (DNS and
robots fetches) before anything else is accomplished by pushing these at
the top of the same-host queue. Then, they are guaranteed to finish --
succeed or fail -- before other URIs on the same host are tried.

If different URIs from the same host land in different queues, the
prerequisites might be redundantly scheduled or not finished when other
URIs are tried, or a site might be overloaded with traffic via multiple
connections. Only if you are sure the site can handle the load, for
example it is your own site, should you risk generating such traffic.

Regarding your specific policy:

Prior QueueAssignmentPolicies have typically loaded multiple hosts into
a single queue, rather than splitting one host over many queues.

I suspect if you're seeing an initially-fast but then-very-slow effect
with your custom policy, that some of your queues have, as their topmost
items, unfetchable URIs. Certain kinds of failed-fetches go into a
slow-timeout retry-cycle, and while a URI is in this cycle, nothing else
from the same queue will be tried. This is a reasonable approach when
all URIs in a queue are subject to the same network failures, but can
cause problems if the queues are mixed, and the deeper URIs would
succeed quickly, but are stuck behind topmost URIs.

You may wish to change your policy so that no URIs on different hosts
land in the same queue. Still, this may not work, or may create problems
as described above, because your custom policy is operating outside the
assumptions of Heritrix.

- Gordon @ IA

nickzwk wrote:
> i wanted to extend QueueAssignmentPolicy
> i create my own QueueAssignmentPolicy NicknameQueueAssignmentPolicy
> /* NicknameQueueAssignmentPolicy
> *
> * $Id: HostnameQueueAssignmentPolicy.java 3838 2005-09-21 23:00:47Z
> gojomo $
> *
> * Created on Oct 5, 2004
> *
> * Copyright (C) 2004 Internet Archive.
> *
> * This file is part of the Heritrix web crawler (crawler.archive.org).
> *
> * Heritrix is free software; you can redistribute it and/or modify
> * it under the terms of the GNU Lesser Public License as published by
> * the Free Software Foundation; either version 2.1 of the License, or
> * any later version.
> *
> * Heritrix is distributed in the hope that it will be useful,
> * but WITHOUT ANY WARRANTY; without even the implied warranty of
> * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> * GNU Lesser Public License for more details.
> *
> * You should have received a copy of the GNU Lesser Public License
> * along with Heritrix; if not, write to the Free Software
> * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-
> 1307 USA
> */
> package org.archive.crawler.frontier;
>
> import java.util.logging.Logger;
> import org.archive.crawler.datamodel.CandidateURI;
> import org.archive.crawler.framework.CrawlController;
>
>
> /**
> * QueueAssignmentPolicy based on the hostname:port evident in the
> given
> * CrawlURI.
> *
> * @author nick
> */
> public class NicknameQueueAssignmentPolicy extends
> QueueAssignmentPolicy {
> private static final Logger logger = Logger
> .getLogger(NicknameQueueAssignmentPolicy.class.getName());
> /**
> *
> */
> public String getClassKey(CrawlController controller,
> CandidateURI cauri) {
> String uri = cauri.getUURI().toString();
> long hash = ELFHash(uri);
> String a = Long.toString(hash % 100);
> return a;
> }
> public long ELFHash(String str)
> {
> long hash = 0;
> long x = 0;
> for(int i = 0; i < str.length(); i++)
> {
> hash = (hash << 4) + str.charAt(i);
> if((x = hash & 0xF0000000L) != 0)
> {
> hash ^= (x >> 24);
> hash &= ~x;
> }
> }
> return (hash & 0x7FFFFFFF);
> }
>
>
> }
>
> my heritrix version is 1.12.1
> and i changed the AbstractFrontier as below:
>
> String queueStr = System.getProperty(AbstractFrontier.class.getName()
> +
> "." + ATTR_QUEUE_ASSIGNMENT_POLICY,
> NicknameQueueAssignmentPolicy.class.getName() + " " +
> HostnameQueueAssignmentPolicy.class.getName() + " " +
> IPQueueAssignmentPolicy.class.getName() + " " +
> BucketQueueAssignmentPolicy.class.getName() + " " +
> SurtAuthorityQueueAssignmentPolicy.class.getName());
> Pattern p = Pattern.compile("\\s*,\\s*|\\s+");
> String [] queues = p.split(queueStr);
> if (queues.length <= 0) {
> throw new RuntimeException("Failed parse of " +
> " assignment queue policy string: " + queueStr);
> }
> t = addElementToDefinition(new SimpleType
> (ATTR_QUEUE_ASSIGNMENT_POLICY,
> "Defines how to assign URIs to queues. Can assign by
> host, " +
> "by ip, and into one of a fixed set of buckets (1k).",
> queues[0], queues));
> t.setExpertSetting(true);
> t.setOverrideable(true);
>
> i also change the heritrix.properties as below:
>
> org.archive.crawler.frontier.AbstractFrontier.queue-assignment-policy
> = \
> org.archive.crawler.frontier.NicknameQueueAssignmentPolicy \
> org.archive.crawler.frontier.HostnameQueueAssignmentPolicy \
> org.archive.crawler.frontier.IPQueueAssignmentPolicy \
> org.archive.crawler.frontier.BucketQueueAssignmentPolicy \
> org.archive.crawler.frontier.SurtAuthorityQueueAssignmentPolicy
> org.archive.crawler.frontier.BdbFrontier.level = INFO
>
> but when reset the heritrix,i find the order.xml didn't accept my
> changes,
> <newObject name="frontier"
> class="org.archive.crawler.frontier.BdbFrontier">
> <float name="delay-factor">5.0</float>
> <integer name="max-delay-ms">30000</integer>
> <integer name="min-delay-ms">3000</integer>
> <integer name="max-retries">30</integer>
> <long name="retry-delay-seconds">900</long>
> <integer name="preference-embed-hops">1</integer>
> <integer name="total-bandwidth-usage-KB-sec">0</integer>
> <integer name="max-per-host-bandwidth-usage-KB-sec">0</integer>
> <string name="queue-assignment-
> policy">org.archive.crawler.frontier.HostnameQueueAssignmentPolicy</st
> ring>
> <string name="force-queue-assignment"></string>
> <boolean name="pause-at-start">false</boolean>
> <boolean name="pause-at-finish">false</boolean>
> <boolean name="source-tag-seeds">false</boolean>
> <boolean name="recovery-log-enabled">true</boolean>
> <boolean name="hold-queues">true</boolean>
> <integer name="balance-replenish-amount">3000</integer>
> <integer name="error-penalty-amount">100</integer>
> <long name="queue-total-budget">-1</long>
> <string name="cost-
> policy">org.archive.crawler.frontier.UnitCostAssignmentPolicy</string>
> <long name="snooze-deactivate-ms">300000</long>
> <integer name="target-ready-backlog">50</integer>
> <string name="uri-included-
> structure">org.archive.crawler.util.BdbUriUniqFilter</string>
> </newObject>
>
> i wonder how can i change the QueueAssignmentPolicy
> because when crawl one site,there is 1 active thread ,it is too slow
> and i want to speed heritrix
> i'm looking forward to the answer
>
>
>
>
> Yahoo! Groups Links
>
>
>
论坛首页 Java企业应用版

跳转论坛:
Global site tag (gtag.js) - Google Analytics