- 浏览: 52445 次
- 来自: 上海
最新评论
-
bestlovetoad:
Pylons笔记(二) -
waveeee:
又转回xp了。用虚拟机安装服务器bsd。 就是网络太球了!!! ...
linux eclipse出错-failed to load the jni shared -
qinq4312:
最好不要完全禁用.可以用命令:
chcon -t execme ...
linux eclipse出错-failed to load the jni shared -
linvar:
果然有此事,SELINUX主要是用来干嘛的,完全disable ...
linux eclipse出错-failed to load the jni shared
Heritrix使用的初步总结
http://jason823.iteye.com/blog/84206
http://blog.sina.com.cn/s/blog_4ef8aa560100bxop.html
Heritrix man
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler.
This document explains how to create, configure and run crawls using Heritrix. It is intended for users of the software and presumes that they possess at least a general familiarity with the concept of web crawling.
For a general overview on Heritrix, see An Introduction to Heritrix.
If you want to build Heritrix from source or if you'd like to make contributions and would like to know about contribution conventions, etc., see instead the Developer's Manual.
This chapter will explain how to set up Heritrix.
Because Heritrix is a pure Java program it can (in theory anyway) be run on any platform that has a Java 5.0 VM. However we are only committed to supporting its operation on Linux and so this chapter only covers setup on that platform. Because of this, what follows assumes basic Linux administration skills. Other chapters in the user manual are platform agnostic.
This chapter also only covers installing and running the prepackaged binary distributions of Heritrix. For information about downloading and compiling the source see the Developer's Manual.
The packaged binary can be downloaded from the project's sourceforge home page. Each release comes in four flavors, packaged as .tar.gz or .zip and including source or not.
For installation on Linux get the file heritrix-?.?.?.tar.gz
(where ?.?.? is the most recent version number).
The packaged binary comes largely ready to run. Once downloaded it can be untarred into the desired directory.
% tar xfz heritrix-?.?.?.tar.gz
Once you have downloaded and untarred the correct file you can move on to the next step.
The Heritrix crawler is implemented purely in Java. This means that the only true requirement for running it is that you have a JRE installed (Building will require a JDK).
The Heritrix crawler, since release 1.10.0, makes use of Java 5.0 features so your JRE must be at least of a 5.0 (1.5.0+) pedigree.
We currently include all of the free/open source third-party libraries necessary to run Heritrix in the distribution package. See dependencies for the complete list (Licenses for all of the listed libraries are listed in the dependencies section of the raw project.xml at the root of the src
download or on Sourceforge).
If you do not have Java installed you can download Java from:
-
Sun -- java.sun.com
-
IBM -- www.ibm.com/java
A default java heap is 256MB RAM, which is usually suitable for crawls that range over hundreds of hosts. Assign more -- see Section 2.2.1.3, “JAVA_OPTS” for how -- of your available RAM to the heap if you are crawling thousands of hosts or experience Java out-of-memory problems.
To run Heritrix, first do the following:
% export HERITRIX_HOME=/PATH/TO/BUILT/HERITRIX...where
$HERITRIX_HOME
is the location of your untarred heritrix.?.?.?.tar.gz
.
Next run:
% cd $HERITRIX_HOME % chmod u+x $HERITRIX_HOME/bin/heritrix % $HERITRIX_HOME/bin/heritrix --helpThis should give you usage output like the following:
Usage: heritrix --help
Usage: heritrix --nowui ORDER.XML
Usage: heritrix [--port=#] [--run] [--bind=IP,IP...] --admin=LOGIN:PASSWORD \
[ORDER.XML]
Usage: heritrix [--port=#] --selftest[=TESTNAME]
Version: @VERSION@
Options:
-b,--bind Comma-separated list of IP addresses or hostnames for web
server to listen on. Set to / to listen on all available
network interfaces. Default is 127.0.0.1.
-a,--admin Login and password for web user interface administration.
Required (unless passed via the 'heritrix.cmdline.admin'
system property). Pass value of the form 'LOGIN:PASSWORD'.
-h,--help Prints this message and exits.
-n,--nowui Put heritrix into run mode and begin crawl using ORDER.XML. Do
not put up web user interface.
-p,--port Port to run web user interface on. Default: 8080.
-r,--run Put heritrix into run mode. If ORDER.XML begin crawl.
-s,--selftest Run the integrated selftests. Pass test name to test it only
(Case sensitive: E.g. pass 'Charset' to run charset selftest).
Arguments:
ORDER.XML Crawl order to run.
Launch the crawler with the UI enabled by doing the following:
% $HERITRIX_HOME/bin/heritrix --admin=LOGIN:PASSWORDThis will start up Heritrix printing out a startup message that looks like the following:
[b116-dyn-60 619] heritrix-0.4.0 > ./bin/heritrix Tue Feb 10 17:03:01 PST 2004 Starting heritrix... Tue Feb 10 17:03:05 PST 2004 Heritrix 0.4.0 is running. Web UI is at: http://b116-dyn-60.archive.org:8080/admin Login and password: admin/letmein
Note
By default, as of version 1.10.x, Heritrix binds to localhost only. This means that you need to be running Heritrix on the same machine as your browser to access the Heritrix UI. Read about the--bind
argument above if you need to access the Heritrix UI over a network.
See Section 3, “Web based user interface” and Section 4, “A quick guide to running your first crawl job” to get your first crawl up and running.
Below are environment variables that effect Heritrix operation.
Set this environment variable to point at the Heritrix home directory. For example, if you've unpacked Heritrix in your home directory and Heritrix is sitting in the heritrix-1.0.0 directory, you'd set HERITRIX_HOME as follows. Assuming your shell is bash:
% export HERITRIX_HOME=~/heritrix-1.0.0If you don't set this environment variable, the Heritrix start script makes a guess at the home for Heritrix. It doesn't always guess correctly.
This environment variable may already exist. It should point to the Java installation on the machine. An example of how this might be set (assuming your shell is bash):
% export JAVA_HOME=/usr/local/java/jre/
Pass options to the Heritrix JVM by populating the JAVA_OPTS environment variable with values. For example, if you want to have Heritrix run with a larger heap, say 512 megs, you could do either of the following (assuming your shell is bash):
% export JAVA_OPTS="-Xmx512M" % $HERITRIX_HOME/bin/heritrixOr, you could do it all on the one line as follows:
% JAVA_OPTS="-Xmx512m" $HERITRIX_HOME/bin/heritrix
Below we document the system properties passed on the command-line that can influence Heritrix's behavior. If you are using the /bin/heritrix script to launch Heritrix you may have to edit it to change/set these properties or else pass them as part of JAVA_OPTS.
Set this property to point at an alternate heritrix.properties file -- e.g.: -Dheritrix.properties=/tmp/alternate.properties
-- when you want heritrix to use a properties file other than that found at conf/heritrix.properties
.
Provide an alternate context for the Heritrix admin UI. Usually the admin webapp is mounted on root: i.e. '/'.
Set this property when you want to run the crawler from eclipse. This property takes no arguments. When this property is set, the conf
and webapps
directories will be found in their development locations and startup messages will show on the text console (standard out).
Where stdout/stderr are sent, usually heritrix_out.log and passed by the heritrix launch script.
Where to drop heritrix jobs. Usually empty. Default location is ${HERITRIX_HOME}/jobs
.
Specify an alternate configuration directory other than the default $HERITRIX_HOME/conf
.
This set of system properties are rarely used. They are for use when Heritrix has NOT been started from the command-line -- e.g. its been embedded in another application -- and the startup configuration that is set usually by command-line options, instead needs to be done via system properties alone.
Value is a colon-delimited String user name and password for admin GUI
If set to true, will prevent embedded web server crawler control interface from starting up.
If set to to a string file path, will use the specified crawl order XML file.
Heritrix has its own trust store at conf/heritrix.cacerts
that it uses if the FetcherHTTP
is configured to use a trust level of other than open (open is the default setting). In the unusual case where you'd like to have Heritrix use an alternate truststore, point at the alternate by supplying the JSSE javax.net.ssl.trustStore
property on the command line: e.g.
The Heritrix
directory includes a file named conf
heritrix.properties
. A section of this file specifies the default Heritrix logging configuration. To override these settings, point java.util.logging.config.file
at a properties file with an alternate logging configuration. Below we reproduce the default heritrix.properties
for reference:
# Basic logging setup; to console, all levels handlers= java.util.logging.ConsoleHandler java.util.logging.ConsoleHandler.level= ALL # Default global logging level: only warnings or higher .level= WARNING # currently necessary (?) for standard logs to work crawl.level= INFO runtime-errors.level= INFO uri-errors.level= INFO progress-statistics.level= INFO recover.level= INFO # HttpClient is too chatty... only want to hear about severe problems org.apache.commons.httpclient.level= SEVEREHere's an example of how you might specify an override:
% JAVA_OPTS="-Djava.util.logging.config.file=heritrix.properties" \ ./bin/heritrix --no-wui order.xml
Alternatively you could edit the default file.
The crawler is a large and active network application which presents security implications, both local to the machine where it operates, and remotely for machines it contacts.
It is important to recognize that the web UI (discussed in Section 3, “Web based user interface”) and JMX agent (discussed in Section 9.5, “Remote Monitoring and Control”) allow remote control of the crawler process in ways that might potentially disrupt a crawl, change the crawler's behavior, read or write locally-accessible files, and perform or trigger other actions in the Java VM or local machine.
The administrative login and password are currently only a very mild protection against unauthorized access, unless you take additional steps to prevent access to the crawler machine. We strongly recommend some combination of the following practices:
First, use network configuration tools, like a firewall, to only allow trusted remote hosts to contact the web UI and, if applicable, JMX agent ports. (The default web UI port is 8080; JMX is 8849.)
Second, use a strong and unique username/password combination to secure the web UI and JMX agent. However, keep in mind that the default administrative web server uses plain HTTP for access, so these values are susceptible to eavesdropping in transit if network links between your browser and the crawler are compromised. (An upcoming update will change the default to HTTPS.) Also, setting the username/password on the command-line may result in their values being visible to other users of the crawling machine, and they are additionally printed to the console and heritrix_out.log for operator reference.
Third, run the crawler as a user with the minimum privileges necessary for its operation, so that in the event of unauthorized access to the web UI or JMX agent, the potential damage is limited.
Successful unauthorized access to the web UI or JMX agent could trivially end or corrupt a crawl, or change the crawler's behavior to be a nuisance to other network hosts. By adjusting configuration paths, unauthorized access could potentially delete, corrupt, or replace files accessible to the crawler process, and thus cause more extensive problems on the crawler machine.
Another potential risk is that some worst-case or maliciously-crafted crawled content might, in combination with crawler bugs, disrupt the crawl or other files or operations of the local system. For example, in the past, even without malicious intent, some rich-media content has caused runaway memory use in 3rd-party libraries used by the crawler, resulting in a memory-exhaustion condition that can stop or corrupt a crawl in progress. Similarly, atypical input patterns have at times caused runaway CPU use by crawler link-extraction regular expressions, severely slowing crawls. Crawl operators should monitor their crawls closely and stay informed via the project discussion list and bug database for any newly discovered similar bugs.
3. Web based user interface
After Heritrix has been launched from the command line, the web based user interface (WUI) becomes accessible.
The URI to access the WUI is printed on the text console from which the program was launched (typicallyhttp://<host>:8080/admin/
).
The WUI is password protected. There is no default login for access; one must be specified using either the '-a'/'--admin' command-line option at startup or by setting the 'heritrix.cmdline.admin' system property. The currently valid username and password combination will be printed out to the console, along with the access URL for the WUI, at startup.
The WUI can be accessed via any web browser. While we've endeavoured to make certain that it functions in all recent browsers, Mozilla 5 or newer is recommended. IE 6 or newer should also work without problems.
The initial login page takes the username/password combination discussed above. Logins will time out after a period of non-use.
Caution
By default, communication with the WUI is not done over an encrypted HTTPS connection! Passwords will be submitted over the network in plain text, so you should take additional steps to protect your crawler administrative interface from unauthorized access, as described in theSection 2.3, “Security Considerations” section.
4. A quick guide to running your first crawl job
Once you've installed Heritrix and logged into the WUI (see above) you are presented with the web Console page. Near the top there is a row of tabs.
Step 1. Create a job
To create a new job choose the Jobs tab, this will take you to the Jobs page. Once there you are presented with three options for creating a new job. Select 'With defaults'. This will create a new job based on the default profile (see Section 5.2, “Profile”).
On the screen that comes next you will be asked to supply a name, description and a seed list for the new job.
For a name supply a short text with no special characters or spaces (except dash and underscore). You can skip the description if you like. In the seeds list type in the URL of the sites you are interested in harvesting. One URL to a line.
Creating a job is covered in greater detail in Section 5, “Creating jobs and profiles”.
Step 2. Configure the job
Once you've entered this information in you are ready to go to the configuration pages. Click the Modulesbutton in the row of buttons at the bottom of the page.
This will take you to the modules configuration page (more details in Section 6.1, “Modules (Scope, Frontier, and Processors)”). For now we are only interested in the option second from the top named Select crawl scope. It allows you to specify the limits of the crawl. By default it is limited to the domains that your seeds span. This may be suitable for your purposes. If not you can choose a broad scope (not limited to the domains of its seeds) or the more restrictive host scope that limits the crawl to the hosts that its seeds span. For more on scopes refer to Section 6.1.1, “Crawl Scope”.
To change scopes, select the new one from the combobox and click the Change button.
Next turn your attention to the second row of tabs at the top of the page, below the usual tabs. You are currently on the far left tab. Now select the tab called Settings near the middle of the row.
This takes you to the Settings page. It allows you to configure various details of the crawl. Exhaustive coverage of this page can be found in Section 6.3, “Settings”. For now we are only interested in the two settings under http-headers. These are the user-agent
and from
field of the HTTP headers in the crawlers requests. You must set them to valid values before a crawl can be run. The current values upper-case what needs replacing. If you have trouble with that please refer to Section 6.3.1.3, “HTTP headers” for what's regarded as valid values.
Once you've set the http-headers settings to proper values (and made any other desired changes), you can click the Submit job tab at the far right of the second row of tabs. The crawl job is now configured and ready to run.
Configuring a job is covered in greater detail in Section 6, “Configuring jobs and profiles”.
Step 3. Running the job
Submitted new jobs are placed in a queue of pending jobs. The crawler does not start processing jobs from this queue until the crawler is started. While the crawler is stopped, jobs are simply held.
To start the crawler, click on the Console tab. Once on the Console page, you will find the option Start at the top of the Crawler Status box, just to the right of the indicator of current status. Clicking this option will put the crawling into Crawling Jobs mode, where it will begin crawling any next pending job, such as the job you just created and configured.
The Console will update to display progress information about the on-going crawl. Click the Refresh option (or the top-left Heritrix logo) to update this information.
For more information about running a job see Section 7, “Running a job”.
Detailed information about evaluating the progress of a job can be found in Section 8, “Analysis of jobs”.
5. Creating jobs and profiles
In order to run a crawl a configuration must be created that defines it. In Heritrix such a configuration is called a crawl job.
A crawl job encompasses the configurations needed to run a single crawl. It also contains some additional elements such as file locations, status etc.
Once logged onto the WUI new jobs can be created by going to the Jobs tab. Once the Jobs page loads users can create jobs by choosing of the following three options:
-
Based on existing job
This option allows the user to create a job by basing it on any existing job, regardless of whether it has been crawled or not. Can be useful for repeating crawls or recovering a crawl that had problems. (See Section 9.3, “Recovery of Frontier State and recover.gz”
-
Based on a profile
This option allows the user to create a job by basing it on any existing profiles.
-
With defaults
This option creates a new crawl job based on the default profile.
Options 1 and 2 will display a list of available options. Initially there are two profiles and no existing jobs.
All crawl jobs are created by basing them on profiles (see Section 5.2, “Profile”) or existing jobs.
Once the proper profile/job has been chosen to base the new job on, a simple page will appear asking for the new job's:
-
Name
The name must only contain letters, numbers, dash (-) and underscore (_). No other characters are allowed. This name will be used to identify the crawl in the WUI but it need not be unique. The name can not be changed later
-
Description
A short description of the job. This is a freetext input box and can be edited later.
-
Seeds
The seed URIs to use for the job. This list can be edited later along with the general configurations.
Below these input fields there are several buttons. The last one Submit job will immediately submit the job and (assuming it is properly configured) it will be ready to run (see Section 7, “Running a job”). The other buttons will take the user to the relevant configuration pages (those are covered in detail in Section 6, “Configuring jobs and profiles”). Once all desired changes have been made to the configuration, click the 'Submit job' tab (usually displayed top and bottom right) to submit it to the list of waiting jobs.
Note
Changes made afterwards to the original jobs or profiles that a new job is based on will not in any way affect the newly created job.
Note
Jobs based on the default profile provided with Heritrix are not ready to run as is. Their HTTP header information must be set to valid values. See Section 6.3.1.3, “HTTP headers” for details.
A profile is a template for a crawl job. It contains all the configurations that a crawl job would, but is not considered to be 'crawlable'. That is Heritrix will not allow you to directly crawl a profile, only jobs based on profiles. The reason for this is that while profiles may in fact be complete, they may also not be.
A common example is leaving the HTTP headers (user-agent
, from
) in an illegal state in a profile to force the user to input valid data. This applies to the default (default) profile that comes with Heritrix. Other examples would be leaving the seeds list empty, not specifying some processors (such as the writer/indexer) etc.
In general there is less error checking of profiles.
To manage profiles, go to the Profiles tab in the WUI. That page will display a list of existing profiles. To create a new profile select the option of creating a "New profile based on it" from the existing profile to use as a template. Much like jobs, profiles can only be created based on other profiles. It is not possible to create profiles based on existing jobs.
The process from there on mirrors the creation of jobs. A page will ask for the new profiles name, description and seeds list. Unlike job names, profile names must be unique from other profile names - jobs and a profile can share the same name - otherwise the same rules apply.
The user then proceeds to the configuration pages (see Section 6, “Configuring jobs and profiles”) to modify the behavior of the new profile from that of the parent profile.
Note
Even though profiles are based on other profiles, changes made to the original profiles afterwards will not affect the new ones.
发表评论
-
mysql freebsd my.cnf 设置
2011-10-01 21:29 1108The Weblog of Titus Bari ... -
flask 备忘
2011-09-29 18:32 1216写道 # -*- coding: utf-8 -*- ... -
python 生成器 yield
2011-08-04 15:25 899def __xlsx2tuples(self, ... -
django 模板 深度变量的查找
2011-05-23 21:33 1186一点提示: Python的列表是从0开始索引。 第一项的索 ... -
自省的威力
2011-05-23 16:26 974apihelper.py 程序和它的输出现在应该非常清晰 ... -
Pylons笔记(二)
2011-05-19 17:10 1477第一天(继续) hellowold 1, 创建 ... -
Pylons笔记(一)
2011-05-19 15:01 1237工作上使用pylons有几个星期了,零零散散的,在这里整理一下 ... -
Python yield 用法
2011-04-09 21:11 1166Python yield 用法 http://w ...
相关推荐
Heritrix是一款强大的开源网络爬虫工具,由互联网档案馆(Internet Archive)开发,主要用于抓取和保存网页内容。Heritrix 1.14.4是该软件的一个较早版本,但依然具有广泛的适用性,尤其对于学习和研究网络爬虫技术...
Heritrix是一款强大的开源网络爬虫工具,由互联网档案馆(Internet Archive)开发,用于抓取和保存网页数据。在IT行业中,爬虫是获取大量网络数据的重要手段,Heritrix因其灵活性、可扩展性和定制性而备受青睐。标题...
### Heritrix爬虫安装部署知识点详解 #### 一、Heritrix爬虫简介 Heritrix是一款由互联网档案馆(Internet Archive)开发的开源网络爬虫框架,它使用Java语言编写,支持高度定制化的需求。Heritrix的设计初衷是为了...
Heritrix 1.14.2 是一个开源的网络爬虫工具,它主要用于抓取互联网上的网页和其他在线资源。这个版本的Heritrix在2007年左右发布,虽然较旧,但它仍然是理解网络爬虫技术的一个重要参考。 Heritrix是一个由Internet...
Heritrix是一款开源的网络爬虫软件,专为大规模网页抓取而设计。这款工具主要用于构建互联网档案馆、搜索引擎的数据源以及其他需要大量网页数据的项目。Heritrix由Internet Archive开发,支持高度可配置和扩展,能够...
Heritrix 3.1.0 是一个强大的网络爬虫工具,主要用于抓取和存档互联网上的网页。这个最新版本的jar包包含了Heritrix的核心功能,为用户提供了一个高效的网页抓取框架。Heritrix的设计理念是模块化和可配置性,使得它...
Heritrix是一款强大的开源网络爬虫工具,由互联网档案馆(Internet Archive)开发,用于抓取和保存网页内容。这款工具被设计为可扩展和高度配置的,允许用户根据特定需求定制爬取策略。在本工程中,Heritrix已经被预...
Heritrix是一款强大的开源网络爬虫工具,专为大规模、深度网页抓取设计。这款工具由互联网档案馆(Internet Archive)开发,旨在提供灵活、可扩展的网页抓取框架,适用于学术研究、数据挖掘和历史记录保存等多种用途...
Heritrix是一个强大的Java开发的开源网络爬虫,主要用于从互联网上抓取各种资源。它由www.archive.org提供,以其高度的可扩展性而著称,允许开发者自定义抓取逻辑,通过扩展其内置组件来适应不同的抓取需求。本文将...
Heritrix是一个开源的网络爬虫工具,专为大规模网页抓取设计。它是由Internet Archive开发的,允许用户系统地、可配置地抓取互联网上的信息。Heritrix的版本1.4.4是一个较旧但仍然有其价值的版本,因其稳定性而被...
在这个过程中,Lucene 和 Heritrix 是两个非常关键的工具,它们分别在搜索引擎的构建中扮演着不同的角色。 首先,Lucene 是一个基于 Java 的开源信息检索库,它为开发者提供了一系列用于构建搜索引擎的工具和接口。...
Heritrix 3 是一款强大的网络爬虫工具,主要用于网页抓取和互联网存档。它在2009年12月发布了3.0.0版本,并随着时间的推移不断更新,提供了3.0.1补丁版和3.2.0版,增加了新的特性和功能,比如更简单的使用方式、持续...
Heritrix是一款开源的网络爬虫工具,由互联网档案馆(Internet Archive)开发,用于抓取和保存网页。它的配置是整个爬虫工作的关键,确保Heritrix正确完整地配置对于实现高效、有针对性的网络抓取至关重要。以下将...
Heritrix是IA的开放源代码,可扩展的,基于整个Web的,归档网络爬虫工程 Heritrix工程始于2003年初,IA的目的是开发一个特殊的爬虫,对网上的 资源进行归档,建立网络数字图书馆,在过去的6年里,IA已经建立了400...
Heritrix是互联网档案(Internet Archive)开发的一款开源网络爬虫工具,用于系统地抓取、存储和归档网页。这个“Heritrix源码”压缩包可能包含了Heritrix项目的完整源代码,以及相关的学习资料,对于深入理解...
Heritrix是一个开源的互联网档案爬虫,用于抓取网页并保存为离线存档。在本文中,我们将深入探讨如何安装和配置Heritrix 1.14.4版本,这是一个基于Java的爬虫工具。 首先,我们需要从SourceForge网站下载Heritrix的...