- 浏览: 78926 次
- 性别:
- 来自: 浙江
文章分类
最新评论
-
qzxfl008:
happysoul 写道
/**
* 获取当前时间的后一天
...
计算下一天 -
happysoul:
/**
* 获取当前时间的后一天
* @param da ...
计算下一天 -
ailongni:
或许 jsoup 更好用些
html解析页面中的A标签 -
qzxfl008:
king_c 写道junit4 什么意思啊
lucene3.1.0 简单分词实例 -
king_c:
junit4
lucene3.1.0 简单分词实例
使用Heritrix抓取必须的三个文件order.xml,seeds.txt和state.job
之前使用的是ui配置order.xml,现在已经能抓取自己想要的文件了,就直接把order.xml拿来用修改一下就可以了,order.xml代码如下
seeds.txt里的内容为http://stock.hexun.com/
state.job里的内容为
状态一定要位Pending,才能抓取。
启动类MainHeritrix
扩展类FrontierSchedulerforHexunStockNews
代码:
注意这个类还要在conf文件夹下的modules下的Processor.option里配置过
启动Main类就可以进行抓取了
之前使用的是ui配置order.xml,现在已经能抓取自己想要的文件了,就直接把order.xml拿来用修改一下就可以了,order.xml代码如下
<?xml version="1.0" encoding="UTF-8"?><crawl-order xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="heritrix_settings.xsd"> <meta> <name>personstock</name> <description>hexunstockInformation</description> <operator>Admin</operator> <organization/> <audience/> <date>20110718194533</date> </meta> <controller> <string name="settings-directory">settings</string> <string name="disk-path"/> <string name="logs-path">logs</string> <string name="checkpoints-path">checkpoints</string> <string name="state-path">state</string> <string name="scratch-path">scratch</string> <long name="max-bytes-download">0</long> <long name="max-document-download">0</long> <long name="max-time-sec">0</long> <integer name="max-toe-threads">50</integer> <integer name="recorder-out-buffer-bytes">4096</integer> <integer name="recorder-in-buffer-bytes">65536</integer> <integer name="bdb-cache-percent">0</integer> <newObject name="scope" class="org.archive.crawler.deciderules.DecidingScope"> <boolean name="enabled">true</boolean> <string name="seedsfile">seeds.txt</string> <boolean name="reread-seeds-on-config">true</boolean> <newObject name="decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence"> <map name="rules"> <newObject name="rejectByDefault" class="org.archive.crawler.deciderules.RejectDecideRule"> </newObject> <newObject name="acceptIfSurtPrefixed" class="org.archive.crawler.deciderules.SurtPrefixedDecideRule"> <string name="decision">ACCEPT</string> <string name="surts-source-file"/> <boolean name="seeds-as-surt-prefixes">true</boolean> <string name="surts-dump-file"/> <boolean name="also-check-via">false</boolean> <boolean name="rebuild-on-reconfig">true</boolean> </newObject> <newObject name="rejectIfTooManyHops" class="org.archive.crawler.deciderules.TooManyHopsDecideRule"> <integer name="max-hops">20</integer> </newObject> <newObject name="rejectIfPathological" class="org.archive.crawler.deciderules.PathologicalPathDecideRule"> <integer name="max-repetitions">2</integer> </newObject> <newObject name="rejectIfTooManyPathSegs" class="org.archive.crawler.deciderules.TooManyPathSegmentsDecideRule"> <integer name="max-path-depth">20</integer> </newObject> <newObject name="acceptIfPrerequisite" class="org.archive.crawler.deciderules.PrerequisiteAcceptDecideRule"> </newObject> </map> </newObject> </newObject> <map name="http-headers"> <string name="user-agent">Mozilla/5.0 (compatible; heritrix/1.14.4 +http://192.168.111.200)</string> <string name="from">test@test.com</string> </map> <newObject name="robots-honoring-policy" class="org.archive.crawler.datamodel.RobotsHonoringPolicy"> <string name="type">classic</string> <boolean name="masquerade">false</boolean> <text name="custom-robots"/> <stringList name="user-agents"> </stringList> </newObject> <newObject name="frontier" class="org.archive.crawler.frontier.BdbFrontier"> <float name="delay-factor">4.0</float> <integer name="max-delay-ms">20000</integer> <integer name="min-delay-ms">2000</integer> <integer name="respect-crawl-delay-up-to-secs">300</integer> <integer name="max-retries">30</integer> <long name="retry-delay-seconds">900</long> <integer name="preference-embed-hops">1</integer> <integer name="total-bandwidth-usage-KB-sec">0</integer> <integer name="max-per-host-bandwidth-usage-KB-sec">0</integer> <string name="queue-assignment-policy">org.archive.crawler.frontier.HostnameQueueAssignmentPolicy</string> <string name="force-queue-assignment"/> <boolean name="pause-at-start">false</boolean> <boolean name="pause-at-finish">false</boolean> <boolean name="source-tag-seeds">false</boolean> <boolean name="recovery-log-enabled">true</boolean> <boolean name="hold-queues">true</boolean> <integer name="balance-replenish-amount">3000</integer> <integer name="error-penalty-amount">100</integer> <long name="queue-total-budget">-1</long> <string name="cost-policy">org.archive.crawler.frontier.ZeroCostAssignmentPolicy</string> <long name="snooze-deactivate-ms">300000</long> <integer name="target-ready-backlog">50</integer> <string name="uri-included-structure">org.archive.crawler.util.BdbUriUniqFilter</string> <boolean name="dump-pending-at-close">false</boolean> </newObject> <map name="uri-canonicalization-rules"> <newObject name="Lowercase" class="org.archive.crawler.url.canonicalize.LowercaseRule"> <boolean name="enabled">true</boolean> </newObject> <newObject name="Userinfo" class="org.archive.crawler.url.canonicalize.StripUserinfoRule"> <boolean name="enabled">true</boolean> </newObject> <newObject name="WWW[0-9]*" class="org.archive.crawler.url.canonicalize.StripWWWNRule"> <boolean name="enabled">true</boolean> </newObject> <newObject name="SessionIDs" class="org.archive.crawler.url.canonicalize.StripSessionIDs"> <boolean name="enabled">true</boolean> </newObject> <newObject name="SessionCFIDs" class="org.archive.crawler.url.canonicalize.StripSessionCFIDs"> <boolean name="enabled">true</boolean> </newObject> <newObject name="QueryStrPrefix" class="org.archive.crawler.url.canonicalize.FixupQueryStr"> <boolean name="enabled">true</boolean> </newObject> </map> <map name="pre-fetch-processors"> <newObject name="Preselector" class="org.archive.crawler.prefetch.Preselector"> <boolean name="enabled">true</boolean> <newObject name="Preselector#decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence"> <map name="rules"> </map> </newObject> <boolean name="override-logger">false</boolean> <boolean name="recheck-scope">true</boolean> <boolean name="block-all">false</boolean> <string name="block-by-regexp"/> <string name="allow-by-regexp"/> </newObject> <newObject name="Preprocessor" class="org.archive.crawler.prefetch.PreconditionEnforcer"> <boolean name="enabled">true</boolean> <newObject name="Preprocessor#decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence"> <map name="rules"> </map> </newObject> <integer name="ip-validity-duration-seconds">21600</integer> <integer name="robot-validity-duration-seconds">86400</integer> <boolean name="calculate-robots-only">false</boolean> </newObject> </map> <map name="fetch-processors"> <newObject name="DNS" class="org.archive.crawler.fetcher.FetchDNS"> <boolean name="enabled">true</boolean> <newObject name="DNS#decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence"> <map name="rules"> </map> </newObject> <boolean name="accept-non-dns-resolves">false</boolean> <boolean name="digest-content">true</boolean> <string name="digest-algorithm">sha1</string> </newObject> <newObject name="HTTP" class="org.archive.crawler.fetcher.FetchHTTP"> <boolean name="enabled">true</boolean> <newObject name="HTTP#decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence"> <map name="rules"> </map> </newObject> <newObject name="midfetch-decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence"> <map name="rules"> </map> </newObject> <integer name="timeout-seconds">1200</integer> <integer name="sotimeout-ms">20000</integer> <integer name="fetch-bandwidth">0</integer> <long name="max-length-bytes">0</long> <boolean name="ignore-cookies">false</boolean> <boolean name="use-bdb-for-cookies">true</boolean> <string name="load-cookies-from-file"/> <string name="save-cookies-to-file"/> <string name="trust-level">open</string> <stringList name="accept-headers"> </stringList> <string name="http-proxy-host"/> <string name="http-proxy-port"/> <string name="default-encoding">ISO-8859-1</string> <boolean name="digest-content">true</boolean> <string name="digest-algorithm">sha1</string> <boolean name="send-if-modified-since">true</boolean> <boolean name="send-if-none-match">true</boolean> <boolean name="send-connection-close">true</boolean> <boolean name="send-referer">true</boolean> <boolean name="send-range">false</boolean> <string name="http-bind-address"/> </newObject> </map> <map name="extract-processors"> <newObject name="ExtractorHTTP" class="org.archive.crawler.extractor.ExtractorHTTP"> <boolean name="enabled">true</boolean> <newObject name="ExtractorHTTP#decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence"> <map name="rules"> </map> </newObject> </newObject> <newObject name="ExtractorHTML" class="org.archive.crawler.extractor.ExtractorHTML"> <boolean name="enabled">true</boolean> <newObject name="ExtractorHTML#decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence"> <map name="rules"> </map> </newObject> <boolean name="extract-javascript">true</boolean> <boolean name="treat-frames-as-embed-links">true</boolean> <boolean name="ignore-form-action-urls">false</boolean> <boolean name="extract-only-form-gets">true</boolean> <boolean name="extract-value-attributes">true</boolean> <boolean name="ignore-unexpected-html">true</boolean> </newObject> <newObject name="ExtractorCSS" class="org.archive.crawler.extractor.ExtractorCSS"> <boolean name="enabled">true</boolean> <newObject name="ExtractorCSS#decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence"> <map name="rules"> </map> </newObject> </newObject> <newObject name="ExtractorJS" class="org.archive.crawler.extractor.ExtractorJS"> <boolean name="enabled">true</boolean> <newObject name="ExtractorJS#decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence"> <map name="rules"> </map> </newObject> </newObject> <newObject name="ExtractorSWF" class="org.archive.crawler.extractor.ExtractorSWF"> <boolean name="enabled">true</boolean> <newObject name="ExtractorSWF#decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence"> <map name="rules"> </map> </newObject> </newObject> </map> <map name="write-processors"> <newObject name="MirrorWriter" class="org.archive.crawler.writer.MirrorWriterProcessor"> <boolean name="enabled">true</boolean> <newObject name="MirrorWriter#decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence"> <map name="rules"> </map> </newObject> <boolean name="case-sensitive">true</boolean> <stringList name="character-map"> </stringList> <stringList name="content-type-map"> </stringList> <string name="directory-file">index.html</string> <string name="dot-begin">%2E</string> <string name="dot-end">.</string> <stringList name="host-map"> </stringList> <boolean name="host-directory">true</boolean> <string name="path">mirror</string> <integer name="max-path-length">1023</integer> <integer name="max-segment-length">255</integer> <boolean name="port-directory">false</boolean> <boolean name="suffix-at-end">true</boolean> <string name="too-long-directory">LONG</string> <stringList name="underscore-set"> </stringList> </newObject> </map> <map name="post-processors"> <newObject name="Updater" class="org.archive.crawler.postprocessor.CrawlStateUpdater"> <boolean name="enabled">true</boolean> <newObject name="Updater#decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence"> <map name="rules"> </map> </newObject> </newObject> <newObject name="LinksScoper" class="org.archive.crawler.postprocessor.LinksScoper"> <boolean name="enabled">true</boolean> <newObject name="LinksScoper#decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence"> <map name="rules"> </map> </newObject> <boolean name="override-logger">false</boolean> <boolean name="seed-redirects-new-seed">true</boolean> <integer name="preference-depth-hops">-1</integer> <newObject name="scope-rejected-url-rules" class="org.archive.crawler.deciderules.DecideRuleSequence"> <map name="rules"> </map> </newObject> </newObject> <newObject name="FrontierSchedulerForHexunStockNews" class="my.FrontierSchedulerForHexunStockNews"> <boolean name="enabled">true</boolean> <newObject name="FrontierSchedulerForHexunStockNews#decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence"> <map name="rules"> </map> </newObject> </newObject> </map> <map name="loggers"> <newObject name="crawl-statistics" class="org.archive.crawler.admin.StatisticsTracker"> <integer name="interval-seconds">20</integer> </newObject> </map> <string name="recover-path"/> <boolean name="checkpoint-copy-bdbje-logs">true</boolean> <boolean name="recover-retain-failures">false</boolean> <boolean name="recover-scope-includes">true</boolean> <boolean name="recover-scope-enqueues">true</boolean> <newObject name="credential-store" class="org.archive.crawler.datamodel.CredentialStore"> <map name="credentials"> </map> </newObject> </controller> </crawl-order>
seeds.txt里的内容为http://stock.hexun.com/
state.job里的内容为
20110718194533 hexunstock Pending false true 2 0 order.xml
状态一定要位Pending,才能抓取。
启动类MainHeritrix
package my; import java.io.File; import javax.management.InvalidAttributeValueException; import org.archive.crawler.event.CrawlStatusListener; import org.archive.crawler.framework.CrawlController; import org.archive.crawler.framework.exceptions.InitializationException; import org.archive.crawler.settings.XMLSettingsHandler; public class MainHeritrix { public static void main(String[] args) { String orderFile = "D:\\Workspaces\\MyEclipse 8.5\\MyHeritrix\\jobs\\personstock-20110718194533\\order.xml"; File file = null; XMLSettingsHandler handler = null; CrawlStatusListener listerner = null; CrawlController controller = null; try { file = new File(orderFile); handler = new XMLSettingsHandler(file); handler.initialize(); controller = new CrawlController(); controller.initialize(handler); if(listerner != null) { controller.addCrawlStatusListener(listerner); } controller.requestCrawlStart(); while(true) { if(controller.isRunning() == false) { break; } Thread.sleep(1000); System.out.println("The current thread is:"+Thread.currentThread()); } //controller.requestCrawlStop(); } catch (InvalidAttributeValueException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (InitializationException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (InterruptedException e) { // TODO Auto-generated catch block e.printStackTrace(); } } }
扩展类FrontierSchedulerforHexunStockNews
代码:
package my; import java.util.logging.Logger; import java.util.regex.Matcher; import java.util.regex.Pattern; import org.archive.crawler.datamodel.CandidateURI; import org.archive.crawler.postprocessor.FrontierScheduler; public class FrontierSchedulerForHexunStockNews extends FrontierScheduler { private static Logger logger = Logger.getLogger(FrontierSchedulerForHexunStockNews.class.getName()); private static Pattern pattern ; private static String regex = "http://stock.hexun.com/[\\d]+-[\\d]+-\\d+/[\\d]+.html"; private static String regexStock="stock.hexun.com"; private static String regexStockDns="dns:stock.hexun.com"; private static String regexReal="stock.hexun.com/real/"; //个股主页面 public FrontierSchedulerForHexunStockNews(String name) { super(name); // TODO Auto-generated constructor stub } static { } /** * */ private static final long serialVersionUID = 1L; protected void schedule(CandidateURI cdUri) { String url = cdUri.toString(); //Matcher m = null; try { // m = pattern.matcher(url); /*if(url.indexOf("dns:") !=-1 ||url.indexOf("robots.txt") !=-1 ||url.indexOf(regexReal) != -1 ||url.indexOf(regexStock) != -1 )url.startsWith(regexStock)||||url.equals(regexStockDns) {*/ if(url.matches(regex)||url.indexOf("dns:stock.hexun.com") !=-1|| url.indexOf("robots.txt") !=-1) { System.out.println("url为:"+url); getController().getFrontier().schedule(cdUri); } /*if(url.indexOf("stock.hexun.com") != -1||url.indexOf("dns:stock.hexun.com") !=-1|| url.indexOf("robots.txt") !=-1) { getController().getFrontier().schedule(cdUri); }*/ else { return; } /* if(url.indexOf("stock.hexun.com") != -1||url.indexOf("dns:") !=-1|| url.indexOf("robots.txt") !=-1) { getController().getFrontier().schedule(cdUri); } else { return; } */ } catch (Exception e) { logger.info(e.getMessage()); } } }
注意这个类还要在conf文件夹下的modules下的Processor.option里配置过
启动Main类就可以进行抓取了
发表评论
-
html解析页面中的A标签
2011-11-05 21:20 1716在heritrix中的自定义继承Extractor的类中,参考 ... -
ELF hash算法 java版
2011-06-08 14:30 2936在Heritrix的 Queue-assignment-pol ... -
An example processor
2011-06-02 20:11 931package org.archive.crawler.e ... -
heritrix文档上的一个例子,放这备用
2011-06-02 18:49 1474package mypackage; import ... -
Crawl Scope 抓取范围
2011-06-01 15:01 1690提供以下几种抓取的范围 1、BroadScope Broad ... -
heritrix中ExtractorJS扩展源代码
2011-05-31 18:34 2034以下是heritrix中对JS的扩展,在自己写扩展的时候可以参 ...
相关推荐
核心配置文件`heritrix.properties`位于`conf`目录下,其中包含了Heritrix运行所需的许多参数,如WebUI登录信息、日志格式等。首次启动时,需在此文件中设置WebUI的用户名和密码,例如`heritrix.cmdline.admin=admin...
本指南将详细介绍如何利用Heritrix抓取淘宝商城大家电分类下的空调商品信息。 首先,确定我们的目标:抓取淘宝商城空调商品页面。步骤如下: 1. 访问淘宝主页,点击“大家电”分类。 2. 在下拉列表中选择“空调”...
Heritrix是一款开源的互联网档案爬虫工具,用于抓取并保存网页以便进行后续处理或分析。...这样的组合使得我们可以有效地抓取、存储和检索大量网络信息,为数据分析、内容挖掘等领域提供强大的支持。
本节将详细介绍如何在Eclipse环境中搭建Heritrix,并进行必要的配置,以便能够顺利地启动Heritrix并执行抓取任务。 ##### 2.1 在Eclipse中搭建MyHeritrix工程 1. **新建Java工程** 在Eclipse中新建一个名为`...
Heritrix 支持各种复杂的抓取策略和规则,如深度优先、广度优先,以及各种过滤器和钩子,能够有效地抓取互联网上的资源。它能够处理大规模的网页抓取任务,并支持对抓取内容的归档管理。 综合Lucene和Heritrix,...
Heritrix是一款强大的开源网络爬虫工具,由互联网档案馆(Internet Archive)开发,主要用于抓取和保存网页内容。Heritrix 1.14.4是该软件的一个较早版本,但依然具有广泛的适用性,尤其对于学习和研究网络爬虫技术...
- **监控和日志**:Heritrix提供了丰富的监控接口,可以实时查看爬行进度、错误信息等,并通过日志文件记录详细的操作记录。 - **扩展性**:Heritrix设计为插件式系统,开发者可以编写自己的模块来扩展其功能,如...
Heritrix是一款强大的开源网络爬虫工具,专为大规模、深度网页抓取设计。这款工具由互联网档案馆(Internet Archive)开发,旨在提供灵活、可扩展的网页抓取框架,适用于学术研究、数据挖掘和历史记录保存等多种用途...
Heritrix是一款开源的网络爬虫软件,专为大规模网页抓取而设计。这款工具主要用于构建互联网档案馆、搜索引擎的数据源以及其他需要大量网页数据的项目。Heritrix由Internet Archive开发,支持高度可配置和扩展,能够...
总的来说,Heritrix1.14.0jar包是构建和运行Heritrix爬虫的关键组成部分,它提供了丰富的功能和高度的定制性,使得开发人员能够高效地获取和处理网络上的大量信息。无论是用于学术研究、数据分析还是网站维护,...
首先,Heritrix负责抓取网络上的数据,将其保存到本地或者某种持久化存储中。然后,Lucene2.0会读取这些数据,进行分析和索引,建立可供快速搜索的结构。用户通过搜索引擎的前端界面输入查询,查询会被发送到Lucene...
Heritrix 1.14.2 是一个开源的网络爬虫工具,它主要用于抓取互联网上的网页和其他在线资源。这个版本的Heritrix在2007年左右发布,虽然较旧,但它仍然是理解网络爬虫技术的一个重要参考。 Heritrix是一个由Internet...
Heritrix 自动生成详细的日志文件和报告,这些信息可用于调试问题、优化性能以及监控抓取进度。 #### 十六、配置任务和配置文件 除了基本的抓取任务配置外,Heritrix 还允许用户自定义复杂的配置文件,以满足特殊...
11. **安全与合规**: 遵守网络抓取的道德和法律规范,尊重网站的robots.txt文件,避免抓取敏感信息。 以上是Heritrix正确完整配置的主要方面。配置过程中,应根据实际需求逐步调整参数,并通过试验和错误找出最佳...
Heritrix是一个强大的开源网络爬虫工具,用于批量抓取互联网上的网页。它提供了一种高效、可配置的方式来收集和处理网页数据。本篇将详细解释Heritrix系统的使用、核心概念、工作原理以及关键组件。 首先,Heritrix...
4. **元数据管理**:Heritrix能够收集并存储关于抓取内容的元数据,如HTTP响应头、网页编码、抓取时间等,这些信息对于后续的数据分析和处理非常有价值。 5. **深度抓取与存档**:Heritrix不仅能够抓取网页,还能...
Heritrix 3.1是互联网档案馆开发的一款开源网络爬虫工具,专门用于抓取和保存网页。这款强大的爬虫软件广泛应用于学术研究、数据分析、网站备份等多个领域。了解Heritrix 3.1的默认配置以及类之间的关系对于有效使用...
Heritrix是一款开源的互联网档案爬虫工具,由Internet Archive开发并维护,广泛应用于数据抓取和网络资源保存。Heritrix 1.14.4是该软件的一个较早版本,包含了完整的源码,因此用户可以对其进行深度定制和调试。 ...
Heritrix是一个强大的Java开发的开源网络爬虫,主要用于从互联网上抓取各种资源。它由www.archive.org提供,以其高度的可扩展性而著称,允许开发者自定义抓取逻辑,通过扩展其内置组件来适应不同的抓取需求。本文将...