org.archive.crawler.restlet.JobResource - shareHua - ITeye博客

`

shareHua

浏览: 14845 次
性别:
来自: 群：57917725

最近访客更多访客>>

woodding2008

博主相关

博客

微博

相册

收藏

留言

关于我

文章分类

社区版块

存档分类

最新评论

org.archive.crawler.restlet.JobResource

博客分类：

heritrix3

阅读更多

1、build ：validateConfiguration（）
2、launch：launch()
   new Thread start ，CrawlController.requestCrawlStart()
   getFrontier().run();
3、pause：getCrawlController().requestCrawlPause()
4、unpause：getCrawlController().requestCrawlResume()
   BdbFrontier.unpause()
   BdbFrontier:A Frontier using several BerkeleyDB JE Databases to hold its record of known hosts (queues), and pending URIs.

   sendCrawlStateChangeEvent(State.RUNNING, CrawlStatus.RUNNING);

CrawlController noteFrontierState INFO: Crawl running.
CrawlJob onApplicationEvent INFO: RUNNING 20121211155156

5、checkpoint：getCheckpointService().requestCrawlCheckpoint()
6、terminate：terminate()
7、teardown ：teardown()

分享到：

hbase-writer | org.archive.crawler.Heritrix

2012-12-09 23:30
浏览 774
评论(0)
分类:互联网
查看更多

评论

发表评论

您还没有登录,请您登录后再发表评论

相关推荐

Heritrix源码分析: 6. `org.archive.crawler.datamodel`：包含Heritrix的数据模型，如`CandidateURI`表示待抓取的URL，`credential`子包则管理登录凭证，以处理需要身份验证的站点。 7. `org.archive.crawler.deciderules`：决策规则...

heritrix的学习-源码分析 1-10: - **`org.archive.crawler.framework`**：Heritrix的核心框架包，包含关键类如`CrawlController`（爬虫控制器）和`Frontier`（调度器）等。 - **`org.archive.crawler.framework.exceptions`**：定义Heritrix框架...

heritrix抓取指南: - 选择Post Processors，依次选择`org.archive.crawler.postprocessor.CrawlStateUpdater`、`org.archive.crawler.postprocessor.LinksScoper`和`my.postprocessor.FrontierSchedulerTaobaoKongtiao`。 4. 设置其他...

TK-crawler.pyTK-crawler.pyTK-crawler.py: TK_crawler.pyTK_crawler.pyTK_crawler.pyTK_crawler.pyTK_crawler.pyTK_crawler.pyTK_crawler.pyTK_crawler.pyTK_crawler.pyTK_crawler.pyTK_crawler.pyTK_crawler.pyTK_crawler.pyTK_crawler.pyTK_crawler.pyTK_...

Wechat.Crawler.zip: Wechat.Crawler.zipWechat.Crawler.zipWechat.Crawler.zipWechat.Crawler.zipWechat.Crawler.zipWechat.Crawler.zipWechat.Crawler.zipWechat.Crawler.zipWechat.Crawler.zipWechat.Crawler.zip

Node.js-baha-crawler.js是一個專門用來爬巴哈姆特各版資料的爬蟲模組: **Node.js-baha-crawler.js 爬虫模块详解** `Node.js-baha-crawler.js` 是一个专门针对巴哈姆特（Bahamut）各板块数据进行抓取的爬虫模块，它利用了Node.js的特性，为开发者提供了一个方便、高效的工具，用于获取...

appcrawler-2.4.0-jar-with-dependencies.jar: app自动化测试工具，能够自动点击ui界面实行测试分析，是移动测试的利器

Node.js-js-crawler-Node.JS的网络爬虫支持HTTP和HTTPS: 压缩包"antivanov-js-crawler-c60b879"可能包含了js-crawler的源代码，版本c60b879，用户可以查看源码了解其具体实现细节，也可以根据项目需求进行定制和扩展。如果你打算使用js-crawler，记得先阅读官方文档，了解...

Node.js-node-crawler-NodeJS中的爬虫拥有简洁的API: crawler.crawl({ url: 'http://example.com', callback: function(error, response, done) { if (!error && response.statusCode === 200) { console.log(response.body); } } }); ``` 在这个例子中，`crawl`...

Node.js-ptt-crawler.js是一個專門用來爬批踢踢(Ptt)各版資料的爬蟲模組: 在给定的标题和描述中，我们关注的是一个名为`ptt-crawler.js`的模块，这是一个专门用于爬取台湾知名论坛批踢踢（Ptt）数据的爬虫工具。批踢踢（Ptt）是台湾最大的网络论坛，拥有众多讨论版块，涵盖了各种话题。`...

Python实现上市公司新闻文本分析与分类预测源码+使用说明，并判断与该新闻相关的股票有哪些，是利好还是利空: 文本处理(text_processing.py)、文本挖掘（text_mining.py）、新闻爬取（crawler_cnstock.py，crawler_jrj.py，crawler_nbd.py，crawler_sina.py，crawler_stcn.py）、Tushare数据提取（crawler_tushare.py）用法 ...

html-crawler.zip_in_org.jsoup.Jsoup: 总结起来，"html-crawler.zip_in_org.jsoup.Jsoup"是一个基于Jsoup库的Java实现的HTML爬虫项目，它利用Jsoup的强大功能解析HTML文档，提取所需信息。在开发这样的爬虫时，我们需要熟悉Jsoup的API，理解HTML结构，...

heritrix3.1: - `org.archive.crawler`: 这是核心爬虫模块，实现了爬虫的基本逻辑，包括URL队列管理、爬取策略、重试机制等。 - `org.archive.net`: 提供HTTP和HTTPS协议的支持，处理网络连接和请求响应。 - `org.archive.io`:...

Crawler4j-3.5: 1、对应Crawler4j的版本应该是3.5。 2、http://code.google.com/p/crawler4j -> Source -> Checkout上用Git Clone失败。 3、采用最笨的方法从 Source -> Browse上把文件一个一个拷贝下来，自己新建的Java项目，包...

基于Java语言的CatVodTVSpider爬虫设计源码: Java是一种广泛使用的编程语言，尤其在企业级开发和Android应用开发领域具有显著的优势。基于Java语言开发的爬虫系统，以其跨平台、稳定性和高效的性能，在网络数据抓取和处理中占据一席之地。CatVodTVSpider是一个...

simil-crawler.py: simil_crawler.pysimil_crawler.pysimil_crawler.pysimil_crawler.pysimil_crawler.pysimil_crawler.pysimil_crawler.pysimil_crawler.pysimil_crawler.pysimil_crawler.pysimil_crawler.pysimil_crawler.pysimil_...

playground.nodejs.node-crawler:开心地抓取网页: 安装环境 Git NodeJS + NPM 安装方法 git clone https://github.com/i-c0112/playground.nodejs.node-crawler.git crawl && cd crawl npm install 执行方法 npm test 输出结果 output/<itemID>.html

漫画爬虫，目前支持eh，manhua.dmzj.com，m.happymh.com.zip: 爬虫（Web Crawler）是一种自动化程序，用于从互联网上收集信息。其主要功能是访问网页、提取数据并存储，以便后续分析或展示。爬虫通常由搜索引擎、数据挖掘工具、监测系统等应用于网络数据抓取的场景。爬虫的...

Heritrix源码分析11-15.pdf: 1. `org.archive.crawler.datamodel.CrawlURI` 继承自 `org.archive.crawler.datamodel.CandidateURI`； 2. `org.archive.net.UURI` 继承自 `org.archive.net.LaxURI`； 3. `org.archive.net.LaxURI` 继承自 `org....

Global site tag (gtag.js) - Google Analytics