文章列表
1、controller.getFetchChain().process(curi,this);
1.1、org.archive.crawler.prefetch.Preselector,
1.2、org.archive.crawler.prefetch.PreconditionEnforcer,
1.3、org.archive.modules.fetcher.FetchDNS,
//httpclient
1.4、org.archive.modules.fetcher.FetchHTTP,
1.5、org.archive.modules.extractor.ExtractorHTTP, ...
1、build :validateConfiguration()
2、launch:launch()
new Thread start ,CrawlController.requestCrawlStart()
getFrontier().run();
3、pause:getCrawlController().requestCrawlPause()
4、unpause:getCrawlController().requestCrawlResume()
BdbFrontier.unpause()
BdbFrontier:A Frontier using several Ber ...
yii2疑问20140204
- 博客分类:
- yii2
layouts的main.php的beginPage biginBody是取什么作用?
【【Yii2 交流群 146409855】】
在这个章节,我们列出从Yii1.1到Yii2.0的主要改动。我们希望这些列表将帮您更容易的从Yii1.1升级,和更快的在您现有的Yii认知上掌握Yii2.0。
域名空间
--------------------
Yii2.0最明显的改动是使用了域名空 ...
What Is Text Mining?
- 博客分类:
- mining
Marti Hearst
SIMS,UC Berkeley
hearst@sims.berkeley.edu
October 17, 2003
I wrote this essay for people who are curious about the topic of text mining after having read the New York Times article by Lisa Guernsey (10/16/2003) or heard my Future Tense interview with Jon Gordon (10/20/2003).
What is t ...
A method for extracting company names from textual information uses a combination of heuristics, exception lists, and extensive corpus analysis. The method first locates company name suffixes (i.e., Company, Corporation) and attempts to locate the beginning of the company name. The method works on bo ...
/**
* The kind of "hop" from one URI to another. Each hop type can be
* represented by a single character; strings of these characters can
* appear in logs. Eg, "LLLX" means that a URI was three normal links from
* a seed, and then one speculative link.
*
* @author pjac ...
ToeThread.run()
ProcessorChain.prcess(CrawlURI curi, ChainStatusReceiver thread)
Processor.process(CrawlURI curi)
Scoper.isInScope(CrawlURI caUri)
//foreach getRules()
DecideResult r = rule.decisionFor(uri);
//inner decisionFor method,
DecideResult result = innerDecide(uri);
//last decisiveRule n ...
When a URI is crawled, a ToeThread will execute a series of processors on it.
The processors are split into 5 distinct chains that are exectued in sequence:
Pre-fetch processing chain
Fetch processing chain
Extractor processing chain
Write/Index processing chain
Post-processing chain
crawler-beans.cxml
- 博客分类:
- heritrix3
1、CrawlMetadata: including identification of crawler/operator
org.archive.modules.CrawlMetadata: Basic crawl metadata, as consulted by functional modules and recorded in ARCs/WARCs.
org.archive.modules.seeds.TextSeedModule
org.archive.modules.deciderules.DecideRuleSequence
org.archive.modules.Can ...
you would like to save the crawled files in a file/directory format instead of saving them in WARC files.
First, create a job with a single seed, http://foo.org/bar/. Configure the warcWriter bean so that its class is org.archive.modules.writer.MirrorWriterProcessor. This Processor will store file ...
hbase-writer
- 博客分类:
- heritrix3
An Hadoop HBase WriterPool implementation for the Heritrix crawler
1、ensure using java 1.6+ before hitting a later cryptic error
2、Set some system properties early.
ignoredSchemes,maxFormSize
3、parsing command line options
4、DEFAULTS until changed by cmd-line options
authLogin 、authPassword、jobsDir、properties、bindHosts、port、SSL options 、
6、Set timezone here.
7、Sta ...
The Main Console page is displayed after you have installed Heritrix and logged into the WUI.
Enter the name of the new job in the text box with the "Create new job with recommended starting configuration" label. Then click "create."
The new job will be displayed in the list o ...
Use svn, checkout the project from the sourceforget.net on https: / / archive-crawler.svn.sourceforge.net/svnroot/archive-crawler/trunk/heritrix3
Especially if you're customizing Heritrix (as seems to be the case from
setting up a dev environment), you should be basing your work off of
Heritrix 3 ...