package org.archive.crawler.extractor;
import java.util.regex.Matcher;
import javax.management.AttributeNotFoundException;
import org.archive.crawler.datamodel.CoreAttributeConstants;
import org.archive.crawler.datamodel.CrawlURI;
import org.archive.crawler.framework.Processor;
import org.archive.crawler.settings.SimpleType;
import org.archive.crawler.settings.Type;
import org.archive.crawler.extractor.Link;
import org.archive.util.TextUtils;
/**
* A very simple extractor. Will assume that any string that matches a
* configurable regular expression is a link.
*
* @author Kristinn Sigurdsson
*/
public class SimpleExtractor extends Processor
implements CoreAttributeConstants
{
public static final String ATTR_REGULAR_EXPRESSION = "input-param";
public static final String DEFAULT_REGULAR_EXPRESSION =
"http://([a-zA-Z0-9]+\\.)+[a-zA-Z0-9]+/"; //Find domains
int numberOfCURIsHandled = 0;
int numberOfLinksExtracted = 0;
public SimpleExtractor(String name) { 1
super(name, "A very simple link extractor. Doesn't do anything useful.");
Type e;
e = addElementToDefinition(new SimpleType(ATTR_REGULAR_EXPRESSION,
"How deep to look into files for URI strings, in bytes",
DEFAULT_REGULAR_EXPRESSION));
e.setExpertSetting(true);
}
protected void innerProcess(CrawlURI curi) {
if (!curi.isHttpTransaction()) 2
{
// We only handle HTTP at the moment.
return;
}
numberOfCURIsHandled++; 3
CharSequence cs = curi.getHttpRecorder().getReplayCharSequence(); 4
String regexpr = null;
try {
regexpr = (String)getAttribute(ATTR_REGULAR_EXPRESSION,curi); 5
} catch(AttributeNotFoundException e) {
regexpr = DEFAULT_REGULAR_EXPRESSION;
}
Matcher match = TextUtils.getMatcher(regexpr, cs); 6
while (match.find()){
String link = cs.subSequence(match.start(),match.end()).toString(); 7
curi.createAndAddLink(link, Link.SPECULATIVE_MISC, Link.NAVLINK_HOP);8
numberOfLinksExtracted++; 9
System.out.println("SimpleExtractor: " + link); 10
}
TextUtils.recycleMatcher(match); 11
}
public String report() { 12
StringBuffer ret = new StringBuffer();
ret.append("Processor: org.archive.crawler.extractor." +
"SimpleExtractor\n");
ret.append(" Function: Example extractor\n");
ret.append(" CrawlURIs handled: " + numberOfCURIsHandled + "\n");
ret.append(" Links extracted: " + numberOfLinksExtracted + "\n\n");
return ret.toString();
}
}
分享到:
相关推荐
GNU m4 is an implementation of the traditional Unix macro processor. It is mostly SVR4 compatible although it has some extensions (for example, handling more than 9 positional parameters to macros). ...
The ARM processor core is a key component of many successful 32-bit embedded systems. You probably own one yourself and may not even realize it! ARM cores are widely ...that surround an ARM processor.
An example of TO.TBL is also found in the \SOFTWARE\TO\EXE directory. You will need to move TO.TBL to the root directory if you are to use TO.EXE. \SOFTWARE\uCOS-II The main directory where all μC/...
An example of TO.TBL is also found in the \SOFTWARE\TO\EXE directory. You will need to move TO.TBL to the root directory if you are to use TO.EXE. \SOFTWARE\uCOS-II The main directory where all μ...
1 Introduction to Database Systems 1.1 The Evolution of Database Systems 1.1.1 Early Database ... 2.2.8 An Example Database Schema 2.3 Defining a Relation Schema in SQL 2.3.1 Relations in SQL
The number of processing units integrated into a single die or package is increasing. We will see more and more general-... For example, an on-chip network may be used to interconnect all cores on a chip
The purpose of this lab is to demonstrate all the steps that are needed to boot an allocation from SPI flash memory. Dependencies: Code composer Studio v5 or v6 MCSDK or Processor SDK RTOS ...
This package is an ARM assembler add-on for FASM. FASMARM currently supports the full range of instructions for 32-bit and 64-bit ARM processors and coprocessors up to and including v8. Contents: ...
The EtherCAT Slave Stack Code (SSC) is an example source code in ANSI C supporting both the µC and the SPI interface. The code serves as a development base for implementation of EtherCAT in devices ...
Storm Applied is an example-driven guide to processing and analyzing real-time data streams. This immediately useful book starts by teaching you how to design Storm solutions the right way. Then, it ...
camp-api.zip An example of editing the "win.ini" file to execute programs when Windows loads.<END><br>6 , win32api.exe This will install the Win32API.txt on your system. This file holds all API...
SQUARES: A First Programming Example Section 1.8. Review of Number Systems Summary References Exercises Chapter 2. Computer Structures and Data Representations Section 2.1. Computer ...
With Query By Example, a user enters a query by (a) filling in skeleton tables of the database with examples of what is to be retrieved (b) placing SQL keywords, such as select, under the ...
AP0102 Linking an FPGA Project to a PCB Project.pdf AP0103 Processing the Captured FPGA Design.pdf AP0104 Re-targeting the design to the Production Board.pdf AP0105 Updating the NanoBoard Firmware....
The book uses a MIPS processor core to present the fundamentals of hardware technologies, assembly language, computer arithmetic, pipelining, memory hierarchies and I/O.Because an understanding of ...
In our example, if one transaction (T1) holds an exclusive lock at the table level, and another transaction (T2) holds an exclusive lock at the row level, each of the transactions believe they have ...
start your Visual Basic program with your project code showing, right click and you should see "Rem Builder".<END><br>3 , syntax.zip This is an excellent example of how to highlight ...