`
qzxfl008
  • 浏览: 78800 次
  • 性别: Icon_minigender_1
  • 来自: 浙江
社区版块
存档分类
最新评论

An example processor

 
阅读更多
package org.archive.crawler.extractor;

import java.util.regex.Matcher;

import javax.management.AttributeNotFoundException;

import org.archive.crawler.datamodel.CoreAttributeConstants;
import org.archive.crawler.datamodel.CrawlURI;
import org.archive.crawler.framework.Processor;
import org.archive.crawler.settings.SimpleType;
import org.archive.crawler.settings.Type;
import org.archive.crawler.extractor.Link;
import org.archive.util.TextUtils;

/**
 * A very simple extractor. Will assume that any string that matches a 
 * configurable regular expression is a link.
 *
 * @author Kristinn Sigurdsson
 */
public class SimpleExtractor extends Processor
    implements CoreAttributeConstants
{
    public static final String ATTR_REGULAR_EXPRESSION = "input-param";
    public static final String DEFAULT_REGULAR_EXPRESSION = 
        "http://([a-zA-Z0-9]+\\.)+[a-zA-Z0-9]+/"; //Find domains
    
    int numberOfCURIsHandled = 0; 
    int numberOfLinksExtracted = 0;

    public SimpleExtractor(String name) { 1
        super(name, "A very simple link extractor. Doesn't do anything useful.");
        Type e;
        e = addElementToDefinition(new SimpleType(ATTR_REGULAR_EXPRESSION,
            "How deep to look into files for URI strings, in bytes",
            DEFAULT_REGULAR_EXPRESSION));
        e.setExpertSetting(true);
    }

    protected void innerProcess(CrawlURI curi) {

        if (!curi.isHttpTransaction()) 2
        {
            // We only handle HTTP at the moment.
            return;
        }
        
        numberOfCURIsHandled++; 3

        CharSequence cs = curi.getHttpRecorder().getReplayCharSequence(); 4
        String regexpr = null;
        try {
            regexpr = (String)getAttribute(ATTR_REGULAR_EXPRESSION,curi); 5
        } catch(AttributeNotFoundException e) {
            regexpr = DEFAULT_REGULAR_EXPRESSION;
        }

        Matcher match = TextUtils.getMatcher(regexpr, cs); 6
        
        while (match.find()){ 
            String link = cs.subSequence(match.start(),match.end()).toString(); 7
            curi.createAndAddLink(link, Link.SPECULATIVE_MISC, Link.NAVLINK_HOP);8
            numberOfLinksExtracted++; 9
            System.out.println("SimpleExtractor: " + link); 10
        }
        
        TextUtils.recycleMatcher(match); 11
    }

    public String report() { 12
        StringBuffer ret = new StringBuffer();
        ret.append("Processor: org.archive.crawler.extractor." +
            "SimpleExtractor\n");
        ret.append("  Function:          Example extractor\n");
        ret.append("  CrawlURIs handled: " + numberOfCURIsHandled + "\n");
        ret.append("  Links extracted:   " + numberOfLinksExtracted + "\n\n");

        return ret.toString();
    }
}
分享到:
评论

相关推荐

    GNU m4 is an implementation of the traditional Unix macro processor

    GNU m4 is an implementation of the traditional Unix macro processor. It is mostly SVR4 compatible although it has some extensions (for example, handling more than 9 positional parameters to macros). ...

    ARM System Developer’s Guide

    The ARM processor core is a key component of many successful 32-bit embedded systems. You probably own one yourself and may not even realize it! ARM cores are widely ...that surround an ARM processor.

    uCOS-II源代码下载

    An example of TO.TBL is also found in the \SOFTWARE\TO\EXE directory. You will need to move TO.TBL to the root directory if you are to use TO.EXE. \SOFTWARE\uCOS-II The main directory where all μC/...

    uCOS-II源码及源码分析_2.52版本

    An example of TO.TBL is also found in the \SOFTWARE\TO\EXE directory. You will need to move TO.TBL to the root directory if you are to use TO.EXE. \SOFTWARE\uCOS-II The main directory where all μ...

    数据库系统基础教程ppt

    1 Introduction to Database Systems 1.1 The Evolution of Database Systems 1.1.1 Early Database ... 2.2.8 An Example Database Schema 2.3 Defining a Relation Schema in SQL 2.3.1 Relations in SQL

    erlang on many core

    The number of processing units integrated into a single die or package is increasing. We will see more and more general-... For example, an on-chip network may be used to interconnect all cores on a chip

    C6678-SPIboot-usersManual.pdf

    The purpose of this lab is to demonstrate all the steps that are needed to boot an allocation from SPI flash memory. Dependencies:  Code composer Studio v5 or v6  MCSDK or Processor SDK RTOS ...

    一个win32下的ARM开源编译器

    This package is an ARM assembler add-on for FASM. FASMARM currently supports the full range of instructions for 32-bit and 64-bit ARM processors and coprocessors up to and including v8. Contents: ...

    EtherCAT Slave Stack Code Tool

    The EtherCAT Slave Stack Code (SSC) is an example source code in ANSI C supporting both the µC and the SPI interface. The code serves as a development base for implementation of EtherCAT in devices ...

    Storm.Applied.Strategies.for.real-time.event.processing

    Storm Applied is an example-driven guide to processing and analyzing real-time data streams. This immediately useful book starts by teaching you how to design Storm solutions the right way. Then, it ...

    VB编程资源大全(英文源码 API)

    camp-api.zip An example of editing the "win.ini" file to execute programs when Windows loads.<END><br>6 , win32api.exe This will install the Win32API.txt on your system. This file holds all API...

    Itanium Architecture For Programmers

    SQUARES: A First Programming Example Section 1.8. Review of Number Systems Summary References Exercises Chapter 2. Computer Structures and Data Representations Section 2.1. Computer ...

    SSD7 选择题。Multiple-Choice

    With Query By Example, a user enters a query by (a) filling in skeleton tables of the database with examples of what is to be retrieved (b) placing SQL keywords, such as select, under the ...

    Altium Designer Documentation.rar

    AP0102 Linking an FPGA Project to a PCB Project.pdf AP0103 Processing the Captured FPGA Design.pdf AP0104 Re-targeting the design to the Production Board.pdf AP0105 Updating the NanoBoard Firmware....

    Computer Organization and Design 完整版带目录

    The book uses a MIPS processor core to present the fundamentals of hardware technologies, assembly language, computer arithmetic, pipelining, memory hierarchies and I/O.Because an understanding of ...

    微软内部资料-SQL性能优化3

    In our example, if one transaction (T1) holds an exclusive lock at the table level, and another transaction (T2) holds an exclusive lock at the row level, each of the transactions believe they have ...

    VB编程资源大全(英文源码 控制)

    start your Visual Basic program with your project code showing, right click and you should see "Rem Builder".<END><br>3 , syntax.zip This is an excellent example of how to highlight ...

Global site tag (gtag.js) - Google Analytics