`

抓取工具Web-Harvest

阅读更多



 Overview

总览

 

This section describes the motive, the notions and concepts used in Web-Harvest.

 

本章描述了在Web-Harvest涉及的动机、观念和概念。

Rationale

理念

 

World Wide Web, though by far the largest knowledge base, is rarely regarded as database in traditional sense - as source of information used for further computing. Web-Harvest is inspired by practical need for having right data at the right time. And very often, the Web is the only source that publicly provides wanted information.

 

万维网,尽管是目前最大的知识基地,但是仍然难以将它视为传统意义上的数据库,从而作为深入计算的所使用的信息来源。Web-Harvest 受启发满足实用性的需要成为在正确的时间获取正确的数据。web经常是唯一给公众提供所需要的信息来源。

Basic concept

基本概念

 

The main goal behind Web-Harvest is to empower the usage of already existing extraction technologies. Its purpose is not to propose a new method, but to provide a way to easily use and combine the existing ones. Web-Harvest offers the set of processors for data handling and control flow. Each processor can be regarded as a function - it has zero or more input parameters and gives a result after execution. Processors could be combined in a pipeline, making the chain of execution. For easier manipulation and data reuse Web-Harvest provides variable context where named variables are stored. The following diagram describes one pipeline execution:

 

Web-Harvest 的总体目标的是要能使用已经存在的抽取技术。它的目标不是提供一个新的方法,而是提供一种可以简单使用并整合已经存在的技术的新方式。Web-Harvest 提供一系列数据处理和控制流程的处理器。每个处理器可以看做是一个方法-它有零个或多个输入参数并能在执行后提供一个结果。处理器可以组装为一个管道,形成执行链。为了更加简单地操作以及数据重用,Web-Harvest 提供了变量上下文,那些被命名的变量可以存储在这个上下文中。下图描述了一个管道的执行过程:

 

 

 

The result of extraction could be available in files created during execution or from the variable context if Web-Harvest is programmatically used.

在执行期间,抽取的结果可以存在于文件,如果Web-Harvest 采用编程方式进行使用时,抽取的结果也来自于变量上下文。

Configuration language

配置语言

 

Every extraction process is defined in one or more configuration files, using simple XML-based language. Each processor is described by specific XML element or structure of XML elements. For the illustration, here is presented an example of configuration file:

 

每个抽取过程都定义在一个或多个配置文件中,并且使用简单的基于XML的语言。每个处理器都被特定的XML元素或XML元素的结构所描述。为了说明,下面展示了一个配置文件的例子:

 

 

<?xml version="1.0" encoding="UTF-8"?>
 
<config charset="UTF-8">
    <var-def name="urlList">
        <xpath expression="//img/@src">
            <html-to-xml>
                <http url="http://news.bbc.co.uk"/>
            </html-to-xml>
        </xpath>
    </var-def>
        
    <loop item="link" index="i" filter="unique">
        <list>
            <var name="urlList"/>
        </list>
        <body>
            <file action="write" type="binary" path="images/${i}.gif">
                <http url="${sys.fullUrl('http://news.bbc.co.uk', link)}"/>
            </file>
        </body>
    </loop>
</config>

 

This configuration contains two pipelines. The first pipeline performs the following steps:

 

这个配置包含了两个管道。第一个管道执行了下面的步骤:

  1. HTML content at http://news.bbc.co.uk is downloaded,
  2. HTML cleaning is performed on downloaded content producing XHTML,
  3. XPath expression is searched for, giving URL sequence of page images,
  4. New variable named "urlList" is defined containing sequence of image URLs.

    1.  http://news.bbc.co.uk的网站内容被下载,

    2.  HTML清理

    3.  XPath 表达式用于查找页面图片的URL序列,

    4.  新命名urlList变量用于定义包汉了图片URL的序列。

 

The second pipeline uses result of the previous execution in order to collect all page images:

  1. Loop processor iterates over URL sequence and for every item:
  2. Downloads image at current URL,
  3. Stores the image on the file system.

第二个管道为了收集所有的页面图片,使用了前面执行的结果:

    1.  Loop处理器迭代了所有的URL序列并且对于每项都:

    2.  下载当前URL的图片,

    3.  在文件系统中保存图片。

 

 

This example illustrates some procedural-language elements of Web-Harvest, like variable definition and list iteration, few data management processors (file and http) and couple of HTML/XML processing instructions (html-to-xml and xpath processors).

 

For slightly more complex example of image download, where some other features of Web-Harvest are used, see Examples page. For technical coverage of supported processors, see User manual.

 

这个例子说明了Web-Harvest中一些过程化语言的元素,比如变量定义和列表迭代,少量数据管理的处理器(文件和http)以及一些HTML/XML处理指令。(HTML到XML和XPATH处理器)

想了解在Web-Harvest 中更加复杂一点的图片下载,以及用到的一些特点,见Examples 页。想了解所支持的处理器的技术覆盖范围,看User manual

Data values

All data produced and consumed during extraction process in Web-Harvest have three representations: text, binary and list. There is also special data value empty, whose textual representation is empty string, binary - empty byte array and list - zero length list. Which form of data is used - it depends on processor that consumes the data. In previous configuration html-to-xml processor uses downloaded content as text in order to transform it to HTML, loop processor uses variable urlList as a list in order to iterate over it and file processor treats downloaded images as binary data when saving them to the files. In most cases proper representation of the data is chosen by Web-Harvest. However - in some situations it must be explicitly stated which one to use. One example is file processor where default data type is text and the binary content must be explicitly specified with type="binary".

Variables

Web-Harvest provides the variable context for storing and using variables. There is no special convention for naming variables like in most of the programming languages. Thus, the names like arr[1], 100 or #$& are valid. However, if aforementioned variables were used in scripts or templates (see next section), where expressions are dynamically evaluated, the exception would be thrown. It is therefore recommended to use usual programming language naming in order to avoid any difficulties.

When Web-Harvest is programmatically used (from Java code, not from command line) variable context may be initially set by user in order to add custom values and functionality. Similarly, after execution, variable context is available for taking variables from it.

When user-defined functions are called (see User manual) separate local variable context is created (like in many programming languages, including Java). The valid way to exchange data between caller and called function is through the function parameters.

Scripting and templating

Before Web-Harvest 0.5 templating mechanism was based on OGNL (Object-Graph Navigation Language). From the version 0.5 OGNL is replaced by BeanShell, and starting from version 1.0, multiple scripting languages are supported, giving developers freedom to choose the favourite one.

Besides the set of powerful text and XML manipulation processors, Web-Harvest supports real scripting languages which code can be easily intergrated within scraper configurations. Languages currently supported are BeanShell, Groovy and Javascript. BeanShell is probably the closest to Java syntax and power, but Groovy and Javascript have some other adventages. It is up to the developer to use prefered language or even to mix different languages in the single configuration.

Templating allowes evaluating of marked parts of the text (text "islands" surrounded with ${ and }). Evaluation is performed using the chosen scripting language. In Web-Harvest all elements' attributes are implicitly passed to the templating engine. In upper configuration, there are two places where templater is doing the job:

  • path="images/${i}.gif" in file processor, producing file names based on loop index,
  • url="${sys.fullUrl('http://news.bbc.co.uk', link)}" in http processor, where built-in functionality is called to calculate full URL of the image (see User manual to check all built-in objects).
  • 大小: 3.3 KB
2
0
分享到:
评论

相关推荐

    [Web-Harvest数据采集之二]Web-Harvest基础-抓取java代码分析

    Web-Harvest是一款开源的数据采集工具,主要用于自动化地从网页上提取信息。在这个主题中,我们将深入探讨Web-Harvest的基础知识,特别是如何利用它来抓取Java代码进行分析。在进行数据采集时,理解配置文件、抓取类...

    web-harvest解析及源文件

    Web-Harvest是一种开源的、基于Java的网页数据提取工具,它允许用户通过编写XML配置文件来定义数据抽取规则,从而实现对网页内容的自动化处理和分析。这个压缩包文件包含了一些与Web-Harvest相关的学习资料和源文件...

    Web-Harvest学习笔记.doc

    Web-Harvest是一个用于Web数据挖掘的开源工具,它的核心功能是通过自定义的XML配置文件来抓取和处理目标网页中的数据。该工具支持多种技术,如XPath、XQuery和正则表达式,用于从HTML文档中提取所需信息。Web-...

    web-Harvest帮助手册

    web-Harvest是一款强大的、可扩展的Web数据提取工具,它允许用户通过编写简单的配置脚本来抓取和处理网页上的信息。这款开源工具使用Java语言开发,适用于各种操作系统,并且支持XPath和XQuery等技术,使得数据提取...

    Web-Harvest手册

    Web-Harvest是一款强大的网页数据提取工具,它通过配置文件来定义复杂的网页抓取和处理逻辑。本手册将深入介绍Web-Harvest配置文件的结构和元素,帮助用户理解和运用这款工具。 首先,Web-Harvest配置文件的核心是...

    试用Web-Harvest 使用手册

    Web-Harvest是一款强大的、开源的Web数据提取工具,它允许用户通过简单的配置文件定义规则来抓取网页内容。本手册将深入探讨如何试用Web-Harvest,理解其基本概念和核心功能,以便于在实际项目中灵活应用。 **一、...

    基于Nutch的Web网站定向采集系统

    **Web-Harvest** 则是一款轻量级的Web抓取工具,它允许用户指定一个或多个网页作为抓取的起点,并通过定义规则来抽取特定的内容片段。Web-Harvest 使用XPath表达式来指定需要抓取的数据位置,生成XML文档作为输出...

    webharvset爬虫抓取

    总的来说,WebHarvest是一个强大而灵活的网页抓取工具,通过熟练掌握其配置语法和使用技巧,你可以高效地从互联网上获取和处理大量信息。尽管它有一定的学习曲线,但一旦熟悉了,你会发现它的潜力无穷,能够满足各种...

    Python库 | harvest_python-0.3.3-py3-none-any.whl

    对于数据科学家或Web开发者来说,这类库是不可或缺的工具,可以极大地提高数据获取和处理的效率。 为了深入了解harvest_python库的功能和用法,建议查看其官方文档或在安装后通过Python交互式环境运行`help(harvest...

    nutch定向采集

    Nutch是一款开源的Web爬虫工具,由Apache基金会维护,主要用于互联网上的大规模网页抓取、解析及索引工作。它不仅能够抓取网页,还能进行网页解析、链接数据库构建、网页评分、建立Lucene索引并提供搜索界面等一系列...

    Harvest Web Indexing-开源

    总结来说,Harvest Web Indexing及其Harvest-NG分支展示了开源软件在信息检索领域的强大潜力。它们以模块化、高效和灵活的设计,为索引和管理互联网信息提供了有力工具。无论是大型企业的信息中心,还是个人的学习...

    开源爬虫介绍及下载链接

    11. **Web-Harvest**: 使用Java开发的数据抽取工具,通过XSLT、XQuery和正则表达式等技术从Web页面中提取所需数据。 12. **ItSucks**: 提供Swing GUI界面的Java爬虫项目,支持下载规则的自定义,通过下载模板和正则...

    BAD HARVEST-开源

    总之,BAD HARVEST是一个有效的工具,它利用CGI和开源理念来保护用户的电子邮件地址,同时保持网站的互动性。对于那些希望在网站上提供联系方式但又不想遭受垃圾邮件困扰的人来说,这是一个值得考虑的解决方案。其...

    Harvest-开源

    例如,版本harvest-1.9.15是Harvest的一个稳定版本,包含了完整的源代码和必要的文档,为用户提供了良好的开发和部署环境。通过下载这个压缩包,开发者可以深入研究其内部机制,学习如何配置和优化Harvest,以实现更...

    webharvest基础教程 pdf

    - **定义与功能**:WebHarvest是一款开源的Web数据抓取工具,基于Java编写,主要用于从Web页面中提取所需的数据。该工具支持多种数据处理技术,如XSLT、XQuery和正则表达式等,能够有效处理HTML/XML格式的网页内容。...

Global site tag (gtag.js) - Google Analytics