搜索引擎Nutch源代码研究之一网页抓取（3）

blessed24

浏览: 291685 次
性别:
来自: 北京

最近访客更多访客>>

BeyondPC

wjzayy

yfxu10

903896940

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

Search Engine

搜索引擎数据结构

今天我们看看Nutch网页抓取，所用的几种数据结构：
主要涉及到了这几个类：FetchListEntry，Page，
首先我们看看FetchListEntry类：
public final class FetchListEntry implements Writable, Cloneable
实现了Writable, Cloneable接口，Nutch许多类实现了Writable, Cloneable。
自己负责自己的读写操作其实是个很合理的设计方法，分离出来反倒有很琐碎
的感觉。
看看里面的成员变量：

Java代码

public static final String DIR_NAME = "fetchlist";//要写入磁盘的目录
private final static byte CUR_VERSION = 2;//当前的版本号
private boolean fetch;//是否抓取以便以后更新
private Page page;//当前抓取的页面
private String[] anchors;//抓取到的该页面包含的链接

public static final String DIR_NAME = "fetchlist";//要写入磁盘的目录
private final static byte CUR_VERSION = 2;//当前的版本号
private boolean fetch;//是否抓取以便以后更新
private Page page;//当前抓取的页面
private String[] anchors;//抓取到的该页面包含的链接

我们看看如何读取各个字段的，也就是函数
public final void readFields(DataInput in) throws IOException
读取version 字段，并判断如果版本号是否大约当前的版本号，则抛出版本不匹配的异常，
然后读取fetch 和page 字段。
判断如果版本号大于1，说明anchors已经保存过了，读取anchors，否则直接赋值一个空的字符串
代码如下：

Java代码

byte version = in.readByte(); // read version
if (version > CUR_VERSION) // check version
throw new VersionMismatchException(CUR_VERSION, version);
fetch = in.readByte() != 0; // read fetch flag
page = Page.read(in); // read page
if (version > 1) { // anchors added in version 2
anchors = new String[in.readInt()]; // read anchors
for (int i = 0; i < anchors.length; i++) {
anchors[i] = UTF8.readString(in);
}
} else {
anchors = new String[0];
}

    byte version = in.readByte();                 // read version
    if (version > CUR_VERSION)                    // check version
      throw new VersionMismatchException(CUR_VERSION, version);

    fetch = in.readByte() != 0;                   // read fetch flag

    page = Page.read(in);                         // read page

    if (version > 1) {                            // anchors added in version 2
      anchors = new String[in.readInt()];         // read anchors
      for (int i = 0; i < anchors.length; i++) {
        anchors[i] = UTF8.readString(in);
      }
    } else {
      anchors = new String[0];
    }

同时还提供了一个静态的读取各个字段的函数，并构建出FetchListEntry对象返回：

Java代码

public static FetchListEntry read(DataInput in) throws IOException {
FetchListEntry result = new FetchListEntry();
result.readFields(in);
return result;
}

public static FetchListEntry read(DataInput in) throws IOException {
    FetchListEntry result = new FetchListEntry();
    result.readFields(in);
    return result;
}

写得代码则比较易看,分别写每个字段：

Java代码

public final void write(DataOutput out) throws IOException {
out.writeByte(CUR_VERSION); // store current version
out.writeByte((byte)(fetch ? 1 : 0)); // write fetch flag
page.write(out); // write page
out.writeInt(anchors.length); // write anchors
for (int i = 0; i < anchors.length; i++) {
UTF8.writeString(out, anchors[i]);
}
}

public final void write(DataOutput out) throws IOException {
    out.writeByte(CUR_VERSION);                   // store current version
    out.writeByte((byte)(fetch ? 1 : 0));         // write fetch flag
    page.write(out);                              // write page
    out.writeInt(anchors.length);                 // write anchors
    for (int i = 0; i < anchors.length; i++) {
      UTF8.writeString(out, anchors[i]);
    }
  }

其他的clone和equals函数实现的也非常易懂。
下面我们看看Page类的代码：
public class Page implements WritableComparable, Cloneable
和FetchListEntry一样同样实现了Writable, Cloneable接口，我们看看Nutch的注释，我们就非常容易知道各个字段的意义了：

Java代码

/*********************************************
* A row in the Page Database.
* <pre>
* type name description
* ---------------------------------------------------------------
* byte VERSION - A byte indicating the version of this entry.
* String URL - The url of a page. This is the primary key.
* 128bit ID - The MD5 hash of the contents of the page.
* 64bit DATE - The date this page should be refetched.
* byte RETRIES - The number of times we've failed to fetch this page.
* byte INTERVAL - Frequency, in days, this page should be refreshed.
* float SCORE - Multiplied into the score for hits on this page.
* float NEXTSCORE - Multiplied into the score for hits on this page.
* </pre>
*
* @author Mike Cafarella
* @author Doug Cutting
*********************************************/

/*********************************************
 * A row in the Page Database.
 * <pre>
 *   type   name    description
 * ---------------------------------------------------------------
 *   byte   VERSION  - A byte indicating the version of this entry.
 *   String URL      - The url of a page.  This is the primary key.
 *   128bit ID       - The MD5 hash of the contents of the page.
 *   64bit  DATE     - The date this page should be refetched.
 *   byte   RETRIES  - The number of times we've failed to fetch this page.
 *   byte   INTERVAL - Frequency, in days, this page should be refreshed.
 *   float  SCORE   - Multiplied into the score for hits on this page.
 *   float  NEXTSCORE   - Multiplied into the score for hits on this page.
 * </pre>
 *
 * @author Mike Cafarella
 * @author Doug Cutting
 *********************************************/

各个字段：

Java代码

private final static byte CUR_VERSION = 4;
private static final byte DEFAULT_INTERVAL =
(byte)NutchConf.get().getInt("db.default.fetch.interval", 30);
private UTF8 url;
private MD5Hash md5;
private long nextFetch = System.currentTimeMillis();
private byte retries;
private byte fetchInterval = DEFAULT_INTERVAL;
private int numOutlinks;
private float score = 1.0f;
private float nextScore = 1.0f;

private final static byte CUR_VERSION = 4;
  private static final byte DEFAULT_INTERVAL =
    (byte)NutchConf.get().getInt("db.default.fetch.interval", 30);

  private UTF8 url;
  private MD5Hash md5;
  private long nextFetch = System.currentTimeMillis();
  private byte retries;
  private byte fetchInterval = DEFAULT_INTERVAL;
  private int numOutlinks;
  private float score = 1.0f;
  private float nextScore = 1.0f;

同样看看他是如何读取自己的各个字段的，其实代码加上本来提供的注释，使很容易看懂的，不再详述：

Java代码

ublic void readFields(DataInput in) throws IOException {
byte version = in.readByte(); // read version
if (version > CUR_VERSION) // check version
throw new VersionMismatchException(CUR_VERSION, version);
url.readFields(in);
md5.readFields(in);
nextFetch = in.readLong();
retries = in.readByte();
fetchInterval = in.readByte();
numOutlinks = (version > 2) ? in.readInt() : 0; // added in Version 3
score = (version>1) ? in.readFloat() : 1.0f; // score added in version 2
nextScore = (version>3) ? in.readFloat() : 1.0f; // 2nd score added in V4
}

ublic void readFields(DataInput in) throws IOException {
    byte version = in.readByte();                 // read version
    if (version > CUR_VERSION)                    // check version
      throw new VersionMismatchException(CUR_VERSION, version);

    url.readFields(in);
    md5.readFields(in);
    nextFetch = in.readLong();
    retries = in.readByte();
    fetchInterval = in.readByte();
    numOutlinks = (version > 2) ? in.readInt() : 0; // added in Version 3
    score = (version>1) ? in.readFloat() : 1.0f;  // score added in version 2
    nextScore = (version>3) ? in.readFloat() : 1.0f;  // 2nd score added in V4
  }

写各个字段也很直接：

Java代码

public void write(DataOutput out) throws IOException {
out.writeByte(CUR_VERSION); // store current version
url.write(out);
md5.write(out);
out.writeLong(nextFetch);
out.write(retries);
out.write(fetchInterval);
out.writeInt(numOutlinks);
out.writeFloat(score);
out.writeFloat(nextScore);
}

public void write(DataOutput out) throws IOException {
    out.writeByte(CUR_VERSION);                   // store current version
    url.write(out);
    md5.write(out);
    out.writeLong(nextFetch);
    out.write(retries);
    out.write(fetchInterval);
    out.writeInt(numOutlinks);
    out.writeFloat(score);
    out.writeFloat(nextScore);
  }

我们顺便看看提供方便读写Fetch到的内容的类FetcherOutput：这个类通过委托前面介绍的两个类的读写，提供了Fetche到的
各种粒度结构的读写功能，代码都比较直接，不再详述。
下次我们看看parse-html插件，看看Nutch是如何提取html页面的。

Nutch源代码学习小小总结一下 | 搜索引擎Nutch源代码研究之一网页抓取(2 ...

16:39
浏览 (1892)
评论 (3)
分类: Search Engine
收藏
相关推荐

3 楼 sharong 2008-03-17 引用

或者是要安装一些nutch的插件？

2 楼 fuliang 2007-12-15 引用

另外补充一下Content类：
public final class Content extends VersionedWritable
我们看到继承了VersionedWritable类。VersionedWritable类实现了版本字段的读写功能。
我们先看看成员变量：

Java代码

public static final String DIR_NAME = "content";
private final static byte VERSION = 1;
private String url;
private String base;
private byte[] content;
private String contentType;
private Properties metadata;

  public static final String DIR_NAME = "content";
  private final static byte VERSION = 1;
  private String url;
  private String base;
  private byte[] content;
  private String contentType;
  private Properties metadata;

DIR_NAME 为Content保存的目录，
VERSION 为版本常量
url为该Content所属页面的url
base为该Content所属页面的base url
contentType为该Content所属页面的contentType
metadata为该Content所属页面的meta信息

下面我们看看Content是如何读写自身的字段的：
public final void readFields(DataInput in) throws IOException
这个方法功能为读取自身的各个字段

Java代码

super.readFields(in); // check version
url = UTF8.readString(in); // read url
base = UTF8.readString(in); // read base
content = WritableUtils.readCompressedByteArray(in);
contentType = UTF8.readString(in); // read contentType
int propertyCount = in.readInt(); // read metadata
metadata = new Properties();
for (int i = 0; i < propertyCount; i++) {
metadata.put(UTF8.readString(in), UTF8.readString(in));
}

super.readFields(in);                         // check version

    url = UTF8.readString(in);                    // read url
    base = UTF8.readString(in);                   // read base

    content = WritableUtils.readCompressedByteArray(in);

    contentType = UTF8.readString(in);            // read contentType

    int propertyCount = in.readInt();             // read metadata
    metadata = new Properties();
    for (int i = 0; i < propertyCount; i++) {
      metadata.put(UTF8.readString(in), UTF8.readString(in));
    }

代码加注释之后基本上比较清晰了．
super.readFields(in);
这句调用父类VersionedWritable读取并验证版本号
写的代码也比较简单：

Java代码

public final void write(DataOutput out) throws IOException {
super.write(out); // write version
UTF8.writeString(out, url); // write url
UTF8.writeString(out, base); // write base
WritableUtils.writeCompressedByteArray(out, content); // write content
UTF8.writeString(out, contentType); // write contentType
out.writeInt(metadata.size()); // write metadata
Iterator i = metadata.entrySet().iterator();
while (i.hasNext()) {
Map.Entry e = (Map.Entry)i.next();
UTF8.writeString(out, (String)e.getKey());
UTF8.writeString(out, (String)e.getValue());
}
}

public final void write(DataOutput out) throws IOException {
    super.write(out);                             // write version

    UTF8.writeString(out, url);                   // write url
    UTF8.writeString(out, base);                  // write base

    WritableUtils.writeCompressedByteArray(out, content); // write content

    UTF8.writeString(out, contentType);           // write contentType
    
    out.writeInt(metadata.size());                // write metadata
    Iterator i = metadata.entrySet().iterator();
    while (i.hasNext()) {
      Map.Entry e = (Map.Entry)i.next();
      UTF8.writeString(out, (String)e.getKey());
      UTF8.writeString(out, (String)e.getValue());
    }
  }

其实这些类主要是它的字段．以及怎样划分各个域模型的

分享到：

索引擎Nutch源代码研究之一网页抓取(4) | 搜索引擎Nutch源代码研究之一网页抓取(2 ...

2010-12-06 21:47
浏览 1044
评论(0)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论