nutch 0.9二次开发--搜索结果优化

nhy520

浏览: 958196 次
性别:
来自: 北京

最近访客更多访客>>

yunzhu

k0521klb

remote_silence

prog

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

搜索引擎学习

QQ 腾讯 lucene HTML Apache

使用nutch 0.9自带的程序包搜索的时候，存在一个冗余数据的情况。例如，如果想搜索关于姚明、易建联等的信息时，nutch默认会把网页中导航条或者一些标题等中包含姚明和易建联信息的页面检索出来，以腾讯为例，http://sports.qq.com/nba/的导航条部分包含了姚明和易建联。

但这个页面的其他信息没有设计到姚明和易建联，所以这个页面可能实际上不是我们想要的；

还有一种情况，当我们想搜索“莎娃”的时，nutch会抓取到http://sports.qq.com/a/20090108/000407.htm，但实际上“莎娃”只是在这个页面的右边超链接款上有包含“莎娃”的信息。

如下图：

这个页面也可能不是我们想要的。

深入研究nutch工作原理：

结论：nutch通过HTMLParser把爬取到的网页解析成文本格式，保存到本地硬盘，然后再通过lucene建立索引，如果想达到去除无用信息目的，就要从图中红色标注的部分入手。

优化搜索结果方法1：

改写org.apache.nutch.parse.html.DOMContentUtils文件，修改方法getTextHelper方法：

getTextHelper源代码：

private boolean getTextHelper(StringBuffer sb, Node node,
                                             boolean abortOnNestedAnchors,
                                             int anchorDepth) {
    if ("script".equalsIgnoreCase(node.getNodeName())) {
      return false;
    }
    if ("style".equalsIgnoreCase(node.getNodeName())) {
      return false;
    }
    if (abortOnNestedAnchors && "a".equalsIgnoreCase(node.getNodeName())) {
      anchorDepth++;
      if (anchorDepth > 1)
        return true;
    }
    if (node.getNodeType() == Node.COMMENT_NODE) {
      return false;
    }
    if (node.getNodeType() == Node.TEXT_NODE) {
      // cleanup and trim the value
      String text = node.getNodeValue();
      text = text.replaceAll("\\s+", " ");
      text = text.trim();
      if (text.length() > 0) {
        if (sb.length() > 0) sb.append(' ');
      sb.append(text);
      }
    }
    boolean abort = false;
    NodeList children = node.getChildNodes();
    if (children != null) {
      int len = children.getLength();
      for (int i = 0; i < len; i++) {
        if (getTextHelper(sb, children.item(i),
                          abortOnNestedAnchors, anchorDepth)) {
          abort = true;
          break;
        }
      }
    }
    return abort;
}

自定义方法（实际上就是过滤掉解析下来包含<a href=" ">的信息）：

if (node.getNodeType() == Node.TEXT_NODE) { //node是解析下来的网页源文件所包含的内容

                           //Node.TEXT_NODE:节点属于文本节点<body><div><a href><td>等标签
      // cleanup and trim the value
      String text = node.getNodeValue();         //获取节点里面的文本内容，相当与去掉HTML标签

       /* 过滤掉包含的特殊字符*/
      text = text.replaceAll("\\s+", " ");
      text = text.replace("【", "");
      text = text.replace("】", "");
      text = text.replace("[", "");
      text = text.replace("]", "");
      text = text.replace("|", "");
      text = text.replace("┊", "");
      text = text.replace("?", "");
      text = text.replace("?", "");
      text = text.replace("？", "");
      text = text.replace("|", "");
      text = text.replace("、", "");
      text = text.replace("-", "");
      text = text.replace("~", "");
      text = text.replace("!", "");
      text = text.replace("@", "");
      text = text.replace("#", "");
      text = text.replace("$", "");
      text = text.replace("^", "");
      text = text.replace("*", "");
      text = text.replace("(", "");
      text = text.replace(")", "");
      text = text.replace("%", "");
      text = text.replace(">", "");
      text = text.replace("?", "");
      text = text.replace("%", "");

text = text.trim();

temp = node.getParentNode().toString(); //获取父节点的标签

if (text.length() > 0 && temp.indexOf("A:") == -1) { //如果属于<a href>，则过滤...

        if (sb.length() > 0) sb.append(' ');
      sb.append(text);
      }
    }
    boolean abort = false;
    NodeList children = node.getChildNodes();
    if (children != null) {
      int len = children.getLength();
      for (int i = 0; i < len; i++) {
        if (getTextHelper(sb, children.item(i),
                          abortOnNestedAnchors, anchorDepth)) {
          abort = true;
          break;
        }
      }
    }
    return abort;
}

相关资料：接口org.w3c.dom.Node的使用：http://gceclub.sun.com.cn/Java_Docs/html/zh_CN/api/org/w3c/dom/class-use/Node.html

分享到：

nutch-0.9 build.xml和.bat文件配置 | nutch 0.9二次开发--搜索结果高亮

2009-05-23 00:22
浏览 2352
评论(0)
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论