使用zend Framework的lucene进行全文检索——中文分词

ythzjk

浏览: 75978 次
性别:
来自: 上海

最近访客更多访客>>

zmmandcl

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

zend framework

全文检索 Zend lucene 算法 PHP

[2007/06/16 21:52 | 分类: PHP高级技术 » PHP面对对象 | by feifengxlq ]

前言：去年系统的研究了下lucene和weblucene等相关的全文检索技术。后来也稍微的看了下zend Framework的lucene。之后就没有继续研究了。这次打算给研学吧(http://www.yanxue8.com)添加全文检索的功能，再次重新研究下zend Framework的lucene模块。

接下来的几篇文章会系统讲述如何使用zend Framework搭建一个简单实用的站内全文检索。主要是增对数据库里面的数据进行检索。关于全文检索的基本知识，和zend framework环境的搭建使用，我这里不细说（呵呵，其实我自己也不是是zend framework，而是用自己的phpbean。zend framework我只当作库类用）。

ZF本身没有提供中文分词算法，具体应用中要自己写。我这里使用简单的二元分词算法（只在utf-8下工作正常，对于其他字符集，请修改程序）。

第一步、如何测试分词算法的输出。
在zf 的手册中没有提到，我这里简单给个例子：

复制内容到剪贴板

代码:

<?php 
$analyzer = Zend_Search_Lucene_Analysis_Analyzer::getDefault(); 
$value = 'this is a test!'; 
        $analyzer->setInput($value, 'utf-8'); 
         
    $position     = 0; 
        $tokenCounter = 0; 
        while (($token = $analyzer->nextToken()) !== null) { 
            $tokenCounter++; 
            $tokens[] = $token; 
        } 
        print_r($tokens); 
?>

这里使用是zf默认的分词算法Zend_Search_Lucene_Analysis_Analyzer_Common_Text。另外你可以加上一个过滤方法。比如说过滤一些单词，比如“is”,"a "之类的。

第二步、自定义自己的分词算法，可以参考手册，或者自己看Zend_Search_Lucene_Analysis_Analyzer_Common_Text类的实现。
其中要注意的是过滤这点。由于我们的分词是二元分词，如果要过滤一些比如“的”、“啊”之类的单词，是无法使用内置的Tokens Filtering。我们需要是分词前先过滤调。这个可以在reset()里面实现
例子。

复制内容到剪贴板

代码:

<? 
require_once 'Zend/Search/Lucene/Analysis/Analyzer.php'; 
class Phpbean_Lucene_Analyzer extends Zend_Search_Lucene_Analysis_Analyzer_Common { 
     
    private $_position; 
     
    private $_cnStopWords = array(); 
     
    public function setCnStopWords($cnStopWords){ 
        $this->_cnStopWords = $cnStopWords; 
    } 

    /** 
     * Reset token stream 
     */ 
    public function reset() 
    { 
        $this->_position = 0; 
        $search = array(",", "/", "\", ".", ";", ":", """, "!", "~", "`", "^", "(", ")", "?", "-", "t", "n", "'", "<", ">", "r", "rn", "$", "&", "%", "#", "@", "+", "=", "{", "}", "[", "]", "：", "）", "（", "．", "。", "，", "！", "；", "“", "”", "‘", "’", "［", "］", "、", "—", "　", "《", "》", "－", "…", "【", "】",); 
        $this->_input = str_replace($search,' ',$this->_input); 
        $this->_input = str_replace($this->_cnStopWords,' ',$this->_input); 
    } 

    /** 
     * Tokenization stream API 
     * Get next token 
     * Returns null at the end of stream 
     * 
     * @return Zend_Search_Lucene_Analysis_Token|null 
     */ 
    public function nextToken() 
    { 
        if ($this->_input === null) { 
            return null; 
        } 
        while ($this->_position < strlen($this->_input)) { 
            while ($this->_position < strlen($this->_input) && 
                    $this->_input[$this->_position]==' ' ) { 
                $this->_position++; 
            } 
            $termStartPosition = $this->_position;       
            $temp_char = $this->_input[$this->_position]; 
            $isCnWord = false; 
            if(ord($temp_char)>127){   
                $i = 0;        
                while ($this->_position < strlen($this->_input) && 
                ord( $this->_input[$this->_position] )>127) { 
                    $this->_position = $this->_position + 3; 
                    $i ++; 
                    if($i==2){ 
                        $isCnWord = true; 
                        break; 
                    } 
                } 
                if($i==1)continue; 
            }else{ 
                while ($this->_position < strlen($this->_input) && 
                ctype_alnum( $this->_input[$this->_position] )) { 
                    $this->_position++; 
                } 
            } 
            if ($this->_position == $termStartPosition) { 
                return null; 
            } 

            $token = new Zend_Search_Lucene_Analysis_Token( 
                                      substr($this->_input, 
                                             $termStartPosition, 
                                             $this->_position - $termStartPosition), 
                                      $termStartPosition, 
                                      $this->_position); 
            $token = $this->normalize($token); 
            if($isCnWord)$this->_position = $this->_position - 3; 
            if ($token !== null) { 
                return $token; 
            } 
        } 
        return null; 
    } 
     
} 
?>

测试分词输出demo

复制内容到剪贴板

代码:

<? 
$stopWords = array('a', 'an', 'at', 'the', 'and', 'or', 'is', 'am'); 
        $stopWordsFilter = new Zend_Search_Lucene_Analysis_TokenFilter_StopWords($stopWords); 
        $analyzer = new Phpbean_Lucene_Analyzer(); 
        $cnStopWords = array('的'); 
        $analyzer->setCnStopWords($cnStopWords); 
        $analyzer->addFilter($stopWordsFilter); 
        $value = 'this is " a test【中文】的测试'; 
        $analyzer->setInput($value, 'utf-8'); 
         
        $position     = 0; 
        $tokenCounter = 0; 
        while (($token = $analyzer->nextToken()) !== null) { 
            $tokenCounter++; 
            $tokens[] = $token; 
        } 
        print_r($tokens); 
?>

比如上面的输出就是"this" "test" "中文" “测试”四个结果。符合我们的需要。

搜索更多相关主题的帖子: zend_framework lucene 分词

分享到：

zend studio for eclipse 中文乱码的问题 | Zend Framework实例教程2

2009-02-09 16:05
浏览 1977
评论(0)
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论