论坛首页 Java企业应用论坛

发布IKAnalyzer中文分词器V3.1.5GA

浏览 13013 次
精华帖 (0) :: 良好帖 (2) :: 新手帖 (0) :: 隐藏帖 (0)
作者 正文
   发表时间:2009-10-22   最后修改:2009-10-22
祝贺网友-1987(李良杰)加盟IKAnalyzer开发团队,感谢他对solr集成部分的测试工作

V3.1.5GA版本变更:

1.新增org.wltea.analyzer.solr.IKTokenizerFactory,支持solr的TokenizerFactory接口配置

 类org.wltea.analyzer.solr.IKTokenizerFactory
说明:该类继承与solr的BaseTokenizerFactory,是IK分词器对solr项目TokenizerFactory接口的扩展实现。从版本V3.1.5起。
属性:isMaxWordLength。该属性决定分词器是否采用最大词语切分。


solr配置样例
使用IKAnalyzer的配置
<schema name="example" version="1.1">
……
<fieldType name="text" class="solr.TextField">
      <analyzer class="org.wltea.analyzer.lucene.IKAnalyzer"/>
</fieldType>
……
</schema>


使用IKTokenizerFactory的配置
<fieldType name="text" class="solr.TextField" >
	<analyzer type="index">
		<tokenizer class="org.wltea.analyzer.solr.IKTokenizerFactory" isMaxWordLength="false"/>
……

	</analyzer>
	<analyzer type="query">
		<tokenizer class="org.wltea.analyzer.solr.IKTokenizerFactory" isMaxWordLength="true"/>
	……
	</analyzer>
</fieldType>


2.修订了3.1.3GA中,在特殊情况下对未知词的切分不输出的bug

3.应广大网友要求,使用JDK5.0对jar包编译发布


下载 :IKAnalyzer3.1.5GA完整包

更多详细请参看《IKAnalyzer中文分词器V3.1.5使用手册》






   发表时间:2009-10-28  
你好,除了用googlecode的svn外,能否提供下3.1.5GA的源码啊?谢谢
0 请登录后投票
   发表时间:2009-10-28  
shadowlin 写道
你好,除了用googlecode的svn外,能否提供下3.1.5GA的源码啊?谢谢


请提供你的邮箱地址。

使用SVN是为了用户能及时得到最新的代码。
0 请登录后投票
   发表时间:2009-11-05  
老大,你的lucene是什么版本了?为什么我的是lucene-core-2.9.1中报错:
org.wltea.analyzer.solr.IKTokenizerFactory.create(Ljava/io/Reader;)Lorg/apache/lucene/analysis/Tokenizer;

java.lang.AbstractMethodError: org.wltea.analyzer.solr.IKTokenizerFactory.create(Ljava/io/Reader;)Lorg/apache/lucene/analysis/Tokenizer;

是不支持
    public TokenStream create(Reader reader) { 
        return new IKAnalyzer().tokenStream("text", reader); 
    } 
     
这种写法。
0 请登录后投票
   发表时间:2009-11-05   最后修改:2009-11-05
jbas 写道
老大,你的lucene是什么版本了?为什么我的是lucene-core-2.9.1中报错:
org.wltea.analyzer.solr.IKTokenizerFactory.create(Ljava/io/Reader;)Lorg/apache/lucene/analysis/Tokenizer;

java.lang.AbstractMethodError: org.wltea.analyzer.solr.IKTokenizerFactory.create(Ljava/io/Reader;)Lorg/apache/lucene/analysis/Tokenizer;

是不支持
    public TokenStream create(Reader reader) { 
        return new IKAnalyzer().tokenStream("text", reader); 
    } 
     
这种写法。


你这段代码不是IKTokenizerFactory中的啊,是之前在帖子上发的。3.1.5GA中IKTokenizerFactory中已经不是这么写了啊
0 请登录后投票
   发表时间:2009-11-05  
jbas 写道
老大,你的lucene是什么版本了?为什么我的是lucene-core-2.9.1中报错:
org.wltea.analyzer.solr.IKTokenizerFactory.create(Ljava/io/Reader;)Lorg/apache/lucene/analysis/Tokenizer;

java.lang.AbstractMethodError: org.wltea.analyzer.solr.IKTokenizerFactory.create(Ljava/io/Reader;)Lorg/apache/lucene/analysis/Tokenizer;

是不支持
    public TokenStream create(Reader reader) { 
        return new IKAnalyzer().tokenStream("text", reader); 
    } 
     
这种写法。


再说,在lucene中不需要用的IKTokenizerFactory,而应该使用IKAnalyzer啊,看说明文档吧
0 请登录后投票
   发表时间:2009-11-05  
jbas 写道
老大,你的lucene是什么版本了?为什么我的是lucene-core-2.9.1中报错:
org.wltea.analyzer.solr.IKTokenizerFactory.create(Ljava/io/Reader;)Lorg/apache/lucene/analysis/Tokenizer;

java.lang.AbstractMethodError: org.wltea.analyzer.solr.IKTokenizerFactory.create(Ljava/io/Reader;)Lorg/apache/lucene/analysis/Tokenizer;

是不支持
    public TokenStream create(Reader reader) { 
        return new IKAnalyzer().tokenStream("text", reader); 
    } 
     
这种写法。


我想问下你是怎么配置的,如下配置:
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
			<analyzer type="index">
				<tokenizer class="org.wltea.analyzer.solr.IKTokenizerFactory" isMaxWordLength="false"/>
				<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
				<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" />
				<filter class="solr.LowerCaseFilterFactory" />
				<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" />
				<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
			</analyzer>
			<analyzer type="query">
				<tokenizer class="org.wltea.analyzer.solr.IKTokenizerFactory" isMaxWordLength="false"/>
				<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
				<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
				<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" />
				<filter class="solr.LowerCaseFilterFactory" />
				<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" />
				<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
			</analyzer>
		</fieldType>


还有正确的org.apache.solr.analysis.TokenizerFactory#create(java.io.Reader)方法的代码如下:


public TokenStream create(Reader reader) {
		return new IKTokenizer(reader , isMaxWordLength());
	}
0 请登录后投票
   发表时间:2009-11-05  
linliangyi2007 ,你好,
  我是按你文档中的写的,应该不会错的,这个问题应该是lucene-core-2.9.1修改了TokenStream 结构导致的,别的中文分词也有这个问题,但有些已经支持最新2.9.1, 请你这边再看一下了。
谢谢!

当我执行:http://localhost:8983/solr/db/select/?q=title%3A%22%E4%BA%92%E8%81%94%E7%BD%91%22&version=2.2&start=0&rows=10&indent=on 这个url 查询后,显示如下错误:

HTTP ERROR: 500
org.wltea.analyzer.solr.IKTokenizerFactory.create(Ljava/io/Reader;)Lorg/apache/lucene/analysis/Tokenizer;

java.lang.AbstractMethodError: org.wltea.analyzer.solr.IKTokenizerFactory.create(Ljava/io/Reader;)Lorg/apache/lucene/analysis/Tokenizer;
at org.apache.solr.analysis.TokenizerChain.getStream(TokenizerChain.java:69)
at org.apache.solr.analysis.SolrAnalyzer.reusableTokenStream(SolrAnalyzer.java:74)
at org.apache.solr.schema.IndexSchema$SolrIndexAnalyzer.reusableTokenStream(IndexSchema.java:364)
at org.apache.lucene.queryParser.QueryParser.getFieldQuery(QueryParser.java:567)





我的配置如下:

    <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
<tokenizer class="org.wltea.analyzer.solr.IKTokenizerFactory" isMaxWordLength="true"/> 

<!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        -->
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
<tokenizer class="org.wltea.analyzer.solr.IKTokenizerFactory" isMaxWordLength="true"/> 

<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>
0 请登录后投票
   发表时间:2009-11-05  
jbas 写道
linliangyi2007 ,你好,
  我是按你文档中的写的,应该不会错的,这个问题应该是lucene-core-2.9.1修改了TokenStream 结构导致的,别的中文分词也有这个问题,但有些已经支持最新2.9.1, 请你这边再看一下了。
谢谢!

当我执行:http://localhost:8983/solr/db/select/?q=title%3A%22%E4%BA%92%E8%81%94%E7%BD%91%22&version=2.2&start=0&rows=10&indent=on 这个url 查询后,显示如下错误:

HTTP ERROR: 500
org.wltea.analyzer.solr.IKTokenizerFactory.create(Ljava/io/Reader;)Lorg/apache/lucene/analysis/Tokenizer;

java.lang.AbstractMethodError: org.wltea.analyzer.solr.IKTokenizerFactory.create(Ljava/io/Reader;)Lorg/apache/lucene/analysis/Tokenizer;
at org.apache.solr.analysis.TokenizerChain.getStream(TokenizerChain.java:69)
at org.apache.solr.analysis.SolrAnalyzer.reusableTokenStream(SolrAnalyzer.java:74)
at org.apache.solr.schema.IndexSchema$SolrIndexAnalyzer.reusableTokenStream(IndexSchema.java:364)
at org.apache.lucene.queryParser.QueryParser.getFieldQuery(QueryParser.java:567)





我的配置如下:

    <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
<tokenizer class="org.wltea.analyzer.solr.IKTokenizerFactory" isMaxWordLength="true"/> 

<!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        -->
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
<tokenizer class="org.wltea.analyzer.solr.IKTokenizerFactory" isMaxWordLength="true"/> 

<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>


感谢你的宝贵的反馈,我们会及时跟进lucene2.9.1的变更
(PS:lucene的设计真有问题,哪有老变接口,而且不向下兼容的!!!)
0 请登录后投票
   发表时间:2009-11-08  
将<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="org.wltea.analyzer.solr.IKTokenizerFactory" isMaxWordLength="false"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="org.wltea.analyzer.solr.IKTokenizerFactory" isMaxWordLength="false"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
</analyzer>
</fieldType>改为 :

<fieldType name="text" class="solr.TextField">
<analyzer class="org.wltea.analyzer.lucene.IKAnalyzer"/>
</fieldType> 就可以了。

至于为什么采用tokenizer报错,我估计是solr1.4接口发生改变,在研究中...
0 请登录后投票
论坛首页 Java企业应用版

跳转论坛:
Global site tag (gtag.js) - Google Analytics