精华帖 (0) :: 良好帖 (2) :: 新手帖 (0) :: 隐藏帖 (0)
|
|
---|---|
作者 | 正文 |
发表时间:2009-10-22
最后修改:2009-10-22
V3.1.5GA版本变更: 1.新增org.wltea.analyzer.solr.IKTokenizerFactory,支持solr的TokenizerFactory接口配置 类org.wltea.analyzer.solr.IKTokenizerFactory 说明:该类继承与solr的BaseTokenizerFactory,是IK分词器对solr项目TokenizerFactory接口的扩展实现。从版本V3.1.5起。 属性:isMaxWordLength。该属性决定分词器是否采用最大词语切分。 solr配置样例 使用IKAnalyzer的配置 <schema name="example" version="1.1"> …… <fieldType name="text" class="solr.TextField"> <analyzer class="org.wltea.analyzer.lucene.IKAnalyzer"/> </fieldType> …… </schema> 使用IKTokenizerFactory的配置 <fieldType name="text" class="solr.TextField" > <analyzer type="index"> <tokenizer class="org.wltea.analyzer.solr.IKTokenizerFactory" isMaxWordLength="false"/> …… </analyzer> <analyzer type="query"> <tokenizer class="org.wltea.analyzer.solr.IKTokenizerFactory" isMaxWordLength="true"/> …… </analyzer> </fieldType> 2.修订了3.1.3GA中,在特殊情况下对未知词的切分不输出的bug 3.应广大网友要求,使用JDK5.0对jar包编译发布 下载 :IKAnalyzer3.1.5GA完整包 更多详细请参看《IKAnalyzer中文分词器V3.1.5使用手册》 声明:ITeye文章版权属于作者,受法律保护。没有作者书面许可不得转载。
推荐链接
|
|
返回顶楼 | |
发表时间:2009-10-28
你好,除了用googlecode的svn外,能否提供下3.1.5GA的源码啊?谢谢
|
|
返回顶楼 | |
发表时间:2009-10-28
shadowlin 写道 你好,除了用googlecode的svn外,能否提供下3.1.5GA的源码啊?谢谢
请提供你的邮箱地址。 使用SVN是为了用户能及时得到最新的代码。 |
|
返回顶楼 | |
发表时间:2009-11-05
老大,你的lucene是什么版本了?为什么我的是lucene-core-2.9.1中报错:
org.wltea.analyzer.solr.IKTokenizerFactory.create(Ljava/io/Reader;)Lorg/apache/lucene/analysis/Tokenizer; java.lang.AbstractMethodError: org.wltea.analyzer.solr.IKTokenizerFactory.create(Ljava/io/Reader;)Lorg/apache/lucene/analysis/Tokenizer; 是不支持 public TokenStream create(Reader reader) { return new IKAnalyzer().tokenStream("text", reader); } 这种写法。 |
|
返回顶楼 | |
发表时间:2009-11-05
最后修改:2009-11-05
jbas 写道 老大,你的lucene是什么版本了?为什么我的是lucene-core-2.9.1中报错:
org.wltea.analyzer.solr.IKTokenizerFactory.create(Ljava/io/Reader;)Lorg/apache/lucene/analysis/Tokenizer; java.lang.AbstractMethodError: org.wltea.analyzer.solr.IKTokenizerFactory.create(Ljava/io/Reader;)Lorg/apache/lucene/analysis/Tokenizer; 是不支持 public TokenStream create(Reader reader) { return new IKAnalyzer().tokenStream("text", reader); } 这种写法。 你这段代码不是IKTokenizerFactory中的啊,是之前在帖子上发的。3.1.5GA中IKTokenizerFactory中已经不是这么写了啊 |
|
返回顶楼 | |
发表时间:2009-11-05
jbas 写道 老大,你的lucene是什么版本了?为什么我的是lucene-core-2.9.1中报错:
org.wltea.analyzer.solr.IKTokenizerFactory.create(Ljava/io/Reader;)Lorg/apache/lucene/analysis/Tokenizer; java.lang.AbstractMethodError: org.wltea.analyzer.solr.IKTokenizerFactory.create(Ljava/io/Reader;)Lorg/apache/lucene/analysis/Tokenizer; 是不支持 public TokenStream create(Reader reader) { return new IKAnalyzer().tokenStream("text", reader); } 这种写法。 再说,在lucene中不需要用的IKTokenizerFactory,而应该使用IKAnalyzer啊,看说明文档吧 |
|
返回顶楼 | |
发表时间:2009-11-05
jbas 写道 老大,你的lucene是什么版本了?为什么我的是lucene-core-2.9.1中报错:
org.wltea.analyzer.solr.IKTokenizerFactory.create(Ljava/io/Reader;)Lorg/apache/lucene/analysis/Tokenizer; java.lang.AbstractMethodError: org.wltea.analyzer.solr.IKTokenizerFactory.create(Ljava/io/Reader;)Lorg/apache/lucene/analysis/Tokenizer; 是不支持 public TokenStream create(Reader reader) { return new IKAnalyzer().tokenStream("text", reader); } 这种写法。 我想问下你是怎么配置的,如下配置: <fieldType name="text" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="org.wltea.analyzer.solr.IKTokenizerFactory" isMaxWordLength="false"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" /> <filter class="solr.LowerCaseFilterFactory" /> <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" /> <filter class="solr.RemoveDuplicatesTokenFilterFactory" /> </analyzer> <analyzer type="query"> <tokenizer class="org.wltea.analyzer.solr.IKTokenizerFactory" isMaxWordLength="false"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" /> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" /> <filter class="solr.LowerCaseFilterFactory" /> <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" /> <filter class="solr.RemoveDuplicatesTokenFilterFactory" /> </analyzer> </fieldType> 还有正确的org.apache.solr.analysis.TokenizerFactory#create(java.io.Reader)方法的代码如下: public TokenStream create(Reader reader) { return new IKTokenizer(reader , isMaxWordLength()); } |
|
返回顶楼 | |
发表时间:2009-11-05
linliangyi2007 ,你好,
我是按你文档中的写的,应该不会错的,这个问题应该是lucene-core-2.9.1修改了TokenStream 结构导致的,别的中文分词也有这个问题,但有些已经支持最新2.9.1, 请你这边再看一下了。 谢谢! 当我执行:http://localhost:8983/solr/db/select/?q=title%3A%22%E4%BA%92%E8%81%94%E7%BD%91%22&version=2.2&start=0&rows=10&indent=on 这个url 查询后,显示如下错误: HTTP ERROR: 500 org.wltea.analyzer.solr.IKTokenizerFactory.create(Ljava/io/Reader;)Lorg/apache/lucene/analysis/Tokenizer; java.lang.AbstractMethodError: org.wltea.analyzer.solr.IKTokenizerFactory.create(Ljava/io/Reader;)Lorg/apache/lucene/analysis/Tokenizer; at org.apache.solr.analysis.TokenizerChain.getStream(TokenizerChain.java:69) at org.apache.solr.analysis.SolrAnalyzer.reusableTokenStream(SolrAnalyzer.java:74) at org.apache.solr.schema.IndexSchema$SolrIndexAnalyzer.reusableTokenStream(IndexSchema.java:364) at org.apache.lucene.queryParser.QueryParser.getFieldQuery(QueryParser.java:567) 我的配置如下: <fieldType name="text" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="org.wltea.analyzer.solr.IKTokenizerFactory" isMaxWordLength="true"/> <!-- in this example, we will only use synonyms at query time <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/> --> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="org.wltea.analyzer.solr.IKTokenizerFactory" isMaxWordLength="true"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> </fieldType> |
|
返回顶楼 | |
发表时间:2009-11-05
jbas 写道 linliangyi2007 ,你好,
我是按你文档中的写的,应该不会错的,这个问题应该是lucene-core-2.9.1修改了TokenStream 结构导致的,别的中文分词也有这个问题,但有些已经支持最新2.9.1, 请你这边再看一下了。 谢谢! 当我执行:http://localhost:8983/solr/db/select/?q=title%3A%22%E4%BA%92%E8%81%94%E7%BD%91%22&version=2.2&start=0&rows=10&indent=on 这个url 查询后,显示如下错误: HTTP ERROR: 500 org.wltea.analyzer.solr.IKTokenizerFactory.create(Ljava/io/Reader;)Lorg/apache/lucene/analysis/Tokenizer; java.lang.AbstractMethodError: org.wltea.analyzer.solr.IKTokenizerFactory.create(Ljava/io/Reader;)Lorg/apache/lucene/analysis/Tokenizer; at org.apache.solr.analysis.TokenizerChain.getStream(TokenizerChain.java:69) at org.apache.solr.analysis.SolrAnalyzer.reusableTokenStream(SolrAnalyzer.java:74) at org.apache.solr.schema.IndexSchema$SolrIndexAnalyzer.reusableTokenStream(IndexSchema.java:364) at org.apache.lucene.queryParser.QueryParser.getFieldQuery(QueryParser.java:567) 我的配置如下: <fieldType name="text" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="org.wltea.analyzer.solr.IKTokenizerFactory" isMaxWordLength="true"/> <!-- in this example, we will only use synonyms at query time <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/> --> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="org.wltea.analyzer.solr.IKTokenizerFactory" isMaxWordLength="true"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> </fieldType> 感谢你的宝贵的反馈,我们会及时跟进lucene2.9.1的变更 (PS:lucene的设计真有问题,哪有老变接口,而且不向下兼容的!!!) |
|
返回顶楼 | |
发表时间:2009-11-08
将<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index"> <tokenizer class="org.wltea.analyzer.solr.IKTokenizerFactory" isMaxWordLength="false"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" /> <filter class="solr.LowerCaseFilterFactory" /> <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" /> <filter class="solr.RemoveDuplicatesTokenFilterFactory" /> </analyzer> <analyzer type="query"> <tokenizer class="org.wltea.analyzer.solr.IKTokenizerFactory" isMaxWordLength="false"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" /> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" /> <filter class="solr.LowerCaseFilterFactory" /> <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" /> <filter class="solr.RemoveDuplicatesTokenFilterFactory" /> </analyzer> </fieldType>改为 : <fieldType name="text" class="solr.TextField"> <analyzer class="org.wltea.analyzer.lucene.IKAnalyzer"/> </fieldType> 就可以了。 至于为什么采用tokenizer报错,我估计是solr1.4接口发生改变,在研究中... |
|
返回顶楼 | |