来自开源支持者的第一笔捐赠

linliangyi2007

浏览: 1018874 次
性别:
来自: 福州

最近访客更多访客>>

anyitzy

pos3721

ymgjava

winco304

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

程序人生
IK中文分词开源捐赠

2013年1月9号，一个平凡而又不平常的日子！

IK中文分词开源项目在历经六个年头的发展，迈入第七个年头时，迎来的它的第一笔捐赠！

感谢来自广州的热心支持者Andy!!!作为一名开源项目的个人用户及捐赠者，您的慷慨善举

对于IK，甚至对于广大的国内开源项目的发起者而言，都是莫大的肯定与鼓舞！！

笔者本人并不清楚国内（个人发起的）开源项目中有多少是已经得到捐赠的。国内的开源用户

群体中，又有多少是曾经给予过您使用的开源项目以捐赠的。

写这个博客的目的就是在告诉大家，我们为之期待和努力的开源社区氛围正在形成，一切都在

悄然改变。有志于开源的兄弟姐妹们，行动起来吧！！不论你们是用户，还是项目发起者，你

们的每一点支持和付出，终将积沙成塔，积水成海！！

也许在不久的将来，我们就能看到一个生机勃勃的春天.....

12
顶

7
踩

分享到：

IT宅来解答所谓的12306的售票所谓库存算法 ... | 发布 IK Analyzer 2012 FF 版本

2013-01-09 21:15
浏览 5921
评论(17)
分类:开源软件
查看更多

17 楼 majiedota 2015-07-01

加油

16 楼 gibberish2 2014-12-15

我特意进来赞一下作者！

15 楼亦梦亦真 2013-10-17

发现一个很奇怪的问题，当我需要给一个文件名进行分词时，如果是“测试文件.rar”时，它会分成测试文件 .rar 但是如果是英文的“test.rar" 则不会分词，这个结果依然是test.rar 这是为什么呢？是IK分词的问题还是本身lucene的问题呢？

14 楼 rubricate 2013-07-10

rubricate 写道

你好，感谢你的贡献，我对你的分词做了修改，是它支持了分词后的汉字转换成拼音和首字母，介绍在这里http://blog.csdn.net/liugang51096557/article/d[url]etails/9291331

http://blog.csdn.net/liugang51096557/article/details/9291331

13 楼 rubricate 2013-07-10

你好，感谢你的贡献，我对你的分词做了修改，是它支持了分词后的汉字转换成拼音和首字母，介绍在这里http://blog.csdn.net/liugang51096557/article/details/9291331

12 楼 dandongsoft 2013-06-18

solr in action 这本书有中文的吗

11 楼淫笑琪 2013-06-14

6年过去了。。才一笔捐赠，连50都达不到吧。。这是讽刺嘛？

10 楼 biy 2013-05-10

恭喜!~~

9 楼 dandongsoft 2013-03-08

TopDocs TopDocs TopDocs TopDocs TopDocs TopDocs TopDocs TopDocs TopDocs TopDocs TopDocs TopDocs TopDocs TopDocs TopDocs TopDocs TopDocs TopDocs TopDocs TopDocs TopDocs TopDocs TopDocs TopDocs TopDocs TopDocs TopDocs TopDocs TopDocs TopDocs TopDocs TopDocs

8 楼 Iloseyou 2013-01-28

林老师提前祝你和家人新年快乐

7 楼 linliangyi2007 2013-01-11

snakeling 写道

给你反馈个bug(已修复)
这两天为了搞定这个bug我把lucene4.0的analyzer接口文档全看了一遍……
精确短语检索在现有版本上一直不能用，因为你接口没掉对。
现在将IKTokenizer改为以下模样精确短语检索就能用了。

/**
 * IK 中文分词  版本 5.0.1
 * IK Analyzer release 5.0.1
 * 
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 *
 * 源代码由林良益(linliangyi2005@gmail.com)提供
 * 版权声明 2012，乌龙茶工作室
 * provided by Linliangyi and copyright 2012 by Oolong studio
 * 

 * 
 */
package org.wltea.analyzer.lucene;

import java.io.IOException;
import java.io.Reader;

import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
import org.apache.lucene.analysis.tokenattributes.PositionLengthAttribute;
import org.apache.lucene.analysis.tokenattributes.TypeAttribute;

import org.wltea.analyzer.core.IKSegmenter;
import org.wltea.analyzer.core.Lexeme;

/**
 * IK分词器 Lucene Tokenizer适配器类
 * 兼容Lucene 4.0版本
 */
public final class IKTokenizer extends Tokenizer {
	
	//IK分词器实现
	private IKSegmenter _IKImplement;
	
	//词元文本属性
	private final CharTermAttribute termAtt;
	//词元位移属性
	private final OffsetAttribute offsetAtt;
	//词元分类属性（该属性分类参考org.wltea.analyzer.core.Lexeme中的分类常量）
	private final TypeAttribute typeAtt;
	//记录最后一个词元的结束位置
	private int endPosition;
	//设置term跨position范围，用于短语检索 ——by snakeling
	private final PositionLengthAttribute posLenAtt;
	//这是position用于lucene短语检索 ——by snakeling
	private final PositionIncrementAttribute posIncAtt;
	//上一个词的开始位置 ——by snakeling
	private int lastBeginePosition;
	//第一个term要特殊处理
	private boolean bFirstTerm;
	/**
	 * Lucene 4.0 Tokenizer适配器类构造函数
	 * @param in
	 * @param useSmart
	 */
	public IKTokenizer(Reader in , boolean useSmart){
	    super(in);
	    offsetAtt = addAttribute(OffsetAttribute.class);
	    termAtt = addAttribute(CharTermAttribute.class);
	    typeAtt = addAttribute(TypeAttribute.class);
	    posLenAtt = addAttribute(PositionLengthAttribute.class);
	    posIncAtt = addAttribute(PositionIncrementAttribute.class);
		_IKImplement = new IKSegmenter(input , useSmart);
	}

	/* (non-Javadoc)
	 * @see org.apache.lucene.analysis.TokenStream#incrementToken()
	 */
	@Override
	public boolean incrementToken() throws IOException {
		//清除所有的词元属性
		clearAttributes();
		Lexeme nextLexeme = _IKImplement.next();
		if(nextLexeme != null){
			//将Lexeme转成Attributes
			//设置词元文本
			termAtt.append(nextLexeme.getLexemeText());
			//设置词元长度
			termAtt.setLength(nextLexeme.getLength());
			//设置词元位移-----几乎没用，lucene默认配置不保存offset。——by snakeling
			offsetAtt.setOffset(nextLexeme.getBeginPosition(), nextLexeme.getEndPosition());
			//记录分词的最后位置
			endPosition = nextLexeme.getEndPosition();
			//记录词元分类
			typeAtt.setType(nextLexeme.getLexemeTypeString());			
			//配置term的position信息，这个才是拿来做短语检索的，上面那个offset几乎没用，lucene默认配置不保存offset。 ——by snakeling
			if(bFirstTerm){
				posIncAtt.setPositionIncrement(1);
				posLenAtt.setPositionLength(nextLexeme.getLength());
				bFirstTerm = false;
			}else{
				posIncAtt.setPositionIncrement(nextLexeme.getBeginPosition() - lastBeginePosition);
				posLenAtt.setPositionLength(nextLexeme.getLength());
			}
			lastBeginePosition = nextLexeme.getBeginPosition();
			//返会true告知还有下个词元
			return true;
		}
		//返会false告知词元输出完毕
		return false;
	}
	
	/*
	 * (non-Javadoc)
	 * @see org.apache.lucene.analysis.Tokenizer#reset(java.io.Reader)
	 */
	@Override
	public void reset() throws IOException {
		super.reset();
		_IKImplement.reset(input);
		bFirstTerm = true;  //——by snakeling
		lastBeginePosition = 0;  //——by snakeling
	}	
	
	@Override
	public final void end() {
	    // set final offset
		int finalOffset = correctOffset(this.endPosition);
		offsetAtt.setOffset(finalOffset, finalOffset);
	}
}

非常感谢，你反馈的问题。我会将您的代码加入下个版本的发布包！！感谢对IK的支持

6 楼 linliangyi2007 2013-01-11

remoteJavaSky 写道

这个让人兴奋哈
如果我用到国内开源项目，觉得不错，我会donate的，当然支付宝最好，我没其它在线支付了

是的，Andy就是通过支付宝账号(linliangyi2005@gmail.com)捐赠的，哈哈

5 楼 sharewind 2013-01-10

恭喜恭喜！

4 楼 snakeling 2013-01-10

有问题可以练习我~
snakedling@gmail.com

3 楼 snakeling 2013-01-10

/**
 * IK 中文分词  版本 5.0.1
 * IK Analyzer release 5.0.1
 * 
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 *
 * 源代码由林良益(linliangyi2005@gmail.com)提供
 * 版权声明 2012，乌龙茶工作室
 * provided by Linliangyi and copyright 2012 by Oolong studio
 * 

 * 
 */
package org.wltea.analyzer.lucene;

import java.io.IOException;
import java.io.Reader;

import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
import org.apache.lucene.analysis.tokenattributes.PositionLengthAttribute;
import org.apache.lucene.analysis.tokenattributes.TypeAttribute;

import org.wltea.analyzer.core.IKSegmenter;
import org.wltea.analyzer.core.Lexeme;

/**
 * IK分词器 Lucene Tokenizer适配器类
 * 兼容Lucene 4.0版本
 */
public final class IKTokenizer extends Tokenizer {
	
	//IK分词器实现
	private IKSegmenter _IKImplement;
	
	//词元文本属性
	private final CharTermAttribute termAtt;
	//词元位移属性
	private final OffsetAttribute offsetAtt;
	//词元分类属性（该属性分类参考org.wltea.analyzer.core.Lexeme中的分类常量）
	private final TypeAttribute typeAtt;
	//记录最后一个词元的结束位置
	private int endPosition;
	//设置term跨position范围，用于短语检索 ——by snakeling
	private final PositionLengthAttribute posLenAtt;
	//这是position用于lucene短语检索 ——by snakeling
	private final PositionIncrementAttribute posIncAtt;
	//上一个词的开始位置 ——by snakeling
	private int lastBeginePosition;
	//第一个term要特殊处理
	private boolean bFirstTerm;
	/**
	 * Lucene 4.0 Tokenizer适配器类构造函数
	 * @param in
	 * @param useSmart
	 */
	public IKTokenizer(Reader in , boolean useSmart){
	    super(in);
	    offsetAtt = addAttribute(OffsetAttribute.class);
	    termAtt = addAttribute(CharTermAttribute.class);
	    typeAtt = addAttribute(TypeAttribute.class);
	    posLenAtt = addAttribute(PositionLengthAttribute.class);
	    posIncAtt = addAttribute(PositionIncrementAttribute.class);
		_IKImplement = new IKSegmenter(input , useSmart);
	}

	/* (non-Javadoc)
	 * @see org.apache.lucene.analysis.TokenStream#incrementToken()
	 */
	@Override
	public boolean incrementToken() throws IOException {
		//清除所有的词元属性
		clearAttributes();
		Lexeme nextLexeme = _IKImplement.next();
		if(nextLexeme != null){
			//将Lexeme转成Attributes
			//设置词元文本
			termAtt.append(nextLexeme.getLexemeText());
			//设置词元长度
			termAtt.setLength(nextLexeme.getLength());
			//设置词元位移-----几乎没用，lucene默认配置不保存offset。——by snakeling
			offsetAtt.setOffset(nextLexeme.getBeginPosition(), nextLexeme.getEndPosition());
			//记录分词的最后位置
			endPosition = nextLexeme.getEndPosition();
			//记录词元分类
			typeAtt.setType(nextLexeme.getLexemeTypeString());			
			//配置term的position信息，这个才是拿来做短语检索的，上面那个offset几乎没用，lucene默认配置不保存offset。 ——by snakeling
			if(bFirstTerm){
				posIncAtt.setPositionIncrement(1);
				posLenAtt.setPositionLength(nextLexeme.getLength());
				bFirstTerm = false;
			}else{
				posIncAtt.setPositionIncrement(nextLexeme.getBeginPosition() - lastBeginePosition);
				posLenAtt.setPositionLength(nextLexeme.getLength());
			}
			lastBeginePosition = nextLexeme.getBeginPosition();
			//返会true告知还有下个词元
			return true;
		}
		//返会false告知词元输出完毕
		return false;
	}
	
	/*
	 * (non-Javadoc)
	 * @see org.apache.lucene.analysis.Tokenizer#reset(java.io.Reader)
	 */
	@Override
	public void reset() throws IOException {
		super.reset();
		_IKImplement.reset(input);
		bFirstTerm = true;  //——by snakeling
		lastBeginePosition = 0;  //——by snakeling
	}	
	
	@Override
	public final void end() {
	    // set final offset
		int finalOffset = correctOffset(this.endPosition);
		offsetAtt.setOffset(finalOffset, finalOffset);
	}
}

2 楼 whiletrue 2013-01-10

恭喜恭喜！

1 楼 remoteJavaSky 2013-01-10

这个让人兴奋哈
如果我用到国内开源项目，觉得不错，我会donate的，当然支付宝最好，我没其它在线支付了

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论