文档相似度计算

henry2009

浏览: 94900 次
性别:
来自: 广州

最近访客更多访客>>

wuzijingaip

放牛班的孩子

姜俊881201

Public_zyzm

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

java

Google Gmail 算法 F#

最近在做爬虫时的一点点心德，记录下来。

文档相似度计算，一般常用的就是余弦定理，代表性介绍的文章有：

google黑板报的数学之美系列十二 -- 余弦定理和新闻的分类（这个是网上的一遍原文转载，google的黑板报被河蟹了）

把文档量化然后通过余弦定理计算相似度，主要适用于爬虫的聚类统计，和文档分类，是一种比较简单的分类算法：

       /**
	 * 计算文档相似度
	 * 
	 * @param doci
	 * 				准备比较的文档
	 * @param docj
	 * 				样例文档
	 * @return
	 */
	public double calculateSimilary(Document doci, Document docj) {
		Map<String, Integer> ifreq = doci.documentFreq();//文档词项词频
		Map<String, Integer> jfreq = docj.documentFreq();
		
		double ijSum = 0;
		Iterator<Entry<String, Integer>> it = ifreq.entrySet().iterator();
		while (it.hasNext()) {
			Map.Entry<String,Integer> entry = it.next();
			if(jfreq.containsKey(entry.getKey())) {
				double iw = weight(entry.getValue());
				double jw = weight(jfreq.get(entry.getKey()));
				ijSum += (iw * jw);
			}
		}
		
		double iPowSum = powSum(doci);
		double jPowSum = powSum(docj);
		
		
		return ijSum / (iPowSum * jPowSum);
	}
	
	/**
	 * @param document
	 * @return
	 */
	public double powSum(Document document) {
		Map<String, Integer> mapfreq = document.documentFreq();
		Collection<Integer> freqs = mapfreq.values();
		
		double sum = 0;
		for(int f : freqs) {
			double dw = weight(f);
			sum += Math.pow(dw, 2);
		}
		
		return Math.sqrt(sum);
	}
	
	/**
	 * 计算词项特征值
	 * @param wordfreq
	 * @return
	 */
	public double weight(float wordfreq) {
		return Math.sqrt(wordfreq);
	}

通过计算，两文档的余弦值越接近1，文档相似度越高。

当余弦值为1是，文档重叠。

其他java类：

public interface Document {

	/**
	 * 获取文档词频
	 * @param content
	 * @return {@link Map}
	 */
	public Map<String, Integer> segment();
	public Map<String, Integer> documentFreq();
}

public class DocumentIpml implements Document {

	private String content;
	private IKSegmentation ikSegmentation;
	private Logger logger = Logger.getLogger("DocumentIpmlLogger");
	private Map<String, Integer> dfreq;
	
	
	public DocumentIpml(String cont) {
		this.content = cont;
	}
	
	public Map<String, Integer> documentFreq() {
		if(dfreq == null || dfreq.isEmpty()) {
			dfreq = segment();
			return dfreq;
		}
		
		return dfreq;
	}
	
	public Map<String, Integer> segment() {
		if(this.content == null || content.isEmpty()) {
			logger.log(Level.WARNING, "document content can not be empty");
			return null;
		}
		
		if(ikSegmentation == null)
			ikSegmentation = new IKSegmentation(new StringReader(content), true);
		else 
			ikSegmentation.reset(new StringReader(content));
		
		Lexeme lexeme = null;
		Map<String, Integer> mapfreq = new HashMap<String, Integer>();
		
		try {
			while((lexeme = ikSegmentation.next()) != null) {
				if(!mapfreq.containsKey(lexeme.getLexemeText())) {
					mapfreq.put(lexeme.getLexemeText(), 1);
					continue;
				}
				
				int freq = mapfreq.get(lexeme.getLexemeText());
				mapfreq.put(lexeme.getLexemeText(), ++freq);
			}
		} catch (IOException e) {
			logger.log(Level.SEVERE, "", e);
			return null;
		}
		
		return mapfreq;
	}

}

实现结果：

1.txt和2.txt的相似度为：0.32460869971007195
1.txt和3.txt的相似度为：0.21837417258281094
1.txt和94.txt的相似度为：0.1805190131222515
1.txt和77.txt的相似度为：0.14018416797440844
txt6和77.txt的相似度为：0.1979109275388269

这几遍文档在附件中。

如果对文档相似度计算方式有更好的做法，欢迎指导：

我的邮箱：

liuziheng5726@gmail.com

txt.rar (7.8 KB)
下载次数: 194

分享到：

【转】从HTML文件中抽取正文的简单方案 | 持久化list

2010-08-23 00:46
浏览 4047
评论(4)
分类:编程语言
查看更多

4 楼 henry2009 2013-04-19

eight90 写道

它的main函数呢？

main函数还要列出来~~~

3 楼 henry2009 2013-04-19

deydoris 写道

算法里面的分词是用的什么算法咩？开源的还是亲自己写的分词算法呀？

我直接拿IK的，懒得自己再写分词算法了，因为关注点不在那里

2 楼 eight90 2013-04-10

它的main函数呢？

1 楼 deydoris 2012-06-11

算法里面的分词是用的什么算法咩？开源的还是亲自己写的分词算法呀？

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论