文本相似度计算-google的simHash汉明距离 -

dengqsintyt

浏览: 292291 次
性别:

最近访客更多访客>>

hznanhai

hekui1016

ggggoo

liyanwei66

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

文本相似度计算-google的simHash汉明距离

博客分类：

算法篇
工具篇
数据挖掘篇

simHash、汉明距离、相似度计算、分析

一、概述

针对文本相似性计算，很多开发朋友首先想到的应该是使用向量空间模型VSM（Vector Space Model）。使用VSM计算相似度，先对文本进行分词，然后建立文本向量，把相似度的计算转换成某种特征向量距离的计算，比如余弦角、欧式距离、Jaccard相似系数等。这种方法存在很大一个问题：需要对文本两两进行相似度比较，无法扩展到海量文本的处理。想想像Google这种全网搜索引擎，收录了上百亿的网页，爬虫每天爬取的网页数都是百万千万级别的。为了防止重复收录网页，爬虫需要对网页进行判重处理。如果采用VSM方法，计算量是相当可观的。

二、思想

输入为一个N维向量V，比如文本的特征向量，每个特征具有一定权重。输出是一个C位的二进制签名S。

1）初始化一个C维向量Q为0，C位的二进制签名S为0。

2）对向量V中的每一个特征，使用传统的Hash算法计算出一个C位的散列值H。对1<=i<=C,

如果H的第i位为1，则Q的第i个元素加上该特征的权重；

否则，Q的第i个元素减去该特征的权重。

3）如果Q的第i个元素大于0，则S的第i位为1；否则为0；

4）返回签名S。

三、java实现

import java.math.BigInteger;
import java.util.StringTokenizer;

public class SimHash {

	private String tokens;
	private BigInteger strSimHash;
	private int hashbits = 128;

	public SimHash(String tokens) {
		this.tokens = tokens;
		this.strSimHash = this.simHash();
	}
	
	public SimHash(String tokens, int hashbits) {
		this.tokens = tokens;
		this.hashbits = hashbits;
		this.strSimHash = this.simHash();
	}

	public BigInteger simHash() {
		int[] v = new int[this.hashbits];
		StringTokenizer stringTokens = new StringTokenizer(this.tokens);
		while (stringTokens.hasMoreTokens()) {
			String temp = stringTokens.nextToken();
			BigInteger t = this.hash(temp);
			System.out.println("temp = " + temp+" ： " + t);
			for (int i = 0; i < this.hashbits; i++) {
				BigInteger bitmask = new BigInteger("1").shiftLeft(i);
				if (t.and(bitmask).signum() != 0) {
					v[i] += 1;
				} else {
					v[i] -= 1;
				}
			}
		}
		BigInteger fingerprint = new BigInteger("0");
		for (int i = 0; i < this.hashbits; i++) {
			if (v[i] >= 0) {
				fingerprint = fingerprint.add(new BigInteger("1").shiftLeft(i));
			}
		}
		return fingerprint;
	}

	private BigInteger hash(String source) {
		if (source == null || source.length() == 0) {
			return new BigInteger("0");
		} else {
			char[] sourceArray = source.toCharArray();
			BigInteger x = BigInteger.valueOf(((long) sourceArray[0]) << 7);
			BigInteger m = new BigInteger("1000003");
			BigInteger mask = new BigInteger("2").pow(this.hashbits).subtract(
					new BigInteger("1"));
			for (char item : sourceArray) {
				BigInteger temp = BigInteger.valueOf((long) item);
				x = x.multiply(m).xor(temp).and(mask);
			}
			x = x.xor(new BigInteger(String.valueOf(source.length())));
			if (x.equals(new BigInteger("-1"))) {
				x = new BigInteger("-2");
			}
			return x;
		}
	}

	public int hammingDistance(SimHash other) {
		BigInteger m = new BigInteger("1").shiftLeft(this.hashbits).subtract(
				new BigInteger("1"));
		BigInteger x = this.strSimHash.xor(other.strSimHash).and(m);
		int tot = 0;
		while (x.signum() != 0) {
			tot += 1;
			x = x.and(x.subtract(new BigInteger("1")));
		}
		return tot;
	}

	public static void main(String[] args) {
		String s = "China people's Republic of China Chinese China people's Republic of China People's Republic of China";
		SimHash hash1 = new SimHash(s, 128);
		System.out.println(hash1.strSimHash + "  "
				+ hash1.strSimHash.bitLength());

		s = "China people's Republic of China Chinese China people's Republic of China";
		SimHash hash2 = new SimHash(s, 128);
		System.out.println(hash2.strSimHash + "  "
				+ hash2.strSimHash.bitCount());

		s = "China people's Republic";
		SimHash hash3 = new SimHash(s, 128);
		System.out.println(hash3.strSimHash + "  "
				+ hash3.strSimHash.bitCount());

		System.out.println("============================");
		System.out.println(hash1.hammingDistance(hash2));
		System.out.println(hash1.hammingDistance(hash3));
	}
}

查看图片附件

分享到：

数据挖掘-基于模板的属性抽取 | 文本相似度计算-编辑距离

2014-07-22 17:27
浏览 5759
评论(0)
分类:行业应用
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

文本相似度计算-google的simHash汉明距离

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

文本相似度计算-google的simHash汉明距离

评论

发表评论

相关推荐

Vim批量注释--记录

Linux系统中查看日志的常用命令

linux 监控命令图谱

Linux Shell之sort命令

数据挖掘-文本分类：特征选择方法总结

数据挖掘-机器学习：Kmean聚类思想

数据挖掘-基于dom树的网页属性抽取方法应用

数据挖掘-基于模板的属性抽取

文本相似度计算-编辑距离

mongoDB基本命令使用总结

linux下创建无密码登陆

KMP算法思想及实现

shell中的${}，##和%%的意义与使用

Shell文件操作-实战篇

Shell的标准输出、标准错误 >/dev/null 2>&1

shell中如何判断一个变量是否为空

tomcat并发优化配置

两个常用的测试工具LoadRunner和Jmeter

c++ 实现-基本排序算法

常见算法+基础常用问题

最近访客更多访客>>