字符串相似算法-(3) NGram Distance -

jimmee

浏览: 564661 次
性别:
来自: 杭州

最近访客更多访客>>

loven_11

shohokuf

sunyeshigou

新的开始2015

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

字符串相似算法-(3) NGram Distance

博客分类：

J2SE
算法
搜索引擎

ngram distance 字符串相似性

就是N-Gram version of edit distance

 public float getDistance(String source, String target) {
    final int sl = source.length();
    final int tl = target.length();
    
    if (sl == 0 || tl == 0) {
      if (sl == tl) {
        return 1;
      }
      else {
        return 0;
      }
    }

    int cost = 0;
    if (sl < n || tl < n) {
      for (int i=0,ni=Math.min(sl,tl);i<ni;i++) {
        if (source.charAt(i) == target.charAt(i)) {
          cost++;
        }
      }
      return (float) cost/Math.max(sl, tl);
    }

    char[] sa = new char[sl+n-1];
    float p[]; //'previous' cost array, horizontally
    float d[]; // cost array, horizontally
    float _d[]; //placeholder to assist in swapping p and d
    
    //construct sa with prefix
    // 填充前缀，满足n-gram
    for (int i=0;i<sa.length;i++) {
      if (i < n-1) {
        sa[i]=0; //add prefix
      }
      else {
        sa[i] = source.charAt(i-n+1);
      }
    }
    p = new float[sl+1]; 
    d = new float[sl+1]; 
  
    // indexes into strings s and t
    int i; // iterates through source
    int j; // iterates through target

    char[] t_j = new char[n]; // jth n-gram of t

    // 初始化第一横排的编辑距离
    for (i = 0; i<=sl; i++) {
        p[i] = i;
    }

    for (j = 1; j<=tl; j++) { // 开始处理第二个横排，...到tl最后一个横排
        //construct t_j n-gram，构建n-gram
        if (j < n) { // 补充前缀
          for (int ti=0;ti<n-j;ti++) {
            t_j[ti]=0; //add prefix
          }
          for (int ti=n-j;ti<n;ti++) {
            t_j[ti]=target.charAt(ti-(n-j));
          }
        }
        else { // 直接取n-gram
          t_j = target.substring(j-n, j).toCharArray();
        }
        d[0] = j;
        for (i=1; i<=sl; i++) {
            cost = 0;
            int tn=n;
            //compare sa to t_j，计算f(i,j)
            for (int ni=0;ni<n;ni++) {
              if (sa[i-1+ni] != t_j[ni]) {
                cost++;
              }
              else if (sa[i-1+ni] == 0) { //discount matches on prefix
                tn--;
              }
            }
            float ec = (float) cost/tn;
            // minimum of cell to the left+1, to the top+1, diagonally left and up +cost
            d[i] = Math.min(Math.min(d[i-1]+1, p[i]+1),  p[i-1]+ec);
        }
        // copy current distance counts to 'previous row' distance counts
        _d = p;
        p = d;
        d = _d;
    }

    // our last action in the above loop was to switch d and p, so p now
    // actually has the most recent cost counts
    return 1.0f - ((float) p[sl] / Math.max(tl, sl));
  }

1
顶

0
踩

分享到：

lucene的拼写检查的实现原理 | 字符串相似算法-(2) Levenshtein distance

2014-06-08 17:54
浏览 5096
评论(0)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

字符串相似算法-(3) NGram Distance

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

字符串相似算法-(3) NGram Distance

评论

发表评论

相关推荐

[转载]并发之痛 Thread，Goroutine，Actor

moses安装记录

翻译算法

JVM动态调整字节码

java字节码常量池处理说明

JPEG 简易文档 V2.15【转载】

Mac OSX 10.10 Yosemite编译OpenJDK 8

Java 并发之 ConcurrentSkipListMap 简述

hbase等源码导入eclipse流程

最简单的平衡树（红-黑树）的实现

多线程程序中操作的原子性[转载]

6. 内存屏障[转载]

5.合并写(write combining)[转载]

4. 内存访问模型的重要性[转载]

3. Java 7与伪共享的新仇旧恨[转载]

2. 伪共享(False Sharing)[转载]

lucene索引创建的理解思路

lucene的拼写检查的实现原理

字符串相似算法-(2) Levenshtein distance

字符串相似算法-(1) Jaro-Winkler Distance

最近访客更多访客>>