原文:http://www.javacodegeeks.com/2013/11/java-implementation-of-optimal-string-alignment.html
---------------------------------------------------------------------------------------------------------------------------------
For a while, I’ve used the Apache Commons lang StringUtils implementation ofLevenshtein distance. It implements a few well known tricks to use less memory by only hanging on to two arrays instead of allocating a huge n x m table for the memoisation table. It also only checks a “stripe” of width 2 * k +1 where k is the maximum number of edits.
In most practical usages of levenshtein you just care if a string is within some small number (1, 2, 3) of edits from another string. This avoid much of the n * m computation that makes levenstein “expensive”. We found that with a k <= 3, levenshtein with these tricks was faster than Jaro-Winkler distance, which is an approximate edit distance calculation that was created to be a faster approximate (well there were many reasons).
Unfortunately, the Apache Commons Lang implementation only calculates Levenshtein and not the possible more useful Damerau-Levenshtein distance. Levenshtein defines the edit operations insert, delete, and substitute. The Damerau variant adds *transposition* to the list, which is pretty useful for most of the places I use edit distance. Unfortunately DL distance is not a true metric in that it doesn’t respect the triangle inequality, but there are plenty of applications that are unaffected by this. As you can see from that wikipedia page, there is often confusion between Optimal String Alignment and DL distance. In practice OSA is a simpler algorithm and requires less book-keeping so the runtime is probably marginally faster.
I could not find any implementations of OSA or DL that used the memory tricks and “stripe” tricks that I saw in Apache Commons Lang. So I implemented my own OSA using those tricks. At some point I’ll also implement DL with the tricks and see what the performance differences are:
Here’s OSA in Java. It’s public domain; feel free to use as you like. The unit tests are below. Only dependency is on Guava- but its just the preconditions class and an annotation for documentation so easy to remove that dependency if you like:
package com.github.steveash.util; import static com.google.common.base.Preconditions.checkArgument; import static com.google.common.base.Preconditions.checkNotNull; import static com.google.common.primitives.Shorts.checkedCast; import static java.lang.Math.abs; import static java.lang.Math.max; import java.util.Arrays; import com.google.common.annotations.VisibleForTesting; /** * Implementation of the OSA which is similar to the Damerau-Levenshtein in that it allows for transpositions to * count as a single edit distance, but is not a true metric and can over-estimate the cost because it disallows * substrings to edited more than once. See wikipedia for more discussion on OSA vs DL * <p/> * See Algorithms on Strings, Trees and Sequences by Dan Gusfield for more information. * <p/> * This also has a set of local buffer implementations to avoid allocating new buffers each time, which might be * a premature optimization * <p/> * @author Steve Ash */ public class OptimalStringAlignment { private static final int threadLocalBufferSize = 64; private static final ThreadLocal<short[]> costLocal = new ThreadLocal<short[]>() { @Override protected short[] initialValue() { return new short[threadLocalBufferSize]; } }; private static final ThreadLocal<short[]> back1Local = new ThreadLocal<short[]>() { @Override protected short[] initialValue() { return new short[threadLocalBufferSize]; } }; private static final ThreadLocal<short[]> back2Local = new ThreadLocal<short[]>() { @Override protected short[] initialValue() { return new short[threadLocalBufferSize]; } }; public static int editDistance(CharSequence s, CharSequence t, int threshold) { checkNotNull(s, "cannot measure null strings"); checkNotNull(t, "cannot measure null strings"); checkArgument(threshold >= 0, "Threshold must not be negative"); checkArgument(s.length() < Short.MAX_VALUE, "Cannot take edit distance of strings longer than 32k chars"); checkArgument(t.length() < Short.MAX_VALUE, "Cannot take edit distance of strings longer than 32k chars"); if (s.length() + 1 > threadLocalBufferSize || t.length() + 1 > threadLocalBufferSize) return editDistanceWithNewBuffers(s, t, checkedCast(threshold)); short[] cost = costLocal.get(); short[] back1 = back1Local.get(); short[] back2 = back2Local.get(); return editDistanceWithBuffers(s, t, checkedCast(threshold), back2, back1, cost); } @VisibleForTesting static int editDistanceWithNewBuffers(CharSequence s, CharSequence t, short threshold) { int slen = s.length(); short[] back1 = new short[slen + 1]; // "up 1" row in table short[] back2 = new short[slen + 1]; // "up 2" row in table short[] cost = new short[slen + 1]; // "current cost" return editDistanceWithBuffers(s, t, threshold, back2, back1, cost); } private static int editDistanceWithBuffers(CharSequence s, CharSequence t, short threshold, short[] back2, short[] back1, short[] cost) { short slen = (short) s.length(); short tlen = (short) t.length(); // if one string is empty, the edit distance is necessarily the length of the other if (slen == 0) { return tlen <= threshold ? tlen : -1; } else if (tlen == 0) { return slen <= threshold ? slen : -1; } // if lengths are different > k, then can't be within edit distance if (abs(slen - tlen) > threshold) return -1; if (slen > tlen) { // swap the two strings to consume less memory CharSequence tmp = s; s = t; t = tmp; slen = tlen; tlen = (short) t.length(); } initMemoiseTables(threshold, back2, back1, cost, slen); for (short j = 1; j <= tlen; j++) { cost[0] = j; // j is the cost of inserting this many characters // stripe bounds int min = max(1, j - threshold); int max = min(slen, (short) (j + threshold)); // at this iteration the left most entry is "too much" so reset it if (min > 1) { cost[min - 1] = Short.MAX_VALUE; } iterateOverStripe(s, t, j, cost, back1, back2, min, max); // swap our cost arrays to move on to the next "row" short[] tempCost = back2; back2 = back1; back1 = cost; cost = tempCost; } // after exit, the current cost is in back1 // if back1[slen] > k then we exceeded, so return -1 if (back1[slen] > threshold) { return -1; } return back1[slen]; } private static void iterateOverStripe(CharSequence s, CharSequence t, short j, short[] cost, short[] back1, short[] back2, int min, int max) { // iterates over the stripe for (int i = min; i <= max; i++) { if (s.charAt(i - 1) == t.charAt(j - 1)) { cost[i] = back1[i - 1]; } else { cost[i] = (short) (1 + min(cost[i - 1], back1[i], back1[i - 1])); } if (i >= 2 && j >= 2) { // possible transposition to check for if ((s.charAt(i - 2) == t.charAt(j - 1)) && s.charAt(i - 1) == t.charAt(j - 2)) { cost[i] = min(cost[i], (short) (back2[i - 2] + 1)); } } } } private static void initMemoiseTables(short threshold, short[] back2, short[] back1, short[] cost, short slen) { // initial "starting" values for inserting all the letters short boundary = (short) (min(slen, threshold) + 1); for (short i = 0; i < boundary; i++) { back1[i] = i; back2[i] = i; } // need to make sure that we don't read a default value when looking "up" Arrays.fill(back1, boundary, slen + 1, Short.MAX_VALUE); Arrays.fill(back2, boundary, slen + 1, Short.MAX_VALUE); Arrays.fill(cost, 0, slen + 1, Short.MAX_VALUE); } private static short min(short a, short b) { return (a <= b ? a : b); } private static short min(short a, short b, short c) { return min(a, min(b, c)); } }
相关推荐
damerau-levenshtein-js NPM软件包,用于同步或异步计算字符串之间的Damerau-Levenshtein距离。 安装 npm i damerau-levenshtein-js 用法 调用“ distance”或“ distanceProm”函数将输出一个整数,即计算出的两...
该gem实现了纯Levenshtein算法,即Damerau的改进算法(其中2个字符换位算作1个编辑距离)。 它还包括Boermer&Rees 2008对Damerau算法的修改,其中也考虑了大于1个字符块的转置 。 require "damerau-levenshtein...
获取与Damerau-Levenshtein距离的文本相似度。 要求 PHP 7.1.0或更高版本。 安装 composer require oefenweb/damerau-levenshtein 用法 $ pattern = 'foo bar' ; $ string = 'fuu baz' ; $ damerauLevenshtein = ...
Crystal语言的Damerau-Levenshtein算法实现。 安装 将其添加到Projectfile deps do github " suxxes/damerau-levenshtein " end 用法 require " damerau-levenshtein " DamerauLevenshtein .distance( " string " ...
pyxDamerauLevenshtein在Cython中为Python实现了Damerau-Levenshtein(DL)编辑距离算法,以实现高性能。 礼貌: 在信息论和计算机科学中,Damerau-Levenshtein距离(以Frederick J. Damerau和Vladimir I. ...
通过Quicklisp(推荐): (ql:quickload " mk-string-metrics " )文献资料 damerau-levenshtein x y计算两个给定字符串x和y之间的Damerau-Levenshtein距离。 hamming x y计算两个给定字符串x和y之间的汉明距离,...
它提供了一个接受两个字符串参数并返回如下哈希值的函数: { steps : 5 , // Levenstein demerau distance relative : 0.7 , // steps / length of the longer string similarity : 0.3 // 1 - relative } 安装 ...
"edits.cr"是一个用Crystal编程语言实现的库,它提供了对几种常见的编辑距离算法的支持,包括Jaro距离、Damerau-Levenshtein距离以及最佳对准算法。 1. **Jaro距离**:由Martin Winkler在1980年代提出的Jaro距离是...
Levenshtein距离是一种衡量字符串相似度的算法,它计算的是将一个字符串转换成另一个字符串所需要的最少单字符编辑(插入、删除或替换)的数量。 Levenshtein距离的核心思想是动态规划,由俄国数学家Vladimir ...
在"用Levenshtein距离算法最快的JS实现来测量两个字符串之间的差异"这个主题中,我们关注的是如何高效地在JavaScript中执行这个算法。通常,Levenshtein距离的计算会涉及到二维数组的填充,这可能导致性能问题,尤其...
在`python-Levenshtein-master`这个压缩包中,可能包含了Levenshtein库的源代码,包括实现算法的Python文件、测试用例、文档等资源。开发者可以通过阅读源代码来深入理解其内部工作原理,并根据项目需求进行定制和...
在"WebApplication1"这个项目中,可能包含了一个简单的示例程序,演示了如何实现这些字符串相似度算法并进行模糊匹配。通过学习和理解这些算法,开发者可以构建出能够处理各种模糊查询和相似性比较的应用,提升用户...
一个实现不同字符串相似度和距离度量的库。目前实现了十几种算法(包括 Levenshtein 编辑距离和兄弟、Jaro-Winkler、最长公共子序列、余弦相似度等)。查看下面的汇总表以获取完整列表... python字符串相似度 下载 ...
编辑距离,又称Levenshtein距离,是一种衡量两个字符串相似度的方法。它是通过计算将一个字符串转换成另一个字符串所需的最少单字符编辑(插入、删除或替换)次数来定义的。在文本处理、生物信息学、搜索建议等领域...
支持快速汉明、Levenshtein、受限Damerau-Levenshtein等距离计算和字符串搜索。尽管矢量化 SIMD 代码比它们的标量对应物允许高达 20-30 倍的加速,但处理依赖于平台的 SIMD 代码的困难使得 SIMD 例程不那么有吸引力...
Levenshtein 距离(也称为编辑距离)是衡量两个字符串之间最少单字符编辑操作(插入、删除、替换)次数的方法。例如,将 "kitten" 变为 "sitting" 需要3次操作:将 "k" 替换为 "s",将 "e" 替换为 "i",并在末尾...
在Go中实现Levenshtein距离算法,可以有效地帮助开发者解决与字符串相似度相关的实际问题。 Levenshtein距离算法的基本思想是通过计算将一个字符串转换为另一个字符串所需的最少单字符编辑操作次数(插入、删除或...
在IT领域,字符串相似度...总之,编辑距离算法(Levenshtein Distance)是字符串处理中的一个重要工具,它提供了一种量化字符串相似性的方法。理解并熟练运用这个算法能够帮助我们在处理文本数据时做出更准确的决策。
1. **Levenshtein距离**:Levenshtein距离是一种衡量两个字符串相似度的方法,定义为由一个字符串转换成另一个字符串最少的单字符编辑操作次数(插入、删除或替换)。Java中可以自定义实现,或者使用开源库Apache ...