KMP快速字符串查找算法 -

rein07

浏览: 22274 次
性别:
来自: 南京

最近访客更多访客>>

hnjzsyjyj

houwanxing

HeadFirst

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

2011-08 ( 20)
更多存档...

KMP快速字符串查找算法

博客分类：

算法

在C/C++语言编程过程中，一般的字符串搜索操作都是通过标准库的strstr()函数来完成的，这在通常的情况下，因为字符串的搜索操作不多，并不会产生效率问题。实际上，这个函数的时间复杂度不容乐观。如果要从长度为n的字符串中查找长度为m的子字符串，那么这个strstr()函数的最坏时间复杂度为O(n*m)，可见，随着子字符串长度m的增大，strstr()函数的时间复杂度也相应地成倍增加，有没有更加高效的算法呢？

KMP（Knuth-Morris-Pratt）算法通过预先计算模式字符串中相应字符处的回溯索引，避免了模式匹配时不必要的回溯操作，从而提高了效率，将时间复杂度变成了O(m+n)。

KMP字符串查找（匹配）算法最大的好处，并不是它比strstr快，而是它不回溯。这是很奇妙的一个特征。这意味着目标文本只需要提供一个取得下一个字符的函数（在WINX中，这个函数叫get），就可以实现搜索。这对KMP算法的客户而言，无疑是非常有利的一件事情。

WINX的KMP字符串查找（匹配）算法总体来说用法很简单，唯一需要注意的是，和一般的匹配算法不同，WINX匹配成功后，目标文本中当前位置（position）指向的是被匹配串的末尾，而不是开始。例如，C库的strstr("1234abcdefg", "abc")，返回的结果是指向"abcdefg"中的'a'。而WINX的KMP算法返回的是"defg"中的'd'。

Java SDK的String类中的indexOf方法没有使用KMP搜索，基本上算是最简单的搜索：

view sourceprint?01 /**

02 * Code shared by String and StringBuffer to do searches. The source is the

03 * character array being searched, and the target is the string being

04 * searched for.

05 *

06 * @param source

07 *            the characters being searched.

08 * @param sourceOffset

09 *            offset of the source string.

10 * @param sourceCount

11 *            count of the source string.

12 * @param target

13 *            the characters being searched for.

14 * @param targetOffset

15 *            offset of the target string.

16 * @param targetCount

17 *            count of the target string.

18 * @param fromIndex

19 *            the index to begin searching from.

20 */

21 static int indexOf(char[] source, int sourceOffset, int sourceCount,

22         char[] target, int targetOffset, int targetCount, int fromIndex) {

23     // if start from a position that is beyond the source string

24     if (fromIndex >= sourceCount) {

25         // return the string length if target string is empty, otherwise,

26         // return -1 which means match fails

27         return (targetCount == 0 ? sourceCount : -1);

28     }

29     // correct the fromIndex

30     if (fromIndex < 0) {

31         fromIndex = 0;

32     }

33     // if target string is empty, return fromIndex

34     if (targetCount == 0) {

35         return fromIndex;

36     }

37     // first char to match

38     char first = target[targetOffset];

39     /*

40      * a little optimize. let's say the source string length is 9 and the

41      * target String length is 7. Then starting from 3 (index is 2) of

42      * source string is the last change to match the whole target sting.

43      * Otherwise, there are only 6 characters in source string and it would

44      * definitely not going to match the target string whose length is 7.

45      */

46     int max = sourceOffset + (sourceCount - targetCount);

47     // loop from the first to the max

48     for (int i = sourceOffset + fromIndex; i <= max; i++) {

49         /* Look for first character. */

50         if (source[i] != first) {

51             // using i <= max, not i < max

52             while (++i <= max && source[i] != first)

53                 ;

54         }

55         /* Found first character, now look at the rest of v2 */

56         if (i <= max) {

57             int j = i + 1;

58             int end = j + targetCount - 1;

59             // using j < end, not j <= end

60             for (int k = targetOffset + 1; j < end

61                     && source[j] == target[k]; j++, k++)

62                 ;

63             if (j == end) {

64                 /* Found whole string. */

65                 return i - sourceOffset;

66             }

67             // if match fails, i++ and loop again, there are to iterators

68             // for two loops. i and j.

69         }

70     }

71     return -1;

72 }

这是Java String中的搜索算法，对于原字符串使用了两个指针来进行搜索。但是实质上来讲，这个算法还是有回溯的，可以看出来，每次搜索的时候，j都会搜索到一个大于i的位置，而如果搜索失败，则下次搜索将是从i++开始，这就是回溯了。

KMP的优势就是没有回溯，这对于只能够使用一个指针进行搜索的情况下，不仅仅有效率上的优势，实现起来也更自然。当然对于数组来说，使用俩指针并没有什么不便，如果是对于文件或者输入流进行搜索，那回溯起来就会很麻烦了。下面是KMP搜索。

KMP算法的核心就是不回溯原字符串指针，这点其实不难做到，重要的是要想到这一点——对于回溯的字符，其实都是已知的。解释一下就是，比如在"abcdefg"中搜索"abcdeg"，前五个字符"abcdeg"都是匹配的，第六个字符f和g不匹配，这时候，对于上面的搜索算法，i将会+1，整个匹配重新开始一次，这就是回溯了。但是仔细想一下，回溯其实完全可以避免的，因为如果知道是在第六个字符不匹配，那就说明前五个字符都是匹配的，从而说明"知道回溯之后的字符是什么"，对于这个例子来说，我们肯定知道源字符串前面五个字符是"abcde"。这是KMP搜索的根基。

好，下面让我们抛开源字符串吧！我们只关心目标字符串，也就是"abcdeg"。下面我们来设想，如果在搜索中发现源字符串的【n】字符和目标字符串的【m】字符匹配失败，那说明什么呢？说明之前的字符都是匹配的，否则也不会走到这里。也就是源字符串的【n-m】到【n-1】这m个字符与目标字符串的【0】到【m-1】这m个字符匹配。既然已经在搜索之前知道这个相等关系，那何苦在搜索的时候一次又一次的回溯呢？这个本来就是可以预测的，是搞一次就得的事情。因为源字符串的【n-m】到【n-1】是已知的。所以不用每次都死板的回溯到源字符串的n-m+1。

举例来说，对于在"abababc"中搜索"ababc"，第一次不匹配的情况如下

view sourceprint?1 0 1 2 3 4 5 6

2 a b a b a b c

3 a b a b c

4         ^

这时候，如果把指针回溯到源字符串的1位置，其实没有意义的，因为它是b，和目标字符串的a不匹配。而且，我们其实已经知道源字符串0到3这四个字符的值是跟目标字符串的四个字符一样的，都是abab。KMP的思想就是，充分利用这个已知条件，"源字符串不回溯，尽量让目标字符串少回溯，然后继续进行搜索"。那应该让目标字符串回溯到什么地方呢？这就看已经匹配的字符串的内容了。

使用S代表源字符串，T代表目标字符串，S[n]和T[m]失配（注意，因为失配了，这时候S[n]是什么是不知道的）。对于源字符串已知的只有S[n-m+1]到S[n-1]这m-1个字符。假设能够找到这样一个k，使得S[n-k]...S[n-1]=T[0]....T[k-1] (0<k<m)，那么就只需要保持S不回溯，让T回溯到K，然后继续匹配就好了。而如果能够找到一个最大的K值，那么效率则是最高的。

对于上面的例子，k的值是2，KMP搜索的下一个状态是：

view sourceprint?1 0 1 2 3 4 5 6

2 a b a b a b c

3     a b a b c

4         ^

然后继续匹配就成功啦。

所以，KMP算法的核心是，如何为目标字符串的每个位置的找到一个k值，组成一个数组F，好在每次匹配到目标字符串的m失配的时候，将目标字符串回溯到F[m]，然后继续进行匹配。找到这个数组之后，KMP搜索就算是完成80%了。

下面是构建这个数组F的方法。

这时候目标字符串身兼源字符串和目标字符串两个角色。构建数组T可以说是一个步进的过程，需要用到之前的结果。首先是F[0]，F[0]的意思是第一个字符就不匹配，也就是说对源字符串一无所知，这时候没得搞了，直接要源字符串向前挪动一个。在F里，我们使用-1来标记第一个字符就匹配失败的情况。也就是F[0]=-1。F[1]其实肯定是0。我们真正需要计算的是从F[2]到最后的。下面是>=2的时候的计算方法。注意，F[i]代表S的第i个字符匹配"失败"的时候，T需要回溯到的索引的值。如何求F[i]的值呢？首先取得F[i-1]的值，然后看S[i-1]是否=T[F[i-1]]，如果等于，那么F[i]=F[i-1]+1。这个原理是递归的。F[i-1]的值是在i-1失配的时候，T索引回溯到的值，如果这时候，这个值与S[i-1]相等，那就说明F[i]可以在F[i-1]的基础上增加1了。否则继续检查S[i-1]是否等于T[[F[i-1]]]，直到没有的搜索了，就是0。下面是具体的代码：

view sourceprint?01 /**

02      * each value of array rollback means: when source[i] mismatch pattern[i],

03      * KMP will restart match process form rollback[j] of pattern with

04      * source[i]. And if rollback[i] == -1, it means the current source[i] will

05      * never match pattern. then i should be added by 1 and j should be set to

06      * 0, which means restart match process from source[i+1] with pattern from

07      * pattern[0].

08      *

09      * @param pattern

10      * @return

11      */

12     private static int[] getRollbackArray(char[] pattern) {

13         int[] rollback = new int[pattern.length];

14         for (int i = 0; i < pattern.length; i++) {

15             rollback[i] = 0;

16         }

17         rollback[0] = -1;

18         for (int i = 1; i < rollback.length; i++) {

19             char prevChar = pattern[i - 1];

20             int prevRollback = i - 1;

21             while (prevRollback >= 0) {

22                 int previousRollBackIdx = rollback[prevRollback];

23                 if ((previousRollBackIdx == -1)

24                         || (prevChar == pattern[previousRollBackIdx])) {

25                     rollback[i] = previousRollBackIdx + 1;

26                     break;

27                 } else {

28                     prevRollback = rollback[prevRollback];

29                 }

30             }

31         }

32         return rollback;

33     }

上面并没有吧F[1]=1写成固定的，不过根据计算，F[1]始终是=0的。有了这个rollback数组，KMP搜索就是水到渠成了：

view sourceprint?01 /**

02      * search pattern chars in source chars.

03      *

04      * @param source

05      * @param pattern

06      * @return

07      */

08     public static int searchKMP(char[] source, char[] pattern) {

09         // validation

10         if (source == null || source.length == 0 || pattern == null

11                 || pattern.length == 0) {

12             return -1;

13         }

14

15         // get the rollback array.

16         int[] rollback = getRollbackArray(pattern);

17

18         // incremental index of pattern. pointing the char to compare with.

19         int currMatch = 0;

20         int len = pattern.length;

21         // i point the char to compare with

22         for (int i = 0; i < source.length;) {

23             // if current char match

24             if ((currMatch == -1) || (source[i] == pattern[currMatch])) {

25                 /*

26                  * then each of the indexes adding by one, moving to the next

27                  * char for comparation. notice that if currMatch is -1, it

28                  * means the first char in pattern can not be matched. so i add

29                  * by one to move on. and currMatch add by one so its value is

30                  * 0.

31                  */

32                 i++;

33                 currMatch++;

34                 /*

35                  * if reaches the end of pattern, then match success, return the

36                  * index of first matched char.

37                  */

38                 if (currMatch == len) {

39                     return i - len;

40                 }

41             } else {

42                 /*

43                  * if current char mismatch, then rollback the next char to

44                  * compare in pattern.

45                  */

46                 currMatch = rollback[currMatch];

47             }

48         }

49         return -1;

50     }

下面是几个测试方法：

view sourceprint?01 @Test

02 public void testRollBackArray() {

03     int[] expectedRollback = new int[] { -1, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0,

04             0, 0, 0, 0, 0, 1, 2, 3, 0, 0, 0, 0, 0 };

05     int[] rollback = getRollbackArray("PARTICIPATE IN PARACHUTE"

06             .toCharArray());

07     Assert.assertArrayEquals("Rollback array compare failed to match!",

08             expectedRollback, rollback);

09 }

10 @Test

11 public void testKMPSearchMatch() {

12     int matchIndex = searchKMP(

13             "aaaaaababacbaslierjalsdzmflkasjf".toCharArray(),

14             "ababacb".toCharArray());

15     Assert.assertEquals(5, matchIndex);

16     matchIndex = searchKMP(

17             "aaaaaababacbaslierjalsdzmflkasjf".toCharArray(),

18             "aaaaaababacbaslierjalsdzmflkasjf".toCharArray());

19     Assert.assertEquals(0, matchIndex);

20 }

21 @Test

22 public void testKMPSearchNoMatch() {

23     int matchIndex = searchKMP("ABCABCDABABCDABCDABDE".toCharArray(),

24             "hjABCDABD".toCharArray());

25     Assert.assertEquals(-1, matchIndex);

26 }

把这三段代码放在一个类里，KMP搜索就算是完事儿了。

在自己看KMP算法之前，很多文章都说神马KMP有代价，只适合目标字符串很长很长，搜索字符串也很长很长的case。但是就我看下来，KMP对于日常一般的搜索也是有优势的。首先，构建rollback数组计算并不复杂，当然需要一个额外的数组空间。但是对于匹配来说，还是有很大的加速优势的，而且目标字符串不需要回溯。所以KMP唯一的代价就是需要一个额外的数组，实际占用的内存应该是目标字符串的两倍（String是char的数组，char=short，int是char的两倍）。难道，真的是为了节省内存所以不采用KMP搜索？

分享到：

求解最大公约数问题

2011-08-25 19:29
浏览 697
评论(0)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

KMP快速字符串查找算法

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

KMP快速字符串查找算法

评论

发表评论

相关推荐

求解最大公约数问题

堆排序(Heap Sort)算法学习

整数拆分问题的动态规划解法

背包问题介绍与分析

求平方根sqrt()函数的底层算法效率问题

面试中常见的一些算法问题

各种排序算法的C++实现与性能比较

背包问题之硬币找零问题

求能被1到20的数整除的最小正整数

买书折扣最优惠问题解法

二叉树中的最近公共祖先问题

判断一个整数是否是2的N次方

字符串逆序的算法汇总

计算从1到N中1的出现次数

KMP快速字符串查找算法

快速排序的递归实现

数字1亿里面有多少个1呢

最大子序列、最长公共子串、最长公共子序列

一道关于男女比例的面试题

最近访客更多访客>>