`

Suffix Tree

 
阅读更多

A Suffix Tree is a data-structure that allows many problems on strings (sequences of characters) to be solved quickly. Iftxt=t1t2...ti...tn is a string, then Ti=titi+1...tn is the suffix of txt that starts at position i, e.g.

 

The suffix tree for `txt' is a Trie-like or PATRICIA-like data structure that represents the suffixes of txt.

A given suffix tree can be used to search for a substring, pat[1..m] in O(m) time. There are n(n+1)/2 substrings in txt[1..n] so it is rather surprising that a suffix tree can be built in O(n) time. Adding just one character to txt increases the number of substrings by n+1, but they are not independent. Weiner (1973) gave the first algorithm and McCreight (1976) gave a more readable account for constructing the suffix tree while processing txt from right to left. Only much later did Ukkonen (1992, 1995) give a left-to-right on-line algorithm, i.e., an algorithm that maintains a suffix tree for txt[1..i] at each step as i is increased from 1 to n.

If the non-empty suffixes are sorted:

 

it becomes obvious that some of them (may) share common prefixes. Here there are substrings starting with `i', `m', `p' and `s', but all of those starting `is', in fact start `issi'. Two or more common prefixes share a common path from the root of the suffix tree (as in a PATRICIA tree). Now, a search (sub)string pat must be a prefix of a suffix of txt, if it occurs in txt.

 

           tree                       substrings

tree-->|---mississippi                m .. mississippi
       |
       |---i-->|---ssi-->|---ssippi   i .. ississippi
       |       |         |
       |       |         |---ppi      issip,issipp,issippi
       |       |
       |       |---ppi                ip, ipp, ippi
       |
       |---s-->|---si-->|---ssippi    s .. ssissippi
       |       |        |
       |       |        |---ppi       ssip, ssipp, ssippi
       |       |
       |       |---i-->|---ssippi     si .. sissippi
       |               |
       |               |---ppi        sip, sipp, sippi
       |
       |---p-->|---pi                 p, pp, ppi
               |
               |---i                  p, pi
--- Suffix Tree for "mississippi" ---

Each edge (arc) of the suffix tree is labelled with a substring of txt which is implemented by pointers to the start and end of the substring, e.g. `ssi' by <3,5>. One of the observation in Ukkonen's algorithm is that an edge, <i,n>, leading to a leaf can be implemented by <i,∞> where `∞', i.e., infinity, means `to the end of the string'.

Suffix Tree Demonstration

Change the Text txt=... in the HTML FORM below, and click on `go'; experiment with different text strings:

txt= 
tree

 

NB. If the string is "short", a simple sort routine is run first to sort the suffices the slow way for comparison with the tree; this is not done if the string is "long".

If the termination of txt is important, this can be indicated by a special terminating character often denoted by `$' in papers on strings (~zero char in C/Unix).

Building a Suffix Tree, (a) Slowly

We show how a suffix tree might be built "by hand". Three dots, `...', are used to show the current end of any suffix that will grow as more characters are processed. Starting with the empty suffix tree, consider the string `m':

 
        tree 1
tree-->----m...

Adding the second character to get `mi' there are now suffixes `mi' and `i':

 
        tree 2
tree-->|---mi...
       |
       |---i...

Next `mis'

 
        tree 3
tree-->|---mis...
       |
       |---is...
       |
       |---s...

There is no need to add any more splits for `miss' because `s' is part of `ss'.

 
        tree 4
tree-->|---miss...
       |
       |---iss...
       |
       |---ss...

However, with `missi' there must be a split because one `s' is followed by `i', the other by `s'

 
        tree 5
tree-->|---missi...
       |
       |---issi...
       |
       |---s-->|---si...
               |
               |---i...

The 6th character, `s', brings us to `missis' and no split because both `i's are followed by `s's.

 
        tree 6
tree-->|---missis...
       |
       |---issis...
       |
       |---s-->|---sis...
               |
               |---is...

`mississ'

 
        tree 7
tree-->|---mississ...
       |
       |---ississ...
       |
       |---s-->|---siss...
               |
               |---iss...

`mississi'

 
        tree 8
tree-->|---mississi...
       |
       |---ississi...
       |
       |---s-->|---sissi...
               |
               |---issi...

A lot suddenly happens for `mississip', because it brings the first `p', and causes the third `i' to be followed by `p' where the other two are followed by `ssi'. Consequently one of the `ssi' is followed by `p', the other by `ssip', ditto `si'.

 
        tree 9
tree-->|---mississip...
       |
       |---i-->|---ssi-->|---ssip...
       |       |         |
       |       |         |---p...
       |       |
       |       |---p...
       |
       |---s-->|---si-->|---ssip...
       |       |        |
       |       |        |---p...
       |       |
       |       |---i-->|---ssip...
       |               |
       |               |---p...
       |
       |---p...

By comparison `mississipp' is very quiet

 
        tree 10
tree-->|---mississipp...
       |
       |---i-->|---ssi-->|---ssipp...
       |       |         |
       |       |         |---pp...
       |       |
       |       |---pp...
       |
       |---s-->|---si-->|---ssipp...
       |       |        |
       |       |        |---pp...
       |       |
       |       |---i-->|---ssipp...
       |               |
       |               |---pp...
       |
       |---pp...

`mississippi' is an anti-climax

 
        tree 11
tree-->|---mississippi
       |
       |---i-->|---ssi-->|---ssippi
       |       |         |
       |       |         |---ppi
       |       |
       |       |---ppi
       |
       |---s-->|---si-->|---ssippi
       |       |        |
       |       |        |---ppi
       |       |
       |       |---i-->|---ssippi
       |               |
       |               |---ppi
       |
       |---p-->|---pi
               |
               |---i

and we are done. The challenge, to a computer scientist, is to make sure treei is updated to treei+1 efficiently. This can be done (Ukkonen 1992, 1995) so that treen can be built, starting from tree0, in O(n)-time overall.

(b) Faster

The following terminology is adapted from Ukkonen (1995).

  • If `x' is a substring of txt then `x' represents the state (i.e., location) in the suffix-tree found by tracing out the characters of x from the root. Note that x might be part-way along an edge of the tree.
  • A vertex (node) of the suffix-tree is called an explicit state.
  • A substring x=txt[L..R] can be represented by (L,R).
  • If `v' is a vertex of the suffix-tree, the pair `(v,x)', equivalently (v,(L,R)), represents the state (location) in the suffix-tree found by tracing out the characters of x from v.
  • (v,x) is canonical if v is the last explit state on the path from v to (v,x). NB. (v,empty) is canonical.
  • A special vertex called `bottom' is added and is denoted _|_.
The transition function, g( ), is defined as follows:
g(_|_, a) = root, for all characters `a'.
g(x, a) = y where y=xa, for character `a'.
f( ):
f(root)=_|_
f(x)=y, if x~=empty and x=ay
The suffix function f'( ) is defined as follows:
f'(root)=_|_.
If vertex v=x where x~=empty then f'(v)=y where x=ay for some character `a' and substring y (possibly empty).
The boundary path s1, s2, ..., si, si+1 of suffix-treei-1:
s1=(1,i-1), i.e., the state corresponding to txt[1..i-1]
s2=(2,i-1)
...
si=root
si+1=_|_
The active point is the first sj on the boundary path that is not a leaf, and
the end-point is the first sj' that has a txt[i]-transition.

When treei-1 is expanded into treei, character txt[i] must be dealt with. This is done during a traversal of the boundary path. Any state on the boundary path before sj is a leaf and could be extended by adding txt[i] to the incoming arc, but this can be done for free by representing arcs to leaves by (L,∞) where `∞' is `infinity'. So it it is only necessary to examine states from the active point sj and prior to the end-point sj' .

"[states from sj and before sj'  create entirely new branches that start from states sh, j<=h<j'. ... They are found along the boundary path of [treei-1] using reference pairs and suffix links." - Ukkonen (1995).

 

// almost  JavaScript (try view-source)

function upDate(s, k, i)
// (s, (k, i-1)) is the canonical reference pair for the active point
 { var oldr = root;
   var (endPoint, r) = test_and_split(s, k, i-1, Txt.charAt(i));

   while (!endPoint)
    { r.addTransition(i, infinity, new State());
      if (oldr != root) oldr.sLink = r; // build suffix-link active-path

      oldr = r;
      var (s,k) = canonize(s.sLink, k, i-1)
      (endPoint, r) = test_and_split(s, k, i-1, Txt.charAt(i))
    }

   if(oldr != root) oldr.sLink = s;

   return new pair(s, k);
 }//upDate

Note that r.addTransition(...) adds an edge from state r, labelling the edge with a substring. New txt[i]-transitions must be "open" transitions of the form (L,∞).

 

Where necessary, test_and_split(...) replaces edges s--->s1 with s--->r--->s1 for a new node r. This makes r=(s,(k,p))explicit.

function test_and_split(s, k, p, t)
 { if(k<=p)
    { // find the t_k transition g'(s,(k',p'))=s' from s
      // k1 is k'  p1 is p' in Ukkonen '95
      var ((k1,p1), s1)  = s[Txt.charAt(k)];

      if (t == Txt.charAt(k1 + p - k + 1))
         return new pair(true, s);
      else
       { var r = new State()
         s.addTransition(k1, k1+p-k,   r);     // s---->r---->s1
         r.addTransition(    k1+p-k+1, p1, s1);
         return new pair(false, r)
       }
    }
   else // k > p;  ? is there a t-transition from s ?
      return new pair(s[t] != null, s);
 }//test_and_split

Canonize(...) takes (s,w)=(s,(k,p)) and steps over intermediate nodes by spelling out the characters of w=txt[k..p] for as far as possible.

function canonize(s, k, p)    // s--->...
 { if(p < k) return new pair (s, k);

   // find the t_k transition g'(s,(k',p'))=s' from s
   // k1 is k',  p1 is p' in Ukk' '95
   var ((k1,p1), s1) = s[Txt.charAt(k)];     // s--(k1,p1)-->s1

   while(p1-k1 <= p-k)                       // s--(k1,p1)-->s1--->...
    { k += p1 - k1 + 1;  // remove |(k1,p1)| chars from front of (k,p)
      s = s1;
      if(k <= p)
       { ((k1,p1), s1) = s[Txt.charAt(k)];   // s--(k1,p1)-->s1
       }
    }
   return new pair(s, k);
 }//canonize

The main controlling routine repeatedly takes the next character, updates the sites on the active path and finds and canonizes the new active point:

function ukkonen95()// construct suffix tree for Txt[0..N-1]
 { var s, k, i;
   var bt;

   root = new State();
   bt = new State();                            // bt (bottom or _|_)

   // Want to create transitions for all possible chars
   // from bt to root
   for (i=0; i < Txt.length; i++)
      bt.addTransition(i,i, root);

   root.sLink = bt;
   s=root; k=0;    // NB. k=0, unlike Ukkonen our strings are 0 based

   for(i=0; i < Txt.length; i++)
    { var (s,k) = upDate(s, k, i);   // follow path from active-point
      (s,k) = canonize(s, k, i);
    }
 }//ukkonen95

It relies upon the fact (lemma 2 Ukkonen (1995)) that if (s,(k,i-1)) is a reference pair for the end-point, sj' , of treei-1 then (s,(k,i)) is a reference pair for the active point of treei.

Suffix Tree Applications

Suffix Trees can be used to solve a large number of string problems that occur in text-editing, free-text search, computational biology, and other application areas. Some examples are given below.

String Search

Searching for a substring, pat[1..m], in txt[1..n], can be solved in O(m) time (after the suffix tree for txt has been built in O(n) time).

Longest Repeated Substring

Add a special ``end of string'' character, e.g. `$', to txt[1..n] and build a suffix tree; the longest repeated substring oftxt[1..n] is indicated by the deepest fork node in the suffix tree, where depth is measured by the number of characterstraversed from the root, i.e., `issi' in the case of `mississippi'. The longest repeated substring can be found in O(n) time using a suffix tree.

Longest Common Substring

The longest common substring of two strings, txt1 and txt2, can be found by building a generalized suffix tree for txt1 andtxt2: Each node is marked to indicate if it represents a suffix of txt1 or txt2 or both. The deepest node marked for both txt1and txt2 represents the longest common substring.

Equivalently, one can build a (basic) suffix tree for the string txt1$txt2#, where `$' is a special terminator for txt1 and `#' is a special terminator for txt2. The longest common substring is indicated by the deepest fork node that has both `...$...' and `...#...' (no $) beneath it.
(Try it using the HTML FORM above.)

Note that the `longest common substring problem' is different to the `longest common subsequence problem' which is closely related to the `edit-distance problem': An instance of a subsequence can have gaps where it appears in txt1 and intxt2, but an instance of a substring cannot have gaps.

Palindromes

A palindrome is a string, P, such that P=reverse(P). e.g. `abba'=reverse(`abba'). e.g. `ississi' is the longest palindrome in `mississippi'.

The longest palindrome of txt[1..n] can be found in O(n) time, e.g. by building the suffix tree for txt$reverse(txt)# or by building the generalized suffix tree for txt and reverse(txt).
(Try it.)

 

From:

http://www.allisons.org/ll/AlgDS/Tree/Suffix/

http://decomplexify.blogspot.com/2014/07/suffix-tree_19.html

分享到:
评论

相关推荐

    suffix tree

    后缀树(Suffix Tree)是一种高效的数据结构,用于处理字符串查询和模式匹配问题。它在文本索引、生物信息学、自然语言处理等领域有广泛应用。后缀树的主要优点是能够快速地查找一个字符串的所有后缀,同时也能进行...

    SuffixTree后缀树讲义

    例如,在某些后缀树的实现中,可能会用到稀疏后缀树(sparse suffix tree)或者后缀数组(suffix array)和后缀链接(suffix link)的组合,这样可以在处理非常大的数据集时优化空间复杂度。 后缀树不仅在理论上...

    suffix tree 代码

    suffix tree源代码,有注释!大家一起努力学习吧

    c语言suffix tree库

    This library is an implementation of the suffix tree algorithm applied to indexing. A search on "suffix trees Ukkonen" on a search engine should give you an idea of what I'm talking about. The ...

    SuffixTree_java.zip_javascript

    这个压缩包“SuffixTree_java.zip_javascript”虽然带有JavaScript标签,但从文件列表来看,它包含的是一个名为“SuffixTree.java”的Java源代码文件。这表明该代码实现可能是用Java语言编写的后缀树算法,可以被...

    suffix tree.zip

    【标题】"suffix tree.zip"所包含的内容是关于广义后缀树(Generalized Suffix Tree,简称GST)的一种C++实现代码。广义后缀树是一种数据结构,主要用于高效地处理字符串集合中的模式匹配问题,它能快速查找并处理多...

    suffix tree—后缀树的典型应用

    - 构建包含所有字符串的广义后缀树(Generalized Suffix Tree)。 - 遍历树的每个节点,统计经过该节点的不同字符串数量。如果有两个或更多的字符串通过了该节点,则表示找到了一个公共子串。 - 通过比较这些公共...

    SuffixTree 后缀树 c#实现

    后缀树(Suffix Tree)是一种高效的数据结构,用于处理字符串搜索和模式匹配问题。它在计算机科学中,尤其是在文本处理、生物信息学和数据压缩等领域有着广泛的应用。C#实现后缀树可以帮助开发者快速地在大量文本...

    【suffixtree】2020年全国大学生软件测试大赛预选赛开发者测试题目下载

    2020年全国大学生软件测试大赛预选赛,开发者测试赛项中,“suffixtree”题目资源下载,可直接导入eclipse运行。本题使用的源码来自github开源代码,包含后缀树算法“suffixtree”相关的一系列数据结构和算法的实现...

    Algorithm-Ukkonen-s-Suffix-Tree-Algorithm.zip

    Algorithm-Ukkonen-s-Suffix-Tree-Algorithm.zip,ukkonen的后缀树算法,一个用python实现的完整版本,算法是为计算机程序高效、彻底地完成任务而创建的一组详细的准则。

    ukkonen-suffixtree:Ukkonen 后缀树构建算法的 AC 实现,带有测试套件和树打印

    这是 Esko Ukkonen 在线后缀树构建算法的 C 语言基本实现。... 正如那里所解释的,tree.c 文件可以用更快的实现代替,例如,一个对大节点使用小哈希表的实现。 提供的实现使用足够用于测试和演示目的的链表。

    后缀树算法 suffix_tree

    后缀树算法是一种高效处理字符串相关问题的数据结构,它的全称是“通用后缀树”(Universal Suffix Tree)。在计算机科学中,特别是在文本搜索、生物信息学和数据压缩等领域,后缀树有着广泛的应用。 后缀树的核心...

    A Compressed Suffix Tree Based Implementation with Low Peak Memory Usage (2014)-计算机科学

    A Compressed Suffix Tree Based Implementation With Low Peak MemoryUsage 1Daniel Saad Nogueira Nunes2 Mauricio Ayala-Rincón3Instituto de Ciências Exatas Departamentos de Ciência da Computação...

    SR-tree-java.zip_java tree_sr tree_tree

    标题中的"SR-tree-java.zip_java tree_sr tree_tree"暗示了这是一个关于Java实现的SR树(Suffix-Radix Tree)的项目。SR树是一种高效的多维数据结构,常用于数据库索引和空间数据处理,它结合了后缀树(Suffix Tree...

    SuffixArray 扩展(以单词为单位) 源码

    同时,对于以单词为单位的处理,可能还需要额外的数据结构来存储单词的信息,如词典树(Trie)或后缀树(Suffix Tree),以支持高效的单词级别查询。 4. **源码分析**:在分析源码时,主要关注以下几点: - 数据...

    matlab开发-SuffixArray.zip.zip

    3. **构建算法**:构建Suffix Array的常用方法包括线性时间复杂度的Manber-Myers算法、suffix tree剪枝算法等。这些算法通过一系列的排序和合并步骤来构造数组,使得构建过程尽可能高效。 4. **应用**:Suffix ...

    数据结构Advanced-Data-Structures

    Generalised suffix tree 371 B-trie 372 Judy array 372 Directed acyclic word graph 374 Multiway trees 376 Ternary search tree 376 And–or tree 379 (a,b)-tree 380 Link/cut tree 381 SPQR tree 381 ...

Global site tag (gtag.js) - Google Analytics