转自双数组字典算法:<http://linux.thai.net/~thep/datrie/datrie.html>
An Implementation of Double-Array Trie
Contents
- What is Trie?
- What Does It Take to Implement a Trie?
- Tripple-Array Trie
- Double-Array Trie
- Suffix Compression
- Key Insertion
- Key Deletion
- Double-Array Pool Allocation
- An Implementation
- Download
- Other Implementations
- References
What is Trie?
Trie is a kind of digital search tree. (See [Knuth1972] for the detail of digital search tree.) [Fredkin1960] introduced the trie terminology, which is abbreviated from "Retrieval".
Trie is an efficient indexing method. It is indeed also a kind of deterministic finite automaton (DFA) (See [Cohen1990], for example, for the definition of DFA). Within the tree structure, each node corresponds to a DFA state, each (directed) labeled edge from a parent node to a child node corresponds to a DFA transition. The traversal starts at the root node. Then, from head to tail, one by one character in the key string is taken to determine the next state to go. The edge labeled with the same character is chosen to walk. Notice that each step of such walking consumes one character from the key and descends one step down the tree. If the key is exhausted and a leaf node is reached, then we arrive at the exit for that key. If we get stuck at some node, either because there is no branch labeled with the current character we have or because the key is exhausted at an internal node, then it simply implies that the key is not recognized by the trie.
Notice that the time needed to traverse from the root to the leaf is not dependent on the size of the database, but is proportional to the length of the key. Therefore, it is usually much faster than B-tree or any comparison-based indexing method in general cases. Its time complexity is comparable with hashing techniques.
In addition to the efficiency, trie also provides flexibility in searching for the closest path in case that the key is misspelled. For example, by skipping a certain character in the key while walking, we can fix the insertion kind of typo. By walking toward all the immediate children of one node without consuming a character from the key, we can fix the deletion typo, or even substitution typo if we just drop the key character that has no branch to go and descend to all the immediate children of the current node.
What Does It Take to Implement a Trie?
In general, a DFA is represented with a transition table, in which the rows correspond to the states, and the columns correspond to the transition labels. The data kept in each cell is then the next state to go for a given state when the input is equal to the label.
This is an efficient method for the traversal, because every transition can be calculated by two-dimensional array indexing. However, in term of space usage, this is rather extravagant, because, in the case of trie, most nodes have only a few branches, leaving the majority of the table cells blanks.
Meanwhile, a more compact scheme is to use a linked list to store the transitions out of each state. But this results in slower access, due to the linear search.
Hence, table compression techniques which still allows fast access have been devised to solve the problem.
- [Johnson1975] (Also explained in [Aho+1985] pp. 144-146) represented DFA with four arrays, which can be simplified to three in case of trie. The transition table rows are allocated in overlapping manner, allowing the free cells to be used by other rows.
- [Aoe1989] proposed an improvement from the three-array structure by reducing the arrays to two.
Tripple-Array Trie
As explained in [Aho+1985] pp. 144-146, a DFA compression could be done using four linear arrays, namely default, base, next, and check. However, in a case simpler than the lexical analyzer, such as the mere trie for information retrieval, the default array could be omitted. Thus, a trie can be implemented using three arrays according to this scheme.
Structure
The tripple-array structure is composed of:
- base. Each element in base corresponds to a node of the trie. For a trie node s, base[s] is the starting index within the next and check pool (to be explained later) for the row of the node s in the transition table.
- next. This array, in coordination with check, provides a pool for the allocation of the sparse vectors for the rows in the trie transition table. The vector data, that is, the vector of transitions from every node, would be stored in this array.
- check. This array works in parallel to next. It marks the owner of every cell in next. This allows the cells next to one another to be allocated to different trie nodes. That means the sparse vectors of transitions from more than one node are allowed to be overlapped.
Definition 1. For a transition from state s to t which takes character c as the input, the condition maintained in the tripple-array trie is:
check[base[s] + c] = s
next[base[s] + c] = t
Walking
According to definition 1, the walking algorithm for a given state s and the input character c is:
if check[t] = s then next state := next[t] else fail endif
Construction
To insert a transition that takes character c to traverse from a state s to another state t, the cell next[base[s] + c]] must be managed to be available. If it is already vacant, we are lucky. Otherwise, either the entire transition vector for the current owner of the cell or that of the state s itself must be relocated. The estimated cost for each case could determine which one to move. After finding the free slots to place the vector, the transition vector must be recalculated as follows. Assuming the new place begins at b, the procedure for the relocation is:
Double-Array Trie
The tripple-array structure for implementing trie appears to be well defined, but is still not practical to keep in a single file. The next/check pool may be able to keep in a single array of integer couples, but the base array does not grow in parallel to the pool, and is therefore usually split.
To solve this problem, [Aoe1989] reduced the structure into two parallel arrays. In the double-array structure, the base and next are merged, resulting in only two parallel arrays, namely, base and check.
Structure
Instead of indirectly referencing through state numbers as in tripple-array trie, nodes in double-array trie are linked directly within the base/check pool.
Definition 2. For a transition from state s to t which takes character c as the input, the condition maintained in the double-array trie is:
check[base[s] + c] = s
base[s] + c = t
Walking
According to definition 2, the walking algorithm for a given state s and the input character c is:
if check[t] = s then next state := t else fail endif
Construction
The construction of double-array trie is in principle the same as that of tripple-array trie. The difference is the base relocation:
Suffix Compression
[Aoe1989] also suggested a storage compression strategy, by splitting non-branching suffixes into single string storages, called tail, so that the rest non-branching steps are reduced into mere string comparison.
With the two separate data structures, double-array branches and suffix-spool tail, key insertion and deletion algorithms must be modified accordingly.
Key Insertion
To insert a new key, the branching position can be found by traversing the trie with the key one by one character until it gets stuck. The state where there is no branch to go is the very place to insert a new edge, labeled by the failing character. However, with the branch-tail structure, the insertion point can be either in the branch or in the tail.
1. When the branching point is in the double-array structure
Suppose that the new key is a string a1a2...ah-1ahah+1...an, where a1a2...ah-1 traverses the trie from the root to a node sr in the double-array structure, and there is no edge labeled ah that goes out of sr. The algorithm called A_INSERT in [Aoe1989] does as follows:
2. When the branching point is in the tail pool
Since the path through a tail string has no branch, and therefore corresponds to exactly one key, suppose that the key corresponding to the tail is
a1a2...ah-1ah...ah+k-1b1...bm,
where a1a2...ah-1 is in double-array structure, and ah...ah+k-1b1...bm is in tail. Suppose that the substring a1a2...ah-1 traverses the trie from the root to a node sr.
And suppose that the new key is in the form
a1a2...ah-1ah...ah+k-1ah+k...an,
where ah+k <> b1. The algorithm called B_INSERT in [Aoe1989] does as follows:
Key Deletion
To delete a key from the trie, all we need to do is delete the tail block occupied by the key, and all double-array nodes belonging exclusively to the key, without touching any node belonging to other keys.
Consider a trie which accepts a language K = {pool#, prepare#, preview#, prize#, produce#, producer#, progress#} :
The key "pool#" can be deleted by removing the tail string "ol#" from the tail pool, and node 3 from the double-array structure. This is the simplest case.
To remove the key "produce#", it is sufficient to delete node 14 from the double-array structure. But the resulting trie will not obay the convention that every node in the double-array structure, except the separate nodes which point to tail blocks, must belong to more than one key. The path from node 10 on will belong solely to the key "producer#".
But there is no harm violating this rule. The only drawback is the uncompactnesss of the trie. Traversal, insertion and deletion algoritms are intact. Therefore, this should be relaxed, for the sake of simplicity and efficiency of the deletion algorithm. Otherwise, there must be extra steps to examine other keys in the same subtree ("producer#" for the deletion of "produce#") if any node needs to be moved from the double-array structure to tail pool.
Suppose further that having removed "produce#" as such (by removing only node 14), we also need to remove "producer#" from the trie. What we have to do is remove string "#" from tail, and remove nodes 15, 13, 12, 11, 10 (which now belong solely to the key "producer#") from the double-array structure.
We can thus summarize the algorithm to delete a key k = a1a2...ah-1ah...an, where a1a2...ah-1 is in double-array structure, and ah...an is in tail pool, as follows :
Where outdegree(s) is the number of children nodes of s.
Double-Array Pool Allocation
When inserting a new branch for a node, it is possible that the array element for the new branch has already been allocated to another node. In that case, relocation is needed. The efficiency-critical part then turns out to be the search for a new place. A brute force algoritm iterates along the check array to find an empty cell to place the first branch, and then assure that there are empty cells for all other branches as well. The time used is therefore proportional to the size of the double-array pool and the size of the alphabet.
Suppose that there are n nodes in the trie, and the alphabet is of size m. The size of the double-array structure would be n + cm, where c is a coefficient which is dependent on the characteristic of the trie. And the time complexity of the brute force algorithm would be O(nm + cm2).
[Aoe1989] proposed a free-space list in the double-array structure to make the time complexity independent of the size of the trie, but dependent on the number of the free cells only. The check array for the free cells are redefined to keep a pointer to the next free cell (called G-link) :
Definition 3. Let r1, r2, ... , rcm be the free cells in the double-array structure, ordered by position. G-link is defined as follows :
check[0] = -r1
check[ri] = -ri+1 ; 1 <= i <= cm-1
check[rcm] = -1
By this definition, negative check means unoccupied in the same sense as that for "none" check in the ordinary algorithm. This encoding scheme forms a singly-linked list of free cells. When searching for an empty cell, only cm free cells are visited, instead of all n + cm cells as in the brute force algorithm.
This, however, can still be improved. Notice that for those cells with negative check, the corresponding base's are not given any definition. Therefore, in our implementation, Aoe's G-link is modified to be doubly-linked list by letting base of every free cell points to a previous free cell. This can speed up the insertion and deletion processes. And, for convenience in referencing the list head and tail, we let the list be circular. The zeroth node is dedicated to be the entry point of the list. And the root node of the trie will begin with cell number one.
Definition 4. Let r1, r2, ... , rcm be the free cells in the double-array structure, ordered by position. G-link is defined as follows :
check[0] = -r1
check[ri] = -ri+1 ; 1 <= i <= cm-1
check[rcm] = 0
base[0] = -rcm
base[r1] = 0
base[ri+1] = -ri ; 1 <= i <= cm-1
Then, the searching for the slots for a node with input symbol set P = {c1, c2, ..., cp} needs to iterate only the cells with negative check :
The time complexity for free slot searching is reduced to O(cm2). The relocation stage takes O(m2). The total time complexity is therefore O(cm2 + m2) = O(cm2).
It is useful to keep the free list ordered by position, so that the access through the array becomes more sequential. This would be beneficial when the trie is stored in a disk file or virtual memory, because the disk caching or page swapping would be used more efficiently. So, the free cell reusing should maintain this strategy :
Time complexity of freeing a cell is thus O(cm).
An Implementation
In my implementation, I designed the API with persistent data in mind. Tries can be saved to disk and loaded for use afterward. And in newer versions, non-persistent usage is also possible. You can create a trie in memory, populate data to it, use it, and free it, without any disk I/O. Alternatively you can load a trie from disk and save it to disk whenever you want.
The trie data is portable across platforms. The byte order in the disk is always little-endian, and is read correctly on either little-endian or big-endian systems.
Trie index is 32-bit signed integer. This allows 2,147,483,646 (231 - 2) total nodes in the trie data, which should be sufficient for most problem domains. And each data entry can store a 32-bit integer value associated to it. This value can be used for any purpose, up to your needs. If you don't need to use it, just store some dummy value.
For sparse data compactness, the trie alphabet set should be continuous, but that is usually not the case in general character sets. Therefore, a map between the input character and the low-level alphabet set for the trie is created in the middle. You will have to define your input character set by listing their continuous ranges of character codes in a .abm (alphabet map) file when creating a trie. Then, each character will be automatically assigned internal codes of continuous values.
Download
Update: The double-array trie implementation has been simplified and rewritten from scratch in C, and is now named libdatrie. It is now available under the terms of GNU Lesser General Public License (LGPL):
- libdatrie-0.2.4 (30 June 2010)
- libdatrie-0.2.3 (27 February 2010)
- libdatrie-0.2.2 (29 April 2009)
- libdatrie-0.2.1 (5 April 2009)
- libdatrie-0.2.0 (24 March 2009)
- libdatrie-0.1.3 (28 January 2008)
- libdatrie-0.1.2 (25 August 2007)
- libdatrie-0.1.1 (12 October 2006)
- libdatrie-0.1.0 (18 September 2006)
SVN: svn co http://linux.thai.net/svn/software/datrie
The old C++ source code below is under the terms of GNU Lesser General Public License (LGPL):
- midatrie-0.3.3 (2 October 2001)
- midatrie-0.3.3 (16 July 2001)
- midatrie-0.3.2 (21 May 2001)
- midatrie-0.3.1 (8 May 2001)
- midatrie-0.3.0 (23 Mar 2001)
Other Implementations
- DoubleArrayTrie: Java implementation by Christos Gioran (More information)
References
- [Knuth1972] Knuth, D. E. The Art of Computer Programming Vol. 3, Sorting and Searching. Addison-Wesley. 1972.
- [Fredkin1960] Fredkin, E. Trie Memory. Communication of the ACM. Vol. 3:9 (Sep 1960). pp. 490-499.
- [Cohen1990] Cohen, D. Introduction to Theory of Computing. John Wiley & Sons. 1990.
- [Johnson1975] Johnson, S. C. YACC-Yet another compiler-compiler. Bell Lab. NJ. Computing Science Technical Report 32. pp.1-34. 1975.
- [Aho+1985] Aho, A. V., Sethi, R., Ullman, J. D. Compilers : Principles, Techniques, and Tools. Addison-Wesley. 1985.
- [Aoe1989] Aoe, J. An Efficient Digital Search Algorithm by Using a Double-Array Structure. IEEE Transactions on Software Engineering. Vol. 15, 9 (Sep 1989). pp. 1066-1077.
- [Virach+1993] Virach Sornlertlamvanich, Apichit Pittayaratsophon, Kriangchai Chansaenwilai. Thai Dictionary Data Base Manipulation using Multi-indexed Double Array Trie. 5th Annual Conference. National Electronics and Computer Technology Center. Bangkok. 1993. pp 197-206. (in Thai)
Copyright © 1999 by Theppitak Karoonboonyanan, Software and Language Engineering Laboratory, National Electronics and Computer Technology Center. All rights reserved.
Copyright © 2003-2010 by Theppitak Karoonboonyanan. All rights reserved.
相关推荐
### 双数组Trie树算法优化及其应用研究 #### 摘要与关键词解析 本文主要探讨了一种针对双数组Trie树(Double-Array Trie)算法的优化策略,并通过实验验证了该策略的有效性。双数组Trie树是一种高效的数据结构,常...
双数组Trie(Double-ArrayTrie)是trie树的一个简单而有效的实现,由两个整数数组构成,一个是base[],另一个是check[]。设数组下标为i,如果base[i],check[i]均为0,表示该位置为空。如果base[i]为负值,表示该状态为...
双数组字典树(Double Array Trie,简称DAT)是一种高效的数据结构,主要用于字符串的存储和检索。这种数据结构由日本的原田康夫提出,它在处理大量字符串数据时表现出优秀的性能,尤其在查找和前缀匹配方面。本文将...
本文将深入探讨双数组Trie算法、哈希方法、以及它们在分词中的应用。 双数组Trie(Double-Array Trie),也称为Darts,是Trie数据结构的一种优化实现。Trie,又称“前缀树”或“字典树”,是一种用于存储动态集合或...
**双数组 Trie(Double-Array Trie)源码详解** 在计算机科学中,Trie,也称为前缀树或字典树,是一种用于存储键值对的数据结构,它以高效的键查找速度著称。双数组 Trie(Double-Array Trie,DART)是 Trie 结构的...
本文旨在探讨汉语词典查询算法的研究进展,重点介绍了基于双数组TRIE和双编码机制的查询算法,并通过实验对比分析了不同算法的性能。 #### 基于双数组TRIE机制的汉语词典查询算法 双数组TRIE(Double-Array Trie)...
DA算法通过双数组的构建,不仅解决了空间效率问题,还提供了快速访问和查询后缀信息的能力。理解并掌握后缀数组及其构建方法,对于提升字符串处理的效率和灵活性至关重要。在实际编程中,我们可以结合各种优化技术,...
### 双数组字典树的有效实现 #### 概述 本文档主要介绍了一种高效的字典树(Trie)结构实现方法——双数组字典树(Double Array Trie)。该实现方式旨在结合矩阵形式的快速访问特性和列表形式的紧凑性,从而在减少...
- **双数组表示**:通过多次遍历构建双数组,并调整`base[]`中的值以标记词尾。 - **查询算法**:例如,查询“阿根廷”时,首先根据“阿”的编码找到下一个状态,然后继续查询“根”和“廷”,最终定位到“阿根廷”...
HanLP是一个开源的自然语言处理库,其分词模块采用了高效的数据结构和算法,包括上述提到的双数组字典树和AC自动机。 2.9 准确率评测 准确率评测是衡量分词系统性能的重要指标,通过比较分词结果与人工标注的参考...
在Java编程语言中,双数组是一种高级的数据结构实现,它结合了哈希表和Trie树(字典树)的优点,提供快速的查找和插入操作。这个名为"java_da"的库就是专门为Java开发者提供双数组实现的工具。 **双数组的原理与...
在本文中,我们将深入探讨 Trie 数组的算法实现,特别是基于 libdatrie 库的双数组 Trie 实现。 libdatrie 是一个由泰国开发者编写的开源库,它提供了构建和操作双数组 Trie 树的功能。双数组 Trie 是 Trie 数据...
通常采用字典树(如Trie)或双数组字典树(如ATrie)等数据结构来高效地存储和检索码表。 3. **编码和解码**:编码阶段,根据码表将原始数据转换为高阶哈夫曼编码;解码阶段,通过反向查找码表,将压缩后的编码还原...
DoubleArrayTrie(DAT),也称为双数组字典树,是一种高效的数据结构,主要用于存储字符串集合,并进行快速的前缀匹配查询。它由日本的Makoto Matsumoto和Takao Nishizeki在1990年代提出,广泛应用于搜索引擎、文本...
双数组Trie结构,即PAT树,是一种更为先进的数据结构,它同时利用了两个数组来存储Trie树的状态,将节点的转移和输出信息分开,从而进一步提高了空间效率和查询速度。在处理大规模数据时,双数组Trie树能够提供更优...
通过压缩叶子节点和优化双数组Trie树的存储结构,该算法在保持高效率检索能力的同时,降低了存储空间的占用,并提高了数据插入的效率。对于希望深入了解网络安全态势感知技术的专业人士而言,该研究成果提供了一个...
首先,我们需要理解Trie树,也被称为前缀树或字典树。Trie树是一种用于存储字符串集合的数据结构,每个节点代表一个前缀,从根节点到某个节点的路径上的字符序列构成了该节点代表的字符串。在敏感词过滤中,我们可以...
IKAnalyzer采用了Aho-Corasick算法优化的双数组字典树,提高了查找效率。 - **分析器(Analyzer)**: 负责读取输入的文本,对其进行预处理,然后调用字典进行分词,并输出分词结果。 - **过滤器(Filter)**: 在...