`
thd52java
  • 浏览: 72255 次
  • 性别: Icon_minigender_1
  • 来自: 北京
社区版块
存档分类
最新评论

Prefix tree

阅读更多

Prefix tree

The trie, or prefix tree, is a data structure for storing strings or other sequences in a way that allows for a fast look-up. In its simplest form it can be used as a list of keywords or a dictionary. By associating each string with an object it can be used as an alternative to a hashmap. The name 'trie' comes from the word 'retrieval'.

The basic idea behind a trie is that each successive letter is stored as a separate node. To find out if the word 'cat' is in the list you start at the root and look up the 'c' node. Having found the 'c' node you search the list of c's children for an 'a' node, and so on. To differentiate between 'cat' and 'catalog' each word is ended by a special delimiter.

The figure below shows a schematic representation of a partial trie:

a schematic representation of a trie structure

Implementation

The fastest way to implement this is with fixed size arrays. Unfortunately this only works if you know which characters can show up in the sequences. For keywords with 26 letters its a fast but space consuming option, for unicode strings its pretty much impossible.

Instead of fixed sizes arrays you can use a linked list at each node. This has obvious space advantages, since no more empty spaces are stored. Unfortunately searching a long linked list is rather slow. For example to find the word 'zzz' you might need 3 times 26 steps.

Faster trie algorithms have been devised that lie somewhere between these two extremes in terms of speed and space consumption. These can be found by searching google.

Fun & games with prefix trees

Prefix trees are a bit of an overlooked data structure with lots of interesting possibilities.

Storage

By storing values at each leaf node you can use them as a kind of alternative hashmap, although when working with unicode strings a hashmap will greatly outperform a trie.

As a dictionary

Looking up if a word is in a trie takes O(n) operations, where n is the length of the word. Thus - for array implementations - the lookup speed doesn't change with increasing trie size.

Word completion

Word completion is straightforward to implement using a trie: simply find the node corresponding to the first few letters, and then collape the subtree into a list of possible endings.

This can be used in autocompleting user input in text editors or the T9 dictionary on your phone

Censoring strings

Given a large list of swear words and a string to censor a trie offers a speed advantage over a simple array of strings. If the swear word can appear anywhere in the string you'll need to attempt to match it from any possible starting offset. With a string of m characters and a list of n words this would mean m*n string comparisons.

Using a trie you can attempt to find a match from each given offset in the string, this means m trie lookups. Since the speed of a trie lookup scales well with an increasing number of words this is considerably faster than the array lookup.

Java linked list implementation

Just for fun, here's a java linked list implementation. Keep in mind that this is a fairly slow implementation. For serious speed boosts you'll need to investigate double or triple-array tries.

Please note: the version below is a simplified version intended only to give some insight into the workings of the Trie. For the full version please see theDownloads section.

publicclass Trie
{
    /**
     * The delimiter used in this word to tell where words end. Without a proper delimiter either A.
     * a lookup for 'win' would return false if the list also contained 'windows', or B. a lookup
     * for 'mag' would return true if the only word in the list was 'magnolia'
     *
     * The delimiter should never occur in a word added to the trie.
     */

    public final static char DELIMITER = '\u0001';

    /**
     * Creates a new Trie.
     */

    public Trie()
    {
        root = new Node('r');
        size = 0;
    }

    /**
     * Adds a word to the list.
     * @param word The word to add.
     * @return True if the word wasn't in the list yet
     */

    public boolean add(String word)
    {
        if (add(root, word+ DELIMITER, 0))
        {
            size++;
            int n = word.length();
            if (n > maxDepth) maxDepth = n;
            return true;
        }
        return false;
    }

    /*
     * Does the real work of adding a word to the trie
     */

    private boolean add(Node root, String word,int offset)
    {
        if (offset== word.length())return false;
        int c = word.charAt(offset);

        // Search for node to add to
        Node last = null, next = root.firstChild;
        while (next != null)
        {
            if (next.value < c)
            {
                // Not found yet, continue searching
                last = next;
                next = next.nextSibling;
            }
            else if (next.value == c)
            {
                // Match found, add remaining word to this node
                return add(next, word, offset+ 1);
            }
            // Because of the ordering of the list getting here means we won't
            // find a match
            else break;
        }

        // No match found, create a new node and insert
        Node node = new Node(c);
        if (last== null)
        {
            // Insert node at the beginning of the list (Works for next == null
            // too)
            root.firstChild = node;
            node.nextSibling = next;
        }
        else
        {
            // Insert between last and next
            last.nextSibling = node;
            node.nextSibling = next;
        }

        // Add remaining letters
        for (int i= offset + 1; i< word.length(); i++)
        {
            node.firstChild =new Node(word.charAt(i));
            node = node.firstChild;
        }
        return true;
    }

    /**
     * Searches for a word in the list.
     *
     * @param word The word to search for.
     * @return True if the word was found.
     */

    public boolean isEntry(String word)
    {
        if (word.length()== 0)
            throw new IllegalArgumentException("Word can't be empty");
        return isEntry(root, w+ DELIMITER, 0);
    }

    /*
     * Does the real work of determining if a word is in the list
     */

    private boolean isEntry(Node root, String word, int offset)
    {
        if (offset== word.length())return true;
        int c = word.charAt(offset);

        // Search for node to add to
        Node next = root.firstChild;
        while (next != null)
        {
            if (next.value < c) next= next.nextSibling;
            else if (next.value == c) return isEntry(next, word, offset +1);
            else return false;
        }
        return false;
    }

    /**
     * Returns the size of this list;
     */

    public int size()
    {
        return size;
    }

    /**
     * Returns all words in this list starting with the given prefix
     *
     * @param prefix The prefix to search for.
     * @return All words in this list starting with the given prefix, or if no such words are found,
     *         an array containing only the suggested prefix.
     */

    public String[] suggest(String prefix)
    {
        return suggest(root, prefix,0);
    }

    /*
     * Recursive function for finding all words starting with the given prefix
     */

    private String[] suggest(Node root,String word, int offset)
    {
        if (offset== word.length())
        {
            ArrayList<String> words = new ArrayList<String>(size);
            char[] chars= new char[maxDepth];
            for (int i = 0; i < offset; i++)
                chars[i] = word.charAt(i);
            getAll(root, words, chars, offset);
            return words.toArray(newString[words.size()]);
        }
        int c = word.charAt(offset);

        // Search for node to add to
        Node next = root.firstChild;
        while (next != null)
        {
            if (next.value < c) next= next.nextSibling;
            else if (next.value == c) return suggest(next, word, offset +1);
            else break;
        }
        return new String[]{ word };
    }

    /**
     * Searches a string for words present in the trie and replaces them with stars (asterixes).
     * @param z The string to censor
     */

    public String censor(String s)
    {
        if (size== 0) return s;
        String z = s.toLowerCase();
        int n = z.length();
        StringBuilder buffer = new StringBuilder(n);
        int match;
        char star = '*';
        for (int i= 0; i < n;)
        {
            match = longestMatch(root, z, i,0, 0);
            if (match > 0)
            {
                for (int j = 0; j < match; j++)
                {
                    buffer.append(star);
                    i++;
                }
            }
            else
            {
                buffer.append(s.charAt(i++));
            }
        }
        return buffer.toString();
    }

    /*
     * Finds the longest matching word in the trie that starts at the given offset...
     */

    private int longestMatch(Node root, String word, int offset,int depth, int maxFound)
    {
        // Uses delimiter = first in the list!
        Node next = root.firstChild;
        if (next.value== DELIMITER) maxFound = depth;
        if (offset== word.length())return maxFound;
        int c = word.charAt(offset);

        while (next != null)
        {
            if (next.value < c) next= next.nextSibling;
            else if (next.value == c) return longestMatch(next, word,
                offset + 1, depth + 1, maxFound);
            else return maxFound;
        }
        return maxFound;
    }

    /*
     * Represents a node in the trie. Because a node's children are stored in a linked list this
     * data structure takes the odd structure of node with a firstChild and a nextSibling.
     */

    private class Node
    {
        public int value;
        public Node firstChild;
        public Node nextSibling;

        public Node(int value)
        {
            this.value= value;
            firstChild = null;
            nextSibling = null;
        }
    }


    private Node root;
    private int size;
    private int maxDepth; // Not exact, but bounding for the maximum
}

Please note: the code given above is intended only to give some insight into the workings of the Trie. For the full version of the class please see theDownloads section.

1
1
分享到:
评论

相关推荐

    prefixtree:go中的前缀树(trie)实现

    前缀树prefixtree包实现了一个简单的前缀trie数据结构。 通过树,可以快速搜索与给定前缀唯一匹配的字符串。 该实现允许用户将数据与每个字符串相关联,因此它可以充当一种灵活的键值存储,在该存储中,搜索将以最短...

    Unity中基于前缀树的红点系统实现 (Unity + Lua + Prefix Tree) 红点系统.zip

    Unity中基于前缀树的红点系统实现 (Unity + Lua + Prefix Tree) 红点系统是在大部分游戏中都能看到的常见功能需求。在相关的UI上亮起红点来提示玩家进行相关操作。 项目基于UnityXFramework框架,集成tolua,使用lua...

    hbase-prefix-tree-1.4.3-API文档-中文版.zip

    赠送jar包:hbase-prefix-tree-1.4.3.jar; 赠送原API文档:hbase-prefix-tree-1.4.3-javadoc.jar; 赠送源代码:hbase-prefix-tree-1.4.3-sources.jar; 赠送Maven依赖信息文件:hbase-prefix-tree-1.4.3.pom; ...

    hbase-prefix-tree-1.1.3-API文档-中文版.zip

    赠送jar包:hbase-prefix-tree-1.1.3.jar; 赠送原API文档:hbase-prefix-tree-1.1.3-javadoc.jar; 赠送源代码:hbase-prefix-tree-1.1.3-sources.jar; 赠送Maven依赖信息文件:hbase-prefix-tree-1.1.3.pom; ...

    All-optical prefix tree adder with the help of terahertz optical asymmetric demultiplexer

    We propose and describe an all-optical prefix tree adder with the help of a terahertz optical asymmetric demultiplexer (TOAD) using a set of optical switches. The prefix tree adder is useful in ...

    hbase-prefix-tree-1.2.12-API文档-中文版.zip

    赠送jar包:hbase-prefix-tree-1.2.12.jar; 赠送原API文档:hbase-prefix-tree-1.2.12-javadoc.jar; 赠送源代码:hbase-prefix-tree-1.2.12-sources.jar; 赠送Maven依赖信息文件:hbase-prefix-tree-1.2.12.pom;...

    hbase-prefix-tree-1.2.12-API文档-中英对照版.zip

    赠送jar包:hbase-prefix-tree-1.2.12.jar; 赠送原API文档:hbase-prefix-tree-1.2.12-javadoc.jar; 赠送源代码:hbase-prefix-tree-1.2.12-sources.jar; 赠送Maven依赖信息文件:hbase-prefix-tree-1.2.12.pom;...

    FP_Tree.rar_ FP_tree _fp-tree_fp_tree_fp算法_tree

    - **前缀树(Prefix Tree)**:在构建阶段就利用了项目的有序性,比FP-Tree更快但占用更多空间。 - **FP-Growth**:通过剪枝策略进一步优化FP-Tree,减少了中间结果的存储。 总结,FP-Tree算法是数据挖掘中的一种...

    hbase-prefix-tree-1.1.3-API文档-中英对照版.zip

    赠送jar包:hbase-prefix-tree-1.1.3.jar; 赠送原API文档:hbase-prefix-tree-1.1.3-javadoc.jar; 赠送源代码:hbase-prefix-tree-1.1.3-sources.jar; 赠送Maven依赖信息文件:hbase-prefix-tree-1.1.3.pom; ...

    radixtree基数树

    基数树,也称为紧凑前缀树(Compact Prefix Tree)、字节序树(Byte-Ordered Tree),是一种高效的字符串集合数据结构,常用于存储和检索具有公共前缀的字符串。它通过将字符串的每个字符视为一个节点来组织,使得在...

    专题资料(2021-2022年)ASB 与遵义移动联合项目培训资料字冠分析与路由.doc

    - **SCP PREFIX TREE (TREE=204)**:可能涉及智能网络(Smart Network)中的Service Control Point。 - **CF PREFIX TREE (TREE=205)**:可能是Call Forwarding相关的字冠处理。 - **CLI 字冠分析**:CLI...

    数据结构常用算法c++实现

    Prefix Tree(Trie) *Suffix Tree(未实现)* B-Tree Hash by multiplication Hash table Universal hash function Perfect hash Java's string hash FNV-1a string hash SimHash Bloom Filter SHA-1 Message Digest ...

    Prefix Tree:快速的前缀树和DAWG在C和python中的实现-开源

    例如,对于字符串"apple"和"appletree",它们在前缀树中的路径将共享"app"部分,然后分别在各自的分支上结束。这种结构使得查找以特定前缀开头的所有字符串变得非常迅速。 压缩前缀树,或者称为Deterministic ...

    PrefixTreeESpan 频繁模式挖掘

    PrefixTree,也称为前缀树或Trie树,是一种高效的数据结构,用于存储字符串集合。在频繁模式挖掘中,PrefixTree常被用作项集的存储结构,以降低内存消耗并加速挖掘过程。通过构建前缀树,每个频繁项集作为一个节点...

    不生成候选集的频繁模式数据挖掘

    它是一种扩展的前缀树(Prefix Tree),能够有效地存储关键信息,并且避免了传统方法中的重复数据库扫描,从而显著提高了挖掘效率。 ##### 2.1 FP-Tree的关键特点: 1. **压缩性**:FP-Tree通过对频繁项集进行压缩...

    galileo二三层转发芯片工作原理.ppt

    Prefix Tree是一种高效的路由查找结构,它根据IP前缀匹配数据包的目的IP,确定最佳下一跳,从而决定数据包的转发路径。 虚设备的概念是为了让二层芯片利用三层芯片的路由功能。当为三层芯片分配一个虚设备号并启用...

    prefixspan 算法实现 python3

    3. **构建前缀树**:使用前缀树(prefix tree 或 trie)来存储频繁项集,便于进行前缀扩展。 4. **递归发现频繁项集**:从单个项目开始,对每个频繁项集进行前缀扩展,并创建投影数据库,递归查找更长的频繁项集。 ...

    基于三叉树的中文字典结构

    三叉树,又称为Trie或Prefix Tree,是一种有序树数据结构,它能有效存储和查找具有公共前缀的字符串。与二叉搜索树不同,三叉树每个节点通常有三个子节点,分别代表字符集中的三个不同字符。这样的设计使得三叉树在...

    succinct-binary-tree-encoding:简洁地表示二叉树的结构

    var tree = JSON . parse ( process . argv . slice ( 2 ) . join ( ' ' ) ) var bits = succinct . encode ( tree ) console . log ( bits . join ( '' ) ) $ node encode.js '{"left":null,"right":{"left":{...

Global site tag (gtag.js) - Google Analytics