论坛首页 编程语言技术论坛

Python SuffixTree (后缀树)中文 AutoComplete 算法

浏览 3285 次
精华帖 (0) :: 良好帖 (0) :: 新手帖 (0) :: 隐藏帖 (0)
作者 正文
   发表时间:2012-08-14   最后修改:2012-08-14
最近javaeye的python板块实在是太不活跃了,发一个有意思的开源程序,给大家玩玩,这个程序代码是后缀树,一般用于autoComplete,还不知到的同学赶紧来看看吧 :)
开源地址:https://github.com/edisonlz/suffixTree_ch



o SuffixTree.SuffixTree -- The suffix tree structure.  This is a
  thin wrapper around strmat's stree data structure.  This isn't a
  complete wrapper yet; I need to find some time to complete this.
  The wrapper appears to be good enough for simple stuff.

  Methods of SuffixTree:

      o SuffixTree(alphabet=STREE_ASCII)

          Construct a new SuffixTree.  By default, the alphabet
          used by the SuffixTree is ASCII.  Other choices include
          STREE_DNA, STREE_RNA, and STREE_PROTEIN.

      o add(string, id)

          Adds a string to the suffix tree with an id.

      o root()

          Returns the root() SuffixNode of the tree.

      o num_nodes():

          Returns the total number of nodes held in the tree.

      o match(string)

          Given a string, traverse the suffix tree and return a
          3-tuple (match_length, suffix_node, endpos)



o SuffixTree.SuffixNode  (I need to fix the documentation here)

    Methods of 
    num_children()
    find_child(char ch)
    children()
    next()
    parent()
    suffix_link()
    edgelen()
    edgestr()
    getch()
    labellen()
    labelstr()
    ident()
    num_leaves()
    leaf(int leafnum)



o SuffixTree.SubstringDict -- An application of suffix trees toward
  substring matching.  An example might help:

  >>> #coding=utf-8
  >>> from SuffixTree import SubstringDict


  >>> sd = SubstringDict()
  >>> sd.__setitem__("我是python程序员",1)
  >>> sd.__setitem__("我是ruby程序员",2)
  >>> sd.__setitem__("我是javascript程序员",3)
  >>> sd.__setitem__("我是android程序员",4)
  >>> sd.__setitem__("我还是DBA",4)
  >>> print sd[“我是”]
  >>> print sd[“我还是”]



  >>> sd = SubstringDict()
  >>> sd["我是python程序员"] = 1
  >>> sd["我是ruby程序员"] = 2
  >>> sd["我是javascript程序员"] = 3
  >>> sd["我是android程序员"] = 4
  >>> sd["我还是DBA"] = 5
  >>> print sd[“我还是”]


  SubstringDict provides a mapping that allows for substrings of
  keys.  The keys do need to be strings though.

  支持中文的方式是使用 base64,数据量回增加30%,对性能回有些损耗,但是,损耗不大

  64 位 安装 :
  ARCHFLAGS="-arch i386 -arch x86_64" python setup.py installPython SuffixTree (后缀树)中文
论坛首页 编程语言技术版

跳转论坛:
Global site tag (gtag.js) - Google Analytics