AC自动机

itace

浏览: 184180 次
性别:
来自: 北京

最近访客更多访客>>

u012363178

wangyy

tianshiguishu

Sharpleo

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

算法

Usage

Setting up the Trie is a piece of cake:

    Trie trie = Trie.builder()
        .addKeyword("hers")
        .addKeyword("his")
        .addKeyword("she")
        .addKeyword("he")
        .build();
    Collection<Emit> emits = trie.parseText("ushers");

You can now read the set. In this case it will find the following:

"she" starting at position 1, ending at position 3
"he" starting at position 2, ending at position 3
"hers" starting at position 2, ending at position 5

In normal situations you probably want to remove overlapping instances, retaining the longest and left-most matches.

    Trie trie = Trie.builder()
        .removeOverlaps()
        .addKeyword("hot")
        .addKeyword("hot chocolate")
        .build();
    Collection<Emit> emits = trie.parseText("hot chocolate");

The removeOverlaps method tells the Trie to remove all overlapping matches. For this it relies on the following conflict resolution rules: 1) longer matches prevail over shorter matches, 2) left-most prevails over right-most. There is only one result now:

"hot chocolate" starting at position 0, ending at position 12

If you want the algorithm to only check for whole words, you can tell the Trie to do so:

    Trie trie = Trie.builder()
        .onlyWholeWords()
        .addKeyword("sugar")
        .build();
    Collection<Emit> emits = trie.parseText("sugarcane sugarcane sugar canesugar");

In this case, it will only find one match, whereas it would normally find four. The sugarcane/canesugar words are discarded because they are partial matches.

Some text is WrItTeN in a combination of lowercase and uppercase and therefore hard to identify. You can instruct the Trie to lowercase the entire searchtext to ease the matching process. The lower-casing extends to keywords as well.

    Trie trie = Trie.builder()
        .caseInsensitive()
        .addKeyword("casing")
        .build();
    Collection<Emit> emits = trie.parseText("CaSiNg");

Normally, this match would not be found. With the caseInsensitive settings the entire search text is lowercased before the matching begins. Therefore it will find exactly one match. Since you still have control of the original search text and you will know exactly where the match was, you can still utilize the original casing.

It is also possible to just ask whether the text matches any of the keywords, or just to return the first match it finds.

    Trie trie = Trie.builder().removeOverlaps()
            .addKeyword("ab")
            .addKeyword("cba")
            .addKeyword("ababc")
            .build();
    Emit firstMatch = trie.firstMatch("ababcbab");

The firstMatch will now be "ababc" found at position 0. containsMatch just checks if there is a firstMatch and returns true if that is the case.

If you just want the barebones Aho-Corasick algorithm (ie, no dealing with case insensitivity, overlaps and whole words) and you prefer to add your own handler to the mix, that is also possible.

    Trie trie = Trie.builder()
            .addKeyword("hers")
            .addKeyword("his")
            .addKeyword("she")
            .addKeyword("he")
            .build();

    final List<Emit> emits = new ArrayList<>();
    EmitHandler emitHandler = new EmitHandler() {

        @Override
        public void emit(Emit emit) {
            emits.add(emit);
        }
    };

In many cases you may want to do useful stuff with both the non-matching and the matching text. In this case, you might be better served by using the Trie.tokenize(). It allows you to loop over the entire text and deal with matches as soon as you encounter them. Let's look at an example where we want to highlight words from HGttG in HTML:

    String speech = "The Answer to the Great Question... Of Life, " +
            "the Universe and Everything... Is... Forty-two,' said " +
            "Deep Thought, with infinite majesty and calm.";
    Trie trie = Trie.builder().removeOverlaps().onlyWholeWords().caseInsensitive()
        .addKeyword("great question")
        .addKeyword("forty-two")
        .addKeyword("deep thought")
        .build();
    Collection<Token> tokens = trie.tokenize(speech);
    StringBuffer html = new StringBuffer();
    html.append("<html><body><p>");
    for (Token token : tokens) {
        if (token.isMatch()) {
            html.append("<i>");
        }
        html.append(token.getFragment());
        if (token.isMatch()) {
            html.append("</i>");
        }
    }
    html.append("</p></body></html>");
    System.out.println(html);

转：https://github.com/robert-bor/aho-corasick

分享到：

排序和二分法 | oracle所有表的基本信息

2016-09-09 10:18
浏览 549
评论(0)
分类:开源软件
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

AC自动机

Usage

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

AC自动机

Usage

评论

发表评论

相关推荐

获取连通区域

图像边缘检测算子

数据挖掘算法对比

期望值，方差，标准差，协方差，相关系数

基本函数求导

欧拉七桥问题

trie树--AC自动机

九种排序【转】

排序和二分法

常用聚类算法

GBDT和RF(梯度提升决策树和随机森林)

最近访客更多访客>>