lucene3.6.0的分析器

zhwj184

浏览: 759228 次
性别:
来自: 杭州

最近访客更多访客>>

WoKo_Jb

u014087707

invaderii

mjj

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

语汇单元:位置增量是唯一的元数据

poter词干提取算法实现

/**
 *
 * Stemmer, implementing the Porter Stemming Algorithm
 *
 * The Stemmer class transforms a word into its root form.  The input
 * word can be provided a character at time (by calling add()), or at once
 * by calling one of the various stem(something) methods.
 */
PorterStemmer.java


lucene3.6.0 token获取
            TokenStream tokenStream = analyzer.tokenStream("context", new StringReader("旧水泥袋"));
            OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class);
            CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);

            while (tokenStream.incrementToken()) {
                int startOffset = offsetAttribute.startOffset();
                int endOffset = offsetAttribute.endOffset();
                String term = charTermAttribute.toString();
                System.out.println(offsetAttribute.toString());
            }

WhiteSpaceAnalyzer  在空格处分隔词汇单元
SimpleAnalyzer   在非字母处切分文本，并进行小写规格化
StopAnalyzer  在非字母处切分文本，并进行小写规格化，再移除停用词
StarderAnalyzer   基于复杂的语法来进行词汇切分，可以识别email，首字母缩写词，汉语-日语-汉语字符，字符数字；小写化；移除停用词


stopAnalyzer停用词表 可以通过构造函数自己设置停用词表
public final class StopAnalyzer extends StopwordAnalyzerBase {
  
  /** An unmodifiable set containing some common English words that are not usually useful
  for searching.*/
  public static final Set<?> ENGLISH_STOP_WORDS_SET;
  
  static {
    final List<String> stopWords = Arrays.asList(
      "a", "an", "and", "are", "as", "at", "be", "but", "by",
      "for", "if", "in", "into", "is", "it",
      "no", "not", "of", "on", "or", "such",
      "that", "the", "their", "then", "there", "these",
      "they", "this", "to", "was", "will", "with"
    );
    final CharArraySet stopSet = new CharArraySet(Version.LUCENE_CURRENT, 
        stopWords.size(), false);
    stopSet.addAll(stopWords);  
    ENGLISH_STOP_WORDS_SET = CharArraySet.unmodifiableSet(stopSet); 
  }
  /** Builds an analyzer with the stop words from the given set.
   * @param matchVersion See <a href="#version">above</a>
   * @param stopWords Set of stop words */
  public StopAnalyzer(Version matchVersion, Set<?> stopWords) {
    super(matchVersion, stopWords);
  }

StarderAnalyzer：基于javacc，也可以通过构造函数传递停用词表，来达到stopAnalyzer的功能。

PerFieldAnalyzerWrapper.java能够针对某个域采用不同的分析器。

 *   Map analyzerPerField = new HashMap();
 *   analyzerPerField.put("firstname", new KeywordAnalyzer());
 *   analyzerPerField.put("lastname", new KeywordAnalyzer());
 *
 *   PerFieldAnalyzerWrapper aWrapper =
 *      new PerFieldAnalyzerWrapper(new StandardAnalyzer(), analyzerPerField);

分享到：

Maven Enforcer Plugin 定义一些必须遵守的 ... | lucene3.6.0的查询条件分析

2012-05-13 19:45
浏览 1269
评论(0)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

lucene3.6.0的分析器

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

lucene3.6.0的分析器

评论

发表评论

相关推荐

对字符串进行验证之前先进行规范化

使用telnet连接到基于spring的应用上执行容器中的bean的任意方法

jdk7和8的一些新特性介绍

Lucene的IndexWriter初始化时的LockObtainFailedException的解决方法

java对于接口和抽象类的代理实现，不需要有具体实现类

Excel2007格式分析和XML解析

Java EE 7中对WebSocket 1.0的支持

java QRCode生成示例

Spring Security Logout

Spring Security Basic Authentication

Spring Security Form Login

spring3 的restful API RequestMapping介绍

Java Web使用swfobject调用flex图表

spring使用PropertyPlaceholderConfigurer扩展来满足不同环境的参数配置

java国际化

RSS feeds with Java

使用ibatis将数据库从oracle迁移到mysql的几个修改点

线上机器jvm dump分析脚本

spring3学习入门示例工程

spring map使用annotation泛型注入问题分析

最近访客更多访客>>