- 浏览: 2196589 次
- 性别:
- 来自: 北京
- 全部博客 (682)
- 软件思想 (7)
- Lucene(修真篇) (17)
- Lucene(仙界篇) (20)
- Lucene(神界篇) (11)
- Solr (48)
- Hadoop (77)
- Spark (38)
- Hbase (26)
- Hive (19)
- Pig (25)
- ELK (64)
- Zookeeper (12)
- JAVA (119)
- Linux (59)
- 多线程 (8)
- Nutch (5)
- JAVA EE (21)
- Oracle (7)
- Python (32)
- Xml (5)
- Gson (1)
- Cygwin (1)
- JavaScript (4)
- MySQL (9)
- Lucene/Solr(转) (5)
- 缓存 (2)
- Github/Git (1)
- 开源爬虫 (1)
- Hadoop运维 (7)
- shell命令 (9)
- 生活感悟 (42)
- shell编程 (23)
- Scala (11)
- MongoDB (3)
- docker (2)
- Nodejs (3)
- Neo4j (5)
- storm (3)
- opencv (1)
粟谷_sugu 写道不太理解“分词字段存储docvalue是没 ...
浅谈Lucene中的DocValues -
不太理解“分词字段存储docvalue是没有意义的”,这句话, ...
浅谈Lucene中的DocValues -
高性能elasticsearch ORM开发库使用文档http ...
为什么说Elasticsearch搜索是近实时的? -
Solr中Group和Facet的用法 -
遇到的问题同楼上 为什么会返回null
句子: i have two cats
result = new GermanNormalizationFilter(result);
result = new GermanLightStemFilter(result);
句子: i have two cats
List<String> list=new ArrayList<String>(); list.add("player");//这里面的词,不会被做词干抽取,词形还原 CharArraySet ar=new CharArraySet(Version.LUCENE_43,list , true); //分词器的第二个参数是禁用词参数,第三个参数是排除不做词形转换,或单复数的词 GermanAnalyzer sa=new GermanAnalyzer(Version.LUCENE_43,null,ar);
protected TokenStreamComponents createComponents(String fieldName, Reader reader) { //标准分词器过滤 final Tokenizer source = new StandardTokenizer(matchVersion, reader); TokenStream result = new StandardFilter(matchVersion, source); //转小写过滤 result = new LowerCaseFilter(matchVersion, result); //禁用词过滤 result = new StopFilter( matchVersion, result, stopwords); //排除词过滤 result = new SetKeywordMarkerFilter(result, exclusionSet); if (matchVersion.onOrAfter(Version.LUCENE_36)) { //在lucene3.6以后的版本,采用如下filter过滤 //规格化,将德语中的特殊字符,映射成英语 result = new GermanNormalizationFilter(result); //stem词干抽取,词性还原 result = new GermanLightStemFilter(result); } else if (matchVersion.onOrAfter(Version.LUCENE_31)) { //在lucene3.1至3.6的版本中,采用SnowballFilter处理 result = new SnowballFilter(result, new German2Stemmer()); } else { //在lucene3.1之前的采用兼容的GermanStemFilter处理 result = new GermanStemFilter(result); } return new TokenStreamComponents(source, result); }
result = new GermanNormalizationFilter(result);
result = new GermanLightStemFilter(result);
package org.apache.lucene.analysis.de; /* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ import java.io.IOException; import org.apache.lucene.analysis.TokenFilter; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; import org.apache.lucene.analysis.util.StemmerUtil; /** * Normalizes German characters according to the heuristics * of the <a href="http://snowball.tartarus.org/algorithms/german2/stemmer.html"> * German2 snowball algorithm</a>. * It allows for the fact that ä, ö and ü are sometimes written as ae, oe and ue. * * [list] * <li> 'ß' is replaced by 'ss' * <li> 'ä', 'ö', 'ü' are replaced by 'a', 'o', 'u', respectively. * <li> 'ae' and 'oe' are replaced by 'a', and 'o', respectively. * <li> 'ue' is replaced by 'u', when not following a vowel or q. * [/list] * <p> * This is useful if you want this normalization without using * the German2 stemmer, or perhaps no stemming at all. *上面的解释说得很清楚,主要是对德文的一些特殊字母,转换成对应的英文处理 * */ public final class GermanNormalizationFilter extends TokenFilter { // FSM with 3 states: private static final int N = 0; /* ordinary state */ private static final int V = 1; /* stops 'u' from entering umlaut state */ private static final int U = 2; /* umlaut state, allows e-deletion */ private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class); public GermanNormalizationFilter(TokenStream input) { super(input); } @Override public boolean incrementToken() throws IOException { if (input.incrementToken()) { int state = N; char buffer[] = termAtt.buffer(); int length = termAtt.length(); for (int i = 0; i < length; i++) { final char c = buffer[i]; switch(c) { case 'a': case 'o': state = U; break; case 'u': state = (state == N) ? U : V; break; case 'e': if (state == U) length = StemmerUtil.delete(buffer, i--, length); state = V; break; case 'i': case 'q': case 'y': state = V; break; case 'ä': buffer[i] = 'a'; state = V; break; case 'ö': buffer[i] = 'o'; state = V; break; case 'ü': buffer[i] = 'u'; state = V; break; case 'ß': buffer[i++] = 's'; buffer = termAtt.resizeBuffer(1+length); if (i < length) System.arraycopy(buffer, i, buffer, i+1, (length-i)); buffer[i] = 's'; length++; state = N; break; default: state = N; } } termAtt.setLength(length); return true; } else { return false; } } }
package org.apache.lucene.analysis.de; /* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ import java.io.IOException; import org.apache.lucene.analysis.TokenFilter; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.miscellaneous.SetKeywordMarkerFilter; import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; import org.apache.lucene.analysis.tokenattributes.KeywordAttribute; /** * A {@link TokenFilter} that applies {@link GermanLightStemmer} to stem German * words. * <p> * To prevent terms from being stemmed use an instance of * {@link SetKeywordMarkerFilter} or a custom {@link TokenFilter} that sets * the {@link KeywordAttribute} before this {@link TokenStream}. * * * *这个类,主要做Stemmer(词干提取),而我们主要关注 *GermanLightStemmer这个类的作用 * * */ public final class GermanLightStemFilter extends TokenFilter { private final GermanLightStemmer stemmer = new GermanLightStemmer(); private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class); private final KeywordAttribute keywordAttr = addAttribute(KeywordAttribute.class); public GermanLightStemFilter(TokenStream input) { super(input); } @Override public boolean incrementToken() throws IOException { if (input.incrementToken()) { if (!keywordAttr.isKeyword()) { final int newlen = stemmer.stem(termAtt.buffer(), termAtt.length()); termAtt.setLength(newlen); } return true; } else { return false; } } }
package org.apache.lucene.analysis.de; /* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ /* * This algorithm is updated based on code located at: * http://members.unine.ch/jacques.savoy/clef/ * * Full copyright for that code follows: */ /* * Copyright (c) 2005, Jacques Savoy * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: * * Redistributions of source code must retain the above copyright notice, this * list of conditions and the following disclaimer. Redistributions in binary * form must reproduce the above copyright notice, this list of conditions and * the following disclaimer in the documentation and/or other materials * provided with the distribution. Neither the name of the author nor the names * of its contributors may be used to endorse or promote products derived from * this software without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE * POSSIBILITY OF SUCH DAMAGE. */ /** * Light Stemmer for German. * <p> * This stemmer implements the "UniNE" algorithm in: * <i>Light Stemming Approaches for the French, Portuguese, German and Hungarian Languages</i> * Jacques Savoy */ public class GermanLightStemmer { //处理特殊字符映射 public int stem(char s[], int len) { for (int i = 0; i < len; i++) switch(s[i]) { case 'ä': case 'à': case 'á': case 'â': s[i] = 'a'; break; case 'ö': case 'ò': case 'ó': case 'ô': s[i] = 'o'; break; case 'ï': case 'ì': case 'í': case 'î': s[i] = 'i'; break; case 'ü': case 'ù': case 'ú': case 'û': s[i] = 'u'; break; } len = step1(s, len); return step2(s, len); } private boolean stEnding(char ch) { switch(ch) { case 'b': case 'd': case 'f': case 'g': case 'h': case 'k': case 'l': case 'm': case 'n': case 't': return true; default: return false; } } //处理基于以下规则的词干抽取和缩减 private int step1(char s[], int len) { if (len > 5 && s[len-3] == 'e' && s[len-2] == 'r' && s[len-1] == 'n') return len - 3; if (len > 4 && s[len-2] == 'e') switch(s[len-1]) { case 'm': case 'n': case 'r': case 's': return len - 2; } if (len > 3 && s[len-1] == 'e') return len - 1; if (len > 3 && s[len-1] == 's' && stEnding(s[len-2])) return len - 1; return len; } //处理基于以下规则est,er,en等的词干抽取和缩减 private int step2(char s[], int len) { if (len > 5 && s[len-3] == 'e' && s[len-2] == 's' && s[len-1] == 't') return len - 3; if (len > 4 && s[len-2] == 'e' && (s[len-1] == 'r' || s[len-1] == 'n')) return len - 2; if (len > 4 && s[len-2] == 's' && s[len-1] == 't' && stEnding(s[len-3])) return len - 2; return len; } }
搜索技术交流群:324714439 大数据hadoop交流群:376932160 0,将一些德语特殊字符,替换成对应的英文表示 1,将所有词干元音还原 a ,o,i,u ste(2)(按先后顺序,符合以下任意一项,就完成一次校验(return)) 2,单词长度大于5的词,以ern结尾的,直接去掉 3,单词长度大于4的词,以em,en,es,er结尾的,直接去掉 4,单词长度大于3的词,以e结尾的直接去掉 5,单词长度大于3的词,以bs,ds,fs,gs,hs,ks,ls,ms,ns,ts结尾的,直接去掉s step(3)(按先后顺序,符合以下任意一项,就完成一次校验(return)) 6,单词长度大于5的词,以est结尾的,直接去掉 7,单词长度大于4的词,以er或en结尾的直接去掉 8,单词长度大于4的词,bst,dst,fst,gst,hst,kst,lst,mst,nst,tst,直接去掉后两位字母st
2016-06-23 18:08 1894最近收集的两个搜索 ... -
2016-06-01 19:37 2983当我们在处理搜索业务时候,需求往往是灵活多变的,有时候我们需 ... -
2016-05-12 17:49 4882(一)背景介绍 大多数时候我们使用lucene/solr ... -
2016-05-10 19:12 7644前言: 在Lucene4.x之后, ... -
2016-02-01 17:07 2030识别垃圾数据,在一些 ... -
2016-02-01 12:54 2737使用Spark构建索引非常简单,因为spark提供了更高级的 ... -
玩转大数据系列之Apache Pig如何与Apache Lucene集成(一)
2015-03-05 21:54 2937在文章开始之前,我们 ... -
2014-10-15 15:21 5077原创不易,转载请务必 ... -
2014-08-12 19:17 2273散仙前几篇博客上,已经写了单机程序使用使用hadoop的构建l ... -
2014-07-09 20:22 3288转载请务必注明,原创地址,谢谢配合! http://qind ... -
2014-07-03 19:16 3863转载请务必注明,原创地址,谢谢配合! http://qind ... -
2014-03-13 18:21 2427原创不易,转载请务必 ... -
2014-02-13 23:14 2825转载请务必注明,原创 ... -
2014-02-12 22:22 3658转载请务必注明,原创地址,谢谢配合! http://qind ... -
2014-02-11 00:24 3636转载请务必注明,原创地址,谢谢配合! http://qind ... -
2014-01-24 09:25 3989转载请务必注明,原创 ... -
2014-01-23 00:40 3942转载请务必注明,原创地址,谢谢配合! http://qind ... -
2014-01-18 23:30 3249转载请务必注明,原创地址,谢谢配合! http://qind ... -
2014-01-18 22:13 1638转载请务必注明,原创地址,谢谢配合! http://qind ... -
2014-01-18 00:11 1610转载请务必注明,原创 ...
在"lucene4.3 按坐标距离排序"这个主题中,我们将探讨如何在Lucene 4.3版本中利用地理位置信息进行文档排序,特别是在处理地理空间搜索时的应用。 首先,Lucene 4.3引入了对地理空间搜索的支持,这允许我们根据地理...
全文检索lucene 4.3 所用到的3个jar包,包含lucene-queryparser-4.3.0.jar、 lucene-core-4.3.0.jar、lucene-analyzers-common-4.3.0.jar。
《Lucene高级搜索进阶项目_04》 在深入探讨Lucene的高级搜索进阶项目时,我们首先需要理解Lucene的核心概念及其在信息检索中的应用。Lucene是一个高性能、全文本搜索库,它提供了丰富的搜索功能,包括布尔运算、...
lucene4.3源代码 censed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information ...
Lucene是一个开源的全文搜索引擎库,由Apache软件基金会开发并维护。在Java编程环境中,它为开发者提供了强大的文本检索功能,使得在海量数据中快速查找相关信息变得简单易行。本篇文章将详细探讨Lucene 4.3.1版本的...
共13页07.Lucene搜索实战1 共4页08.Lucene搜索实战2 共5页09.Lucene搜索深入实战1 共5页10.Lucene搜索深入实战2 共11页11....Lucene高级进阶1 共23页16.Lucene高级进阶2 共4页17.Lucene高级进阶3 共4页18.Lucene排序...
在本课程中,我们主要探讨了Lucene 4.x版本的高级进阶应用,特别是针对大规模文档搜索引擎的构建。Lucene作为一个开源全文搜索引擎库,它提供了高效、灵活的索引和搜索功能,是构建高性能搜索系统的基石。在这个部分...
1.XunTa是在lucene4.3上创建的通过“知识点”来找人的搜人引擎。 输入一个关键词(或组合),XunTa返回一个排名列表,排在前面的人是与该关键词(组合)最相关的“达人”。 可访问 http://www.xunta.so立即体验...
在高级进阶部分,我们将重点探讨Lucene在索引、搜索、排序、过滤以及分词器等方面的高级用法,旨在帮助开发者掌握Lucene的精髓,打造高效、精确的搜索体验。 1. **Document与索引更新**: 在Lucene中,`Document`...
【Lucene4.X实战类baidu搜索的大型文档海量搜索系统】课程主要涵盖了Lucene搜索引擎的各个方面,包括基础和高级进阶。以下是课程的主要知识点: 1. **Lucene入门与系统架构**:介绍Lucene的基本概念,以及其系统...
Lucene是Java开发的开源库,它提供了文本分析、索引和搜索功能,使得开发者能够轻松地在应用程序中实现复杂的搜索功能。这个项目的重点在于提升对Lucene高级特性和优化技巧的理解。 首先,我们要了解Lucene的核心...
共13页07.Lucene搜索实战1 共4页08.Lucene搜索实战2 共5页09.Lucene搜索深入实战1 共5页10.Lucene搜索深入实战2 共11页11....Lucene高级进阶1 共23页16.Lucene高级进阶2 共4页17.Lucene高级进阶3 共4页18.Lucene排序...
共13页07.Lucene搜索实战1 共4页08.Lucene搜索实战2 共5页09.Lucene搜索深入实战1 共5页10.Lucene搜索深入实战2 共11页11....Lucene高级进阶1 共23页16.Lucene高级进阶2 共4页17.Lucene高级进阶3 共4页18.Lucene排序...
共13页07.Lucene搜索实战1 共4页08.Lucene搜索实战2 共5页09.Lucene搜索深入实战1 共5页10.Lucene搜索深入实战2 共11页11....Lucene高级进阶1 共23页16.Lucene高级进阶2 共4页17.Lucene高级进阶3 共4页18.Lucene排序...
共13页07.Lucene搜索实战1 共4页08.Lucene搜索实战2 共5页09.Lucene搜索深入实战1 共5页10.Lucene搜索深入实战2 共11页11....Lucene高级进阶1 共23页16.Lucene高级进阶2 共4页17.Lucene高级进阶3 共4页18.Lucene排序...
共13页07.Lucene搜索实战1 共4页08.Lucene搜索实战2 共5页09.Lucene搜索深入实战1 共5页10.Lucene搜索深入实战2 共11页11....Lucene高级进阶1 共23页16.Lucene高级进阶2 共4页17.Lucene高级进阶3 共4页18.Lucene排序...
共13页07.Lucene搜索实战1 共4页08.Lucene搜索实战2 共5页09.Lucene搜索深入实战1 共5页10.Lucene搜索深入实战2 共11页11....Lucene高级进阶1 共23页16.Lucene高级进阶2 共4页17.Lucene高级进阶3 共4页18.Lucene排序...