Lucene---全文检索(问题分析)

wu_quanyin

浏览: 210542 次
性别:
来自: 福建省

最近访客更多访客>>

wu_quanyin1011

lsj20040708

yangbo126

892848153

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

Search Engine

全文检索 lucene Apache 多线程 SQL

创建索引时处理:

一,是否要被分词

1,Field.Index.ANALYZED:所传字段会被分词,会根据分词后进行查找

2,Field.Index.NOT_ANALYZED:所传字段不会被分词,会根据(原值)进行查找

3,Field.Index.NO:不参与分词,也不参与查找

二,是否被持久化保存在文件中

1,Field.Store.YES:索引时会被持久化,查找时可以查到相应的值

2,Field.Store.NO:索引时不被持久化.不会被存储

问题一:当同时配index_no store_no,时程序会报错误,,因为不存这样的情况,,,

例子

一个Lucene Document添加以下Field

Field(path,not_analyzed,store.yes)

Field(content,analyzed,store.yes)

Field(name,not_analyzed,store.no)

1,当用path查找时,只能用(全路径查找)因为没有经过分词

2,当用content查找时,可以找到任意值

3,当用name查找时,只能用全名称查找,但是有参与索引,本身没有(持久化),故不能得到自己的值,但能得到同一个document上的值

-----------------------------------------------------------------------------

执行搜索时处理:

一,以怎样的方式进行搜索

1,BooleanClause.Occur.MUST:

2,BooleanClause.Occur.SHOULD:

3,BooleanClause.Occur.MUST_NOT:

以上主要对字段的合并查找,与sql中的含义一致 and or not

------------------------------------------------------------------------------

问题：

１,在对如一对多查询，对＂多＂的一方进行索引查询时，会关联出多个＂一＂上的信息，这时候需要用Lucene提供的一个过滤器DuplicateFilter(进行去重)

２，排序问题，Lucene提供了多种排序，不过对于时间排序，因为是根据字符串的compareTo类型比较，所以要是同一格式的才能排序

３，使用BooleanQuery进行查询时可以整合其他的查询使之达到如　（福建｜上海）＋上海　这种类型的查询组合查找

４，MultiFieldQueryParser.parser()进行查找时，只会对空格　加号等特殊字符串处理，如＂调拨单　人员＂会查出

调拨单　或者　人员的信息，，查不出调拨的相应信息，故对他进行了分词，并用IKAnalyzer进行了停词处理

//分词处理
IKSegmentation ikSegmentation = new IKSegmentation(
					new StringReader(queryValue), true);
			StringBuffer segmentCombine = new StringBuffer();
			segmentCombine.append("\"" + queryValue + "\"");
			segmentCombine.append(" ");
			for (Lexeme lexeme = null; (lexeme = ikSegmentation.next()) != null;) {
				segmentCombine.append(lexeme.getLexemeText());
				segmentCombine.append(" ");
			}
			queryValue = segmentCombine.toString();
			
			Query q2 = MultiFieldQueryParser.parse(Version.LUCENE_30,
					queryValue, indexFields, clauses, getAnalyzer());
			((BooleanQuery) query).add(q2, BooleanClause.Occur.MUST);

５，返回高度且相关字符串的大小可用

　　SimpleHTMLFormatter

　　ighlighter.setTextFragmenter(new SimpleFragmenter(n));

６，lucene中设置返回的默认条数为１００（可设置），对于分页没有中间查找，只有自己取出相要的条数再进行相应的运算．

７，在查询时出现＂open too many files ＂error,是因为没有关闭Search,但是项目中已经关闭该Search,

原因：

当NIOFSDirectory不是作为参数传给IndexSearch时，indexSearch.close()并没有真正意义上的关闭，因为，他是交给IndexReader的，所以自己要search.getIndexReader().close()

８，当进行查询时，Lucene会把相应的索引全部载入内存，如果索引太大，耗费的内存不够用，可设置Search为单例不关闭

９，相似度问题（各个属性设置　参考文档）重写Similarity类

１０,去重问题，Lucene本身提供了一个DuplicateFilter去重设置，不过这个Filter有一个问题，并不能满足需求

如：

　　主表　　id:aa-xx-yy 　title:表单问题　

　　从表　　id:aa pid:aa-xx-yy content:你好

　　　id: bb pid:aa-xx-yy content:谢谢

当在DuplicateFilter中设置pid去重时，默认是取最后一条，把第一条过滤掉，也就是你查＂你好＂时，关联不到主表的＂表单问题＂，查＂谢谢＂就可以，．．．

解决：根据（FilteredQuery）重写一个这个类中的一个方法（在上传文件中）----不可用，，还未找到解决方案

11，在使用IKQueryParser时，进行对中文分词，出现问题如

“调拨功能”时，只会出现同时含有“调拨”，“功能”的记录，不会出现单独的如“调拨”　或者“功能”的记录

第4点有说过，查看了这个类的原码，是ik中全部采用了MUST的形式，，，－－－修改IKQueryParser

添加自己相要的实现类如下：

=-----------------------------------------------------------------上传不上来直接复制代码

12,实现高级搜索

高级搜索的处理主要对数字（如价格）与时间的处理，lucene提供了NumericRangeQuery与TermRangeQuery进行分别处理，不过对数字处理还得对索引进行特殊处理

13，实现排序功能

lucene中提供了排序的 sort = new Sort(sortFields.toArray(sortArray));功能，但是对一个排序正确，当加了如大小与时间排序时，，结果并不正确

14,显示高亮问题

在用搜索时，组合起来的query进行高亮处理时，字段设置为检索不分词时，没有高亮标记，，，

处理方法，，用客户端所传的值，设置一个专门为高亮应用的query，这个query的条件是最宽松的，都是分诩过的

/**
 * 
 */
package com.fdauto.bws.business.module.lucene;

import java.io.IOException;
import java.io.StringReader;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

import org.apache.lucene.index.Term;
import org.apache.lucene.search.BooleanClause;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.BooleanClause.Occur;

import org.wltea.analyzer.IKSegmentation;
import org.wltea.analyzer.Lexeme;

/**
 * IK查询分析器
 * 实现了对分词歧义结果的非冲突排列组合
 * 有效的优化对歧义关键词的搜索命中
 * 针对IK Analyzer V3的优化实现
 * 
 *
 */
public final class BWSQueryParser {
	
	
	//查询关键字解析缓存线程本地变量
	private static ThreadLocal<Map<String , TokenBranch>> keywordCacheThreadLocal 
			= new ThreadLocal<Map<String , TokenBranch>>();
	
	
	//是否采用最大词长分词
	private static boolean isMaxWordLength = false;

	

	/**
	 * 设置分词策略
	 * isMaxWordLength = true 采用最大词长分词
	 * @param isMaxWordLength
	 */
	//public static void setMaxWordLength(boolean isMaxWordLength) {
		//BWSQueryParser.isMaxWordLength = isMaxWordLength ;
	//}
	
	/**
	 * 优化query队列
	 * 减少Query表达式的嵌套
	 * @param queries
	 * @return
	 */
	private static Query optimizeQueries(List<Query> queries,boolean exactMatch){	
		//生成当前branch 的完整query
		if(queries.size() == 0){
			return null;
		}else if(queries.size() == 1){
			return queries.get(0);
		}else{
			BooleanQuery mustQueries = new BooleanQuery();
			if(exactMatch){
				for(Query q : queries){
					mustQueries.add(q, Occur.MUST);
				}
			}else{
				for(Query q : queries){
					mustQueries.add(q, Occur.SHOULD);
				}
			}
			return mustQueries;
		}			
	}
	
	/**
	 * 获取线程本地的解析缓存
	 * @return
	 */
	private static Map<String , TokenBranch> getTheadLocalCache(){
		Map<String , TokenBranch> keywordCache = keywordCacheThreadLocal.get();
		if(keywordCache == null){
			 keywordCache = new HashMap<String , TokenBranch>(4);
			 keywordCacheThreadLocal.set(keywordCache);
		}
		return keywordCache;
	}
	
	/**
	 * 缓存解析结果的博弈树
	 * @param query
	 * @return
	 */
	private static TokenBranch getCachedTokenBranch(String query){
		Map<String , TokenBranch> keywordCache = getTheadLocalCache();
		return keywordCache.get(query);
	}
	
	/**
	 * 缓存解析结果的博弈树
	 * @param query
	 * @return
	 */
	private static void cachedTokenBranch(String query , TokenBranch tb){
		Map<String , TokenBranch> keywordCache = getTheadLocalCache();
		keywordCache.put(query, tb);
	}
		
	
	/**
	 * 单连续字窜（不带空格符）单Field查询分析
	 * @param field
	 * @param query
	 * @return
	 * @throws IOException
	 */
	private static Query _parse(String field , String query,boolean exactMatch) throws IOException{
		if(field == null){
			throw new IllegalArgumentException("parameter \"field\" is null");
		}

		if(query == null || "".equals(query.trim())){
			return new TermQuery(new Term(field));
		}
		
		//从缓存中取出已经解析的query生产的TokenBranch
		TokenBranch root = getCachedTokenBranch(query);
		if(root != null){
			return optimizeQueries(root.toQueries(field,exactMatch),exactMatch); 
		}else{
			
			
			
			//System.out.println(System.currentTimeMillis());
			root = new TokenBranch(null);	
			if(!query.startsWith("\"")&&!query.endsWith("\"")){
				//对查询条件q进行分词
				StringReader input = new StringReader(query.trim());
				IKSegmentation ikSeg = new IKSegmentation(input , isMaxWordLength);
				for(Lexeme lexeme = ikSeg.next() ; lexeme != null ; lexeme = ikSeg.next()){
					//处理词元分支
					root.accept(lexeme);
				}
			}else{
				//当有一个词加了引号之后，不分词（当传＂调拨单＂时，分词后变成＂调拨单＂　调拨　单　这样的话　在记录中恰好有＂调拨单＂比＂调拨＂的比重更大）
				//参考相似度中coord()当有匹配的最多时，权重最大
				String[] querys=query.replace("\"", "").split("-");
				for(String queryValue:querys){
					Lexeme lexeme=new Lexeme(0,0,queryValue.length(),0);
					lexeme.setLexemeText(queryValue);
					root.accept(lexeme);
				}
				if(query.length()!=1){
					String queryTemp=query.replaceAll("-", "");
					Lexeme lexeme=new Lexeme(0,0,queryTemp.length(),0);
					lexeme.setLexemeText(queryTemp);
					root.accept(lexeme);
				}
			}
			//缓存解析结果的博弈树
			cachedTokenBranch(query , root);
			return optimizeQueries(root.toQueries(field,exactMatch), exactMatch);
		}
	}
	
	/**
	 * 单条件,单Field查询分析
	 * @param field -- Document field name
	 * @param query -- keyword
	 * @return Query 查询逻辑对象
	 * @throws IOException
	 */
	public static Query parse(String field , String query,boolean exactMatch) throws IOException{
		if(field == null){
			throw new IllegalArgumentException("parameter \"field\" is null");
		}
		String[] qParts = query.split("\\s");
		if(qParts.length > 1){			
			BooleanQuery resultQuery = new BooleanQuery();
			for(String q : qParts){
				//过滤掉由于连续空格造成的空字串
				if("".equals(q)){
					continue;
				}
				Query partQuery = _parse(field , q,exactMatch);
				if(partQuery != null && 
				          (!(partQuery instanceof BooleanQuery) || ((BooleanQuery)partQuery).getClauses().length>0)){
					if(exactMatch)
						resultQuery.add(partQuery, Occur.MUST);
					else
						resultQuery.add(partQuery, Occur.SHOULD);
				}
			}
			return resultQuery;
		}else{
			return _parse(field , query,exactMatch);
		}
	}
	
	
	
	/**
	 * 多Field,单条件,多Occur查询分析
	 * @param fields -- Document fields name
	 * @param query	-- keyword
	 * @param flags -- BooleanClause
	 * @return Query 查询逻辑对象
	 * @throws IOException
	 */
	public static Query parseMultiField(String[] fields , String query ,  BooleanClause.Occur[] flags,boolean exactMatch) throws IOException{
		if(fields == null){
			throw new IllegalArgumentException("parameter \"fields\" is null");
		}
		if(flags == null){
			throw new IllegalArgumentException("parameter \"flags\" is null");
		}
		
		if (flags.length != fields.length){
		      throw new IllegalArgumentException("flags.length != fields.length");
		}		
		
		BooleanQuery resultQuery = new BooleanQuery();		
		for(int i = 0; i < fields.length; i++){
			if(fields[i] != null){
				Query partQuery = parse(fields[i] , query,exactMatch);
				if(partQuery != null && 
				          (!(partQuery instanceof BooleanQuery) || ((BooleanQuery)partQuery).getClauses().length>0)){
					resultQuery.add(partQuery, flags[i]); 
				}
			}			
		}		
		return resultQuery;
	}


	/**
	 * 多Field,多条件,多Occur查询分析
	 * @param fields
	 * @param queries
	 * @param flags
	 * @return Query 查询逻辑对象
	 * @throws IOException
	 */
	public static Query parseMultiField(String[] fields , String[] queries , BooleanClause.Occur[] flags,boolean exactMatch) throws IOException{
		if(fields == null){
			throw new IllegalArgumentException("parameter \"fields\" is null");
		}				
		if(queries == null){
			throw new IllegalArgumentException("parameter \"queries\" is null");
		}
		if(flags == null){
			throw new IllegalArgumentException("parameter \"flags\" is null");
		}
		
	    if (!(queries.length == fields.length && queries.length == flags.length)){
	        throw new IllegalArgumentException("queries, fields, and flags array have have different length");
	    }

	    BooleanQuery resultQuery = new BooleanQuery();		
		for(int i = 0; i < fields.length; i++){
			if(fields[i] != null){
				Query partQuery = parse(fields[i] , queries[i], exactMatch);
				if(partQuery != null && 
				          (!(partQuery instanceof BooleanQuery) || ((BooleanQuery)partQuery).getClauses().length>0)){
					resultQuery.add(partQuery, flags[i]); 
				}
			}			
		}		
		return resultQuery;
	}	
	/**
	 * 词元分支
	 * 当分词出现歧义时，采用词元分支容纳不同的歧义组合
	 *
	 */
	private static class TokenBranch{
		
		private static final int REFUSED = -1;
		private static final int ACCEPTED = 0;
		private static final int TONEXT = 1;
		
		//词元分支左边界
		private int leftBorder;
		//词元分支右边界
		private int rightBorder;
		//当前分支主词元
		private Lexeme lexeme;
		//当前分支可并入的词元分支
		private List<TokenBranch> acceptedBranchs;
		//当前分支的后一个相邻分支
		private TokenBranch nextBranch;
		
		TokenBranch(Lexeme lexeme){
			if(lexeme != null){
				this.lexeme = lexeme;
				//初始化branch的左右边界
				this.leftBorder = lexeme.getBeginPosition();
				this.rightBorder = lexeme.getEndPosition();
			}
		}
		
		public int getLeftBorder() {
			return leftBorder;
		}

		public int getRightBorder() {
			return rightBorder;
		}

		public Lexeme getLexeme() {
			return lexeme;
		}

		public List<TokenBranch> getAcceptedBranchs() {
			return acceptedBranchs;
		}

		public TokenBranch getNextBranch() {
			return nextBranch;
		}

		public int hashCode(){
			if(this.lexeme == null){
				return 0;
			}else{
				return this.lexeme.hashCode() * 37;
			}
		}
		
		public boolean equals(Object o){			
			if(o == null){
				return false;
			}		
			if(this == o){
				return true;
			}
			if(o instanceof TokenBranch){
				TokenBranch other = (TokenBranch)o;
				if(this.lexeme == null ||
						other.getLexeme() == null){
					return false;
				}else{
					return this.lexeme.equals(other.getLexeme());
				}
			}else{
				return false;
			}			
		}	
		
		/**
		 * 组合词元分支
		 * @param _lexeme
		 * @return 返回当前branch能否接收词元对象
		 */
		boolean accept(Lexeme _lexeme){
			
			/*
			 * 检查新的lexeme 对当前的branch 的可接受类型
			 * acceptType : REFUSED  不能接受
			 * acceptType : ACCEPTED 接受
			 * acceptType : TONEXT   由相邻分支接受 
			 */			
			int acceptType = checkAccept(_lexeme);			
			switch(acceptType){
			case REFUSED:
				// REFUSE 情况
				return false;
				
			case ACCEPTED : 
				if(acceptedBranchs == null){
					//当前branch没有子branch，则添加到当前branch下
					acceptedBranchs = new ArrayList<TokenBranch>(2);
					acceptedBranchs.add(new TokenBranch(_lexeme));					
				}else{
					boolean acceptedByChild = false;
					//当前branch拥有子branch，则优先由子branch接纳
					for(TokenBranch childBranch : acceptedBranchs){
						acceptedByChild = childBranch.accept(_lexeme) || acceptedByChild;
					}
					//如果所有的子branch不能接纳，则由当前branch接纳
					if(!acceptedByChild){
						acceptedBranchs.add(new TokenBranch(_lexeme));
					}					
				}
				//设置branch的最大右边界
				if(_lexeme.getEndPosition() > this.rightBorder){
					this.rightBorder = _lexeme.getEndPosition();
				}
				break;
				
			case TONEXT : 
				//把lexeme放入当前branch的相邻分支
				if(this.nextBranch == null){
					//如果还没有相邻分支，则建立一个不交叠的分支
					this.nextBranch = new TokenBranch(null);
				}
				this.nextBranch.accept(_lexeme);
				break;
			}

			return true;
		}
		
		/**
		 * 将分支数据转成Query逻辑
		 * @return
		 */
		List<Query> toQueries(String fieldName,boolean exactMatch){			
			List<Query> queries = new ArrayList<Query>(1);			
 			//生成当前branch 的query
			if(lexeme != null){
				queries.add(new TermQuery(new Term(fieldName , lexeme.getLexemeText())));
			}			
			//生成child branch 的query
			if(acceptedBranchs != null && acceptedBranchs.size() > 0){
				if(acceptedBranchs.size() == 1){
					Query onlyOneQuery = optimizeQueries(acceptedBranchs.get(0).toQueries(fieldName,exactMatch),exactMatch);
					if(onlyOneQuery != null){
						queries.add(onlyOneQuery);
					}					
				}else{
					BooleanQuery orQuery = new BooleanQuery();
					for(TokenBranch childBranch : acceptedBranchs){
						Query childQuery = optimizeQueries(childBranch.toQueries(fieldName,exactMatch),exactMatch);
						if(childQuery != null){
							orQuery.add(childQuery, Occur.SHOULD);
						}
					}
					if(orQuery.getClauses().length > 0){
						queries.add(orQuery);
					}
				}
			}			
			//生成nextBranch的query
			if(nextBranch != null){				
				queries.addAll(nextBranch.toQueries(fieldName,exactMatch));
			}
			return queries;	
		}
		
		/**
		 * 判断指定的lexeme能否被当前的branch接受
		 * @param lexeme
		 * @return 返回接受的形式
		 */
		private int checkAccept(Lexeme _lexeme){
			int acceptType = 0;
			
			if(_lexeme == null){
				throw new IllegalArgumentException("parameter:lexeme is null");
			}
			
			if(null == this.lexeme){//当前的branch是一个不交叠（ROOT）的分支
				if(this.rightBorder > 0  //说明当前branch内至少有一个lexeme
						&& _lexeme.getBeginPosition() >= this.rightBorder){
					//_lexeme 与 当前的branch不相交
					acceptType = TONEXT;
				}else{
					acceptType = ACCEPTED;
				}				
			}else{//当前的branch是一个有交叠的分支
				
				if(_lexeme.getBeginPosition() < this.lexeme.getBeginPosition()){
					//_lexeme 的位置比 this.lexeme还靠前（这种情况不应该发生）
					acceptType = REFUSED;
				}else if(_lexeme.getBeginPosition() >= this.lexeme.getBeginPosition()
							&& _lexeme.getBeginPosition() < this.lexeme.getEndPosition()){
					// _lexeme 与 this.lexeme相交
					acceptType = REFUSED;
				}else if(_lexeme.getBeginPosition() >= this.lexeme.getEndPosition()
							&& _lexeme.getBeginPosition() < this.rightBorder){
					//_lexeme 与 this.lexeme 不相交， 但_lexeme 与 当前的branch相交
					acceptType = ACCEPTED;
				}else{//_lexeme.getBeginPosition() >= this.rightBorder
					//_lexeme 与 当前的branch不相交
					acceptType=  TONEXT;
				}
			}
			return acceptType;
		}
	
	}
}

package com.fdauto.doc;

import org.apache.lucene.search.Similarity;

public class MySimilarity extends Similarity {

	/**
	 * 当查找词组时，命中率高的权重高
	 * 如：调拨　功能　　　同时出现的会更高
	 */
	@Override
	public float coord(int overlap, int maxOverlap) {
		// TODO Auto-generated method stub
		return 1.0f;
	}

	/**
	 * 在创建索引时，如Term="调拨"
	 * 他在多个文档中，按理讲，应该跨越多文档权重越低
	 *
	 */
	@Override
	public float idf(int docFreq, int numDocs) {
		// TODO Auto-generated method stub
		return 1.0f;
	}

	/**
	 * 这是对文档长度设置权重，长度越短，权重越高
	 * 
	 */
	@Override
	public float lengthNorm(String fieldName, int numTokens) {
		// TODO Auto-generated method stub
		return 1.0f;
	}

	/**
	 * 这是跟查询有关系的
	 */
	@Override
	public float queryNorm(float sumOfSquaredWeights) {
		// TODO Auto-generated method stub
		return 1.0f;
	}

	/**
	 * 这是在查找时，如调拨功能
	 * ＂调拨功能＂比＂调拨我们的功能＂　越近的分数越高
	 * 
	 */
	@Override
	public float sloppyFreq(int distance) {
		// TODO Auto-generated method stub
		return 1.0f / (distance + 1);
	}

	/**
	 * 在一篇文档中出现的term越多权重发挥高
	 */
	@Override
	public float tf(float freq) {
		// TODO Auto-generated method stub
		return 1.0f;
	}

}


/**
 * 
 * 
score(q,d)   =   (6)coord(q,d)  ·  (3)queryNorm(q)  · ∑( (4)tf(t in d)  ·  (5)idf(t)2  ·  t.getBoost() ·  (1)norm(t,d) )

t in q

norm(t,d)   =   doc.getBoost()  ·  (2)lengthNorm(field)  ·  ∏f.getBoost()

                                                                           field f in d named as t

下面逐个进行解释：
 * 
 */

Lucene 中wiki内容非常全面，可作为参考http://wiki.apache.org/lucene-java/LuceneFAQ#Why_am_I_getting_an_IOException_that_says_.22Too_many_open_files.22.3F

如何加快搜索速度：http://wiki.apache.org/lucene-java/ImproveSearchingSpeed

如何提高索引：http://wiki.apache.org/lucene-java/ImproveIndexingSpeed

1
顶

0
踩

分享到：