mozhenghua

浏览: 328941 次
性别:
来自: 杭州

最近访客更多访客>>

huang_love_ok

wang_eye

贝铃-Turing

joechl

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

Solr PostFilter优化查询性能

博客分类：

solr

solr lucene performance

背景

实际业务场景中，有时会需要两阶段过滤，最终的搜索结果是在前一个搜索结果上进一步搜索而得到的（search-within-search）的特性。

假设，最终搜索结果集是由（A AND B）两个条件对应的命中结果集求交而得到的。如果A条件对应的文档集合非常小（大概不超过300个），而B条件对应的文档集合非常大。在这样的场景下在solr中使用二阶段过滤的方式来查询就再合适不过了。

详细实现

第一阶段，通过A求得命中结果集合，然后第二节点在第一阶段基础上再进行过滤。

对于第一阶段没有什么好说了了，只要在solr中设置普通参数q就能实现，关键是第二阶段过滤。

在已经得到的命中结果集合上继续进行搜索缩小结果集合的方法其实就是早期Lucene版本中的Filter，但是不知道为什么，在高版本的Lucene中已经把Filter从lucene中去掉了，完全由collector链取代了。（可能是觉得Filter和Collector的功能重合了）

首先要使用org.apache.solr.search.PostFilter, 接口说明如下：

/** The PostFilter interface provides a mechanism to further filter documents
 * after they have already gone through the main query and other filters.
 * This is appropriate for filters with a very high cost.
 * <p>
 * The filtering mechanism used is a {@link DelegatingCollector}
 * that allows the filter to not call the delegate for certain documents,
 * thus effectively filtering them out.  This also avoids the normal
 * filter advancing mechanism which asks for the first acceptable document on
 * or after the target (which is undesirable for expensive filters).
 * This collector interface also enables better performance when an external system
 * must be consulted, since document ids may be buffered and batched into
 * a single request to the external system.
 * <p>
 * Implementations of this interface must also be a Query.
 * If an implementation can only support the collector method of
 * filtering through getFilterCollector, then ExtendedQuery.getCached()
 * should always return false, and ExtendedQuery.getCost() should
 * return no less than 100.
 */

很重要的一点，在子类中需要设置cache为false，cost不能小于100，对应的代码为SolrIndexSearcher中的+getProcessedFilter()方法中的一小段：

if (q instanceof ExtendedQuery) {
        ExtendedQuery eq = (ExtendedQuery)q;
        if (!eq.getCache()) {
          if (eq.getCost() >= 100 && eq instanceof PostFilter) {
            if (postFilters == null) postFilters = new ArrayList<>(sets.length-end);
            postFilters.add(q);
          } else {
            if (notCached == null) notCached = new ArrayList<>(sets.length-end);
            notCached.add(q);
          }
          continue;
        }
}

当Query对象满足eq.getCache()为false，cost>=100，且PostFilter对象之后会把query对象放到postFilters list中以备后用。

另外，加之lucene高版本中，加入了docValue这一特性，使得在第二阶段中通过docid求对应field内容变得可行了，以前没有docvalue的时候，只能讲field的值通过fieldCache的方式缓存到内存中，现在使用docValue大大降低了内存的开销。

构建PostFilterQuery：

public  class PostFilterQuery extends ExtendedQueryBase implements PostFilter {
		private final boolean exclude;
		private final Set<String> items;
		private final String field;

		public PostFilterQuery(boolean exclude, Set<String> items, String field) {
			super();
			this.exclude = exclude;
			this.items = items;
			this.field = field;
		}
		@Override
		public int hashCode() {
			return System.identityHashCode(this);
		}
		@Override
		public boolean equals(Object obj) {
			return this == obj;
		}
		@Override
		public void setCache(boolean cache) {
		}

		@Override
		public boolean getCache() {
			return false;
		}

		public int getCost() {
			return Math.max(super.getCost(), 100);
		}

		@Override
		public DelegatingCollector getFilterCollector(IndexSearcher searcher) {
			return new DelegatingCollector() {
				private SortedDocValues docValue;
				@Override
				public void collect(int doc) throws IOException {
					int order = this.docValue.getOrd(doc);
					if (order == -1) {
						if (exclude) {
							super.collect(doc);
						}
						return;
					}
					BytesRef ref = this.docValue.lookupOrd(order);
					if (items.contains(ref.utf8ToString())) {
						if (!exclude) {
							super.collect(doc);
						}
					} else {
						if (exclude) {
							super.collect(doc);
						}
					}
				}

				@Override
				protected void doSetNextReader(LeafReaderContext context) throws IOException {
					super.doSetNextReader(context);
					this.docValue = DocValues.getSorted(context.reader(), field);
				}
			};
		}

	}

该类中构造函数参数传入了三个值的意义：

boolean exclude：使用排除过滤还是包含过滤
Set<String> items：需要过滤的item集合
String field：通过Document文档上的那个field来过滤。

为了让这个Query类在查询的时候生效，需要写一个queryParserPlugin：

public class PostFilterQParserPlugin extends QParserPlugin {

	@Override
	@SuppressWarnings("all")
	public void init(NamedList args) {
	}

	@Override
	public QParser createParser(String qstr, SolrParams localParams, SolrParams params,
			SolrQueryRequest req) {
		boolean exclude = localParams.getBool("exclude");
		String field = localParams.get(CommonParams.FIELD);
		if (field == null) {
			throw new IllegalArgumentException(
					"field:" + field + " has not been define in localParam");
		}
		Set<String> items = Sets.newHashSet(StringUtils.split(qstr, ','));
		final PostFilterQuery q = new PostFilterQuery(exclude, items, field);
		return new QParser(qstr, localParams, params, req) {
			@Override
			public Query parse() throws SyntaxError {
				return q;			}
		};
	}}

将这个plugin配置solr-config.xml中：

<queryParser name="postfilter" class="com.dfire.tis.solrextend.queryparse.PostFilterQParserPlugin" />

接下来就是在Solr客户端查询过程中使用了，以下是一个例子：

        SolrQuery query = new SolrQuery();
		
	query.setQuery("customerregister_id:193d43b1734245f5d3bf35092dbb3a40");
	query.addFilterQuery("{!postfilter f=menu_id exclude=true}000008424a4234f0014a5746c2cd1065,000008424a4234f0014a5746c2cd1065");
	SimpleQueryResult<Object> result = client.query("search4totalpay",
				"00000241", query, Object.class);
	System.out.println("getNumberFound:" + result.getNumberFound());

总结

使用postfilter在特定场景下可以大大提高查询效率，不妨试试吧！

分享到：

Solr分词fieldType分词解析器设置错误导致 ... | 合理设置Solr Schema防止出现OOM

2017-02-07 14:20
浏览 1119
评论(0)
论坛回复 / 浏览 (1 / 6049)
分类:开源软件
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

Solr PostFilter优化查询性能

背景

详细实现

总结

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

Solr PostFilter优化查询性能

背景

详细实现

总结

评论

发表评论

相关推荐

solr多组Merge Query原理

基于Nested Document的RealtimeGet实现

Solr/Lucene使用docValue查询的一个坑

Solr分词fieldType分词解析器设置错误导致查询错误

合理设置Solr Schema防止出现OOM

solr5和solr6相同集群共存冲突解决

Solr 自定义FieldType Analyzer不生效 问题查找

Solr DocValues详解

基于Solr的多表join查询加速方法

Solr facet rage 查询

Solr性能优化之filterCache

在本地启动一个EmbeddedSolrServer 用于测试

solr cloud 之添加，删除，更新Document

最近访客更多访客>>

Solr 自定义FieldType Analyzer不生效问题查找