solr的facet源码解读（三）——facet.field之数字单值域类型 -

suichangkele

浏览: 203036 次
性别:
来自: 北京

最近访客更多访客>>

jieyuan_cg

z9780420

jzhfmm

geeksun

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

solr的facet源码解读（三）——facet.field之数字单值域类型

博客分类：

solr
lucene

lucene solr facet 数字类型

（这个使用的solr的版本是4.10）

承接上一篇文章，在对单值域的数字类型的域做facet的时候，会使用FCS方法，里面再调用的方法是NumericFacets.getCounts(searcher, base, field, offset, limit, mincount, missing, sort);所以看看这个的代码吧：

/**
 * 处理单值域的数字类型的facet
 * @param searcher
 * @param docs 		基础范围（即有q和fq确定的所有的doc的id）
 * @param fieldName	要facet的域的名字
 * @param offset	最后返回的结果的偏移量
 * @param limit		最后返回的结果的数量，小于0表示全部返回！
 * @param mincount	能够被facet的term值得最小的doc的数量，如果一个term匹配的doc的数量小于这个值，则不计算这个term。如果这个值位0且其他的term不能够满足条件，则要收集匹配的doc数量为0的term，即没有在上面的docs中doc的term。
 * @param missing	要不要返回null的值。即上面的
 * @param sort		收集到的值得排序
 * @return  
 * @throws IOException
 */
public static NamedList<Integer> getCounts(SolrIndexSearcher searcher, DocSet docs, String fieldName, int offset, int limit, int mincount, boolean missing, String sort) throws IOException {
	final boolean zeros = mincount <= 0;//要不要收集没有doc的值（包括没有在docs中的那些doc匹配的term）
	mincount = Math.max(mincount, 1);   //这么做是有好处的，这样可以加快速度。因为可能不需要使用不命中的那些值，单单那些已经命中的doc的term就已经可以得到结果了。
	final SchemaField sf = searcher.getSchema().getField(fieldName);//
	final FieldType ft = sf.getType();
	final NumericType numericType = ft.getNumericType();
	if (numericType == null) {//只能facet数字类型的
		throw new IllegalStateException();
	}
	final List<AtomicReaderContext> leaves = searcher.getIndexReader().leaves();
	
	// 1. 先把已经搜索到的docs的对应的值都收集了，收集到hashTable中。单独创建了这么一个类，用于保存term和匹配的doc的数量。
	final HashTable hashTable = new HashTable();
	final Iterator<AtomicReaderContext> ctxIt = leaves.iterator();
	AtomicReaderContext ctx = null;
	FieldCache.Longs longs = null;
	Bits docsWithField = null;
	int missingCount = 0;//
	for (DocIterator docsIt = docs.iterator(); docsIt.hasNext();) {//
		final int doc = docsIt.nextDoc();
		if (ctx == null || doc >= ctx.docBase + ctx.reader().maxDoc()) {//找到这个doc所在的段
			do {
				ctx = ctxIt.next();
			} while (ctx == null || doc >= ctx.docBase + ctx.reader().maxDoc());
			//从fieldCache中获得这个doc的值，从之前的博客中可以知道，fieldCache也是优先获取docValue的值的，所以说这个收集方式就是优先使用docValue的值。
			switch (numericType) {
			case LONG:	longs = FieldCache.DEFAULT.getLongs(ctx.reader(), fieldName, true);	break;
			case INT:
				final FieldCache.Ints ints = FieldCache.DEFAULT.getInts(ctx.reader(), fieldName, true);
				longs = new FieldCache.Longs() {
					@Override
					public long get(int docID) {		return ints.get(docID);
					}
				};
				break;
			case FLOAT:
				final FieldCache.Floats floats = FieldCache.DEFAULT.getFloats(ctx.reader(), fieldName, true);
				longs = new FieldCache.Longs() {
					@Override
					public long get(int docID) {
						return NumericUtils.floatToSortableInt(floats.get(docID));
					}
				};
				break;
			case DOUBLE:
				final FieldCache.Doubles doubles = FieldCache.DEFAULT.getDoubles(ctx.reader(), fieldName, true);
				longs = new FieldCache.Longs() {
					@Override
					public long get(int docID) {
						return NumericUtils.doubleToSortableLong(doubles.get(docID));
					}
				};
				break;
			default:
				throw new AssertionError();
			}
			docsWithField = FieldCache.DEFAULT.getDocsWithField(ctx.reader(), fieldName);//含有这个域的doc的bit
		}
		long v = longs.get(doc - ctx.docBase);//获得这个id的值
		if (v != 0 || docsWithField.get(doc - ctx.docBase)) {//如果v != 0说明是一定有值得，但是==0的话可能也有值的，所以要判断两次。
			hashTable.add(doc, v, 1);//收集到了，加入到hash表里面。
		} else {
			++missingCount;//没有值，也就是null的数量，如果需要返回missing的话这个就有用了。
		}
	}
	// 2. 从hash表中根据规则 选择offset+limit个。
	final int pqSize = limit < 0 ? hashTable.size : Math.min(offset + limit, hashTable.size);，如果limit小于0 则全部的term都要返回，否则返回offset+ limit个。
	final PriorityQueue<Entry> pq;//根据排序创建一个优先队列
	if (FacetParams.FACET_SORT_COUNT.equals(sort) || FacetParams.FACET_SORT_COUNT_LEGACY.equals(sort)) {//如果排序是按照term匹配的doc数量排序
		pq = new PriorityQueue<Entry>(pqSize) {
			@Override
			protected boolean lessThan(Entry a, Entry b) {
				if (a.count < b.count || (a.count == b.count && a.bits > b.bits)) {//现根据count排序，如果count一样，按照数字排序
					return true;
				} else {
					return false;
				}
			}
		};
	} else {
		pq = new PriorityQueue<Entry>(pqSize) {
			@Override
			protected boolean lessThan(Entry a, Entry b) {//按照facet到的数字的大小排序
				return a.bits > b.bits;
			}
		};
	}
	Entry e = null;
	for (int i = 0; i < hashTable.bits.length; ++i) {//循环已经收集的term，这些的doc都是大于0的，因为他们的获取方式就是从已经搜索到的doc中获取的。
		if (hashTable.counts[i] >= mincount) {//如果大于指定的值，hashTable.counts[i]的这个值最小是1，所以如果这些的term已经够数量了，就不去查询词典表了，所以前面才将其置位最小是1的数字，当然如果指定了>1的数字，就使用那个数字
			if (e == null) {
				e = new Entry();
			}
			e.bits = hashTable.bits[i];
			e.count = hashTable.counts[i];
			e.docID = hashTable.docIDs[i];
			e = pq.insertWithOverflow(e);
		}
	}
	
	// 4. build the NamedList  构建最后的结果
	final ValueSource vs = ft.getValueSource(sf, null);//使用valueSource查询具体的值，因为之前查询的都是long类型的值，而我们要返回的是字符串，这次就是要查询字符串。
	final NamedList<Integer> result = new NamedList<>();
	
//	如果上面的term的数量不够，体现在两个方面，一个是排序，即收集的term的排序是按照term的字面值排序的，或者是minCount=0,表示要获得所有的term， 则要查询词典表，这就复杂了！
	// This stuff is complicated because if facet.mincount=0, the counts needs to be merged with terms from the terms dict（翻译过来是：如果mincount=0，则要读取词典表获得所有的term，因为现在仅仅是收集了一部分doc的term）
	// 或者不计算不命中的doc的term值或者是按照count排序的，就不需要查词典表了。
	if (!zeros || FacetParams.FACET_SORT_COUNT.equals(sort) || FacetParams.FACET_SORT_COUNT_LEGACY.equals(sort)) {
		final Deque<Entry> counts = new ArrayDeque<>();//保存offset后面的那些值
		while (pq.size() > offset) {//删除offset个到counts中去
			counts.addFirst(pq.pop());
		}
		// Entries from the PQ first, then using the terms dictionary
		for (Entry entry : counts) {
			final int readerIdx = ReaderUtil.subIndex(entry.docID, leaves);
			final FunctionValues values = vs.getValues(Collections.emptyMap(), leaves.get(readerIdx));//valueSource读取真正的值，使用FieldCache
			result.add(values.strVal(entry.docID - leaves.get(readerIdx).docBase), entry.count);//放入结果
		}

		//如果计算那些不命中的且单单使用docSet不够数量，则要查看词典表，即检查所有的term
		if (zeros && (limit < 0 || result.size() < limit)) { // need to merge with the term dict
			if (!sf.indexed()) {//此时必须要简历索引，不然没法查词典表了
				throw new IllegalStateException("Cannot use " + FacetParams.FACET_MINCOUNT + "=0 on field " + sf.getName() + " which is not indexed");
			}
			// Add zeros until there are limit results
			final Set<String> alreadySeen = new HashSet<>();
			//将使用docSet已经查找到的所有的值放入set集合里面，放置重复了
			while (pq.size() > 0) {//第一步是放入offset的那些
				Entry entry = pq.pop();
				final int readerIdx = ReaderUtil.subIndex(entry.docID, leaves);
				final FunctionValues values = vs.getValues(Collections.emptyMap(), leaves.get(readerIdx));
				alreadySeen.add(values.strVal(entry.docID - leaves.get(readerIdx).docBase));
			}
			//第二部是放入已经放入到result里面的那些
			for (int i = 0; i < result.size(); ++i) {
				alreadySeen.add(result.getName(i));
			}
			
			//获得这个域的所有的term
			final Terms terms = searcher.getAtomicReader().terms(fieldName);
			if (terms != null) {
				
				final String prefixStr = TrieField.getMainValuePrefix(ft);//这个域的前缀
				final BytesRef prefix;
				if (prefixStr != null) {
					prefix = new BytesRef(prefixStr);
				} else {
					prefix = new BytesRef();
				}
				
				final TermsEnum termsEnum = terms.iterator(null);
				BytesRef term;
				switch (termsEnum.seekCeil(prefix)) {
				case FOUND:
				case NOT_FOUND:
					term = termsEnum.term();
					break;
				case END:
					term = null;
					break;
				default:
					throw new AssertionError();
				}
				
				final CharsRef spare = new CharsRef();

				
				//继续跳过offset-hashtable.size，因为这一部分不要。
				for (int skipped = hashTable.size; skipped < offset && term != null	&& StringHelper.startsWith(term, prefix);) {
					ft.indexedToReadable(term, spare);
					final String termStr = spare.toString();
					if (!alreadySeen.contains(termStr)) {
						++skipped;
					}
					term = termsEnum.next();
				}
				
				
				
				//读取limit-result.size个term
				for (; term != null && StringHelper.startsWith(term, prefix) && (limit < 0 || result.size() < limit); term = termsEnum.next()) {
					ft.indexedToReadable(term, spare);
					final String termStr = spare.toString();
					if (!alreadySeen.contains(termStr)) {//如果从来没有出现过！
						result.add(termStr, 0);//添加到结果中
					}
				}	
			}
		}
	} else {//收集docset中没有的且按照字面值排序，读取词典表
		// sort=index, mincount=0 and we have less than limit items => Merge the PQ and the terms dictionary on the fly
		if (!sf.indexed()) {
			throw new IllegalStateException("Cannot use " + FacetParams.FACET_SORT + "=" + FacetParams.FACET_SORT_INDEX + " on a field which is not indexed");
		}
		//key是facet的数字的字面值，value是次数
		final Map<String, Integer> counts = new HashMap<>();
		while (pq.size() > 0) {//从优先队列里面取出来，再放入到counts里面，放入的key是字面值，value是在docSet中facet到的次数
			final Entry entry = pq.pop();
			final int readerIdx = ReaderUtil.subIndex(entry.docID, leaves);
			final FunctionValues values = vs.getValues(Collections.emptyMap(), leaves.get(readerIdx));
			counts.put(values.strVal(entry.docID - leaves.get(readerIdx).docBase), entry.count);
		}
		final Terms terms = searcher.getAtomicReader().terms(fieldName);
		if (terms != null) {
			final String prefixStr = TrieField.getMainValuePrefix(ft);
			final BytesRef prefix;
			if (prefixStr != null) {
				prefix = new BytesRef(prefixStr);
			} else {
				prefix = new BytesRef();
			}
			final TermsEnum termsEnum = terms.iterator(null);
			BytesRef term;
			switch (termsEnum.seekCeil(prefix)) {
			case FOUND:
			case NOT_FOUND:
				term = termsEnum.term();
				break;
			case END:
				term = null;
				break;
			default:
				throw new AssertionError();
			}
			final CharsRef spare = new CharsRef();
			for (int i = 0; i < offset && term != null && StringHelper.startsWith(term, prefix); ++i) {//
				term = termsEnum.next();
			}
			for (; term != null && StringHelper.startsWith(term, prefix)
					&& (limit < 0 || result.size() < limit); term = termsEnum.next()) {
				ft.indexedToReadable(term, spare);
				final String termStr = spare.toString();
				Integer count = counts.get(termStr);
				if (count == null) {
					count = 0;
				}
				result.add(termStr, count);
			}
		}
	}

	if (missing) {//添加null的值得数量
		result.add(null, missingCount);
	}
	return result;
}

从上面可以总结出经验来，在对单值域的数字类型的域做facet的时候，最好是设置上mincount>0，且按照doc的数量排序，在这个时候仅仅是使用命中的所有的doc的term做聚合，数量较少，不会有其他的操作；否则会读取词典表，导致效率低下。还需要注意的是，上面的所有的操作都是在一个线程中完成的，之前说的多线程是在多个facet.field的情况下才会使用的。

分享到：

solr的facet源码解读（四）——facet.fiel ... | solr的facet源码解读（二）——facet.fiel ...

2018-02-18 20:44
浏览 732
评论(0)
分类:开源软件
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

solr的facet源码解读（三）——facet.field之数字单值域类型

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

solr的facet源码解读（三）——facet.field之数字单值域类型

评论

发表评论

相关推荐

FST源代码解读6——FST的读取

FST源代码解读5——FST的压缩

FST源代码解读4——结束添加

FST源代码解读3——编译节点

FST源代码解读2——FST的生成

FST源代码解读1——FST是什么

packedints总结

lucene中的PackedInts源码解读(3)-PACKED格式

lucene中的PackedInts源码解读(2)-Packed64SingleBlock

lucene中的PackedInts源码解读-1

SpanQuery的得分

lucene的spanNearQuery（二）——不带有顺序的

solr的facet源码解读（四）——facet.field之非数字单值域类型

solr的facet源码解读（二）——facet.field

lucene中关于正向信息的获取——FielldCache

solr的facet源码解读（一）——facet.query

solr对docValue的使用

lucene中的docValue实现源码解读（十二）——总结

lucene中的docValue实现源码解读（十一）——SortedSet的读取

lucene中的docValue实现源码解读（十）——SortedSet的写入

最近访客更多访客>>