lucene中的docValue实现源码解读（八）——SortedNumericDocValue的写入 -

suichangkele

浏览: 204574 次
性别:
来自: 北京

最近访客更多访客>>

jieyuan_cg

z9780420

jzhfmm

geeksun

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

lucene中的docValue实现源码解读（八）——SortedNumericDocValue的写入

博客分类：

lucene

lucene docValue sortedNumericDocValue 存储格式

SortedNumericDocValue，乍一看，有好多疑问困扰着我，这种类型的存储的也是排序的吗？是不是可以通过docid获得其排序？是不是和SortedBinaryDocValue一样也是单值域的？带着这些疑问，看完了代码，发现完全不是。SortedNumericDocValue不是对所有的doc进行排序的，即无法获得一个doc的排序，而且他是多值域的，即一个doc可以含有多个数字，这里的sorted说的是每个doc的多个数字在存储的时候是排序的，但是没有多个doc之间的排序。

和之前一样，看一下在内存的添加：

/** Buffers up pending long[] per doc, sorts, then flushes when segment flushes. */
class SortedNumericDocValuesWriter extends DocValuesWriter {
	/**每个doc含有的所有的long*/
	private PackedLongValues.Builder pending; // stream of all values
	/**每个doc含有的long数字的个数，如果没有，则是0*/
	private PackedLongValues.Builder pendingCounts; // count of values per doc
	private final FieldInfo fieldInfo;
	/** 刚刚处理的doc的id */
	private int currentDoc;
	/** 用于保存当前的doc的多个long数字 */
	private long currentValues[] = new long[8];
	/** 当前的doc的最后一个long在currentValues的指针，也可以说用来记录当前的doc的所有的数字的个数 */
	private int currentUpto = 0;
        //构造方法
	public SortedNumericDocValuesWriter(FieldInfo fieldInfo, Counter iwBytesUsed) {
		this.fieldInfo = fieldInfo;
		this.iwBytesUsed = iwBytesUsed;
		pending = PackedLongValues.deltaPackedBuilder(PackedInts.COMPACT);
		pendingCounts = PackedLongValues.deltaPackedBuilder(PackedInts.COMPACT);
		bytesUsed = pending.ramBytesUsed() + pendingCounts.ramBytesUsed();
		iwBytesUsed.addAndGet(bytesUsed);
	}
        //添加一个docvalue
	public void addValue(int docID, long value) {
		if (docID != currentDoc) {//如果切换了docid，说明要处理下一个doc了，则要结束当前的doc，因为一个doc有多个值
			finishCurrentDoc();//结束当前的doc
		}
		// Fill in any holes:  填窟窿，对于没有值的doc在pendingCounts中填入0，表示这个doc的值的个数是0个
		while (currentDoc < docID) {
			pendingCounts.add(0); // no values
			currentDoc++;
		}
		addOneValue(value);//对当前的doc添加一个值
		updateBytesUsed();
	}
	// finalize currentDoc: this sorts the values in the current doc
	private void finishCurrentDoc() {
		Arrays.sort(currentValues, 0, currentUpto);//将当前的doc的多个值从小到大排序
		for (int i = 0; i < currentUpto; i++) {//写入到pending中
			pending.add(currentValues[i]);
		}
		// record the number of values for this doc
		pendingCounts.add(currentUpto);//当前doc含有的long数字的个数
		currentUpto = 0;
		currentDoc++;
	}

	//结束完所有的doc
	@Override
	public void finish(int maxDoc) {
		finishCurrentDoc();
		// fill in any holes
		for (int i = currentDoc; i < maxDoc; i++) {
			pendingCounts.add(0); // no values
		}
	}

	/**添加一个long到数组*/
	private void addOneValue(long value) {
		if (currentUpto == currentValues.length) {
			currentValues = ArrayUtil.grow(currentValues, currentValues.length + 1);
		}

		currentValues[currentUpto] = value;
		currentUpto++;
	}
}

通过上面的方法可以清楚的看到，SortedNumericDocValue是支持一个doc多个数字的。对于每个doc记录了两个内容，一个是这个doc有哪些值（保存在pending中），并且是排序后存放的；第二个是这个doc的值的个数（保存在pendingCount里面）。

再看看flush时的操作：

@Override
public void flush(SegmentWriteState state, DocValuesConsumer dvConsumer) throws IOException {
	final int maxDoc = state.segmentInfo.getDocCount();
	assert pendingCounts.size() == maxDoc;
	final PackedLongValues values = pending.build();//所有的添加的值
	final PackedLongValues valueCounts = pendingCounts.build();//每个doc含有的long的数量

	dvConsumer.addSortedNumericField(fieldInfo,
			// doc -> valueCount，
			new Iterable<Number>() {
				@Override
				public Iterator<Number> iterator() {
					return new CountIterator(valueCounts);//每个doc含有的数字的个数
				}
			},
			// values
			new Iterable<Number>() {
				@Override
				public Iterator<Number> iterator() {//所有的数字
					return new ValuesIterator(values);
				}
			});
}
private static class ValuesIterator implements Iterator<Number> {
	final PackedLongValues.Iterator iter;
	ValuesIterator(PackedLongValues values) {
		iter = values.iterator();
	}
	@Override
	public boolean hasNext() {
		return iter.hasNext();
	}
	@Override
	public Number next() {
		if (!hasNext()) {
			throw new NoSuchElementException();
		}
		return iter.next();
	}
	@Override
	public void remove() {
		throw new UnsupportedOperationException();
	}
}
private static class CountIterator implements Iterator<Number> {
	final PackedLongValues.Iterator iter;
	CountIterator(PackedLongValues valueCounts) {
		this.iter = valueCounts.iterator();
	}
	@Override
	public boolean hasNext() {
		return iter.hasNext();
	}
	@Override
	public Number next() {
		if (!hasNext()) {
			throw new NoSuchElementException();
		}
		return iter.next();
	}
	@Override
	public void remove() {
		throw new UnsupportedOperationException();
	}
}

flush的时候也很简单，就是传递给Consumer两个迭代器，一个用来某个doc的long值的个数，第二个用于传递所有的值。再看看具体向dierctory中写入的方法吧Lucene410DocValuesConsumer.addSortedNumericField(FieldInfo, Iterable<Number>, Iterable<Number>)：

public void addSortedNumericField(FieldInfo field, final Iterable<Number> docToValueCount, final Iterable<Number> values) throws IOException {
	meta.writeVInt(field.number);//域号
	meta.writeByte(Lucene410DocValuesFormat.SORTED_NUMERIC);
	if (isSingleValued(docToValueCount)) {//如果全部是单值的，即每个doc都只有一个值，则直接用之前的数字类型的。因为此时没法排序，这里说的排序是对域一个doc的多个值的排序
		meta.writeVInt(SORTED_SINGLE_VALUED);
		// The field is single-valued, we can encode it as NUMERIC
		addNumericField(field, singletonView(docToValueCount, values, null));//这个就是在记录NumericDocValye的时候的格式，要分为三个格式。
	} else {//正常情况下，即一个doc含有多个数字的情况
		meta.writeVInt(SORTED_WITH_ADDRESSES);
		// write the stream of values as a numeric field，先写数字类型的，即把所有的值写入到directory中。
		addNumericField(field, values, true);
		// write the doc -> ord count as a absolute index to the stream。
		addAddresses(field, docToValueCount);//在写入索引，用来读取每个doc的自己的多个long值。
	}
}
private void addAddresses(FieldInfo field, Iterable<Number> values) throws IOException {
	meta.writeVInt(field.number);//
	meta.writeByte(Lucene410DocValuesFormat.NUMERIC);
	meta.writeVInt(MONOTONIC_COMPRESSED);
	meta.writeLong(-1L);
	meta.writeLong(data.getFilePointer());
	meta.writeVLong(maxDoc);
	meta.writeVInt(PackedInts.VERSION_CURRENT);
	meta.writeVInt(BLOCK_SIZE);
	final MonotonicBlockPackedWriter writer = new MonotonicBlockPackedWriter(data, BLOCK_SIZE);
	long addr = 0;
	writer.add(addr);//
	for (Number v : values) {
		addr += v.longValue();
		writer.add(addr);//记录每个doc之前一共有多少个long，这样就能很快的找到每个doc在numeric那一块的开始，然后读取多少个doc，也就是这个doc的所有的数字。
	}
	writer.finish();
	meta.writeLong(data.getFilePointer());
}

看完了flush就知道了SortedNumericDocValue是如何存储的了，他是分为两部分，一部分是数字，也就是所有的数字，每个doc的所有的数字是一起存放的，并且是排序后存放的；第二部分存储的是每个doc的第一个数字在所有的数字中的排序，比如第一个doc有三个数字，那么第二个doc在第二部分存储的就是3，因为这样在第一部分读取3个long之后，就是这个doc的自己的long了，这个doc的下一个doc的存储的值减去这个doc的存储的值就是这个doc的long的个数，那么在读取这个数量的long，就是这个这个doc的所有的long了。

同时还能看到，在所有的doc中，的确是没有排序，仅仅是一个doc的多个数字排序了。

分享到：

lucene中的docValue实现源码解读（九）—— ... | lucene中的docValue实现源码解读（七）—— ...

2018-02-14 12:03
浏览 684
评论(0)
分类:开源软件
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

lucene中的docValue实现源码解读（八）——SortedNumericDocValue的写入

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

lucene中的docValue实现源码解读（八）——SortedNumericDocValue的写入

评论

发表评论

相关推荐

FST源代码解读6——FST的读取

FST源代码解读5——FST的压缩

FST源代码解读4——结束添加

FST源代码解读3——编译节点

FST源代码解读2——FST的生成

FST源代码解读1——FST是什么

packedints总结

lucene中的PackedInts源码解读(3)-PACKED格式

lucene中的PackedInts源码解读(2)-Packed64SingleBlock

lucene中的PackedInts源码解读-1

SpanQuery的得分

lucene的spanNearQuery（二）——不带有顺序的

solr的facet源码解读（四）——facet.field之非数字单值域类型

solr的facet源码解读（三）——facet.field之数字单值域类型

solr的facet源码解读（二）——facet.field

lucene中关于正向信息的获取——FielldCache

solr对docValue的使用

lucene中的docValue实现源码解读（十二）——总结

lucene中的docValue实现源码解读（十一）——SortedSet的读取

lucene中的docValue实现源码解读（十）——SortedSet的写入

最近访客更多访客>>