lucene中的docValue实现源码解读（四）——BinaryDocValue的写入 -

suichangkele

浏览: 203696 次
性别:
来自: 北京

最近访客更多访客>>

jieyuan_cg

z9780420

jzhfmm

geeksun

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

lucene中的docValue实现源码解读（四）——BinaryDocValue的写入

博客分类：

lucene

docValue BinaryDocValue 存储格式

BinaryDocValue是存储的byte[]，也就是他可以存一些字符串、图片，等可以用byte[]表示的内容。他的使用场景我们不关心，主要看下他是如何在lucene中存储的吧。他的添加还是在DefaultIndexingChain.indexDocValue方法里面，这里还是先保存在内存中，我们介绍一下如何在内存中保存的。使用的类是：BinaryDocValuesWriter，构造方法如下：

public BinaryDocValuesWriter(FieldInfo fieldInfo, Counter iwBytesUsed) {
	this.fieldInfo = fieldInfo;//要添加的域
	this.bytes = new PagedBytes(BLOCK_BITS);//这个是要保存所有的byte[]的容器，他的好处是他是压缩的，可以节省内存。可以直接将其看作是一个很大的byte[]
	this.bytesOut = bytes.getDataOutput();//获得上面说的byte[]的添加入口
	this.lengths = PackedLongValues.deltaPackedBuilder(PackedInts.COMPACT);//用来记录每个doc的byte[]长度的东西，因为在最后是将每个doc的byte[]放到一个大大的byte[]里面，所以要记录每个doc的byte[]的长度
	this.iwBytesUsed = iwBytesUsed;//用来记录使用的内存的，可以忽略
	this.docsWithField = new FixedBitSet(64);//记录含有值得doc的id的对象
	this.bytesUsed = docsWithFieldBytesUsed();
	iwBytesUsed.addAndGet(bytesUsed);
}

下面看下添加docValue的方法：

public void addValue(int docID, BytesRef value) {
	。。。//校验的方法去掉
	// Fill in any holes:
	while (addedValues < docID) {//这个是添加窟窿，因为有的doc没有byte[]，这时候要在记录长度的对象里面用0填充。（其实我觉得没有必要，只需要稍微改动下下面的迭代器的代码就可以避免这里的操作了。见下面加粗倾斜的说明）
		addedValues++;
		lengths.add(0);
	}
	addedValues++;
	lengths.add(value.length);//记录要添加的byte[]的大小
	try {
		bytesOut.writeBytes(value.bytes, value.offset, value.length);//将byte[]写入到内存中
	} catch (IOException ioe) {
		// Should never happen!
		throw new RuntimeException(ioe);
	}
	docsWithField = FixedBitSet.ensureCapacity(docsWithField, docID);
	docsWithField.set(docID);//记录这个id，表示其在这个域中有值
	updateBytesUsed();
}

通过上面的额方法已经将每一个doc的byte[]写入到内存里面了，我们看下当flush的时候的操作

public void flush(SegmentWriteState state, DocValuesConsumer dvConsumer) throws IOException {
	final int maxDoc = state.segmentInfo.getDocCount();//当前段的所有的doc的数量
	bytes.freeze(false);
	final PackedLongValues lengths = this.lengths.build();//
	dvConsumer.addBinaryField(fieldInfo, new Iterable<BytesRef>() {//添加的一个参数是一个迭代器生成器，重写了iterator方法，返回的迭代器的参数第一个是所有的额doc的数量，第二个是记录每个doc的byte[]的对象
		public Iterator<BytesRef> iterator() {
			return new BytesIterator(maxDoc, lengths);
		}
	});
}

看一下迭代器，BytesIterator的next方法，他返回所有的要保存在索引中的byte[]

public BytesRef next() {
	if (!hasNext()) {
		throw new NoSuchElementException();
	}
	final BytesRef v;
	if (upto < size) {
		int length = (int) lengthsIterator.next();//得到这个byte[]de 长度
		value.grow(length);
		value.setLength(length);
		try {
			bytesIterator.readBytes(value.bytes(), 0, value.length());//从里面读取指定长度的byte[]，读取到value里面
		} catch (IOException ioe) {
			// Should never happen!
			throw new RuntimeException(ioe);
		}
		if (docsWithField.get(upto)) {//如果存在这个id的，则返回value。（如果我们把这个检查放在lengthsIterator.next前面，如果不存在的话，就可以直接返回null，这样上面也就不用填窟窿了）
			v = value.get();
		} else {//不存在返回null
			v = null;
		}
	} else {
		v = null;
	}
	upto++;
	return v;
}

从这个方法里面可以得出，他会将所有的doc的值返回，如果这个doc没有值，则返回null，堆byte[]的读取是通过记录每个byte[]的对象以及记录所有的byte[]的对象联合实现的。我们看下最终的flush方法，也就是DocValueConsumer使用迭代器的方法。4.10.4中使用的是Lucene410DocValuesConsumer，和数字类型的docValue是一样的，只不过调用的方法是addBinaryField方法，

@Override
public void addBinaryField(FieldInfo field, Iterable<BytesRef> values) throws IOException {
	meta.writeVInt(field.number);//写入域号，这里的meta和numericDocValue的是一样的，都是data的索引文件，
	meta.writeByte(Lucene410DocValuesFormat.BINARY);
	int minLength = Integer.MAX_VALUE;
	int maxLength = Integer.MIN_VALUE;
	final long startFP = data.getFilePointer();
	long count = 0;
	boolean missing = false;
	for (BytesRef v : values) {//循环所有的byte[]，如果不是null，则写入到data中，
		final int length;
		if (v == null) {
			length = 0;
			missing = true;
		} else {
			length = v.length;
		}
		minLength = Math.min(minLength, length);
		maxLength = Math.max(maxLength, length);
		if (v != null) {
			data.writeBytes(v.bytes, v.offset, v.length);//写入所有的byte[]到data中，这样data就相当于是一个大大的byte[]了，将多个晓得byte[]记录在里面。
		}
		count++;
	}
	
	//写入的格式，有两种，一个是没有压缩的，当所有的byte[]一样长的时候，否则使用压缩的。
	meta.writeVInt(minLength == maxLength ? BINARY_FIXED_UNCOMPRESSED : BINARY_VARIABLE_UNCOMPRESSED);
	if (missing) {//如果有没有值得doc，则在data中记录所有的含有值的id，这一点和numericDocValue也是一样的。
		meta.writeLong(data.getFilePointer());//在meta中记录docset写入的fp，也就是索引。
		writeMissingBitset(values);
	} else {//否则写入-1
		meta.writeLong(-1L);
	}
	meta.writeVInt(minLength);
	meta.writeVInt(maxLength);
	meta.writeVLong(count);//doc的数量，这个源码中的注释错了，他的注释是写入的值得个数，并不是的，其实是doc的数量，因为有的额doc是没有值的。
	meta.writeLong(startFP);//记录没有写入任何的byte[]是的data的fp，

	// if minLength == maxLength, its a fixed-length byte[], we are done (the addresses are implicit) otherwise, we need to record the length fields...  这句英文的意思是如果所有的byte[]的长度都一样，则就没事了，但是这个在实际中几乎是不成立的，
	if (minLength != maxLength) {//如果长度不一致，则写入每个doc的值得长度，这样就很容易从那个大大byte[]里面里面找到每个doc的开始位置了。
		meta.writeLong(data.getFilePointer());//记录此时的data的索引，下面要记录每个doc的开始位置了
		meta.writeVInt(PackedInts.VERSION_CURRENT);
		meta.writeVInt(BLOCK_SIZE);

		final MonotonicBlockPackedWriter writer = new MonotonicBlockPackedWriter(data, BLOCK_SIZE);
		long addr = 0;
		writer.add(addr);//写入第一个doc的开始位置
		for (BytesRef v : values) {
			if (v != null) {//写入每个doc的结束位置（也就是下一个doc的开始位置），这样，相邻的两个doc的差值就是前一个doc的值的长度，对于没有值的doc，他的长度是0，前后两个的值是一样的
				addr += v.length;
			}
			writer.add(addr);//如果v==null则addr不变化，长度是0
		}
		writer.finish();
	}
}

这样，就将每个byte[]写入到索引里面了。

总结一下，BinaryDocValue其实就是将所有的byte[]写入到硬盘上，然后再将记录每个doc的byte[]长度的数字也写到硬盘上，并且将每个部分的fp（也就是开始位置，理解为索引）写入到meta文件中。

其实binaryDocValue要比NumericDocValue简单，因为他不会有很多形式。

分享到：

lucene中的docValue实现源码解读（五）—— ... | lucene中的docValue实现源码解读（三）—— ...

2018-02-08 15:53
浏览 830
评论(0)
分类:开源软件
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

lucene中的docValue实现源码解读（四）——BinaryDocValue的写入

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

lucene中的docValue实现源码解读（四）——BinaryDocValue的写入

评论

发表评论

相关推荐

FST源代码解读6——FST的读取

FST源代码解读5——FST的压缩

FST源代码解读4——结束添加

FST源代码解读3——编译节点

FST源代码解读2——FST的生成

FST源代码解读1——FST是什么

packedints总结

lucene中的PackedInts源码解读(3)-PACKED格式

lucene中的PackedInts源码解读(2)-Packed64SingleBlock

lucene中的PackedInts源码解读-1

SpanQuery的得分

lucene的spanNearQuery（二）——不带有顺序的

solr的facet源码解读（四）——facet.field之非数字单值域类型

solr的facet源码解读（三）——facet.field之数字单值域类型

solr的facet源码解读（二）——facet.field

lucene中关于正向信息的获取——FielldCache

solr对docValue的使用

lucene中的docValue实现源码解读（十二）——总结

lucene中的docValue实现源码解读（十一）——SortedSet的读取

lucene中的docValue实现源码解读（十）——SortedSet的写入

最近访客更多访客>>