FST源代码解读5——FST的压缩 -

suichangkele

浏览: 204234 次
性别:
来自: 北京

最近访客更多访客>>

jieyuan_cg

z9780420

jzhfmm

geeksun

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

FST源代码解读5——FST的压缩

博客分类：

lucene

fst lucene

lucene fst 源代码解读

FST中最麻烦的其实是压缩，压缩的代码很难懂，必要性也不是很大（比如再词典表的存储的时候就没有使用压缩），所以我就仅把代码注释写上了，我估计有可能读的人不会很多，所以就不过多解释了.压缩的本质是对那些含有指向自己的arc的数量很多node编号做精简，有两种方式，或者是使用一个单独的对象记录这种节点的编号到位置的映射，在记录一个arc的指向的target的位置的时候记录编号，然后再用编号查找位置；或者是使用target的开始位置和指向这个节点的arc的开始位置差值（因为target一定再arc的后面，插值一定会更小）。在压缩之前生成的BytesStore中，每个arc会记录自己指向的target，可能用编号（在最终做压缩的fst中是编号，不压缩的时候是用具体值），对于那些重复次数很多的，这种方式可以起到压缩空间的作用。

/**
 * Expert: creates an FST by packing this one. This process requires substantial additional RAM (currently up to ~8 bytes per node depending
 * on <code>acceptableOverheadRatio</code>), but then should produce a smaller FST.
 * <p>
 * The implementation of this method uses ideas from <a target="_blank" href= "http://www.cs.put.poznan.pl/dweiss/site/publications/download/fsacomp.pdf">
 * Smaller Representation of Finite State Automata</a>, which describes techniques to reduce the size of a FST. However, this is not a strict implementation of the algorithms described in this paper.
 */
FST<T> pack(Builder<T> builder, int minInCountDeref, int maxDerefNodes, float acceptableOverheadRatio) throws IOException {

	// NOTE: maxDerefNodes is intentionally int: we cannot
	// support > 2.1B deref nodes

	// TODO: other things to try
	// - renumber the nodes to get more next / better locality?
	// - allow multiple input labels on an arc, so
	// singular chain of inputs can take one arc (on
	// wikipedia terms this could save another ~6%)
	// - in the ord case, the output '1' is presumably
	// very common (after NO_OUTPUT)... maybe use a bit
	// for it..?
	// - use spare bits in flags.... for top few labels /
	// outputs / targets

	if (nodeAddress == null) {
		throw new IllegalArgumentException("this FST was not built with willPackFST=true");
	}

	T NO_OUTPUT = outputs.getNoOutput();

	Arc<T> arc = new Arc<>();

	final BytesReader r = getBytesReader();//倒着读的
	// 一共收集多少个
	final int topN = Math.min(maxDerefNodes, inCounts.size());

	// Find top nodes with highest number of incoming arcs:
	NodeQueue q = new NodeQueue(topN);

	// TODO: we could use more RAM efficient selection algo here...
	NodeAndInCount bottom = null;//未填充满堆之前，都是null
	// 将所有的节点都写入到堆中，保留最大的topN个
	for (int node = 0; node < inCounts.size(); node++) {// node==0的不知道干啥的，忽略,因为等于0的不会进入下面的if判断。
		if (inCounts.get(node) >= minInCountDeref) {
			if (bottom == null) {
				q.add(new NodeAndInCount(node, (int) inCounts.get(node)));
				if (q.size() == topN) {
					bottom = q.top();
				}
			} else if (inCounts.get(node) > bottom.count) {//插入，挤掉最小的
				q.insertWithOverflow(new NodeAndInCount(node, (int) inCounts.get(node)));
			}
		}
	}

	// Free up RAM:
	inCounts = null;
	
	final Map<Integer, Integer> topNodeMap = new HashMap<>();//指向的arc数量最多的n个中排名第几，从0开始，key是节点，value是排名
	for (int downTo = q.size() - 1; downTo >= 0; downTo--) {
		NodeAndInCount n = q.pop();
		topNodeMap.put(n.node, downTo);
	}

	// +1 because node ords start at 1 (0 is reserved as stop node):
	// 按照正向存储，已经生成的最新的fst中每个节点的开始位置（nodeAddress记录的是反向存储时的结束位置），加一是GrowableWriter是从0开始的，但是在fst中node=0是给最后一个节点准备的（参见addNode方法nodeIn.numArcs == 0的情况），
	// 其他的不是最后一个节点的编号是从1开始的，所以记录这1-n个节点最大的是n，所以需要n+1个
	// 但是下面会更新这个newNodeAddress的值，在更新后就可以将其理解为每个节点在新的fst中的开始的值了。
	final GrowableWriter newNodeAddress = new GrowableWriter(PackedInts.bitsRequired(builder.bytes.getPosition()), (int) (1 + builder.nodeCount), acceptableOverheadRatio);
	// Fill initial coarse guess:   注意这里写的仅仅是大概的值，因为下面是从root开始写入的，在写一个前面的节点时，需要记录其arc的target的节点，而此时距离他的target的路径上可能还有好多节点
	// 这些节点还没有处理过他们将来会变，所以写入的仅仅是大概的值
	for (int node = 1; node <= builder.nodeCount; node++) {
		// builder.bytes.getPosition() - nodeAddress.get(node)表示全部的结束位置减去这个节点的结束位置，也就是反向记录的话这个节点的开始位置；因为bytes的开始会写入一个0的byte，所以开始位置需要再加1。
		newNodeAddress.set(node, 1 + builder.bytes.getPosition() - nodeAddress.get(node));//newNodeAddress从1开始，供n个，但是GrowableWriter从0开始的，所以上面是n+1个
	}
	// 还是对上面的newNodeAddress的解释：里面存储的有的是开始位置，是预估的位置，下面还会有解释。
	
	FST<T> fst;

	// Iterate until we converge:
	while (true) {
		boolean changed = false;//这次压缩中有没有有没有发生新的变化，如果发生了变化，说明要继续压缩，直到不再发生变化。
		// for assert:
		boolean negDelta = false;
		// 新的fst
		fst = new FST<>(inputType, outputs, builder.bytes.getBlockBits());
		final BytesStore writer = fst.bytes;
		// Skip 0 byte since 0 is reserved target: 先写入0，所以上面的newNodeAddress中会加一个1.
		writer.writeByte((byte) 0);
		// 当前节点向后移动的距离（负数表示向前移动了）只要这个不是0，说明从这个节点后面的所有的节点的位置都发生了变化，需要继续进行重复压缩,也就是上面的while true需要继续进行，因为之后的节点的位置都发送了变化，前面的节点的target也要变化。
		long addressError = 0;//对于已经处理过的节点，新的值一定不会比之前的值大，因为是压缩过的，所以这个会是负的（或者等于0）

		// Since we re-reverse the bytes, we now write the nodes backwards, so that BIT_TARGET_NEXT is unchanged:
		for (int node = (int) builder.nodeCount; node >= 1; node--) {//遍历所有的node，从最后一个node开始遍历，也就是从root开始（之前是倒叙的，root是最后写入的）
			
			final long address = writer.getPosition();//这个节点在新的fst中的开始位置
			if (address != newNodeAddress.get(node)) {//如果新的开始位置和之前的开始位置不一样
				addressError = address - newNodeAddress.get(node);//从这里看，不会大于0
				changed = true;//当前的节点的开始位置发生了变化，更新这个节点的最新的位置，后续的所有的节点都不准确了，都会进入这个if
				newNodeAddress.set(node, address);//更新当前节点的最新的位置
			}
			int bytesPerArc = 0;//最新的fst中这个节点每个arc使用的内存的最大值，在第一次压缩的时候，就是最早生成的fst中的最大值，第二次（如果存在）压缩的时候就是第一次压缩过程中确定的最大值
			boolean retry = false;//retry表示在此次的fst的压缩过程中，该节点是否是第二次尝试，第一次压缩的时候是false，最多只有两次压缩。
			// for assert:
			boolean anyNegDelta = false;
			
			// Retry loop: possibly iterate more than once, if this is an array'd node and bytesPerArc changes:
//				writeNode: 
			while (true) {//处理这个node，采用while true是因为可能存在第二次压缩，因为第一次压缩的时候不确定bytesPerArc到底是多大。

				readFirstRealTargetArc(node, arc, r);//读取这个节点的第一个arc，读取到arc里面

				final boolean useArcArray = arc.bytesPerArc != 0;//之前有没有用fixedBytes格式
				// 写入这个节点的一些元数据信息
				if (useArcArray) {//之前采用了fixedBytes格式，此时需要先写入这个node的header，里面包括了使用fixedArc的标识，每个arc使用内存的大小还有arc的数量
					// Write false first arc:
					if (bytesPerArc == 0) {
						bytesPerArc = arc.bytesPerArc;
					}
					writer.writeByte(ARCS_AS_FIXED_ARRAY);
					writer.writeVInt(arc.numArcs);
					writer.writeVInt(bytesPerArc);
				}

				int maxBytesPerArc = 0;//压缩之后这个节点在所有的arc中占用内存最大的一个
				while (true) { // iterate over all arcs for this node   写这个节点的多个arc

					final long arcStartPos = writer.getPosition();//arc在bytesStore中的开始

					byte flags = 0;

					if (arc.isLast()) {
						flags += BIT_LAST_ARC;
					}
					if (!useArcArray && node != 1 && arc.target == node - 1) {// 进入fst的node是从1开始的，如果等于1说明他的target节点不在fst中，arc.target == node - 1说明他在fst中的下一个节点就是target节点（倒着读的）
						flags += BIT_TARGET_NEXT;
						if (!retry) {
//								nextCount++;
						}
					}
					if (arc.isFinal()) {//这个arc构成一个term（路径）
						flags += BIT_FINAL_ARC;
						if (arc.nextFinalOutput != NO_OUTPUT) {
							flags += BIT_ARC_HAS_FINAL_OUTPUT;
						}
					} else {
						assert arc.nextFinalOutput == NO_OUTPUT;
					}
					if (!targetHasArcs(arc)) {//指向最后一个节点
						flags += BIT_STOP_NODE;
					}

					if (arc.output != NO_OUTPUT) {//是否有输出
						flags += BIT_ARC_HAS_OUTPUT;
					}

					final long absPtr;//记录target的位置的绝对值
					// 是否需要记录target
					final boolean doWriteTarget = targetHasArcs(arc) && (flags & BIT_TARGET_NEXT) == 0;//是否需要记录指向的target，如果target没有arc，则是最后一个，不需要记录，如果满足BIT_TARGET_NEXT，也不需要记录
					if (doWriteTarget) {//记录target
						final Integer ptr = topNodeMap.get(arc.target);// 新的编号
						if (ptr != null) {//有编号的话用编号
							absPtr = ptr;//这个值一定小于 topNodeMap.size()，这样再取出的时候，通过与 topNodeMap.size()判断大小，就可以知道是哪种情况了
						} else {//没有编号，计算一个，记录的还是绝对值，+topNodeMap.size()仅仅是为了和上面的ptr != null的情况有所区别 
							absPtr = topNodeMap.size() + newNodeAddress.get((int) arc.target) + addressError;//后面的 newNodeAddress.get((int) arc.target) + addressError表示target节点的预估距离，是不准确的
						}
						
						//计算差值： target节点顺序写的话在原来fst中的开始位置  + 移动的距离 - 这个arc在新的fst中的开始位置 ，-2这个可以理解为一个阈值，只有当delta - absPtr < 2的时候才使用差值
						long delta = newNodeAddress.get((int)arc.target) +  addressError - writer.getPosition() - 2;
						if (delta < 0) {
							anyNegDelta = true;
							delta = 0;
						}

						if (delta < absPtr) {//如果记录差值更小，则记录差值
							flags |= BIT_TARGET_DELTA;//此边的目标节点以差值编码方式存储 
						}
					} else {
						absPtr = 0;
					}

					assert flags != ARCS_AS_FIXED_ARRAY;
					writer.writeByte(flags);//写入flag

					fst.writeLabel(writer, arc.label);//写入label

					if (arc.output != NO_OUTPUT) {//写入输出
						outputs.write(arc.output, writer);
					}
					if (arc.nextFinalOutput != NO_OUTPUT) {//写入fianl输出
						outputs.writeFinalOutput(arc.nextFinalOutput, writer);
					}

					if (doWriteTarget) {
						// 差值，计算方式是：目标节点的预估位置（不准确的）减去现在位置，解释： newNodeAddress.get((int) arc.target) + addressError 表示target节点在新的fst中的预估位置（不经过压缩的真实位置+现在已经节省的空间），
						// writer.getPosition()表示这个arc的开始位置；
						long delta = newNodeAddress.get((int) arc.target)/* 这个值是不准确的，因为路径上后面的节点还没有处理，可能会发生变化 */ + addressError/* 这个极可能是负数 */ - writer.getPosition();
						if (delta < 0) {//TODO 这个不可能小于0 啊！
							anyNegDelta = true;
							delta = 0;
						}

						if (flag(flags, BIT_TARGET_DELTA)) {//如果记录差值的，
							writer.writeVLong(delta);
						} else {
							writer.writeVLong(absPtr);//不是记录差值的，还是用绝对的
						}
					}

					if (useArcArray) {// 扩容用的
						final int arcBytes = (int) (writer.getPosition() - arcStartPos);//这个arc占用了多少内存
						maxBytesPerArc = Math.max(maxBytesPerArc, arcBytes);//更新这个节点的所有的arc使用的最大内存（从这个也看的出来，第二次压缩不会更新maxBytesPerArc）
						// NOTE: this may in fact go "backwards", if somehow (rarely, possibly never) we use
						// more bytesPerArc in this rewrite than the incoming FST did... but in this case we
						// will retry (below) so it's OK to ovewrite bytes: wasted += bytesPerArc - arcBytes;
						writer.skipBytes((int) (arcStartPos + bytesPerArc - writer.getPosition()));//扩容，每个arc都占用一个bytesPerArc的大小
					}
					if (arc.isLast()) {//这个arc是最后一个，不读取了，否则继续读下一个
						break;
					}
					readNextRealArc(arc, r);//继续读取下一个
				}// 写这个节点的多个arc 循环结束

				// 如果不是使用fixedArray，则已经是紧凑的了，不会有多余的浪费，退出；如果是fixedArray的，因为之前写入的header中是原来的bytesPerArc，压缩后可能会发生变化，所以这一次写错了，需要重新写，直到maxBytesPerArc == bytesPerArc
				if (useArcArray) {
					if (maxBytesPerArc == bytesPerArc/* 经过压缩之后没有发生变化，退出 */ || (retry && maxBytesPerArc <= bytesPerArc) /*retry=true表示不是第一次压缩了，第一次的时候是false，后面的一定是true，第二次压缩不会改变maxBytesPerArc */ ) {
						break;
					}
				} else {
					break;
				}

				// Retry: 
				bytesPerArc = maxBytesPerArc;//本次的最大值（也可以叫做第一次压缩的最大值，因为后面的几次不会修改这个）
				writer.truncate(address);//废弃掉本次的输入
				retry = true;//表示不是第一次压缩这个节点了
				anyNegDelta = false;
			}// 处理这个node结束

			negDelta |= anyNegDelta;
		}//遍历所有的node

		if (!changed) {//只要有一个arc变化了，则必须要从头全部重来一次。
			// We don't renumber the nodes (just reverse their order) so nodes should only point forward to
			// other nodes because we only produce acyclic FSTs w/ nodes only pointing "forwards":
			assert !negDelta;
			break;
		}
	}

	long maxAddress = 0;
	for (long key : topNodeMap.keySet()) {//key是节点，这个循环的目的是查找最大的值，供下面的packedInts使用
		maxAddress = Math.max(maxAddress, newNodeAddress.get((int) key));
	}

	// 记录上面使用编号的节点的开始位置，因为之前仅仅是记录的起编号，没有记录其开始位置，要获得开始位置，需要使用这个
	PackedInts.Mutable nodeRefToAddressIn = PackedInts.getMutable(topNodeMap.size(), PackedInts.bitsRequired(maxAddress), acceptableOverheadRatio);
	for (Map.Entry<Integer, Integer> ent : topNodeMap.entrySet()) {
		nodeRefToAddressIn.set(ent.getValue(), newNodeAddress.get(ent.getKey()));
	}
	fst.nodeRefToAddress = nodeRefToAddressIn;

	fst.startNode = newNodeAddress.get((int) startNode);

	if (emptyOutput != null) {
		fst.setEmptyOutput(emptyOutput);
	}

	fst.bytes.finish();
	fst.cacheRootArcs();

	return fst;
}

这段代码很晦涩，需要好好理解，而且我觉得需要一点点数学证明证明一点点的压缩是可以压缩完的，我没有完成这个证明。如果有读者在看这个代码时发现我写的不对，可以和我联系，我的qq是：1308567317

分享到：

FST源代码解读6——FST的读取 | FST源代码解读4——结束添加

2019-12-17 18:23
浏览 1010
评论(0)
分类:开源软件
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

FST源代码解读5——FST的压缩

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

FST源代码解读5——FST的压缩

评论

发表评论

相关推荐

FST源代码解读6——FST的读取

FST源代码解读4——结束添加

FST源代码解读3——编译节点

FST源代码解读2——FST的生成

FST源代码解读1——FST是什么

packedints总结

lucene中的PackedInts源码解读(3)-PACKED格式

lucene中的PackedInts源码解读(2)-Packed64SingleBlock

lucene中的PackedInts源码解读-1

SpanQuery的得分

lucene的spanNearQuery（二）——不带有顺序的

solr的facet源码解读（四）——facet.field之非数字单值域类型

solr的facet源码解读（三）——facet.field之数字单值域类型

solr的facet源码解读（二）——facet.field

lucene中关于正向信息的获取——FielldCache

solr对docValue的使用

lucene中的docValue实现源码解读（十二）——总结

lucene中的docValue实现源码解读（十一）——SortedSet的读取

lucene中的docValue实现源码解读（十）——SortedSet的写入

lucene中的docValue实现源码解读（九）——SortedNumericDocValue的读取

最近访客更多访客>>