- 浏览: 16495215 次
- 性别:
- 来自: 济南
最新评论
-
wu1236:
ef0793cd94337324b6fefc4c9474af5 ...
Android ApiDemos示例解析(87):Media->MediaPlayer -
77219634:
0127bf2236bee4dd1f632ce430f1af1 ...
本博客文章都为转载,没有任何版权! -
77219634:
0127bf2236bee4dd1f632ce430f1af1 ...
VPLEX - EMC的RAC -
77219634:
0127bf2236bee4dd1f632ce430f1af1 ...
qTip2 Show -
77219634:
0127bf2236bee4dd1f632ce430f1af1 ...
SecureCRT中文乱码、复制粘贴乱码解决办法(修改版)
Efficient in-memory extensible inverted file
Abstract
The growing amount of on-line data demands efficient parallel and distributed indexing mechanisms to manage large resource requirements and unpredictable system failures. Parallel and distributed indices built using commodity hardware like personal computers (PCs) can substantially save cost because PCs are produced in bulk, achieving the scale of economy. However, PCs have limited amount of random access memory (RAM) and the effective utilization of RAM for in-memory inversion is crucial. This paper presents an analytical investigation and an empirical evaluation of storage-efficient inmemory extensible inverted files, which are represented by fixed- or variable-sized linked list nodes. The size of these linked list nodes is determined by minimizing the storage wastes or maximizing storage utilization under different conditions, which lead to different storage allocation schemes. Minimizing storage wastes also reduces the number of address indirections (i.e., chaining). We evaluated our storage allocation schemes using a number of reference collections. We found that the arrival rate scheme is the best in terms of both storage utilization and the mean number of chainings per term. The final storage utilization can be over 90% in our evaluation if there is a sufficient number of documents indexed. The mean number of chainings is not large (less than 2.6 for all the reference collections). We have also showed that our best storage allocation scheme can be used for our extensible compressed inverted file. The final storage utilization of the extensible compressed inverted file can be over 90% in our evaluation provided that there is a sufficient number of documents indexed. The proposed storage allocation schemes can also be used by compressed extensible inverted files with word positions
Keywords: Information retrieval; Indexing; Optimization
Article Outline
1. Introduction
As more and more data are made available on-line, it becomes increasingly difficult to manage a single inverted file. This difficulty arises from the substantial resource requirement for large-scale indexing and from the long indexing time, making the system vulnerable to unpredictable system failures. For examples, the very large collection (VLC) from TREC [1] requires 100Gb of storage and TREC terabyte track requires 426Gb [2]. The WebBase repository [3] requires 220Gb, estimated to be only 4% of the indexable web pages. The volume of high writing quality, non-English content is also increasing. In the near future, Japanese patent data from NTCIR [4] may be as large as 160Gb. One way to manage such large quantities of data is to create the index by merging smaller indices, which are built using multiple machines indexing different document subsets in parallel [5]. This would limit the impact of system failure to individual machines and increase indexing speed.
Acquiring the computing machines in bulk using commodity hardware substantially reduces monetary costs. Also, commodity hardware, like personal computers (PCs), makes in-memory inversion an attractive proposition because random access memory (RAM) for the PC market is relatively cheap and fast, and because RAM has the potential to be upgraded later at lower prices (e.g. DDR-300 RAM to DDR-400 RAM). However, PCs can only hold a relatively small amount of RAM (i.e., 4Gb) compared with mainframe computers. Efficient RAM utilization becomes an important issue for in-memory inversion using a large number of PCs because typically the entire inverted index cannot be stored in RAM due to the large volume of data on-line. Instead, the inverted file is typically built in relatively small batches [6] and [7]. For each batch, a partial index is built and held in RAM, which is written out as a run on disk. The runs are then merged as the final inverted file.
Efficient RAM utilization can reduce indexing time by reducing the number of inverted files for merging because efficient RAM utilization enables more documents to be indexed per run. During updates, temporary indices are maintained in memory, and then integrated into the main inverted file in batches. Lester and Zobel [8] showed that the amortized time cost to integrate the temporary index with the main inverted file is reduced for different inverted file maintenance methods (i.e., re-build, re-merge and in-place methods) when more documents are indexed. Therefore, during both initial construction and update, making better use of memory resources can reduce overall costs. With the potential to better balance system resource utilization as indexing is made memory-intensive whereas loading/flushing data are made disk- or network-intensive, efficient in-memory inversion is crucial in index construction.
Our major contribution of this paper is in enhancing existing simple-to-implement single-pass in-memory inversion to be storage-efficient for creating partial inverted files and/or temporary index by developing novel storage-efficient allocation schemes that predict the needed storage with minimal storage wastes. The partial index created by our in-memory inversion can be merged with the main or other partial inverted files. The temporary index created by our in-memory inversion can also be searched during the time that the temporary index is being built. This reduces the latency of the availability of the recently indexed documents for searching and this is important for certain applications (e.g. searching recently available news articles).
An evaluation was carried out to determine which of our storage allocation schemes is the best and whether the results are comparable to existing methods (Section 6). The evaluation was carried out using 3.5Gb of test data from the VLC. The best allocation scheme was the arrival rate scheme, which achieved 95% final storage utilization for this VLC dataset. To ascertain the generality of the results, various independent datasets for both English (TREC-2, TREC-6 and TREC-2005) and Chinese (NTCIR-5) are also used to evaluate the best storage allocation scheme. We also showed that the indexing speed of our best storage allocation is similar to the indexing speed of the reported results by others [6] and [9].
The rest of this paper is organized as follows. Section 2 discusses our extensible inverted file structures, the modifications of our extensible inverted file to incorporate compressed postings and word positions, and the related storage wastes. This section also provides the rationale behind the choice of our data structure for our storage allocation schemes and the rationale behind the need to optimize storage wastes of our storage allocation schemes. Section 3 describes the first approach to determine optimal node sizes using a stepwise optimization strategy. This approach results in three related storage allocation schemes. Section 4 discusses the second approach which determines the optimal nodes size that minimizes the asymptotic worst-case storage waste per period for individual terms. Section 5 evaluates these storage allocation schemes and discusses the best scheme in terms of storage utilization, the mean number of chainings, robustness in performance and indexing speed. This section also shows that our storage allocation schemes can be used for allocating nodes to store compressed postings using the best storage allocation scheme and variable byte compression as an example. Section 6 discusses the related work and describes how the storage allocation schemes can predict our extended inverted file that incorporates compressed postings and word positions. Section 7 is the concluding section.
2. Extensible inverted file and storage wastes
This section describes the structure of the extensible inverted file and the related considerations for handling compressed postings and word position information. This section also discusses the storage wastes of the extensible inverted file and the rationale to optimize them, as well as the rationale for using the variable-size linked list data structure.
2.1. Extensible inverted file
An in-memory inverted file can be considered as a set of inverted lists, which are implemented as linked lists. Each linked list node holds a set of (basic) postings of the form di, tf(di,tk) where each basic posting consists of a document identifier di and the within-document term frequency tf(di, tk) of the kth term in the ith document. The rest of this paper assumes that unless otherwise indicated, all postings are basic postings. If a linked list node can hold a variable number of postings, then two additional fields of information other than postings are stored in each node, namely a node size variable and an extension pointer. The node size variable specifies the amount of storage allocated for the current node and the extension pointer facilitates the chaining of nodes.
Fig. 1 shows the conceptual structure of the extensible inverted file, implemented using a set of variable-size linked list nodes. The dictionary data structure holds the set of index terms and the start address of the corresponding variable-size linked list. A new node is allocated whenever the linked list of the kth index term tk is full (e.g., the linked list of the index term “Network” in Fig. 1) and when a new posting for tk arrived. The size of the new node is determined using one of the storage allocation schemes discussed in the next two sections. If the linked list nodes hold a fixed number of postings per node, then the node size variable can be discarded, saving storage space.
Display Full Size version of this image (41K) Fig. 1.The conceptual structure of our extensible inverted file, represented as variable-sized nodes. The start and last pointers point to, respectively, the first and last linked list nodes of the inverted list.
Each dictionary entry for an index term has a start pointer and a last pointer, which, respectively point to the beginning of the inverted list and the last linked list node of the inverted list. The last pointer reduces the traversal of the linked lists when a new posting for the index term is inserted. In this case, during insertion, the last linked list node needs to be exclusively locked to maintain the integrity of the data for concurrent access [10]. To reduce memory usage, the start pointers can be stored in a file since start pointers are used only for retrieval and not for inserting new postings. For clarity of presentation, each dictionary entry may contain additional information (e.g. document frequency) not shown in Fig. 1. In particular, each dictionary entry should hold a variable, say mpos, which indicates the position of the unfilled portion of the last node, to improve the posting insertion speed.
The extensible inverted file can support storing a special type of posting for block addressing inverted files [9] and [11] that index a fixed-size text block instead of variable-size documents. This special type of posting, called block-address posting in this paper, has only di field without the term frequency tf(di, tk) field of the basic posting where di is the block identifier instead of the document identifier and tk is the kth term. Our storage waste optimization discussed in Sections 3 and 4 can minimize the storage wastes of the nodes that store basic postings or block-address postings because the storage of these postings are constants (i.e., c1) in our storage waste optimization.
The extensible inverted file can support storage of compressed postings [12], as well as word positions. For compressed postings (e.g., γ [13] or variable byte compression [14]), each dictionary entry keeps track of the bit position (again using mpos) or the byte position of the unfilled portion of the last node. The new compressed posting is inserted at mpos as a bit/byte string. If the last node does not have enough memory for a new compressed posting, then the new compressed posting is split between the last node and the newly allocated node.
There are two general approaches to storing postings with word positions. One approach stores the word positions in the nodes and one way (as in [6]) to do this is to store the posting followed by a sequence of word positions, i.e. di,tf(di,tk).pos(di,tk,1),…,pos(i,tk,tf(di,tk)) where di is the identifier of the ith document, tf(di, tk) is the within-document frequency of the kth term in the ith document, pos(di,tk,x) is the xth word position of the kth term in the ith document. In this case, the node size includes the storage to hold word positions as well as the postings. Another approach stores extended postings of the form di, tf(di,tk), f(di,tk), where f(di, tk) is the file position in the auxiliary file that stores the sequence of word position of the kth term in the ith document. In this approach, the word positions are stored in an auxiliary file. Whenever a new extended posting is added, the last position of the auxiliary file is stored as f(di, tk) of this new extended posting. The word positions of the term associated with the new extended posting are appended sequentially to the auxiliary file. These word positions in the auxiliary file can be compressed, for example, using differential coding compression [9], [12], [13], [14], [15] and [16]. If the within-document term frequency is one (i.e. tf(di, tk)=1), then f(di, tk) of the extended posting can directly store the single word position of the kth term in the ith document, saving both storage and access time. For both approaches that store word positions, the storage allocation schemes can be modified to determine the node sizes, as discussed in Section 5.
2.2. Rationale for variable-size linked lists
The variable-size linked-list data structure is chosen here because it is used in many in-memory inverted files (sometimes they are called buckets or fixed list blocks) and because it is relatively simple to implement and to analyze. Instead of linked lists, postings can be stored in RAM blocks that can expand to hold more postings when they are inserted. This type of RAM block expansion may involve copying and moving data chunks. Our work can be considered as extending the RAM block approach where the block is pre-allocated with storage to hold the expected amount of postings so that it is not necessary to copy or move data chunks in the RAM blocks. This pre-allocation avoids memory fragmentation and the difficulty is shifted in predicting the expected number postings for each term instead of using advanced storage allocators. If a fast storage allocator is used so that the allocation time is amortized to be a constant, then the storage utilization may be sacrificed. Instead of using advanced storage allocators, dynamic data structures like hash tables, skip lists and balanced trees can be used that support deletion of postings, as well as insertion of postings. However, the storage utilizations of these dynamic data structures are typically low (i.e., no more than 60% if each node contains at least one 4-byte pointer and one 6-byte posting). It is possible to store multiple postings in these data structures. In this case, the problem of optimizing the storage wastes per node re-appears whether one is dealing with a dynamic data structure (e.g., balanced trees) or a variable-size linked list. Therefore, we propose to use variable-size linked lists in this study because they are simple to program, use simple and fast storage allocators, are commonly used by in-memory inverted files, can easily be adopted to store compressed postings and can be optimized for storage waste in the same way as other dynamic data structures (e.g. balanced trees).
Our choice of using linked lists to store (compressed) postings implies that our extensible inverted files are designed largely for append-only operations where direct deletions and direct modifications can be avoided. Deletions can be done indirectly by filtering document identifiers that are known to be invalid (or deleted) from a registry of stale documents [8] and [17] because search engines can trade-off data consistency with availability [17] according to the CAP theorem [18] and [19]. It is expected that there will be little deletions or modifications during the in-memory inversion because the incoming documents are relatively fresh. On the other hand, deletions or modifications may occur more often than during in-memory inversion when the inverted file is transferred to disks. Since disk storage cost per byte is much cheaper than RAM, deletions by filtering document identifiers are practical solutions for large-scale search engines. Similar to deletions, modifications can be implemented effectively as a deletion operation (implemented as filtering) followed by a document insertion. When the registry of stale document identifiers is becoming large or when the temporary index is full, the main inverted file on disk can be maintained by re-building, re-merging or in-place updating approaches [8]. Therefore, the choice of an append-only data structure, like the variable-size linked lists, may not be a severe handicap.
The use of the variable-size linked list representation of inverted lists requires some consideration on how to merge partial inverted files as follows. The first approach saves the in-memory inverted lists that are represented as linked lists as contiguous lists on disk. This requires the system to follow the extension pointers of the linked list nodes when transferring the in-memory inverted file to disk. Tracing all the extension pointers incurs some additional time due to cache misses. However, most of the time cost is due to transferring data to disks provided that the mean number of chainings per term is not large. Once the in-memory inverted file is transferred to disk as a set of contiguous lists on disk, the conventional inverted file-merging algorithm can be used to merge these partial inverted file on disk. An alternative approach dumps the in-memory inverted file onto disk as it is. During the first level of partial inverted file merging, the merging algorithm combines the two inverted lists on disk, represented by two sets of linked lists, into a single contiguous inverted list on disk. The merged partial inverted file has a set of contiguous inverted lists on disk and it can be merged subsequently with other merged partial inverted file using conventional merging algorithm. However, following extension pointers on disk requires file seek operations that incur more time than a cache miss. Therefore, we prefer the first approach because the time cost of cache miss is less that of a file seek and because this approach is applicable to the inverted file maintenance methods mentioned by Lester and Zobel [8] (i.e., re-build, re-merge and in-place updates) using an in-memory temporary index (called the document buffer in [8]).
2.3. Rationale for storage waste optimization
The success of representing inverted lists by linked lists is based on the ability to accurately predict the required storage needed so that the RAM storage utilization is maximized. Otherwise, if final storage utilization is low (say 60%), other data structures that can support deletion should be used instead. The storage utilization of the extensible inverted file is the ratio of the total storage P of all (compressed) postings and the total storage (i.e., P+S, where S is the total storage waste). Maximization of the storage utilization U can be considered as the minimization of storage wastes of the extensible inverted file as follows:
Storage wastes in the extensible inverted file can be divided into two types:
(a) The first type of storage waste is the storage overhead ε that includes the storage for extension pointers and for the node size variables; and
(b) The second type of storage waste is the latent free storage, which has been allocated but is yet unfilled with posting information. If this type of latent free storage is not considered as storage wastes, then the optimal linked-list node size would be as large as possible so that the overhead will appear minimal in the total storage allocated to that node.
The storage waste of each node in the extensible inverted file is the sum of these two types of storage wastes of the node.
There are many advantages to optimize storage wastes. First, it maximizes the storage utilization that can reduce the number of inverted file merge operation and can reduce the amortized time cost of inverted file maintenance [8]. Second, it can indirectly reduce the number of chainings per term. This can reduce (a) the time to search the temporary index when it is built on the fly, (b) the time to store the inverted lists in RAM as lists without chaining on disks when the partial inverted file is transferred to disk and (c) the number of file seeks when merging two partial inverted files on disk if these inverted files represent inverted lists as linked lists. Third, the analysis of optimizing the storage wastes can be applied to not just linked lists but to other dynamic data structures (e.g. balanced trees) where each node holds more than one (compressed) posting. In this case, the optimization analysis of these dynamic data structures treats the storage waste of each node of these dynamic structures as a constant ε with a different value.
3. Stepwise storage allocation approach
The stepwise storage allocation approach determines the optimal node size for the incoming documents based on statistics of the current set of indexed documents. This approach optimizes the expected worst-case storage waste E(W(S(N))) after N documents are indexed as follows:
This approach has three related storage allocation schemes. The first storage allocation scheme determines the optimal node size after indexing N documents, which is the same as the optimal node size for a static collection of N documents. This scheme is called the fixed-sized node scheme (F16). The next storage allocation scheme is called the vocabulary growth rate (VGR) scheme. It extends the formula of the F16 scheme by determining the optimal node size based on the parameter values extracted at the time when a new node is allocated. The assumption is that this optimal node size remains more or less the same between the time that the node is allocated and the time that the node is filled (i.e., the system behavior should be smooth). Unfortunately, the VGR scheme allocates the same optimal node size at a given time instance for common terms and non-common terms, which are known to have widely different number of postings and different desirable node sizes. Thus, the final storage allocation scheme, called the term growth rate (TGR) scheme, determines the optimal node size for individual terms. The first two schemes optimize the expected worst-case storage waste E(W(S(N))) for all terms (as in Eq. (1)). The TGR scheme optimizes the expected worst-case storage waste E(W(S(N, tk))) for the kth term after indexing N documents where S(n, tk) is the storage waste of the kth term after indexing n documents. Similar to Eq. (1), the quantity E(W(S(N, tk))) is defined as follows:
3.1. Fixed-sized node scheme (F16)
The fixed-sized node storage allocation scheme allocates storage to hold B postings for each new node. The overhead εp of a node allocated by this scheme is the storage for the extension pointer. Assuming that each posting occupies c1 bytes, the node requires c1× B+εp bytes. The storage waste S(n, tk) for term tk after indexing n documents is the latent free storage of the last node plus the storage overhead of all the chained nodes. The latent-free storage of the last node is c1((df(n,tk)/B)B-df(n,tk)), where df(n, tk) is the number of documents that contain the kth term, and . is the ceiling function. The storage overhead of all the chained nodes (including the new node) is due to the extension pointer and this overhead is εp(df(n,tk)/B) (including the last unfilled node). The storage waste S(n, tk) for term tk is
By disregarding the storage waste due to the latent-free space, the lower bound of the storage waste can be approximated as
The opti
相关推荐
本文讨论了Volcano模型,一个可扩展和并行的查询评估系统。Volcano模型是由Goetz Graefe开发的,旨在数据库查询处理方面,提供了一个研究和教育环境。它特别关注查询优化和资源分配,以及在并行查询执行系统设计中...
嵌入式应用在现代科技领域中扮演着至关重要的角色,特别是在指令集可扩展处理器(Instruction-Set Extensible Processor,简称ISEP)的设计与优化中。ISEP允许通过扩展其基本指令集来适应特定的应用需求,从而提高...
前端开源库-babel-plugin-extensible-destructuringBabel插件可扩展销毁,Babel插件可根据https://github.com/vacuumlabs/es-proposals启用可扩展销毁
EFI(Extensible Firmware Interface)是英特尔推出的一种新型固件接口,用以替代传统的BIOS,为操作系统提供更高效、更灵活的启动和服务环境。EFI内存管理是EFI系统中的一个重要组成部分,它涉及到系统的启动过程、...
《PyPI官网下载:django-extensible-profiles-1.1.2.tar.gz——深度解析Python库的拓展性用户配置文件》 在Python的世界里,PyPI(Python Package Index)是开发者们获取、分享和安装软件包的主要平台。本文将深入...
win7可用usb3.0驱动:Intel(R)_USB_3.0_eXtensible_Host_Controller_Driver 可解决VMware安装Windows7系统后,插入U盘等移动设备无法在Windows7系统显示的问题。
“FrameGraph: Extensible Rendering Architecture in Frostbite” – Yuriy O’Donnell (Frostbite / Electronic Arts)
用VMware虚拟机中打开win7不能识别u盘的,可以使用这个。 或者win7识别usb3.0硬盘的人可以使用。 驱动程序是操作系统与硬件设备之间的重要桥梁,它们使Windows操作系统能够识别并充分利用硬件的功能。...
1. **接口规范**:例如,使用AHB(Advanced High-performance Bus)或AXI(Advanced eXtensible Interface)总线协议来定义内存和寄存器的访问方式。 2. **地址空间规划**:如何合理分配地址空间给不同的内存和...
这份文档是《可扩展主机控制器接口USB XHCI(eXtensible Host Controller Interface for Universal Serial Bus)需求规范》2019年5月的修订版1.22。该文档提供了xHCI技术的详细要求和规范,是USB接口标准的一部分。...
标题中的“Intel(R)_USB_3.0_eXtensible_Host_Controller_Driver_5.0.3.42”指的是英特尔(Intel)为USB 3.0接口提供的可扩展主机控制器驱动程序的一个特定版本,具体是5.0.3.42。这个驱动程序是操作系统与硬件之间...
允许程序进行结构化输入和输出的规范和工具包。 使用此(以上)库的常见UNIX命令的包装器,在使用Shell管道集成这些命令时,无需进行文本解析。
此软件包提供英特尔USB 3.0可扩展主机控制器驱动程序,并受运行Windows 7操作系统的Vostro成就/Inspiron灵越/XPS/Optiplex/Alienware/Latitude/Precision/IoT系统支持。 为确保完整下载,请验证校验和值。...
**Dancer-Plugin-Auth-Extensible:基于Dancer的Web应用程序的身份验证框架** Dancer-Plugin-Auth-Extensible 是一个专为基于Perl的Dancer框架设计的强大的身份验证和授权插件。这个框架允许开发者轻松地在他们的...
A_third-party_extensible_collection_of_high-perfor_xds
一种易于使用的多语法编程语言,带有一组可移植的 API,可为 UNIX/X11 和/或 Win32 创建 CLI 和 GUI 应用程序。
magician_Extensible bag of tricks for Dynamis CRM developers and administrators in the form of a material design windows app-0.3.0
标题“la-tour-extensible”很可能是指一个Java项目,它设计了一个可扩展的系统,可能是类似于塔防游戏或者数据处理的结构。在这个项目中,“可扩展性”意味着它能够随着需求的增长而添加新的功能或模块,而不影响...
XML(eXtensible Markup Language)是一种用于标记数据的语言,广泛应用于数据交换、配置文件和Web服务等场景。为了解决XML处理的挑战,Python提供了多种库,其中之一就是我们今天要探讨的`et_xmlfile-1.0.1`。 `et...