MG4J (Managing Gigabytes for Java) is a free full-text search engine for large document collections written in Java.
The main points of MG4J are:
* Powerful indexing. Support for document collections and factories makes it possible to analyse, index and query consistently large document collections, providing easy-to-understand snippets that highlight relevant passages in the retrieved documents.
* Efficiency. We do not provide meaningless data such as "we index x GiB per second" (with which configuration? which language? which data source?)—we invite you to try it. MG4J can index without effort the TREC GOV2 collection (document factories are provided to this purpose) and scales to hundreds of millions of documents.
* Multi-index interval semantics. When you submit a query, MG4J returns, for each index, a list of intervals satisfying the query. This provides the base for several high-precision scorers and for very efficient implementation of sophisticated operators. The intervals are built in linear time using new research algorithms.
* Expressive operators. MG4J goes far beyond the bag-of-words model, providing efficient implementation of phrase queries, proximity restrictions, ordered conjunction, and combined multiple-index queries. Each operator is represented internally by an abstract object, so you can easily plug in your favourite syntax.
* Virtual fields. MG4J supports virtual fields—fields containing text for a different, virtual document; the typical example is anchor text, which must be attributed to the target document.
* Flexibility. You can build much smaller indices by dropping term positions, or even term counts. It's up to you. Several different types of codes can be chosen to balance efficiency and index size. Documents coming from a collection can be renumbered (e.g., to match a static rank or experiment with indexing techniques).
* Openness. The document collection/factory interfaces provide an easy way to present your own data representation to MG4J, making it a breeze to set up a web-based search engine accessing directly your data. Every element along the path of query resolution (parsers, document-iterator builders, query engines, etc.) can be substituted with your own versions.
* Distributed processing. Indices can be built for a collection split in several parts, and combined later. Combination of indices allows non-contiguous indices and even the same document can be split across different collections (e.g., when indexing anchor text).
* Multithreading. Indices can be queried and scored concurrently.
* Clustering. Indices can be clustered both lexically and documentally (possibly after a partitioning). The clustering system is completely open, and user-defined strategies decide how to combine documents from different sources. This architecture makes it possible, for instance, to load in RAM the part of an index that contains terms appearing more frequently in user queries.
MG4J is free software distributed under the GNU Lesser General Public License.
homepage: http://mg4j.dsi.unimi.it/
分享到:
相关推荐
- **倒排索引**:Mg4j基于倒排索引的概念,使得搜索过程快速高效。倒排索引将每个词项与其在文档中的出现位置关联起来,从而加速查询处理。 - **压缩技术**:为了优化存储空间,Mg4j使用了先进的压缩算法,降低了...
Zettair搜索引擎是一款由墨尔本皇家理工大学(RMIT)开源实现的全文搜索引擎,它采用了倒排序索引结构。倒排序索引(inverted index)是搜索引擎技术中常见的一种数据结构,旨在通过索引快速响应用户的查询请求。...
heritrix3项目爬虫中所使用到的一个依赖包,mg4j-1.0.1.jar包,有需要的朋友们,赶紧下载吧, 本人亲测过. 有积分的猿友们,赏个积分,没积分的,关注博主,私信发.
标题中的“PWM输出控制mg90s舵机”是指通过脉宽调制(PWM)信号来操纵MG90S微型伺服电机(舵机),这是一种常见的电子控制技术,用于精确地改变信号的占空比,从而调整电机的转速或角度。STM32是一款基于ARM Cortex-...
"佳能打印机MG7780 ix6880MG7580MG5680 MG3680IX6780清零软件教程" 本教程旨在指导用户如何正确地清零佳能打印机MG7780 ix6880MG7580MG5680 MG3680IX6780,避免因误操作而锁主板。 一、什么情况下打印机需要清零?...
标题"D-Link_DIR-615J_v10.01B04Beta_d3mg"涉及的是D-Link公司的一款无线路由器型号DIR-615J的固件升级文件,版本号为10.01B04Beta。这个固件是用于提升设备性能、修复已知问题和增加新功能的软件更新。"d3mg"可能是...
《mg5580,mg6400中文维修手册》是针对佳能(Canon)两款知名打印机模型的专业维修指南,旨在帮助用户和维修技术人员解决设备可能出现的各种问题。这款手册结合了官方的技术支持和详细的操作指导,确保了维修过程的...
4. **查询处理**:用户输入查询后,搜索引擎需要进行查询分析(如关键词扩展、同义词处理),然后在索引中查找匹配的文档,并进行相关性排序。 5. **结果展示**:返回最相关的搜索结果给用户,可能还会包括摘要、...
《深入搜索引擎:海量信息的压缩、索引和查询》是斯坦福大学信息检索和挖掘课程的首选教材之一,并已成为全球主要大学信息检索的主要教材。《深入搜索引擎:海量信息的压缩、索引和查询》理论和实践并重,深入浅出地给...
每个通道都可能配备了一组三段或四段均衡器,允许用户提升或削减特定的频率范围,以优化单个通道的声音。 4. 动态处理:除了均衡器外,每个通道可能还配备了压缩器和/或限制器,这些是动态处理设备,用于控制信号的...
mg4j-工作台与mg4j基准相比,用于评估BitFunnel性能的Java工具。建造视窗 choco install javachoco install mavenmvn packageTODO:设置JAVA_HOME吗?Linux sudo add-apt-repository ppa:webupd8team/javasudo apt-...
佳能MG2500打印机驱动是为佳能MG2500系列多功能一体机设计的重要软件组件,它使得打印机能够与运行不同Windows版本的计算机系统进行有效通信,确保打印、扫描、复印等功能的正常运行。这个驱动程序是连接硬件设备...
MG3000-A2/4/8的WAN口默认IP地址为192.168.0.2,用户可以通过摘机拨打***#查询具体地址。访问时,需使用默认用户名“root”和密码“gohigh”。 #### 1.2 通过LAN口访问 LAN口的默认IP地址有两种:10.10.0.1或192....
佳能MG3620说明书.pdf 本资源摘要信息涵盖了佳能MG3620打印机的使用手册,涵盖了打印机的功能概述、打印、复印、扫描、故障排除等内容。 打印机功能概述: MG3620系列在线手册提供了对打印机的详细介绍,包括打印...
### 雅马哈调音台MG82CX/MG102C使用说明与注意事项 #### 一、产品概述 雅马哈MG82CX/MG102C调音台是专为音乐制作、录音室应用以及现场演出设计的专业音频设备。它集成了高品质的麦克风前置放大器、灵活的信号路由...
4. 在开始菜单中找到 mg-soft 软件,打开 MIB Browser。 5. 选择 MIB 选项,双击 AC2400 的信息,加载 MIB 文件。 6. 切换到 Query 窗口,输入 AC 的地址,点击回车键,连接到 AC。 三、配置 SNMP 1. 点击 IP 栏...
MG996R舵机控制 MG996R舵机控制方法是基于单片机的舵机控制方法,它具有简单、精度高、成本低、体积小的特点,并可根据不同的舵机数量加以灵活应用。在机器人机电控制系统中,舵机控制效果是性能的重要影响因素。...
4. **I2C通信**:STM32与PCA9685之间的通信是通过I2C总线进行的。I2C是一个多主设备协议,允许多个设备共享两根线(SDA和SCL)进行数据传输。在STM32中,需要配置I2C外设,设置时钟、地址、中断等参数,并编写读写...