`

Lucene: Search Engine Arch

 
阅读更多


Components for indexing


 ACQUIRE CONTENT

 

The first step, at the bottom of figure 1.4, is to acquire content. This process, which involves using a crawler or spider, gathers and scopes the content that needs to be indexed. That may be trivial, for example, if you’re indexing a set of XML files that resides in a specific directory in the file system or if all your content resides in a wellorganized database. Alternatively, it may be horribly complex and messy if the content is scattered in all sorts of places (file systems, content management systems, Microsoft
Exchange, Lotus Domino, various websites, databases, local XML files, CGI scripts running on intranet servers, and so forth).

 

BUILD DOCUMENT

 

Once you have the raw content that needs to be indexed, you must translate the content into the units (usually called documents) used by the search engine. The document typically consists of several separately named fields with values, such as title, body, abstract, author, and url. You’ll have to carefully design how to divide the raw content into documents and fields as well as how to compute the value for each of those fields. Often the approach is obvious: one email message becomes one document, or one
PDF file or web page is one document.

 

Once you’ve worked out this design, you’ll need to extract text from the original raw content for each document. If your content is already textual in nature, with a known standard encoding, your job is simple. But more often these days documents are binary in nature (PDF, Microsoft Office, Open Office, Adobe Flash, streaming video and audio multimedia files) or contain substantial markups that you must remove before indexing (RDF, XML, HTML). You’ll need to run document filters to extract text from such content before creating the search engine document.

 

Interesting business logic may also apply during this step to create additional fields. For example, if you have a large “body text” field, you might run semantic analyzers to pull out proper names, places, dates, times, locations, and so forth into separate fields in the document. Or perhaps you tie in content available in a separate store(such as a database) and merge this for a single document to the search engine.

 

Another common part of building the document is to inject boosts to individual documents and fields that are deemed more or less important.

 

Lucene provides an API for building fields and documents, but it doesn’t provide any logic to build a document because that’s entirely application specific. It also doesn’t provide any document filters, although Lucene has a sister project at Apache,Tika, which handles document filtering very well (see chapter 7). If your content resides in a database, projects like DBSight, Hibernate Search, LuSQL, Compass, and Oracle/Lucene integration make indexing and searching your tables simple by handling the Acquire Content and Build Document steps seamlessly.

 

ANALYZE DOCUMENT

 

No search engine indexes text directly: rather, the text must be broken into a series of individual atomic elements called tokens. This is what happens during the Analyze Document step. Each token corresponds roughly to a “word” in the language, and this step determines how the textual fields in the document are divided into a series of tokens.
 

INDEX DOCUMENT

 

During the indexing step, the document is added to the index.

---------------------------------------------------

Components for searching

 

SEARCH USER INTERFACE

The user interface is what users actually see, in the web browser, desktop application,or mobile device, when they interact with your search application. The UI is the most important part of your search application!

 

Keep the interface simple: don’t present a lot of advanced options on the first page. Provide a ubiquitous, prominent search box, visible everywhere, rather than requiring a two-step process of first clicking a search link and then entering the search text (this is a common mistake).

 

Once a user interacts with your search interface, she or he submits a search request, which first must be translated into an appropriate Query object for the search engine.

 

BUILD QUERY

 

When you manage to entice a user to use your search application, she or he issues a search request, often as the result of an HTML form or Ajax request submitted by a browser to your server. You must then translate the request into the search engine’s Query object.

 

The query may contain Boolean operations, phrase queries (in double quotes), or wildcard terms. If your application has further controls on the search UI, or further interesting constraints, you must implement logic to translate this into the equivalent query.

 

SEARCH QUERY

 

Search Query is the process of consulting the search index and retrieving the documents matching the Query, sorted in the requested sort order. Lucene is also wonderfully extensible at this point, so if you’d like to customize how results are gathered, filtered, sorted, and so forth, it’s straightforward.

 

There are three common theoretical models of search:

  • ƒ Pure Boolean model—Documents either match or don’t match the providedquery, and no scoring is done. In this model there are no relevance scores associated with matching documents, and the matching documents are unordered;a query simply identifies a subset of the overall corpus as matching the query.
  • ƒ Vector space model—Both queries and documents are modeled as vectors in ahigh dimensional space, where each unique term is a dimension. Relevance, or similarity, between a query and a document is computed by a vector distancemeasure between these vectors.
  • ƒ Probabilistic model—In this model, you compute the probability that a document is a good match to a query using a full probabilistic approach.

RENDER RESULTS

 

Once you have the raw set of documents that match the query, sorted in the right order, you then render them to the user in an intuitive, consumable manner. The UI should also offer a clear path for follow-on searches or actions, such as clicking to the next page, refining the search, or finding documents similar to one of the matches, so that the user never hits a dead end。

 

  • 大小: 60.5 KB
分享到:
评论

相关推荐

    Lucene:基于Java的全文检索引擎简介

    Lucene是一个基于Java的全文索引工具包。 1. 基于Java的全文索引引擎Lucene简介:关于作者和Lucene的...5. Hacking Lucene:简化的查询分析器,删除的实现,定制的排序,应用接口的 扩展 6. 从Lucene我们还可以学到什么

    指南-Lucene:ES篇.md

    指南-Lucene:ES篇.md

    lucene-core-7.7.0-API文档-中文版.zip

    Maven坐标:org.apache.lucene:lucene-core:7.7.0; 标签:apache、lucene、core、中文文档、jar包、java; 使用方法:解压翻译后的API文档,用浏览器打开“index.html”文件,即可纵览文档内容。 人性化翻译,文档...

    IKAnalyzer中文分词支持lucene6.5.0版本

    由于林良益先生在2012之后未对IKAnalyzer进行更新,后续lucene分词接口发生变化,导致不可使用,所以此jar包支持lucene6.0以上版本

    精品资料(2021-2022收藏)Lucene:基于Java的全文检索引擎简介.doc

    【Lucene:基于Java的全文检索引擎简介】 Lucene是一个由Java编写的开源全文检索引擎工具包,由Doug Cutting创建并贡献给Apache基金会,成为Jakarta项目的一部分。它不是一个独立的全文检索应用,而是提供了一个可...

    Lucene:基于Java的全文检索引擎简介.rar

    **Lucene:基于Java的全文检索引擎简介** Lucene是一个高度可扩展的、高性能的全文检索库,由Apache软件基金会开发并维护。它是Java开发者在构建搜索引擎应用时的首选工具,因为它提供了完整的索引和搜索功能,同时...

    面试指南-Lucene:ES篇.md

    ### Lucene与Elasticsearch核心知识点详解 #### 一、倒排索引深入骨髓 **1. 倒排索引的原理及其应用场景** 倒排索引是一种用于快速检索文档的技术,它改变了传统索引从文档指向关键词的方式,转而以关键词指向...

    精品资料(2021-2022收藏)Lucene:基于Java的全文检索引擎简介.docx

    **Lucene:基于Java的全文检索引擎** Lucene是一个由Apache软件基金会的Jakarta项目维护的开源全文检索引擎。它不是一个完整的全文检索应用,而是一个用Java编写的库,允许开发人员轻松地在他们的应用程序中集成...

    lucene:基于Java的全文检索引擎简介

    ### 基于Java的全文检索引擎Lucene简介 #### 1. Lucene概述与历史背景 Lucene是一个开源的全文检索引擎库,完全用Java编写。它为开发者提供了构建高性能搜索应用程序的基础组件。尽管Lucene本身不是一个现成的应用...

    基于 SSM 框架的二手书交易系统.zip

    快速上手 1. 运行环境 IDE:IntelliJ IDEA 项目构建工具:Maven 数据库:MySQL Tomcat:Tomcat 8.0.47 2. 初始化项目 创建一个名为bookshop的数据库,将bookshop.sql导入 打开IntelliJ IDEA,将项目导入 ...

    精品资料(2021-2022收藏)Lucene:基于Java的全文检索引擎简介22173.doc

    【Lucene:基于Java的全文检索引擎简介】 Lucene是一个由Java编写的全文索引工具包,它不是一个完整的全文检索应用,而是作为一个可嵌入的引擎,为各种应用程序提供全文检索功能。Lucene的设计目标是简化全文检索的...

    lucene:Apache Lucene开源搜索软件

    Lucene: : 用Gradle构建 基本步骤: 安装OpenJDK 11(或更高版本) 从Apache下载Lucene并解压缩 连接到安装的顶层(lucene顶层目录的父目录) 运行gradle 步骤0)设置您的开发环境(OpenJDK 11或更高版本) ...

    lucene 所有jar包 包含IKAnalyzer分词器

    《Lucene分词技术与IKAnalyzer详解》 在信息技术领域,搜索引擎是不可或缺的一部分,而Lucene作为Apache软件基金会的一个开放源代码项目,是Java语言开发的全文检索引擎库,为构建高效、可扩展的信息检索应用提供了...

Global site tag (gtag.js) - Google Analytics