`
DavyJones2010
  • 浏览: 154943 次
  • 性别: Icon_minigender_1
  • 来自: 杭州
社区版块
存档分类
最新评论

Lucene: Introduction to Lucene (Part I)

阅读更多

1. Why do we use Lucene?

    1) If we want to execute the query like this:

        (content like '%DataStructure%') or (content like '%XMU%') in DB. Then it starts searching the whole content from start to end. That would be low efficiency.

        The Lucene comes to build index for the whole content. If we want to execute operations above. We just have to search from index file and not the real content. That would be much more efficient.

    2) If we want to search the content in the attachment, it would be impossible using DB techonlogy.

 

2.The versions of Lucene?

    1) 2.9-Core

    2) 3.0-Core --> There is a big difference from 2.9

    3) 3.5-Core --> There are some big differences from 3.0

 

3.In all kinds of full text indexing tools, they are all consists of three parts:

    1) Index part ---> What kind of information should be stored in index files?

                         ---> Eg. (I am a boy.) Should 'a' be stored in index files?

    2) Participle part ---> How should the sentence be breaked into part?

    3) Search part---> How should the sentence be searched in index file?

 

4. An example of Create Index using Lucene

    1. Core function

package edu.xmu.lucene.Lucene_ModuleOne;

import java.io.File;
import java.io.FileReader;
import java.io.IOException;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.LockObtainFailedException;
import org.apache.lucene.util.Version;

/**
 * Hello world!
 * 
 */
public class App
{
	/**
	 * Create Index
	 * 
	 * @throws IOException
	 * @throws LockObtainFailedException
	 * @throws CorruptIndexException
	 */
	public void buildIndex() throws CorruptIndexException,
			LockObtainFailedException, IOException
	{
		// 1. Create Directory
		// --> Where the directory be stored? Memory or HardDisk?
		// Directory dir = new RAMDirectory(); --> Index File Stored in MEM
		Directory dir = FSDirectory.open(new File("E:/LuceneIndex"));

		// 2. Create IndexWriter
		// --> It is used to write data into index files
		IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_35,
				new StandardAnalyzer(Version.LUCENE_35));
		IndexWriter writer = new IndexWriter(dir, config);
		// Before 3.5 the way to create index is like below(depreciated):
		// new IndexWriter(Direcotry d, Analyzer a, boolean c, MaxFieldLength
		// mfl);
		// d: Directory, a: Analyzer, c: Shoule we create new one each time
		// mlf: The max length of the field to be indexed.

		// 3. Create Document
		// --> The target we want to search may be a doc file or a table in DB.
		// --> The path, name, size and modified date of the file.
		// --> All the information of the file should be stored in the Document.
		Document doc = null;

		// 4. Each Item of The Document is Called a Field.
		// --> The relationship of document and field is like table and cell.

		// Eg. We want to build index for all the txt file in the c:/lucene dir.
		// So each txt file in this dir is called a document.
		// And the name, size, modified date, content is called a field.
		File files = new File("E:/LuceneData");
		for (File file : files.listFiles())
		{
			doc = new Document();
			doc.add(new Field("content", new FileReader(file)));
			doc.add(new Field("name", file.getName(), Field.Store.YES,
					Field.Index.NOT_ANALYZED));
			// Field.Store.YES --> The field should be stored in index file
			// Field.Index.ANALYZED --> The filed should be participled
			doc.add(new Field("path", file.getAbsolutePath(), Field.Store.YES,
					Field.Index.NOT_ANALYZED));

			// 5. Create Index File for Target Document by IndexWriter.
			writer.addDocument(doc);
		}

		// 6. Close Index Writer
		if (null != writer)
		{
			writer.close();
		}
	}
}

   2. Test Case

package edu.xmu.lucene.Lucene_ModuleOne;

import java.io.IOException;

import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.store.LockObtainFailedException;
import org.junit.Test;

/**
 * Unit test for simple App.
 */
public class AppTest
{
	@Test
	public void buildIndex()
	{
		App app = new App();
		try
		{
			app.buildIndex();
		} catch (CorruptIndexException e)
		{
			// TODO Auto-generated catch block
			e.printStackTrace();
		} catch (LockObtainFailedException e)
		{
			// TODO Auto-generated catch block
			e.printStackTrace();
		} catch (IOException e)
		{
			// TODO Auto-generated catch block
			e.printStackTrace();
		}
	}
}

 

5. An Example of Query Using Index Files

    1. Core Function of Query

package edu.xmu.lucene.Lucene_ModuleOne;

import java.io.File;
import java.io.FileReader;
import java.io.IOException;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.LockObtainFailedException;
import org.apache.lucene.util.Version;

/**
 * Hello world!
 * 
 */
public class App
{
	/**
	 * Create Index
	 * 
	 * @throws IOException
	 * @throws LockObtainFailedException
	 * @throws CorruptIndexException
	 */
	public void buildIndex() throws CorruptIndexException,
			LockObtainFailedException, IOException
	{
		// 1. Create Directory
		// --> Where the directory be stored? Memory or HardDisk?
		// Directory dir = new RAMDirectory(); --> Index File Stored in MEM
		Directory dir = FSDirectory.open(new File("E:/LuceneIndex"));

		// 2. Create IndexWriter
		// --> It is used to write data into index files
		IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_35,
				new StandardAnalyzer(Version.LUCENE_35));
		IndexWriter writer = new IndexWriter(dir, config);
		// Before 3.5 the way to create index is like below(depreciated):
		// new IndexWriter(Direcotry d, Analyzer a, boolean c, MaxFieldLength
		// mfl);
		// d: Directory, a: Analyzer, c: Shoule we create new one each time
		// mlf: The max length of the field to be indexed.

		// 3. Create Document
		// --> The target we want to search may be a doc file or a table in DB.
		// --> The path, name, size and modified date of the file.
		// --> All the information of the file should be stored in the Document.
		Document doc = null;

		// 4. Each Item of The Document is Called a Field.
		// --> The relationship of document and field is like table and cell.

		// Eg. We want to build index for all the txt file in the c:/lucene dir.
		// So each txt file in this dir is called a document.
		// And the name, size, modified date, content is called a field.
		File files = new File("E:/LuceneData");
		for (File file : files.listFiles())
		{
			doc = new Document();
			doc.add(new Field("content", new FileReader(file)));
			doc.add(new Field("name", file.getName(), Field.Store.YES,
					Field.Index.NOT_ANALYZED));
			// Field.Store.YES --> The field should be stored in index file
			// Field.Index.ANALYZED --> The filed should be participled
			doc.add(new Field("path", file.getAbsolutePath(), Field.Store.YES,
					Field.Index.NOT_ANALYZED));

			// 5. Create Index File for Target Document by IndexWriter.
			writer.addDocument(doc);
		}

		// 6. Close Index Writer
		if (null != writer)
		{
			writer.close();
		}
	}
	
	/**
	 * Search
	 * @throws IOException 
	 * @throws ParseException 
	 */
	public void search() throws IOException, ParseException
	{
		// 1. Create Directory
		Directory dir = FSDirectory.open(new File("E:/LuceneIndex"));
		
		// 2. Create IndexReader
		IndexReader reader = IndexReader.open(dir);
		
		// 3. Create IndexSearcher using IndexReader
		IndexSearcher searcher = new IndexSearcher(reader);
		
		// 4. Create query for search
		// Search the documents whose content have 'java' key word
		QueryParser parser = new QueryParser(Version.LUCENE_35, "content", new StandardAnalyzer(Version.LUCENE_35));
		Query query = parser.parse("java");
		
		// 5. Execute query and return TopDocs
		// param1: The query to be executed
		// param2: The number of result items 
		TopDocs topDocs = searcher.search(query, 10);
		
		// 6. Get ScoreDoc according to TopDocs
		ScoreDoc[] docs = topDocs.scoreDocs;
		System.out.println("Hits: " + docs.length);
		for(ScoreDoc scoreDoc : docs)
		{
			// 7. Get Document using searcher and ScoreDoc
			Document d = searcher.doc(scoreDoc.doc);
			
			// 8. Get information using Document
			System.out.println("File Name : " + d.get("path"));
		}
		
		// 9. Close Reader
		reader.close();
	}
}

     2. Test Case

package edu.xmu.lucene.Lucene_ModuleOne;

import java.io.IOException;

import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.store.LockObtainFailedException;
import org.junit.Test;

/**
 * Unit test for simple App.
 */
public class AppTest
{
	@Test
	public void buildIndex()
	{
		App app = new App();
		try
		{
			app.buildIndex();
		} catch (CorruptIndexException e)
		{
			// TODO Auto-generated catch block
			e.printStackTrace();
		} catch (LockObtainFailedException e)
		{
			// TODO Auto-generated catch block
			e.printStackTrace();
		} catch (IOException e)
		{
			// TODO Auto-generated catch block
			e.printStackTrace();
		}
	}
	
	@Test
	public void search()
	{
		App app = new App();
		
		try
		{
			app.search();
		} catch (IOException e)
		{
			// TODO Auto-generated catch block
			e.printStackTrace();
		} catch (ParseException e)
		{
			// TODO Auto-generated catch block
			e.printStackTrace();
		}
	}
}

 

分享到:
评论

相关推荐

    nutch入门经典翻译1:Introduction to Nutch, Part 1: Crawling

    《Nutch入门经典翻译1:Introduction to Nutch, Part 1: Crawling》一文深入介绍了Nutch这一开源网络爬虫框架的基本概念、体系结构及其关键组件,为初学者提供了全面的理解视角。以下是对该文章核心知识点的详细解读...

    lucene-core-7.7.0-API文档-中文版.zip

    Maven坐标:org.apache.lucene:lucene-core:7.7.0; 标签:apache、lucene、core、中文文档、jar包、java; 使用方法:解压翻译后的API文档,用浏览器打开“index.html”文件,即可纵览文档内容。 人性化翻译,文档...

    Lucene:基于Java的全文检索引擎简介

    Lucene是一个基于Java的全文索引工具包。 1. 基于Java的全文索引引擎Lucene简介:关于作者和Lucene的...5. Hacking Lucene:简化的查询分析器,删除的实现,定制的排序,应用接口的 扩展 6. 从Lucene我们还可以学到什么

    IKAnalyzer中文分词支持lucene6.5.0版本

    由于林良益先生在2012之后未对IKAnalyzer进行更新,后续lucene分词接口发生变化,导致不可使用,所以此jar包支持lucene6.0以上版本

    指南-Lucene:ES篇.md

    指南-Lucene:ES篇.md

    精品资料(2021-2022收藏)Lucene:基于Java的全文检索引擎简介.doc

    【Lucene:基于Java的全文检索引擎简介】 Lucene是一个由Java编写的开源全文检索引擎工具包,由Doug Cutting创建并贡献给Apache基金会,成为Jakarta项目的一部分。它不是一个独立的全文检索应用,而是提供了一个可...

    Lucene:基于Java的全文检索引擎简介.rar

    **Lucene:基于Java的全文检索引擎简介** Lucene是一个高度可扩展的、高性能的全文检索库,由Apache软件基金会开发并维护。它是Java开发者在构建搜索引擎应用时的首选工具,因为它提供了完整的索引和搜索功能,同时...

    lucene:基于Java的全文检索引擎简介

    ### 基于Java的全文检索引擎Lucene简介 #### 1. Lucene概述与历史背景 Lucene是一个开源的全文检索引擎库,完全用Java编写。它为开发者提供了构建高性能搜索应用程序的基础组件。尽管Lucene本身不是一个现成的应用...

    精品资料(2021-2022收藏)Lucene:基于Java的全文检索引擎简介.docx

    **Lucene:基于Java的全文检索引擎** Lucene是一个由Apache软件基金会的Jakarta项目维护的开源全文检索引擎。它不是一个完整的全文检索应用,而是一个用Java编写的库,允许开发人员轻松地在他们的应用程序中集成...

    基于 SSM 框架的二手书交易系统.zip

    快速上手 1. 运行环境 IDE:IntelliJ IDEA 项目构建工具:Maven 数据库:MySQL Tomcat:Tomcat 8.0.47 2. 初始化项目 创建一个名为bookshop的数据库,将bookshop.sql导入 打开IntelliJ IDEA,将项目导入 ...

    精品资料(2021-2022收藏)Lucene:基于Java的全文检索引擎简介22173.doc

    【Lucene:基于Java的全文检索引擎简介】 Lucene是一个由Java编写的全文索引工具包,它不是一个完整的全文检索应用,而是作为一个可嵌入的引擎,为各种应用程序提供全文检索功能。Lucene的设计目标是简化全文检索的...

    lucene 所有jar包 包含IKAnalyzer分词器

    《Lucene分词技术与IKAnalyzer详解》 在信息技术领域,搜索引擎是不可或缺的一部分,而Lucene作为Apache软件基金会的一个开放源代码项目,是Java语言开发的全文检索引擎库,为构建高效、可扩展的信息检索应用提供了...

    lucene-sandbox-6.6.0-API文档-中文版.zip

    Maven坐标:org.apache.lucene:lucene-sandbox:6.6.0; 标签:apache、lucene、sandbox、jar包、java、中文文档; 使用方法:解压翻译后的API文档,用浏览器打开“index.html”文件,即可纵览文档内容。 人性化翻译...

    Lucene实战源码(Lucene in Action Source Code)part2

    《Lucene实战源码(Lucene in Action Source Code)part2》是针对知名搜索库Lucene的一份重要学习资源,其包含的是书籍《Lucene in Action》中的实践代码,主要聚焦于Lucene的深入理解和应用。这个压缩包的第二部分...

    lucene:Apache Lucene开源搜索软件

    Lucene: : 用Gradle构建 基本步骤: 安装OpenJDK 11(或更高版本) 从Apache下载Lucene并解压缩 连接到安装的顶层(lucene顶层目录的父目录) 运行gradle 步骤0)设置您的开发环境(OpenJDK 11或更高版本) ...

Global site tag (gtag.js) - Google Analytics