- 浏览: 175736 次
- 性别:
- 来自: 杭州
文章分类
最新评论
-
allstar2012:
很详细,顶一个!
java中四种操作(DOM、SAX、JDOM、DOM4J)xml方式详解与比较 -
gongxihai:
项目中正好要用到,以前没接触过,good
搜索引擎--Lucene简介 -
leiwuluan:
哥们,挺有心的!
搜索引擎--Lucene简介 -
bertLee:
我给你加引文:参考自:<a href="htt ...
Lucene 索引文件结构分析 -
weidewei:
benni82 写道应用代码依赖了楼上的还有啥好办法呢?
NGINX+TOMCAT架构下获取真实IP的办法
Apache Lucene is a high-performance, full-featured text search engine library.
1.Here's a simple example how to use Lucene for indexing and searching (using JUnit to check if the results are what we expect):
2.The Lucene API is divided into several packages:
3.To use Lucene, an application should:
4.Some simple examples of code which does this are:
code detail:
(1)FileDocument.java
(2)IndexFiles.java
(3)DeleteFiles.java
(4)SearchFiles.java
1.Here's a simple example how to use Lucene for indexing and searching (using JUnit to check if the results are what we expect):
import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.search.Hits; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.store.Directory; import org.apache.lucene.store.RAMDirectory; /** * @since V2.0 * @author David.Wei * @date 2008-4-16 * @param args * @return void */ public class Test { public static void main(String[] args) throws Exception { Analyzer analyzer = new StandardAnalyzer(); // Store the index in memory: Directory directory = new RAMDirectory(); // To store an index on disk, use this instead: // Directory directory = FSDirectory.getDirectory("/tmp/testindex"); IndexWriter iwriter = new IndexWriter(directory, analyzer, true); iwriter.setMaxFieldLength(25000); Document doc = new Document(); String text = "This is the text to be indexed."; doc.add(new Field("fieldname", text, Field.Store.YES, Field.Index.TOKENIZED)); iwriter.addDocument(doc); iwriter.optimize(); iwriter.close(); // Now search the index: IndexSearcher isearcher = new IndexSearcher(directory); // Parse a simple query that searches for "text": QueryParser parser = new QueryParser("fieldname", analyzer); Query query = parser.parse("text"); Hits hits = isearcher.search(query); // assertEquals(1, hits.length()); // Iterate through the results: for (int i = 0; i < hits.length(); i++) { Document hitDoc = hits.doc(i); System.out.println("This is the text to be indexed." + hitDoc.get("fieldname")); } isearcher.close(); directory.close(); } }
2.The Lucene API is divided into several packages:
- org.apache.lucene.analysis defines an abstract Analyzer API for converting text from a java.io.Reader into a TokenStream, an enumeration of Tokens. A TokenStream is composed by applying TokenFilters to the output of a Tokenizer. A few simple implemenations are provided, including StopAnalyzer and the grammar-based StandardAnalyzer.
- org.apache.lucene.document provides a simple Document class. A document is simply a set of named Fields, whose values may be strings or instances of java.io.Reader.
- org.apache.lucene.index provides two primary classes: IndexWriter, which creates and adds documents to indices; and IndexReader, which accesses the data in the index.
- org.apache.lucene.search provides data structures to represent queries (TermQuery for individual words, PhraseQuery for phrases, and BooleanQuery for boolean combinations of queries) and the abstract Searcher which turns queries into Hits. IndexSearcher implements search over a single IndexReader.
- org.apache.lucene.queryParser uses JavaCC to implement a QueryParser.
- org.apache.lucene.store defines an abstract class for storing persistent data, the Directory, a collection of named files written by an IndexOutput and read by an IndexInput. Two implementations are provided, FSDirectory, which uses a file system directory to store files, and RAMDirectory which implements files as memory-resident data structures.
- org.apache.lucene.util contains a few handy data structures, e.g., BitVector and PriorityQueue.
3.To use Lucene, an application should:
- Create Documents by adding Fields;
- Create an IndexWriter and add documents to it with addDocument();
- Call QueryParser.parse() to build a query from a string; and
- Create an IndexSearcher and pass the query to its search() method.
4.Some simple examples of code which does this are:
- FileDocument.java contains code to create a Document for a file.
- IndexFiles.java creates an index for all the files contained in a directory.
- DeleteFiles.java deletes some of these files from the index.
- SearchFiles.java prompts for queries and searches an index.
code detail:
(1)FileDocument.java
/** * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ import java.io.File; import java.io.FileReader; import org.apache.lucene.document.DateTools; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; /** A utility for making Lucene Documents from a File. */ public class FileDocument { /** * Makes a document for a File. * <p> * The document has three fields: * <ul> * <li><code>path</code>--containing the pathname of the file, as a * stored, untokenized field; * <li><code>modified</code>--containing the last modified date of the * file as a field as created by <a * href="lucene.document.DateTools.html">DateTools</a>; and * <li><code>contents</code>--containing the full contents of the file, * as a Reader field; */ public static Document Document(File f) throws java.io.FileNotFoundException { // make a new, empty document Document doc = new Document(); // Add the path of the file as a field named "path". Use a field that is // indexed (i.e. searchable), but don't tokenize the field into words. doc.add(new Field("path", f.getPath(), Field.Store.YES, Field.Index.UN_TOKENIZED)); // Add the last modified date of the file a field named "modified". Use // a field that is indexed (i.e. searchable), but don't tokenize the // field // into words. doc.add(new Field("modified", DateTools.timeToString(f.lastModified(), DateTools.Resolution.MINUTE), Field.Store.YES, Field.Index.UN_TOKENIZED)); // Add the contents of the file to a field named "contents". Specify a // Reader, // so that the text of the file is tokenized and indexed, but not // stored. // Note that FileReader expects the file to be in the system's default // encoding. // If that's not the case searching for special characters will fail. doc.add(new Field("contents", new FileReader(f))); // return the document return doc; } private FileDocument() { } }
(2)IndexFiles.java
/** * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.index.IndexWriter; import java.io.File; import java.io.FileNotFoundException; import java.io.IOException; import java.util.Date; /** Index all text files under a directory. */ public class IndexFiles { private IndexFiles() { } static final File INDEX_DIR = new File("index"); /** Index all text files under a directory. */ public static void main(String[] args) { String usage = "java org.apache.lucene.demo.IndexFiles <root_directory>"; if (args.length == 0) { System.err.println("Usage: " + usage); System.exit(1); } if (INDEX_DIR.exists()) { System.out.println("Cannot save index to '" + INDEX_DIR + "' directory, please delete it first"); System.exit(1); } final File docDir = new File(args[0]); if (!docDir.exists() || !docDir.canRead()) { System.out .println("Document directory '" + docDir.getAbsolutePath() + "' does not exist or is not readable, please check the path"); System.exit(1); } Date start = new Date(); try { IndexWriter writer = new IndexWriter(INDEX_DIR, new StandardAnalyzer(), true); System.out.println("Indexing to directory '" + INDEX_DIR + "'..."); indexDocs(writer, docDir); System.out.println("Optimizing..."); writer.optimize(); writer.close(); Date end = new Date(); System.out.println(end.getTime() - start.getTime() + " total milliseconds"); } catch (IOException e) { System.out.println(" caught a " + e.getClass() + "\n with message: " + e.getMessage()); } } static void indexDocs(IndexWriter writer, File file) throws IOException { // do not try to index files that cannot be read if (file.canRead()) { if (file.isDirectory()) { String[] files = file.list(); // an IO error could occur if (files != null) { for (int i = 0; i < files.length; i++) { indexDocs(writer, new File(file, files[i])); } } } else { System.out.println("adding " + file); try { writer.addDocument(FileDocument.Document(file)); } // at least on windows, some temporary files raise this // exception with an "access denied" message // checking if the file can be read doesn't help catch (FileNotFoundException fnfe) { ; } } } } }
(3)DeleteFiles.java
/** * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ import org.apache.lucene.store.Directory; import org.apache.lucene.store.FSDirectory; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.Term; // import org.apache.lucene.index.Term; /** Deletes documents from an index that do not contain a term. */ public class DeleteFiles { private DeleteFiles() { } // singleton /** Deletes documents from an index that do not contain a term. */ public static void main(String[] args) { String usage = "java org.apache.lucene.demo.DeleteFiles <unique_term>"; if (args.length == 0) { System.err.println("Usage: " + usage); System.exit(1); } try { Directory directory = FSDirectory.getDirectory("index"); IndexReader reader = IndexReader.open(directory); Term term = new Term("path", args[0]); int deleted = reader.deleteDocuments(term); System.out.println("deleted " + deleted + " documents containing " + term); // one can also delete documents by their internal id: // for (int i = 0; i < reader.maxDoc(); i++) { // System.out.println("Deleting document with id " + i); // reader.delete(i); // } reader.close(); directory.close(); } catch (Exception e) { System.out.println(" caught a " + e.getClass() + "\n with message: " + e.getMessage()); } } }
(4)SearchFiles.java
/** * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.index.FilterIndexReader; import org.apache.lucene.index.IndexReader; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.search.Hits; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.search.Searcher; import java.io.BufferedReader; import java.io.FileReader; import java.io.IOException; import java.io.InputStreamReader; import java.util.Date; /** Simple command-line based search demo. */ public class SearchFiles { /** * Use the norms from one field for all fields. Norms are read into memory, * using a byte of memory per document per searched field. This can cause * search of large collections with a large number of fields to run out of * memory. If all of the fields contain only a single token, then the norms * are all identical, then single norm vector may be shared. */ private static class OneNormsReader extends FilterIndexReader { private String field; public OneNormsReader(IndexReader in, String field) { super(in); this.field = field; } public byte[] norms(String field) throws IOException { return in.norms(this.field); } } private SearchFiles() { } /** Simple command-line based search demo. */ public static void main(String[] args) throws Exception { String usage = "Usage: java org.apache.lucene.demo.SearchFiles [-index dir] [-field f] [-repeat n] [-queries file] [-raw] [-norms field]"; if (args.length > 0 && ("-h".equals(args[0]) || "-help".equals(args[0]))) { System.out.println(usage); System.exit(0); } String index = "index"; String field = "contents"; String queries = null; int repeat = 0; boolean raw = false; String normsField = null; for (int i = 0; i < args.length; i++) { if ("-index".equals(args[i])) { index = args[i + 1]; i++; } else if ("-field".equals(args[i])) { field = args[i + 1]; i++; } else if ("-queries".equals(args[i])) { queries = args[i + 1]; i++; } else if ("-repeat".equals(args[i])) { repeat = Integer.parseInt(args[i + 1]); i++; } else if ("-raw".equals(args[i])) { raw = true; } else if ("-norms".equals(args[i])) { normsField = args[i + 1]; i++; } } IndexReader reader = IndexReader.open(index); if (normsField != null) reader = new OneNormsReader(reader, normsField); Searcher searcher = new IndexSearcher(reader); Analyzer analyzer = new StandardAnalyzer(); BufferedReader in = null; if (queries != null) { in = new BufferedReader(new FileReader(queries)); } else { in = new BufferedReader(new InputStreamReader(System.in, "UTF-8")); } QueryParser parser = new QueryParser(field, analyzer); while (true) { if (queries == null) // prompt the user System.out.println("Enter query: "); String line = in.readLine(); if (line == null || line.length() == -1) break; line = line.trim(); if (line.length() == 0) break; Query query = parser.parse(line); System.out.println("Searching for: " + query.toString(field)); Hits hits = searcher.search(query); if (repeat > 0) { // repeat & time as benchmark Date start = new Date(); for (int i = 0; i < repeat; i++) { hits = searcher.search(query); } Date end = new Date(); System.out.println("Time: " + (end.getTime() - start.getTime()) + "ms"); } System.out.println(hits.length() + " total matching documents"); final int HITS_PER_PAGE = 10; for (int start = 0; start < hits.length(); start += HITS_PER_PAGE) { int end = Math.min(hits.length(), start + HITS_PER_PAGE); for (int i = start; i < end; i++) { if (raw) { // output raw format System.out.println("doc=" + hits.id(i) + " score=" + hits.score(i)); continue; } Document doc = hits.doc(i); String path = doc.get("path"); if (path != null) { System.out.println((i + 1) + ". " + path); String title = doc.get("title"); if (title != null) { System.out.println(" Title: " + doc.get("title")); } } else { System.out.println((i + 1) + ". " + "No path for this document"); } } if (queries != null) // non-interactive break; if (hits.length() > end) { System.out.println("more (y/n) ? "); line = in.readLine(); if (line.length() == 0 || line.charAt(0) == 'n') break; } } } reader.close(); } }
发表评论
-
Lucene索引文件格式分析
2008-04-30 23:04 6281Lucene的文件格式的规范 ... -
Lucene 索引文件结构分析
2008-04-27 17:25 8191首先理解反向索引(Inve ... -
lucene 漫谈--入门与介绍(6)
2008-04-25 17:18 22558 分析器 在 ... -
lucene 漫谈--入门与介绍(5)
2008-04-25 17:06 31517 如何搜索 lucene的搜索相当强大,它 ... -
lucene 漫谈--入门与介绍(4)
2008-04-25 17:03 25565 如何建索引 ... -
lucene 漫谈--入门与介绍(3)
2008-04-25 16:58 24334 lucene的结构 lucene包括c ... -
lucene 漫谈--入门与介绍(2)
2008-04-25 16:52 23743lucene的几个重要概念 ... -
lucene 漫谈--入门与介绍(1)
2008-04-25 16:49 26711 lucene简介 ... -
Lucene 入门与实战
2008-04-15 11:18 3143引用本文转载自:http://www.ibm.com/deve ... -
搜索引擎--Lucene学习资料
2008-04-15 11:14 3321搜索引擎--Lucene学习资料 1、实战 Lucene,第 ... -
搜索引擎--Lucene简介
2008-04-15 09:41 2919Apache Lucene is a high-per ...
相关推荐
标题:Lucene学习笔记 描述:Lucene学习笔记,Lucene入门必备材料 知识点: 一、Lucene概述与文档管理策略 Lucene是一款高性能、全功能的文本搜索引擎库,广泛应用于文档检索、全文搜索等场景。为了提升搜索效率...
**Lucene 学习笔记 1** Lucene 是一个全文搜索引擎库,由 Apache 软件基金会开发。它提供了一个可扩展的、高性能的搜索框架,使得开发者能够在其应用程序中集成高级的搜索功能。本篇学习笔记将深入探讨 Lucene 的...
【Lucene学习笔记】 Lucene 是一款开源的全文检索框架,由Apache软件基金会维护,它提供了高效的、可扩展的搜索引擎功能。不同于一个完整的应用程序,Lucene 提供的是一个基础组件,开发者可以将其集成到自己的应用...
NULL 博文链接:https://kylinsoong.iteye.com/blog/719415
lucene学习笔记 1 .txt lucene学习笔记 2.txt lucene学习笔记 3 .txt lucene入门实战.txt Lucene 的学习 .txt Lucene-2.0学习文档 .txt Lucene入门与使用 .txt lucene性能.txt 大富翁全文索引和查询的例子...
10. **lucene学习笔记 3 .txt** 这是作者的学习笔记,可能包含了个人理解和使用Lucene过程中遇到的问题及解决方案,提供了不同角度的见解和实践经验。 通过这些文档,你可以系统地学习和掌握Lucene的各个方面,从...
【Lucene 3.6 学习笔记】 Lucene 是一个高性能、全文本搜索库,广泛应用于各种搜索引擎的开发。本文将深入探讨Lucene 3.6版本中的关键概念、功能以及实现方法。 ### 第一章 Lucene 基础 #### 1.1 索引部分的核心...
Lucene学习笔记(二)可能涉及索引构建过程,讲解了如何使用Document对象存储文档内容,Field对象定义字段属性,以及如何使用IndexWriter进行索引更新和优化。 笔记(三)和(四)可能深入到查询解析和执行。查询解析器...
### Lucene 3.5 学习笔记 #### 一、Lucene 3.5 基本概念 ##### 1.1 Lucene 概述 **1.1.1 IndexWriter** `IndexWriter` 是 Lucene 中的核心类之一,用于创建或更新索引。它提供了添加文档、删除文档、优化索引等...
Lucene学习笔记.doc nutch_tutorial.pdf nutch二次开发总结.txt nutch入门.pdf nutch入门学习.pdf Nutch全文搜索学习笔记.doc Yahoo的Hadoop教程.doc [硕士论文]_基于Lucene的Web搜索引擎实现.pdf [硕士论文]_基于...
Lucene 学习笔记是指如何学习和使用 Lucene。我们可以通过学习 Lucene 的使用和实现来掌握 Lucene。 Solr 学习笔记 Solr 学习笔记是指如何学习和使用 Solr。我们可以通过学习 Solr 的使用和实现来掌握 Solr。 ...
1> lucene学习笔记 2> 全文检索的实现机制 【1】lucene学习笔记的目录如下 1. 概述 3 2. lucene 的包结构 3 3. 索引文件格式 3 4. lucene中主要的类 4 4.1. Document文档类 4 4.1.1. 常用方法 4 4.1.2. 示例 4 4.2...