- 浏览: 1460538 次
- 性别:
- 来自: 上海
文章分类
最新评论
-
luhouxiang:
写的很不错,学习了
Extjs 模块化动态加载js实践 -
kingkongtown:
如果想改成淘宝后台那样,可以在编辑器批量上传图片呢?
kissy editor 阶段体会 -
317966578:
兄弟我最近也在整jquery和caja 开放一些接口。在git ...
caja 原理 : 前端 -
liuweihug:
Javascript引擎单线程机制及setTimeout执行原 ...
setTimeout ,xhr,event 线程问题 -
辽主临轩:
怎么能让浏览器不进入 文档模式的quirks模式,进入标准的
浏览器模式与文本模式
将常用的 一些 doc txt html文档 索引,关键是 lucene 只需要 将 doc html 剥离成 普通的 还有有效信息的字符串即可。
仿照demo建立 DOCDocument.java ,getText() 利用 POI ,抽取 基类 FileDocument,
动态载入 后缀名 + Document .class ,
基类FileDocument:
package bts.jsp.kbase; import java.io.*; import java.util.Map; import java.util.HashMap; import java.util.Arrays; import org.apache.lucene.document.DateTools; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; public abstract class FileDocument { static Map<String, FileDocument> DocmentMap; static { try { DocmentMap = init(); } catch (Exception e) { e.printStackTrace(); System.exit(1); } } private static Map<String, FileDocument> init() throws Exception { Map<String, FileDocument> map = new HashMap<String, FileDocument>(); for (String t : KbaseConfig.TYPES) { map.put(t, (FileDocument) Class.forName("bts.jsp.kbase." + t.toUpperCase() + "Document").newInstance()); } return map; } /* public static String getCommonContent(String path) { if (!KbaseConfig.accceptFile(path)) return null; String subtype = path.substring(path.lastIndexOf(".") + 1); String content = DocmentMap.get(subtype.toLowerCase()).getTextContent(path); return content; } */ public static String getCacheStringContent(String path) { String stringPath = KbaseConfig.getCacheStringPath(path); System.out.println("get cache :" + stringPath); StringBuffer sb = new StringBuffer(""); try { BufferedReader reader = new BufferedReader(new FileReader(stringPath)); String line; while ((line = reader.readLine()) != null) { sb.append(line).append("\n"); } reader.close(); } catch (Exception e) { e.printStackTrace(); } return sb.toString(); } private String cacheAndGetStringContent(String path) { String content = this.getTextContent(path); String stringPath = KbaseConfig.getCacheStringPath(path); System.out.println(stringPath); { String dir = stringPath.substring(0, stringPath.lastIndexOf("/")); File f = new File(dir); if (!f.exists()) f.mkdirs(); } try { PrintWriter pw = new PrintWriter(stringPath); pw.println(content); pw.close(); } catch (Exception e) { e.printStackTrace(); } return content; } public static Document getCommonDocument(String path) { if (!KbaseConfig.accceptFile(path)) return null; String subtype = path.substring(path.lastIndexOf(".") + 1); try { return DocmentMap.get(subtype).Document(new File(path)); } catch (Exception e) { e.printStackTrace(); } return null; } public abstract String getTextContent(String path); public Document Document(File f) throws java.io.FileNotFoundException { // make a new, empty document Document doc = new Document(); // Add the path of the file as a field named "path". Use a field that is // indexed (i.e. searchable), but don't tokenize the field into words. doc.add(new Field("path", f.getPath(), Field.Store.YES, Field.Index.NOT_ANALYZED)); doc.add(new Field("title", f.getName(), Field.Store.YES, Field.Index.NOT_ANALYZED)); // Add the last modified date of the file a field named "modified". Use // a field that is indexed (i.e. searchable), but don't tokenize the field // into words. doc.add(new Field("modified", f.lastModified() + "", Field.Store.YES, Field.Index.NOT_ANALYZED)); // Add the contents of the file to a field named "contents". Specify a Reader, // so that the text of the file is tokenized and indexed, but not stored. // Note that FileReader expects the file to be in the system's default encoding. // If that's not the case searching for special characters will fail. String content = cacheAndGetStringContent(f.getPath()); doc.add(new Field("contents", content, Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS)); // return the document return doc; } }
举例,对于doc 后缀名 ,有类 DOCDocument
package bts.jsp.kbase; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.poi.hwpf.extractor.WordExtractor; import java.io.BufferedReader; import java.io.FileReader; import java.io.File; import java.io.FileInputStream; /** * Created by IntelliJ IDEA. * User: yiminghe * Date: 2008-12-11 * Time: 15:32:09 * To change this template use File | Settings | File Templates. */ public class DOCDocument extends FileDocument { /** * Makes a document for a File. * <p/> * The document has three fields: * <ul> * <li><code>path</code>--containing the pathname of the file, as a stored, * untokenized field; * <li><code>modified</code>--containing the last modified date of the file as * a field as created by <a * href="lucene.document.DateTools.html">DateTools</a>; and * <li><code>contents</code>--containing the full contents of the file, as a * Reader field; */ public String getTextContent(String path) { String content = ""; try { WordExtractor wordExtractor = new WordExtractor(new FileInputStream(path)); content = wordExtractor.getText(); } catch (Exception e) { e.printStackTrace(); } return content; } }
接着要考虑的问题是 ,如何避免重复索引 以及 文件修改后重新索引 ,这就要用到 lastmodifiy 这个文件的属性,每次索引完 都将 已索引的文件以及其最后修改时间保存下来, 下次索引前先检查 , 更新或新增时 才真正索引。
也就是 每次都找出 应该删掉哪些,应该加入哪些 (索引的更新是通过先删除再加入实现的)
package bts.jsp.kbase; import bts.roi.BtsManager; import java.io.*; import java.util.ArrayList; /** * Created by IntelliJ IDEA. * User: yiminghe * Date: 2008-12-11 * Time: 17:29:45 * To change this template use File | Settings | File Templates. */ public class KbaseConfig { // 已经建立索引的文件集合 static final String INDEXEDFILES = BtsManager.getProperty("Bts.INDEXEDFILES"); //索引存放目录 static final File INDEX_DIR = new File(BtsManager.getProperty("Bts.INDEX_DIR")); //真实数据 static final String DATA_DIR = BtsManager.getProperty("Bts.DATA_DIR"); //真实数据 String static final String DATA_STRING_DIR = BtsManager.getProperty("Bts.DATA_STRING_DIR"); //索引后缀名列表 static String[] TYPES = {"html", "htm", "txt", "doc", "ppt", "xls", "pdf"}; static { File f = new File(DATA_STRING_DIR); if (!f.exists()) f.mkdirs(); } public static void saveIndexedFiles(ArrayList<String[]> data) { try { PrintWriter pw = new PrintWriter(INDEXEDFILES); for (int i = 0; i < data.size(); i++) { String[] d = data.get(i); for (int j = 0; j < d.length; j++) { pw.print(d[j] + "\t"); } pw.println(); } pw.close(); } catch (Exception e) { e.printStackTrace(); } } static String getCacheStringPath(String path) { path = path.replaceAll("\\\\", "/"); String stringPath = path.replaceAll(KbaseConfig.DATA_DIR, KbaseConfig.DATA_STRING_DIR); return stringPath; } public static ArrayList<String[]> loadIndexedFiles() { ArrayList<String[]> data = new ArrayList<String[]>(); if (new File(INDEXEDFILES).exists()) { try { BufferedReader reader = new BufferedReader(new FileReader(INDEXEDFILES)); String line; while ((line = reader.readLine()) != null) { if ((line = line.trim()).equals("")) continue; String[] d = line.split("\t"); data.add(d); } reader.close(); } catch (Exception e) { e.printStackTrace(); } } return data; } public static ArrayList<String[]> getCurrentFiles(String dir) { ArrayList<String[]> d = new ArrayList<String[]>(); getCurrentFiles(dir, d); return d; } private static int indexArray(String[] array, String value) { value = value.trim(); for (int i = 0; i < array.length; i++) { if (array[i].equals(value)) return i; } return -1; } static boolean accceptFile(String path) { int index = path.lastIndexOf("."); if (index == -1) return false; String subtype = path.substring(index + 1); int array_index = indexArray(TYPES, subtype.toLowerCase()); if (array_index == -1) return false; return true; } private static void getCurrentFiles(String dir, ArrayList<String[]> data) { File f = new File(dir); if (f.isDirectory()) { File[] fs = f.listFiles(new FileFilter() { public boolean accept(File pathname) { boolean ac = pathname.isDirectory() || accceptFile(pathname.getAbsolutePath()); return ac; } }); for (int i = 0; i < fs.length; i++) { getCurrentFiles(fs[i].getAbsolutePath(), data); } return; } if (!f.canRead()) return; String[] d = new String[2]; d[0] = f.getAbsolutePath(); d[1] = f.lastModified() + ""; data.add(d); } public static ArrayList<String> getDeleted(ArrayList<String[]> original, ArrayList<String[]> newData) { ArrayList<String> result = new ArrayList<String>(); for (int i = 0; i < original.size(); i++) { String path = original.get(i)[0]; long lm = Long.parseLong(original.get(i)[1]); boolean modified = false; int j = 0; for (j = 0; j < newData.size(); j++) { String path2 = newData.get(j)[0]; long lm2 = Long.parseLong(newData.get(j)[1]); if (path2.equals(path)) { if (lm2 > lm) { modified = true; break; } else { break; } } } //修改或者已经被删除 if (modified || j == newData.size()) { result.add(path); } } return result; } public static ArrayList<String> getAdded(ArrayList<String[]> original, ArrayList<String[]> newData) { ArrayList<String> result = new ArrayList<String>(); for (int i = 0; i < newData.size(); i++) { String path = newData.get(i)[0]; long lm = Long.parseLong(newData.get(i)[1]); boolean modified = false; int j = 0; for (j = 0; j < original.size(); j++) { String path2 = original.get(j)[0]; long lm2 = Long.parseLong(original.get(j)[1]); if (path2.equals(path)) { if (lm > lm2) { modified = true; break; } else { break; } } } //修改或者已经新的 if (modified || j == original.size()) { result.add(path); } } return result; } }
其他的 查询 ,删除 都和 demo 差不多了 ,加了 highlighter 的应用 ,
package bts.jsp.kbase; /** * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ import java.io.*; import java.util.Date; import java.util.ArrayList; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.index.FilterIndexReader; import org.apache.lucene.index.IndexReader; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.search.HitCollector; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.search.ScoreDoc; import org.apache.lucene.search.Searcher; import org.apache.lucene.search.TopDocCollector; import org.apache.lucene.search.highlight.*; /** * Simple command-line based search demo. */ public class SearchFiles { /** * Use the norms from one field for all fields. Norms are read into memory, * using a byte of memory per document per searched field. This can cause * search of large collections with a large number of fields to run out of * memory. If all of the fields contain only a single token, then the norms * are all identical, then single norm vector may be shared. */ private static class OneNormsReader extends FilterIndexReader { private String field; public OneNormsReader(IndexReader in, String field) { super(in); this.field = field; } public byte[] norms(String field) throws IOException { return in.norms(this.field); } } private SearchFiles() { } /** * Simple command-line based search demo. */ public static KbaseFiles search(String field, String queries, int start, int limit) throws Exception { IndexReader reader = IndexReader.open(KbaseConfig.INDEX_DIR); /* if (normsField != null) reader = new OneNormsReader(reader, normsField); */ IndexSearcher searcher = new IndexSearcher(reader); Analyzer analyzer = new StandardAnalyzer(); QueryParser parser = new QueryParser(field, analyzer); Query query = parser.parse(queries); //System.out.println("Searching for: " + query.toString(field)); KbaseFiles files = null; if (start >= 0) { files = doPagingSearch(analyzer, searcher, query, start, limit); } else { doStreamingSearch(searcher, query); } return files; } /** * This method uses a custom HitCollector implementation which simply prints out * the docId and score of every matching document. * <p/> * This simulates the streaming search use case, where all hits are supposed to * be processed, regardless of their relevance. */ public static void doStreamingSearch(final IndexSearcher searcher, Query query) throws IOException { HitCollector streamingHitCollector = new HitCollector() { // simply print docId and score of every matching document public void collect(int doc, float score) { //System.out.println("doc=" + doc + " score=" + score); } }; searcher.search(query, streamingHitCollector); } /** * This demonstrates a typical paging search scenario, where the search engine presents * pages of size n to the user. The user can then go to the next page if interested in * the next hits. * <p/> * When the query is executed for the first time, then only enough results are collected * to fill 5 result pages. If the user wants to page beyond this limit, then the query * is executed another time and all hits are collected. */ public static KbaseFiles doPagingSearch(Analyzer analyzer, IndexSearcher searcher, Query query, int start, int limit) throws IOException { // Collect enough docs to show 5 pages TopDocCollector collector = new TopDocCollector(start + limit); searcher.search(query, collector); ScoreDoc[] hits = collector.topDocs().scoreDocs; int numTotalHits = collector.getTotalHits(); //System.out.println(numTotalHits + " total matching documents"); int end = Math.min(numTotalHits, start + limit); SimpleHTMLFormatter htmlFormatter = new SimpleHTMLFormatter("<em>", "</em>"); Highlighter highlighter = new Highlighter(htmlFormatter, new QueryScorer(query)); KbaseFiles fileResult = new KbaseFiles(); fileResult.setTotal(numTotalHits); ArrayList<KbaseFile> files = new ArrayList<KbaseFile>(); for (int i = start; i < end; i++) { //if (raw) { // output raw format //System.out.println("doc=" + hits[i].doc + " score=" + hits[i].score); //continue; //} Document doc = searcher.doc(hits[i].doc); String path = doc.get("path"); if (path != null) { //System.out.println((i + 1) + ". " + path); String title = doc.get("title"); String contents = FileDocument.getCommonContent(path); String highLightText = highlighter.getBestFragment(analyzer, "contents", contents); String modified = doc.get("modified"); modified = modified.substring(0, modified.length() - 3); KbaseFile file = new KbaseFile(title, path, modified, highLightText); files.add(file); } else { //System.out.println((i + 1) + ". " + "No path for this document"); } } fileResult.setFiles(files); return fileResult; } }
评论
2 楼
yiminghe
2011-08-15
奈落王 写道
少了一个 BtsManager 能不能发一下?
不好意思,很久以前的代码都没了
1 楼
奈落王
2011-08-12
少了一个 BtsManager 能不能发一下?
发表评论
-
continuation, cps
2013-09-12 16:49 2791起 随着 nodejs 的兴起,异步编程成为一种潮流 ... -
using mustache with spring mvc
2011-06-16 20:30 4712spring 基本不用介绍了,是目前最好的 IOC 容器了 ... -
备忘:使用 intellij idea 进行远程调试
2011-05-03 18:56 33595以前都是很土得打 log ,发现一篇关于 java 调试器架构 ... -
前后端编码传递
2010-10-21 00:12 2048背景: 关于编码是 BS 的开发是个大问题,包 ... -
javabean与attribute
2010-07-15 21:02 2418以前很忽视 javabean , ... -
JAVA学习路线图
2010-06-20 23:24 0最近论坛上看到好几个朋友都在问,如何学习 Java的问题, ... -
linux下定位java应用
2010-06-09 02:48 1422场景: java 应用不同于其它程序,在ps查看时程 ... -
java中的协变
2010-05-27 23:17 3355一个一直有点模糊的概念,记录一下,协变是指一个类型随着它关联的 ... -
验证码图片生成
2010-04-29 22:15 0<%@ page contentType="i ... -
利用aop重构数据访问层
2010-02-24 20:57 2100由于一直以来小项目做的多,造成了轻后端重前端的恶果,结果后端现 ... -
struts2讲义
2009-11-07 11:53 0struts2讲义 -
xml transfer for beyond compare
2009-10-22 17:33 0xml transfer for beyond compare ... -
Digester 空白保留问题
2009-10-17 16:40 1972Digester 详细介绍 : apach ... -
Jsp - pageEncoding 解析
2009-09-29 22:28 2446pageEncoding 作为 Jsp page 指令 ... -
图解JVM在内存中申请对象及垃圾回收流程
2009-09-15 20:33 0http://longdick.ite ... -
javarebel
2009-09-11 22:23 0使用JavaRebel实现即时重载javaclass更改 ... -
mac java web开发配置备忘
2009-09-05 17:02 89基本上和linux配置差不多,mac 可算兼具 linux 命 ... -
10个让我去寻找比Java更好的语言的理由
2009-08-26 13:02 0别误会我. 在我的职业生涯中我写了无数的Java代码,我当 ... -
生僻用法:finally and return
2009-08-26 12:55 1498本质上还是 reference 与 primitive val ... -
事件人工详情整理Pattern
2009-08-09 16:13 0事件人工详情整理Pattern
相关推荐
这个项目,即“基于Lucene.Net开发的个人知识库”,展示了如何利用这一工具来整理、检索和管理个人知识。 Lucene.Net是一个开源的、高性能的全文搜索引擎库,它提供了高级的索引和搜索功能,适用于各种数据源,包括...
以上就是关于“lucene创建修改删除组合条件查询”的主要知识点。通过熟练掌握这些操作,开发者可以构建出强大的全文搜索系统,满足各种复杂的查询需求。在实际应用中,还需要注意性能优化,如合理使用索引,以及根据...
Lucene 是一个开源全文搜索引擎库,由 Apache 软件基金会开发。它提供了一个可扩展的、高性能的搜索框架,使得开发者能够快速地在大量文本数据中实现全文检索功能。这个压缩包“LuceneTest2”很可能是作者自己编写的...
Lucene 是一个开源全文搜索引擎库,由 Apache 软件基金会开发。它提供了一个可扩展的、高性能的搜索框架,使得开发者能够轻易地在应用中集成全文检索功能。本实例将通过一个具体的应用场景,帮助大家了解 Lucene 的...
在知识库设计中,我们还需要考虑到Lucene的使用,Lucene是一个开源的、基于Java的全文检索引擎。Lucene提供了多种功能,包括Document、Field、IndexWriter、Analyzer、Directory等。Document是用来描述文档的,Field...
Lucene是一个高性能、全功能的全文搜索引擎库。它为开发者提供了构建搜索应用所需的工具,包括索引文档和执行查询的能力。Lucene的核心特性之一就是支持复杂的查询语言,其中包括正则表达式。 #### 正则表达式在...
总的来说,这个“lucene tool”对于任何使用Apache Lucene构建搜索引擎的人来说都是一个宝贵的资源,它可以提高系统的稳定性和可靠性,确保用户能够得到准确、快速的搜索体验。在日常开发和运维中,熟练掌握并运用...
- Lucene是一个开源的全文搜索引擎库,能够帮助开发者构建应用程序内的搜索功能。 - Lucene的核心能力在于文档索引和查询,它提供了强大的API来实现高效的文档检索。 2. **XML简介** - XML(Extensible Markup ...
Lucene 是一个强大的开源全文检索库,被广泛应用于构建高效的搜索功能。在“CIS(可能是临床信息系统或企业信息系统)”这样的系统中,Lucene 的作用在于提供快速、准确的全文检索能力,以帮助用户迅速定位到所需的...
Lucene 是一个高性能、全文本搜索库,由 Apache 软件基金会开发。它提供了一个可扩展的、高度优化的框架,用于在各种数据源中进行索引和搜索文本信息。Lucene 具有强大的文本分析能力,可以处理多种语言,并且支持...
1. **创建索引**:首先,你需要创建一个Analyzer来定义如何分词,然后使用IndexWriter将文档添加到索引中。 2. **索引优化**:为了提高性能,Lucene支持索引的合并,以减少打开索引时的磁盘I/O。 3. **搜索**:...
Apache Lucene 是一个开源全文搜索引擎库,为开发者提供了在各种应用程序中实现高级搜索功能的工具。这个"lucene3.0.3搜索的使用示例"压缩包文件很可能是为了帮助用户理解并学习如何在项目中应用Lucene 3.0.3版本的...
"丑牛迷你知识库1.0.1beta版(32位)-2"是一个软件的压缩包,根据描述,这个软件可能是一个针对特定平台(32位)的知识管理工具,其核心组件或依赖项需要解压到名为"jre6"的文件夹中。这个命名暗示了它可能依赖于Java...
Lucene的使用者不需要深入了解有关全文检索的知识,仅仅学会使用库中的一个类,你就为你的应用实现全文检索的功能. 不过千万别以为Lucene是一个象google那样的搜索引擎,Lucene甚至不是一个应用程序,它仅仅是一个工具,...
Lucene.NET,作为Apache Lucene的.NET版本,是一个高性能、全文检索库,为.NET开发者提供了强大的文本搜索功能。本实例将带您深入理解Lucene.NET的内部机制,通过源代码分析,我们将探讨其核心组件、工作流程以及...
Lucene.Net的核心目标是实现高性能、可扩展的全文搜索,它广泛应用于网站搜索、内容管理系统、知识库、日志分析等多个领域。 在描述中提到的“快速搜索”是Lucene.Net的关键特性之一。通过高效的倒排索引机制,...
**标题与描述解析** 标题"Lucene资料大全(包括Lucene_in_Action书等)...总的来说,这个压缩包提供了一个全面的Lucene学习路径,既有理论书籍也有实践教程,对于想要深入理解或开始使用Lucene的人来说是宝贵的资源。
- **解释**:这段代码展示了如何使用StandardAnalyzer来创建一个文档的索引,并将其添加到指定路径下的索引中。`addDocument` 方法用于向索引中添加文档,`optimize` 方法则用于合并索引分段,减少分段数量,从而...
首先,"Web搜索引擎开发实例"这部分内容将教你如何使用Lucene来构建一个基本的Web搜索引擎。这通常涉及到爬取网页数据,提取文本,然后使用Lucene进行索引。在这个过程中,你会了解到如何创建Analyzer来处理中文分词...