`
yiminghe
  • 浏览: 1460501 次
  • 性别: Icon_minigender_1
  • 来自: 上海
社区版块
存档分类
最新评论

使用lucene 来创建一个知识库

    博客分类:
  • java
阅读更多

 

将常用的 一些 doc txt html文档 索引,关键是 lucene 只需要 将 doc html 剥离成 普通的 还有有效信息的字符串即可。


    仿照demo建立 DOCDocument.java ,getText() 利用 POI ,抽取 基类 FileDocument,

动态载入 后缀名 + Document .class ,

 

基类FileDocument:

 

package bts.jsp.kbase;


import java.io.*;
import java.util.Map;
import java.util.HashMap;
import java.util.Arrays;

import org.apache.lucene.document.DateTools;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;


public abstract class FileDocument {
    static Map<String, FileDocument> DocmentMap;

    static {
        try {
            DocmentMap = init();
        } catch (Exception e) {
            e.printStackTrace();
            System.exit(1);
        }
    }

    private static Map<String, FileDocument> init() throws Exception {
        Map<String, FileDocument> map = new HashMap<String, FileDocument>();
        for (String t : KbaseConfig.TYPES) {
            map.put(t, (FileDocument) Class.forName("bts.jsp.kbase." + t.toUpperCase() + "Document").newInstance());
        }

        return map;
    }

    /*
 public static String getCommonContent(String path) {
     if (!KbaseConfig.accceptFile(path))
         return null;
     String subtype = path.substring(path.lastIndexOf(".") + 1);
     String content = DocmentMap.get(subtype.toLowerCase()).getTextContent(path);

     return content;
 }   */

    public static String getCacheStringContent(String path) {
        String stringPath = KbaseConfig.getCacheStringPath(path);
        System.out.println("get cache :" + stringPath);
        StringBuffer sb = new StringBuffer("");
        try {
            BufferedReader reader = new BufferedReader(new FileReader(stringPath));
            String line;
            while ((line = reader.readLine()) != null) {
                sb.append(line).append("\n");
            }
            reader.close();
        } catch (Exception e) {
            e.printStackTrace();
        }
        return sb.toString();
    }

    private String cacheAndGetStringContent(String path) {
        String content = this.getTextContent(path);
        String stringPath = KbaseConfig.getCacheStringPath(path);
        System.out.println(stringPath);
        {
            String dir = stringPath.substring(0, stringPath.lastIndexOf("/"));
            File f = new File(dir);
            if (!f.exists()) f.mkdirs();
        }
        try {
            PrintWriter pw = new PrintWriter(stringPath);
            pw.println(content);
            pw.close();
        } catch (Exception e) {
            e.printStackTrace();
        }
        return content;
    }

    public static Document getCommonDocument(String path) {
        if (!KbaseConfig.accceptFile(path))
            return null;
        String subtype = path.substring(path.lastIndexOf(".") + 1);
        try {
            return DocmentMap.get(subtype).Document(new File(path));
        } catch (Exception e) {
            e.printStackTrace();
        }
        return null;
    }


    public abstract String getTextContent(String path);

    public Document Document(File f)
            throws java.io.FileNotFoundException {

        // make a new, empty document
        Document doc = new Document();

        // Add the path of the file as a field named "path".  Use a field that is
        // indexed (i.e. searchable), but don't tokenize the field into words.
        doc.add(new Field("path", f.getPath(), Field.Store.YES, Field.Index.NOT_ANALYZED));

        doc.add(new Field("title", f.getName(), Field.Store.YES, Field.Index.NOT_ANALYZED));


        // Add the last modified date of the file a field named "modified".  Use
        // a field that is indexed (i.e. searchable), but don't tokenize the field
        // into words.
        doc.add(new Field("modified",
                f.lastModified() + "",
                Field.Store.YES, Field.Index.NOT_ANALYZED));

        // Add the contents of the file to a field named "contents".  Specify a Reader,
        // so that the text of the file is tokenized and indexed, but not stored.
        // Note that FileReader expects the file to be in the system's default encoding.
        // If that's not the case searching for special characters will fail.
        String content = cacheAndGetStringContent(f.getPath());
        doc.add(new Field("contents", content, Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS));

        // return the document
        return doc;
    }

}
    
   

 

   举例,对于doc 后缀名 ,有类 DOCDocument

 

package bts.jsp.kbase;

import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.poi.hwpf.extractor.WordExtractor;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.File;
import java.io.FileInputStream;

/**
 * Created by IntelliJ IDEA.
 * User: yiminghe
 * Date: 2008-12-11
 * Time: 15:32:09
 * To change this template use File | Settings | File Templates.
 */
public class DOCDocument extends FileDocument {
    /**
     * Makes a document for a File.
     * <p/>
     * The document has three fields:
     * <ul>
     * <li><code>path</code>--containing the pathname of the file, as a stored,
     * untokenized field;
     * <li><code>modified</code>--containing the last modified date of the file as
     * a field as created by <a
     * href="lucene.document.DateTools.html">DateTools</a>; and
     * <li><code>contents</code>--containing the full contents of the file, as a
     * Reader field;
     */


    public String getTextContent(String path) {
        String content = "";
        try {
            WordExtractor wordExtractor = new WordExtractor(new FileInputStream(path));
            content = wordExtractor.getText();
        } catch (Exception e) {
            e.printStackTrace();
        }
        return content;
    }


}

 

 

 

 

 

    接着要考虑的问题是 ,如何避免重复索引 以及 文件修改后重新索引 ,这就要用到 lastmodifiy 这个文件的属性,每次索引完 都将 已索引的文件以及其最后修改时间保存下来, 下次索引前先检查 , 更新或新增时 才真正索引。


    也就是 每次都找出 应该删掉哪些,应该加入哪些 (索引的更新是通过先删除再加入实现的)

 

package bts.jsp.kbase;

import bts.roi.BtsManager;

import java.io.*;
import java.util.ArrayList;

/**
 * Created by IntelliJ IDEA.
 * User: yiminghe
 * Date: 2008-12-11
 * Time: 17:29:45
 * To change this template use File | Settings | File Templates.
 */
public class KbaseConfig {
    // 已经建立索引的文件集合
    static final String INDEXEDFILES = BtsManager.getProperty("Bts.INDEXEDFILES");
    //索引存放目录
    static final File INDEX_DIR = new File(BtsManager.getProperty("Bts.INDEX_DIR"));

    //真实数据
    static final String DATA_DIR = BtsManager.getProperty("Bts.DATA_DIR");

    //真实数据 String
    static final String DATA_STRING_DIR = BtsManager.getProperty("Bts.DATA_STRING_DIR");

    //索引后缀名列表
    static String[] TYPES = {"html", "htm", "txt", "doc", "ppt", "xls", "pdf"};

    static {
        File f = new File(DATA_STRING_DIR);
        if (!f.exists()) f.mkdirs();
    }

    public static void saveIndexedFiles(ArrayList<String[]> data) {
        try {
            PrintWriter pw = new PrintWriter(INDEXEDFILES);
            for (int i = 0; i < data.size(); i++) {
                String[] d = data.get(i);
                for (int j = 0; j < d.length; j++) {
                    pw.print(d[j] + "\t");
                }
                pw.println();
            }
            pw.close();
        } catch (Exception e) {
            e.printStackTrace();
        }

    }

    static String getCacheStringPath(String path) {
        path = path.replaceAll("\\\\", "/");
        String stringPath = path.replaceAll(KbaseConfig.DATA_DIR, KbaseConfig.DATA_STRING_DIR);
        return stringPath;
    }

    public static ArrayList<String[]> loadIndexedFiles() {
        ArrayList<String[]> data = new ArrayList<String[]>();
        if (new File(INDEXEDFILES).exists()) {
            try {
                BufferedReader reader = new BufferedReader(new FileReader(INDEXEDFILES));
                String line;
                while ((line = reader.readLine()) != null) {
                    if ((line = line.trim()).equals("")) continue;
                    String[] d = line.split("\t");
                    data.add(d);
                }
                reader.close();
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
        return data;
    }


    public static ArrayList<String[]> getCurrentFiles(String dir) {
        ArrayList<String[]> d = new ArrayList<String[]>();
        getCurrentFiles(dir, d);
        return d;
    }


    private static int indexArray(String[] array, String value) {
        value = value.trim();
        for (int i = 0; i < array.length; i++) {
            if (array[i].equals(value))
                return i;
        }
        return -1;
    }

    static boolean accceptFile(String path) {
        int index = path.lastIndexOf(".");
        if (index == -1) return false;
        String subtype = path.substring(index + 1);
        int array_index = indexArray(TYPES, subtype.toLowerCase());
        if (array_index == -1) return false;
        return true;
    }

    private static void getCurrentFiles(String dir, ArrayList<String[]> data) {
        File f = new File(dir);
        if (f.isDirectory()) {
            File[] fs = f.listFiles(new FileFilter() {
                public boolean accept(File pathname) {
                    boolean ac = pathname.isDirectory() || accceptFile(pathname.getAbsolutePath());
                    return ac;
                }
            });

            for (int i = 0; i < fs.length; i++) {
                getCurrentFiles(fs[i].getAbsolutePath(), data);
            }
            return;
        }

        if (!f.canRead()) return;

        String[] d = new String[2];
        d[0] = f.getAbsolutePath();
        d[1] = f.lastModified() + "";
        data.add(d);

    }


    public static ArrayList<String> getDeleted(ArrayList<String[]> original, ArrayList<String[]> newData) {
        ArrayList<String> result = new ArrayList<String>();
        for (int i = 0; i < original.size(); i++) {
            String path = original.get(i)[0];
            long lm = Long.parseLong(original.get(i)[1]);
            boolean modified = false;
            int j = 0;
            for (j = 0; j < newData.size(); j++) {
                String path2 = newData.get(j)[0];
                long lm2 = Long.parseLong(newData.get(j)[1]);
                if (path2.equals(path)) {
                    if (lm2 > lm) {
                        modified = true;
                        break;
                    } else {
                        break;
                    }
                }


            }

            //修改或者已经被删除
            if (modified || j == newData.size()) {
                result.add(path);
            }

        }

        return result;

    }


    public static ArrayList<String> getAdded(ArrayList<String[]> original, ArrayList<String[]> newData) {
        ArrayList<String> result = new ArrayList<String>();
        for (int i = 0; i < newData.size(); i++) {
            String path = newData.get(i)[0];
            long lm = Long.parseLong(newData.get(i)[1]);
            boolean modified = false;
            int j = 0;
            for (j = 0; j < original.size(); j++) {
                String path2 = original.get(j)[0];
                long lm2 = Long.parseLong(original.get(j)[1]);
                if (path2.equals(path)) {
                    if (lm > lm2) {
                        modified = true;
                        break;
                    } else {
                        break;
                    }
                }


            }

            //修改或者已经新的
            if (modified || j == original.size()) {
                result.add(path);
            }

        }

        return result;

    }
}
   

 

其他的 查询 ,删除 都和 demo 差不多了 ,加了 highlighter 的应用 ,

 

package bts.jsp.kbase;

/**
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

import java.io.*;
import java.util.Date;
import java.util.ArrayList;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.FilterIndexReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.HitCollector;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.Searcher;
import org.apache.lucene.search.TopDocCollector;
import org.apache.lucene.search.highlight.*;

/**
 * Simple command-line based search demo.
 */
public class SearchFiles {

    /**
     * Use the norms from one field for all fields.  Norms are read into memory,
     * using a byte of memory per document per searched field.  This can cause
     * search of large collections with a large number of fields to run out of
     * memory.  If all of the fields contain only a single token, then the norms
     * are all identical, then single norm vector may be shared.
     */
    private static class OneNormsReader extends FilterIndexReader {
        private String field;

        public OneNormsReader(IndexReader in, String field) {
            super(in);
            this.field = field;
        }

        public byte[] norms(String field) throws IOException {
            return in.norms(this.field);
        }
    }

    private SearchFiles() {
    }

    /**
     * Simple command-line based search demo.
     */
    public static KbaseFiles search(String field, String queries, int start, int limit) throws Exception {

        IndexReader reader = IndexReader.open(KbaseConfig.INDEX_DIR);

        /*
        if (normsField != null)
            reader = new OneNormsReader(reader, normsField);
         */


        IndexSearcher searcher = new IndexSearcher(reader);
        Analyzer analyzer = new StandardAnalyzer();


        QueryParser parser = new QueryParser(field, analyzer);


        Query query = parser.parse(queries);
        //System.out.println("Searching for: " + query.toString(field));
        KbaseFiles files = null;
        if (start >= 0) {
            files = doPagingSearch(analyzer, searcher, query, start, limit);
        } else {
            doStreamingSearch(searcher, query);
        }
        return files;

    }

    /**
     * This method uses a custom HitCollector implementation which simply prints out
     * the docId and score of every matching document.
     * <p/>
     * This simulates the streaming search use case, where all hits are supposed to
     * be processed, regardless of their relevance.
     */
    public static void doStreamingSearch(final IndexSearcher searcher, Query query) throws IOException {
        HitCollector streamingHitCollector = new HitCollector() {

            // simply print docId and score of every matching document
            public void collect(int doc, float score) {
                //System.out.println("doc=" + doc + " score=" + score);
            }

        };

        searcher.search(query, streamingHitCollector);
    }

    /**
     * This demonstrates a typical paging search scenario, where the search engine presents
     * pages of size n to the user. The user can then go to the next page if interested in
     * the next hits.
     * <p/>
     * When the query is executed for the first time, then only enough results are collected
     * to fill 5 result pages. If the user wants to page beyond this limit, then the query
     * is executed another time and all hits are collected.
     */
    public static KbaseFiles doPagingSearch(Analyzer analyzer, IndexSearcher searcher, Query query,
                                            int start, int limit) throws IOException {
        // Collect enough docs to show 5 pages
        TopDocCollector collector = new TopDocCollector(start + limit);
        searcher.search(query, collector);
        ScoreDoc[] hits = collector.topDocs().scoreDocs;
        int numTotalHits = collector.getTotalHits();
        //System.out.println(numTotalHits + " total matching documents");

        int end = Math.min(numTotalHits, start + limit);
        SimpleHTMLFormatter htmlFormatter = new SimpleHTMLFormatter("<em>", "</em>");
        Highlighter highlighter = new Highlighter(htmlFormatter, new QueryScorer(query));







        KbaseFiles fileResult = new KbaseFiles();
        fileResult.setTotal(numTotalHits);
        ArrayList<KbaseFile> files = new ArrayList<KbaseFile>();
        for (int i = start; i < end; i++) {
            //if (raw) {                              // output raw format
            //System.out.println("doc=" + hits[i].doc + " score=" + hits[i].score);
            //continue;
            //}
            Document doc = searcher.doc(hits[i].doc);
            String path = doc.get("path");
            if (path != null) {
                //System.out.println((i + 1) + ". " + path);
                String title = doc.get("title");
                String contents = FileDocument.getCommonContent(path);
                String highLightText = highlighter.getBestFragment(analyzer, "contents", contents);







                String modified = doc.get("modified");
                modified = modified.substring(0, modified.length() - 3);
                KbaseFile file = new KbaseFile(title, path, modified, highLightText);
                files.add(file);
            } else {
                //System.out.println((i + 1) + ". " + "No path for this document");
            }
        }
        fileResult.setFiles(files);
        return fileResult;

    }

}
 

 

 

 

 

  • 大小: 54.4 KB
  • 大小: 139.7 KB
分享到:
评论
2 楼 yiminghe 2011-08-15  
奈落王 写道
少了一个 BtsManager  能不能发一下?

不好意思,很久以前的代码都没了
1 楼 奈落王 2011-08-12  
少了一个 BtsManager  能不能发一下?

相关推荐

    基于lucene.net开发的个人知识库

    这个项目,即“基于Lucene.Net开发的个人知识库”,展示了如何利用这一工具来整理、检索和管理个人知识。 Lucene.Net是一个开源的、高性能的全文搜索引擎库,它提供了高级的索引和搜索功能,适用于各种数据源,包括...

    lucene创建修改删除组合条件查询

    以上就是关于“lucene创建修改删除组合条件查询”的主要知识点。通过熟练掌握这些操作,开发者可以构建出强大的全文搜索系统,满足各种复杂的查询需求。在实际应用中,还需要注意性能优化,如合理使用索引,以及根据...

    自己写的一个lucene知识点集合

    Lucene 是一个开源全文搜索引擎库,由 Apache 软件基金会开发。它提供了一个可扩展的、高性能的搜索框架,使得开发者能够快速地在大量文本数据中实现全文检索功能。这个压缩包“LuceneTest2”很可能是作者自己编写的...

    lucene的一个实用例子

    Lucene 是一个开源全文搜索引擎库,由 Apache 软件基金会开发。它提供了一个可扩展的、高性能的搜索框架,使得开发者能够轻易地在应用中集成全文检索功能。本实例将通过一个具体的应用场景,帮助大家了解 Lucene 的...

    知识库设计

    在知识库设计中,我们还需要考虑到Lucene的使用,Lucene是一个开源的、基于Java的全文检索引擎。Lucene提供了多种功能,包括Document、Field、IndexWriter、Analyzer、Directory等。Document是用来描述文档的,Field...

    Lucene 使用正则表达式

    Lucene是一个高性能、全功能的全文搜索引擎库。它为开发者提供了构建搜索应用所需的工具,包括索引文档和执行查询的能力。Lucene的核心特性之一就是支持复杂的查询语言,其中包括正则表达式。 #### 正则表达式在...

    一个非常好的检测lucene索引库的工具

    总的来说,这个“lucene tool”对于任何使用Apache Lucene构建搜索引擎的人来说都是一个宝贵的资源,它可以提高系统的稳定性和可靠性,确保用户能够得到准确、快速的搜索体验。在日常开发和运维中,熟练掌握并运用...

    lucene 对 xml建立索引

    - Lucene是一个开源的全文搜索引擎库,能够帮助开发者构建应用程序内的搜索功能。 - Lucene的核心能力在于文档索引和查询,它提供了强大的API来实现高效的文档检索。 2. **XML简介** - XML(Extensible Markup ...

    Lucene 在知识库全文检索模块中所起的作用

    Lucene 是一个强大的开源全文检索库,被广泛应用于构建高效的搜索功能。在“CIS(可能是临床信息系统或企业信息系统)”这样的系统中,Lucene 的作用在于提供快速、准确的全文检索能力,以帮助用户迅速定位到所需的...

    Lucene的一个毕业设计

    Lucene 是一个高性能、全文本搜索库,由 Apache 软件基金会开发。它提供了一个可扩展的、高度优化的框架,用于在各种数据源中进行索引和搜索文本信息。Lucene 具有强大的文本分析能力,可以处理多种语言,并且支持...

    Lucene全文检索引擎

    1. **创建索引**:首先,你需要创建一个Analyzer来定义如何分词,然后使用IndexWriter将文档添加到索引中。 2. **索引优化**:为了提高性能,Lucene支持索引的合并,以减少打开索引时的磁盘I/O。 3. **搜索**:...

    lucene3.0.3搜索的使用示例

    Apache Lucene 是一个开源全文搜索引擎库,为开发者提供了在各种应用程序中实现高级搜索功能的工具。这个"lucene3.0.3搜索的使用示例"压缩包文件很可能是为了帮助用户理解并学习如何在项目中应用Lucene 3.0.3版本的...

    丑牛迷你知识库1.0.1beta版(32位)-2

    "丑牛迷你知识库1.0.1beta版(32位)-2"是一个软件的压缩包,根据描述,这个软件可能是一个针对特定平台(32位)的知识管理工具,其核心组件或依赖项需要解压到名为"jre6"的文件夹中。这个命名暗示了它可能依赖于Java...

    Lucene结合Sql建立索引Demo源码.rar

    Lucene的使用者不需要深入了解有关全文检索的知识,仅仅学会使用库中的一个类,你就为你的应用实现全文检索的功能. 不过千万别以为Lucene是一个象google那样的搜索引擎,Lucene甚至不是一个应用程序,它仅仅是一个工具,...

    lucene.net源代码

    Lucene.NET,作为Apache Lucene的.NET版本,是一个高性能、全文检索库,为.NET开发者提供了强大的文本搜索功能。本实例将带您深入理解Lucene.NET的内部机制,通过源代码分析,我们将探讨其核心组件、工作流程以及...

    Lucene.Net的DLL

    Lucene.Net的核心目标是实现高性能、可扩展的全文搜索,它广泛应用于网站搜索、内容管理系统、知识库、日志分析等多个领域。 在描述中提到的“快速搜索”是Lucene.Net的关键特性之一。通过高效的倒排索引机制,...

    Lucene资料大全(包括Lucene_in_Action书等)

    **标题与描述解析** 标题"Lucene资料大全(包括Lucene_in_Action书等)...总的来说,这个压缩包提供了一个全面的Lucene学习路径,既有理论书籍也有实践教程,对于想要深入理解或开始使用Lucene的人来说是宝贵的资源。

    Lucene的使用与优化

    - **解释**:这段代码展示了如何使用StandardAnalyzer来创建一个文档的索引,并将其添加到指定路径下的索引中。`addDocument` 方法用于向索引中添加文档,`optimize` 方法则用于合并索引分段,减少分段数量,从而...

    lucene开发部分例子

    首先,"Web搜索引擎开发实例"这部分内容将教你如何使用Lucene来构建一个基本的Web搜索引擎。这通常涉及到爬取网页数据,提取文本,然后使用Lucene进行索引。在这个过程中,你会了解到如何创建Analyzer来处理中文分词...

Global site tag (gtag.js) - Google Analytics