mshijie

浏览: 97052 次
性别:
来自: 杭州

最近访客更多访客>>

smartdj

Tulongf

kuyala

wljcom

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

Getting Started with Lucene

博客分类：

Java

lucene 搜索引擎 F#Apache

Lucene是一个高性能的，可扩展的信息提取（IR）库。Lucene不是一个完整的搜索引擎，但基于它可以很快捷构建一个搜索应用。Lucene提供了搜索需要的索引建立和索引查找的功能，以及相应的附属设施。Lucene是Apache下的开源Java项目，其还有一系列与搜索相关的子项目。

一、搜索应用基本过程

搜索主要围绕着索引的建立和索引的查找。搜索引擎首先从原始的内容资源中提取出文本化的内容。对文档内容进行分析得到一系列的token或者关键词。再以此创建索引。并把索引保存在一个地方。

查找过程则是首先从用户获得查询串。根据查询串构造并执行查询。从索引中获得原始内用的引用。

lucene就是这个过程的核心。

lucene接受以任意方式从原始资源内容中提取出来的文档化内容（Document），Documnet由一些列的Field组成，每个Field代表这个资源一个内容组成，比如邮件的标题，发件人和内容。Field也可以保存与应用相关的一些列业务数据。再通过Analyzer对Document进行分词分析处理，lucene有很多内建分析器。最后lucene把分析后得到的索引通过IndexWriter保存在一个Directory下，Directory可以是文件系统FSDirectory，RAMDirectory或者数据库等。

使用IndexSearcher搜索时，需要提供保存索引的Directory，并通过QueryParser更具查询串构建一个Query，或者直接创建查询，lucene提供了满足多种需求的查询。搜索得到TopDocs，TopDocs保存有搜索到的Document的应用。最后得到Document，便可从Document得到与应用相关的一些字段了。

下面是Lucene in Action里面的一个简单例子。对一个目录下的文本文件建立索引，然后使用关键字查询文件。

Indexer

public class Indexer {
    public static void main(String[] args) throws Exception {
        if (args.length != 2) {
            throw new Exception("Usage: java " + Indexer.class.getName()
                    + " <index dir> <data dir>");
        }
        String indexDir = args[0]; //1
        String dataDir = args[1]; //2
        long start = System.currentTimeMillis();
        Indexer indexer = new Indexer(indexDir);
        int numIndexed = indexer.index(dataDir);
        indexer.close();
        long end = System.currentTimeMillis();
        System.out.println("Indexing " + numIndexed + " files took "
                + (end - start) + " milliseconds");
    }

    private IndexWriter writer;

    public Indexer(String indexDir) throws IOException {
        Directory dir = FSDirectory.open(new File(indexDir), null);
        writer = new IndexWriter(dir, //3
                new StandardAnalyzer(Version.LUCENE_CURRENT), true,
                IndexWriter.MaxFieldLength.UNLIMITED);
    }

    public void close() throws IOException {
        writer.close(); //4
    }

    public int index(String dataDir) throws Exception {
        File[] files = new File(dataDir).listFiles();
        for (int i = 0; i < files.length; i++) {
            File f = files[i];
            if (acceptFile(f)) {
                indexFile(f);
            }
        }
        return writer.numDocs(); //5
    }

    private boolean acceptFile(File f) {
        return !f.isDirectory() &&
                !f.isHidden() &&
                f.exists() &&
                f.canRead() &&
                f.getName().toLowerCase().endsWith(".txt");
    }

    private void indexFile(File f) throws Exception {
        System.out.println("Indexing " + f.getCanonicalPath());
        Document doc = getDocument(f);
        if (doc != null) {
            writer.addDocument(doc); //9
        }
    }

    private Document getDocument(File f) throws Exception {
        Document doc = new Document();
        doc.add(new Field("contents", new FileReader(f))); //7
        doc.add(new Field("filename", f.getCanonicalPath(), //8
                Field.Store.YES, Field.Index.NOT_ANALYZED));
        return doc;
    }

}

Searcher

public class Searcher {
    public static void main(String[] args) throws Exception {
        if (args.length != 2) {
            throw new Exception("Usage: java " + Searcher.class.getName()
                    + " <index dir> <query>");
        }
        String indexDir = args[0]; //1
        String q = args[1]; //2
        search(indexDir, q);
    }

    public static void search(String indexDir, String q)
            throws Exception {
        Directory dir = FSDirectory.open(new File(indexDir), null);
        IndexSearcher is = new IndexSearcher(dir); //3
        QueryParser parser = new QueryParser(Version.LUCENE_CURRENT, "contents", new StandardAnalyzer(Version.LUCENE_CURRENT)); //4
        Query query = parser.parse(q); //4
        long start = System.currentTimeMillis();
        TopDocs hits = is.search(query, 10); //5
        long end = System.currentTimeMillis();
        System.err.println("Found " + hits.totalHits + //6
                " document(s) (in " + (end - start) +
                " milliseconds) that matched query '" +
                q + "':");
        for (int i = 0; i < hits.scoreDocs.length; i++) {
            ScoreDoc scoreDoc = hits.scoreDocs[i];
            Document doc = is.doc(scoreDoc.doc); //7
            System.out.println(doc.get("filename")); //8
        }
        is.close(); //9
    }
}

查看图片附件

分享到：

总结一致性哈希(Consistent Hashing) | 使用Unitils测试DAO

2010-02-22 14:27
浏览 1256
评论(0)
论坛回复 / 浏览 (0 / 1484)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

Getting Started with Lucene

一、搜索应用基本过程

Indexer

Searcher

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

Getting Started with Lucene

一、搜索应用基本过程

Indexer

Searcher

评论

发表评论

相关推荐

使用Berkeley DB构建持久化队列

《Clean Code》总结 异常

《Clean Code》总结 方法

《Clean Code》总结 有意义的命名

使用Unitils测试DAO

[转]TDD全攻略

在Spring中结合Dbunit对Dao进行集成单元测试

代码备份build.xml

@Override的在1.5和1.6中的不同含义

集成struts2 spring hibernate，使用注解

最近访客更多访客>>

《Clean Code》总结异常

《Clean Code》总结方法

《Clean Code》总结有意义的命名