`
hz_chenwenbiao
  • 浏览: 1008119 次
  • 性别: Icon_minigender_1
  • 来自: 广州
社区版块
存档分类
最新评论

Field Bridge 将上传文件内容加入全文检索(转)

阅读更多

下面是转载hibernate search的使用文档,例子有些信息是不全的。

这里特别要注意的是加入各种bridge注解的属性里,都要加入field注解,因为不加入field注解的话,就不能进行索引,那加入bridge也就白加了。

4.2. Property/Field Bridge

In Lucene all index fields have to be represented as Strings. For this reason all entity properties annotated with @Field have to be indexed in a String form. For most of your properties, Hibernate Search does the translation job for you thanks to a built-in set of bridges. In some cases, though you need a more fine grain control over the translation process.

4.2.1. Built-in bridges

Hibernate Search comes bundled with a set of built-in bridges between a Java property type and its full text representation.

null

null elements are not indexed. Lucene does not support null elements and this does not make much sense either.

java.lang.String

String are indexed as is

short, Short, integer, Integer, long, Long, float, Float, double, Double, BigInteger, BigDecimal

Numbers are converted in their String representation. Note that numbers cannot be compared by Lucene (ie used in ranged queries) out of the box: they have to be padded

Note

Using a Range query is debatable and has drawbacks, an alternative approach is to use a Filter query which will filter the result query to the appropriate range.

Hibernate Search will support a padding mechanism

java.util.Date

Dates are stored as yyyyMMddHHmmssSSS in GMT time (200611072203012 for Nov 7th of 2006 4:03PM and 12ms EST). You shouldn't really bother with the internal format. What is important is that when using a DateRange Query, you should know that the dates have to be expressed in GMT time.

Usually, storing the date up to the millisecond is not necessary. @DateBridge defines the appropriate resolution you are willing to store in the index (@DateBridge(resolution=Resolution.DAY) ). The date pattern will then be truncated accordingly.

 

@Entity 
@Indexed
public class Meeting {
    @Field(index=Index.UN_TOKENIZED)
    @DateBridge(resolution=Resolution.MINUTE)
    private Date date;
    ...     
 

Warning

A Date whose resolution is lower than MILLISECOND cannot be a @DocumentId

java.net.URI, java.net.URL

URI and URL are converted to their string representation

java.lang.Class

Class are converted to their fully qualified class name. The thread context classloader is used when the class is rehydrated

4.2.2. Custom Bridge

Sometimes, the built-in bridges of Hibernate Search do not cover some of your property types, or the String representation used by the bridge does not meet your requirements. The following paragraphs describe several solutions to this problem.

4.2.2.1. StringBridge

The simplest custom solution is to give Hibernate Search an implementation of your expected Object to String bridge. To do so you need to implements theorg.hibernate.search.bridge.StringBridge interface. All implementations have to be thread-safe as they are used concurrently.

Example 4.13. Implementing your own StringBridge

 

/**
 * Padding Integer bridge.
 * All numbers will be padded with 0 to match 5 digits
 *
 * @author Emmanuel Bernard
 */
public class PaddedIntegerBridge implements StringBridge {

    private int PADDING = 5;

    public String objectToString(Object object) {
        String rawInteger = ( (Integer) object ).toString();
        if (rawInteger.length() > PADDING) 
            throw new IllegalArgumentException( "Try to pad on a number too big" );
        StringBuilder paddedInteger = new StringBuilder( );
        for ( int padIndex = rawInteger.length() ; padIndex < PADDING ; padIndex++ ) {
            paddedInteger.append('0');
        }
        return paddedInteger.append( rawInteger ).toString();
    }
}                

 Then any property or field can use this bridge thanks to the @FieldBridge annotation

 

@FieldBridge(impl = PaddedIntegerBridge.class)
private Integer length;     
 Parameters can be passed to the Bridge implementation making it more flexible. The Bridge implementation implements a ParameterizedBridge interface, and the parameters are passed through the @FieldBridge annotation.

Example 4.14. Passing parameters to your bridge implementation

 

public class PaddedIntegerBridge implements StringBridge, ParameterizedBridge {

    public static String PADDING_PROPERTY = "padding";
    private int padding = 5; //default

    public void setParameterValues(Map parameters) {
        Object padding = parameters.get( PADDING_PROPERTY );
        if (padding != null) this.padding = (Integer) padding;
    }

    public String objectToString(Object object) {
        String rawInteger = ( (Integer) object ).toString();
        if (rawInteger.length() > padding) 
            throw new IllegalArgumentException( "Try to pad on a number too big" );
        StringBuilder paddedInteger = new StringBuilder( );
        for ( int padIndex = rawInteger.length() ; padIndex < padding ; padIndex++ ) {
            paddedInteger.append('0');
        }
        return paddedInteger.append( rawInteger ).toString();
    }
}


//property
@FieldBridge(impl = PaddedIntegerBridge.class,
             params = @Parameter(name="padding", value="10")
            )
private Integer length;
 The ParameterizedBridge interface can be implemented by StringBridgeTwoWayStringBridgeFieldBridge implementations.

All implementations have to be thread-safe, but the parameters are set during initialization and no special care is required at this stage.

If you expect to use your bridge implementation on an id property (ie annotated with @DocumentId ), you need to use a slightly extended version of StringBridgenamed TwoWayStringBridge. Hibernate Search needs to read the string representation of the identifier and generate the object out of it. There is not difference in the way the @FieldBridge annotation is used.

Example 4.15. Implementing a TwoWayStringBridge which can for example be used for id properties

 

public class PaddedIntegerBridge implements TwoWayStringBridge, ParameterizedBridge {

    public static String PADDING_PROPERTY = "padding";
    private int padding = 5; //default

    public void setParameterValues(Map parameters) {
        Object padding = parameters.get( PADDING_PROPERTY );
        if (padding != null) this.padding = (Integer) padding;
    }

    public String objectToString(Object object) {
        String rawInteger = ( (Integer) object ).toString();
        if (rawInteger.length() > padding) 
            throw new IllegalArgumentException( "Try to pad on a number too big" );
        StringBuilder paddedInteger = new StringBuilder( );
        for ( int padIndex = rawInteger.length() ; padIndex < padding ; padIndex++ ) {
            paddedInteger.append('0');
        }
        return paddedInteger.append( rawInteger ).toString();
    }

    public Object stringToObject(String stringValue) {
        return new Integer(stringValue);
    }
}


//id property
@DocumentId
@FieldBridge(impl = PaddedIntegerBridge.class,
             params = @Parameter(name="padding", value="10") 
private Integer id;
                
 It is critically important for the two-way process to be idempotent (ie object = stringToObject( objectToString( object ) ) ).

4.2.2.2. FieldBridge

Some use cases require more than a simple object to string translation when mapping a property to a Lucene index. To give you the greatest possible flexibility you can also implement a bridge as a FieldBridge. This interface gives you a property value and let you map it the way you want in your Lucene Document.The interface is very similar in its concept to the Hibernate UserTypes.

You can for example store a given property in two different document fields:

Example 4.16. Implementing the FieldBridge interface in order to a given property into multiple document fields

 

/**
 * Store the date in 3 different fields - year, month, day - to ease Range Query per
 * year, month or day (eg get all the elements of December for the last 5 years).
 * 
 * @author Emmanuel Bernard
 */
public class DateSplitBridge implements FieldBridge {
    private final static TimeZone GMT = TimeZone.getTimeZone("GMT");

    public void set(String name, Object value, Document document, 
                    LuceneOptions luceneOptions) {
        Date date = (Date) value;
        Calendar cal = GregorianCalendar.getInstance(GMT);
        cal.setTime(date);
        int year = cal.get(Calendar.YEAR);
        int month = cal.get(Calendar.MONTH) + 1;
        int day = cal.get(Calendar.DAY_OF_MONTH);
  
        // set year
        Field field = new Field(name + ".year", String.valueOf(year),
            luceneOptions.getStore(), luceneOptions.getIndex(),
            luceneOptions.getTermVector());
        field.setBoost(luceneOptions.getBoost());
        document.add(field);
  
        // set month and pad it if needed
        field = new Field(name + ".month", month < 10 ? "0" : ""
            + String.valueOf(month), luceneOptions.getStore(),
            luceneOptions.getIndex(), luceneOptions.getTermVector());
        field.setBoost(luceneOptions.getBoost());
        document.add(field);
  
        // set day and pad it if needed
        field = new Field(name + ".day", day < 10 ? "0" : ""
            + String.valueOf(day), luceneOptions.getStore(),
            luceneOptions.getIndex(), luceneOptions.getTermVector());
        field.setBoost(luceneOptions.getBoost());
        document.add(field);
    }
}

//property
@FieldBridge(impl = DateSplitBridge.class)
private Date date;                
 

4.2.2.3. ClassBridge

It is sometimes useful to combine more than one property of a given entity and index this combination in a specific way into the Lucene index. The @ClassBridgeand @ClassBridge annotations can be defined at the class level (as opposed to the property level). In this case the custom field bridge implementation receives the entity instance as the value parameter instead of a particular property. Though not shown in this example, @ClassBridge supports the termVector attribute discussed in section Section 4.1.1, “Basic mapping”.

Example 4.17. Implementing a class bridge

 

@Entity
@Indexed
@ClassBridge(name="branchnetwork",
             index=Index.TOKENIZED,
             store=Store.YES,
             impl = CatFieldsClassBridge.class,
             params = @Parameter( name="sepChar", value=" " ) )
public class Department {
    private int id;
    private String network;
    private String branchHead;
    private String branch;
    private Integer maxEmployees
    ...
}


public class CatFieldsClassBridge implements FieldBridge, ParameterizedBridge {
    private String sepChar;

    public void setParameterValues(Map parameters) {
        this.sepChar = (String) parameters.get( "sepChar" );
    }

    public void set(String name, Object value, Document document, LuceneOptions luceneOptions) {
        // In this particular class the name of the new field was passed
        // from the name field of the ClassBridge Annotation. This is not
        // a requirement. It just works that way in this instance. The
        // actual name could be supplied by hard coding it below.
        Department dep = (Department) value;
        String fieldValue1 = dep.getBranch();
        if ( fieldValue1 == null ) {
            fieldValue1 = "";
        }
        String fieldValue2 = dep.getNetwork();
        if ( fieldValue2 == null ) {
            fieldValue2 = "";
        }
        String fieldValue = fieldValue1 + sepChar + fieldValue2;
        Field field = new Field( name, fieldValue, luceneOptions.getStore(), luceneOptions.getIndex(), luceneOptions.getTermVector() );
        field.setBoost( luceneOptions.getBoost() );
        document.add( field );
   }
}
 In this example, the particular CatFieldsClassBridge is applied to the department instance, the field bridge then concatenate both branch and network and index the concatenation.
分享到:
评论

相关推荐

    Lucene全文检索引擎

    它的设计目标是让开发者能够轻松地在应用中加入高级全文检索功能。 **一、Lucene的基本概念** 1. **文档(Document)**:在Lucene中,每个要搜索的文本对象被称为一个文档,文档由多个字段(Field)组成,如标题、...

    快速构建PHP全文检索——马明练

    全文检索是一种特殊的IR形式,它允许用户通过输入文本查询来搜索整个文档的内容。 #### 二、基础知识 ##### 1. 反向索引 反向索引是全文检索系统中最核心的技术之一。它是一种数据结构,用于存储文档中的单词与...

    lucene 全文检索

    在实践中,Lucene可以处理多种格式的文件,如Word、Excel、PPT和PDF,这些文件通过特定的解析器(如Apache POI和PDFBox)将内容提取出来,然后进行索引。通过封装接口,开发者可以轻松地将这些功能整合到自己的应用...

    lucene全文检索简单索引和搜索实例

    《Lucene全文检索:简单索引与搜索实例详解》 Lucene是Apache软件基金会的开源项目,是一款强大的全文检索库,被广泛应用于Java开发中,为开发者提供了构建高性能搜索引擎的能力。在本文中,我们将深入探讨如何基于...

    纯Java全文检索

    在本项目 "FullTextSearch" 中,你将找到一个实际的示例,展示了如何运用 Lucene 进行全文检索。通过研究源代码,你可以更深入地理解 Lucene 的工作原理,以及如何在自己的 Java 应用程序中集成全文搜索功能。这个...

    lucene 全文检索系统 java源码 (信息检索技术)

    **Lucene 全文检索系统:Java 源码与信息检索技术详解** Lucene 是一个高度可定制的全文检索库,由 Apache 软件基金会维护,它为开发人员提供了一个强大的工具来构建搜索功能。这个压缩包包含了 Lucene 的 Java ...

    Lucene4 全文检索

    作为一个高级的搜索引擎工具包,Lucene4 提供了完整的索引和搜索机制,使得在文件和数据库中进行全文检索变得简单高效。在本文中,我们将深入探讨 Lucene4 的核心概念、工作流程以及如何在实际项目中应用。 ### 1. ...

    使用lucene全文检索数据库

    在这个项目中,我们将探讨如何利用Lucene 2.4.0版本与Access数据库结合,实现对数据库内容的全文检索。 首先,我们需要理解Lucene的基本工作原理。Lucene的核心概念包括文档(Document)、字段(Field)和索引...

    基于Java的全文检索引擎简介

    ### 基于Java的全文检索引擎Lucene详细介绍 #### 一、Lucene概述与历史背景 Lucene是一个开源的高性能全文检索库,它由Doug Cutting创建并维护,旨在为各种规模的应用程序提供高效的文本搜索功能。Lucene采用Java...

    全文检索系统(Lucene)

    **全文检索系统与Lucene** 全文检索系统是一种用于在大量文本数据中快速查找相关信息的工具。它通过索引文本中的关键词来实现高效的搜索性能,使得用户可以输入任意词汇或短语,系统能在短时间内返回最相关的文档。...

    Java多级多类型全文检索 - 基于Lucene3.3.0

    在这个“Java多级多类型全文检索 - 基于Lucene3.3.0”的主题中,我们将深入探讨如何利用Lucene 3.3.0版本来实现复杂且高效的检索机制,支持多种文件类型和多层次的索引构建。 首先,Lucene是一个开源的全文检索框架...

    开放源代码的全文检索引擎Lucene

    第一节 全文检索系统与Lucene简介··· 3 一、 什么是全文检索与全文检索系统?··· 3 二、 什么是Lucene?··· 4 三、 Lucene的应用、特点及优势··· 4 四、 本文的重点问题与cLucene项目··· 5 第二...

    基于Lucene的全文检索系统

    - **创建索引**:首先,我们需要遍历本地文件系统,读取每个文件的内容,并使用分词器将内容拆分成关键词。然后,将这些关键词及其在原文档中的位置信息保存到索引中。 - **搜索**:用户输入查询字符串后,Lucene会...

    Lucene.Net全文检索Demo

    在这个“Lucene.Net全文检索Demo”中,我们将深入探讨其核心功能及实现原理。 1. **基础概念** - **索引**:在Lucene.Net中,全文检索依赖于索引。索引类似于图书的目录,将文档内容转换为便于搜索的结构。 - **...

    最新全文检索系统开源lucene资料大全(pdf格式)

    "最新全文检索系统开源lucene资料大全"这个资料包很可能包含了Lucene的使用教程、API参考、实战案例等内容,帮助初学者快速入门并掌握Lucene的核心概念和技术。通过阅读PDF文档,你可以了解如何安装、配置、索引文档...

    lunece全文检索C#

    对于`.txt`, `.htm`, `.html`文件,我们将文件路径、文件名和内容分别作为字段存储并索引。`Field.Store.YES`表示存储字段,`Field.Index.TOKENIZED`表示该字段会被分词处理。 对于`.doc`文件,我们需要使用...

    lucene全文检索word2007

    **Lucene全文检索Word2007** Lucene是一个开源的全文搜索引擎库,由Apache软件基金会开发并维护。它提供了一个高效、可扩展的搜索框架,使得开发者能够在其应用程序中集成高级的搜索功能。在本示例中,我们讨论的是...

Magicbox
Global site tag (gtag.js) - Google Analytics