solr搜索服务器配置mmseg4j分词

wanglihu

浏览: 923843 次
性别:
来自: 黑龙江

最近访客更多访客>>

noriri

leimingchao

liunancun

zzy_001

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

solr

solr mmseg4j

为solr搜索服务器配置mmseg4j分词 , 并使用搜狗词库。
mmseg4j分词下载地址：http://code.google.com/p/mmseg4j/ ,
搜狗词库下载地址：http://code.google.com/p/mmseg4j/downloads/detail?name=data.zip&can=2&q
下载最新更新：mmseg4j-1.8.5.zip(它是基于lucene/solr 3.1版本)和data.zip
配置过程与操作过程如下：
1.把下载的mmseg4j分词mmseg4j-1.8.5.zip解压缩 , 把里面的mmseg4j-all-1.8.5.jar 文件拷贝到 Tomcat 6.0.26\webapps\solr\WEB-INF\lib 目录下；

2.在 Tomcat 6.0.26\webapps\solr 目录下新建一个 dic 文件夹 , 把新下载的词库拷贝到 dic 目录下

3.在 \Tomcat 6.0.26\webapps\solr\conf\multicore\core0\conf\schema.xml 文件的 types 节点里添加如下节点 :

<fieldtype name="textComplex" class="solr.TextField" positionIncrementGap="100">  
        <analyzer>  
            <tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="complex" dicPath="E:/Program Files/Apache Software Foundation/Tomcat 6.0.26/webapps/solr/dic">  
            </tokenizer>  
        </analyzer>  
    </fieldtype>  
    <fieldtype name="textMaxWord" class="solr.TextField" positionIncrementGap="100">  
        <analyzer>  
            <tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="maxword" dicPath="E:/install/Tomcat 6.0.26/webapps/solr/dic">  
            </tokenizer>  
        </analyzer>  
    </fieldtype>  
    <fieldtype name="textSimple" class="solr.TextField" positionIncrementGap="100">  
        <analyzer>  
            <tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="simple" dicPath="E:/install/Tomcat 6.0.26/webapps/solr/dic">  
            </tokenizer>  
        </analyzer>  
    </fieldtype>

4. 在 \Tomcat 6.0.26\webapps\solr\conf\multicore\core0\conf\schema.xml 文件的 fields 节点里添加如下节点 :

<field name="simple" type="textSimple" indexed="true" stored="true" multiValued="true" />  
  <field name="complex" type="textComplex" indexed="true" stored="true" multiValued="true" />  
  <field name="text" type="textMaxWord" indexed="true" stored="true" multiValued="true" />

5. 因为 solr3.5 里有两个 core , 所以针对 core1 重复 3,4 两步;

6. 对分词进行测试,访问[url]http://localhost:8080/solr/core0/admin/analysis.jsp?highlight=on [/url]

测试1： Field[Name]   输入:complex

测试2： Field Value(index) 输入:中国银行第一分行,Field Value(index)
下面的 verbose outpu 点选

测试3：点击 Analyze 按钮 , 查看分词结果: 中国银行 | 第一 | 分行

7. 此时 Solr3.5 已经可以进行分词 , 接下来配置solr 3.5连接mysql 数据库,生成索引，进行分词；

7.1   下载 java 的 mysql 驱动 , 本机解压 mysql-connector-java-5.1.18-bin.jar, 然后拷贝到 Tomcat 6.0.26\webapps\solr\WEB-INF\lib目录下

7.2   在 \Tomcat 6.0.26\webapps\solr 目录下新建db文件夹

7.3   在 \Tomcat 6.0.26\webapps\solr\db 文件夹下面新建一个db-data-config.xml 文件 , 内容如下 :

<dataConfig>
    <dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost:3306/test" user="root" password="123" />
    <document name="messages">
        <entity name="message" transformer="ClobTransformer" query="select * from test1">
            <field column="ID" name="id" />
            <field column="Val" name="text" />
        </entity>
    </document>
</dataConfig>

url="jdbc:mysql://localhost:3306/test" user="root" password="111111" 这里配置了 mysql 的连接路径 , 用户名 , 密码

<field column="ID" name="id" /><field column="Val" name="text" /> 这里配置的是数据库里要索引的字段 , 注意name 是 4 步配置的

7.4   在 Tomcat 6.0.26\webapps\solr\conf\multicore\core0\conf 目录下的 solrconfig.xml 文件里 , 添加如下代码 :

<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">  
    <lst name="defaults">  
      <str name="config">E:/install/Tomcat 6.0.26/webapps/solr/db/db-data-config.xml</str>   
    </lst>  
  </requestHandler>

“E:/install/Tomcat 6.0.26/webapps/solr/db/db-data-config.xml”是7.3配置文件的绝对路径

7.5 在Tomcat 6.0.26\webapps\solr\conf\multicore\core1\conf\solrconfig.xml路径里重复7.4

7.6 把本地下载解压的 solr3.5 文件里,dist 目录下的apache-solr-dataimporthandler-3.5.0.jar 和 apache-solr-dataimporthandler-extras-3.5.0.jar Tomcat 6.0.26\webapps\solr\WEB-INF\lib目录下

7.7 solr3.5 连接 mysql 已经配置完成 , 测试读取 mysql 生成索引 , 访问 : [url]http://localhost:8080/solr/core0/dataimport?command=full-import [/url]

7.8 测试分词查询 ,访问http://localhost:8080/solr/core0/admin/ 查询数据库里索引列里有的词

注意 , 这仅仅是配置 solr3.5 连接 mysql 生成索引 , 可以执行正常词语的查询 , 但是不能执行对搜索短语的分词查询

multicore目录下面多个 core 文件夹 , 每一个都是一个接口 , 有独立的配置文件 , 处理某一类数据。

multicore/core0/conf/ 目录下的 schema.xml 文件相当于数据表配置文件 , 它定义了加入索引的数据的数据类型。文件里有一个 <uniqueKey>id</uniqueKey> 的配置 , 这里将 id 字段作为索引文档的唯一标示符 , 非常重要。

FieldType 类型 , name 是这个 FieldType 的名称 , class 指向了 org.apache.solr.analysis 包里面对应的 class 名称 , 用来定义这个类型的定义。在 FieldType 定义的时候最重要的就是定义这个类型的数据在建立索引和进行查询的时候要使用的分析器analyzer,包括分词和过滤。

Fields 字段 : 结点内定义具体的字段(类似数据库中的字段) , 就是 field , 包含 name , type(为之前定义过的各种FieldType) , indexed(是否被索引) , stored(是否被存储) , multiValued(是否有多个值)

copeField（赋值字段）: 建立一个拷贝字段 , 将所有的全文字段复制到一个字段中 , 以便进行统一的检索。

分享到：

使用PinYin4j.jar将汉字转换为拼音使用实例 | Apache Lucene 3.5 发布的优化、改进和Bu ...

2011-12-29 11:08
浏览 5590
评论(1)
分类:开源软件
查看更多

1 楼 hfkiss44 2012-01-13

哥们首先谢谢你分享这篇文章我按着你的步骤也搭建成功了数据库索引查询但是只能查到英文相关的索引若是中文就没有查出索引结果请问这个是怎么回事？

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论