solr(五)同义词

extrimlycold

浏览: 67160 次
性别:
来自: 武汉

最近访客更多访客>>

marbletop

zuochi

spring_ln9868

xx3531000

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

solr搜索引擎

solr中自带有synonyms的功能，但是功能很有限，因为中文需要在分词的基础上进行搜索，所以官方的配置就没有多大意义。

概念说明：同义词大体的意思是指，当用户输入一个词时，solr会把相关有相同意思的近义词的或同义词的term的语段内容从索引中取出，展示给用户，提高交互的友好性（当然这些同义词的定义是要在配置文件中事先定义好的），比如：用户输入：日本，那么就可能有一些相关的近义词如：鬼子，屠杀，战犯等的词，这个可在配置文件中事先定义好。

一) 官方的配置：这个配置是在cookbook中有提及的，但是在中文分词上没办法加在一起，所以基本上没用。

1：在schema.xml的<types>标签中添加<fieldType>,如下:

<fieldType name="text_syn" class="solr.TextField">
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.
txt" ignoreCase="true" expand="false" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

这其中涉及到的synonyms.txt文件是配置文件中原来就有的，这个就是同义词的配置文件。大体格式如下

# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

#-----------------------------------------------------------------------
#some test synonym mappings unlikely to appear in real input text
aaafoo => aaabar
bbbfoo => bbbfoo bbbbar
cccfoo => cccbar cccbaz
fooaaa,baraaa,bazaaa

# Some synonym groups specific to this example
GB,gib,gigabyte,gigabytes
MB,mib,megabyte,megabytes
Television, Televisions, TV, TVs
#notice we use "gib" instead of "GiB" so any WordDelimiterFilter coming
#after us won't split it into two words.
中国,英国,日本

# Synonym mappings can be used for spelling correction too
pixima => pixma

我已经在上面加入了中文的配置(由于字符集的问题，请修改完成后用EditNote打开并选择格式-->UTF-8编码，如有乱码就改)，意思是输入这几个中文字都是一样的搜索结果，另外其中还有=>及以逗号分隔的，这里引用官方的话做参考:

Let's get back to our example for a second. What if the person from the marketing
department says that he/she wants not only to be able to find books that have the word
"machine" to be found when entering the word "electronics", but also all the books that
have the word "electronics", to be found when entering the word "machine". The answer
is simple. First, we would set the expand attribute (of the filter) to true. Then we would
change our synonyms.txt file to something like this:
machine, electronics
As I said earlier Solr would expand synonyms to equivalent forms.

就是说=>指一对一，以逗号分隔的是组群，也就是多对多。

当然这个还得定义相关字段为这个类型，如下。

<field name="content_copy" type="text_syn" indexed="true" stored="true"/>

这时，在界面analysis上测试一下, 输入pixima, 会出现pixma的匹配词组。

分享到：

solr(五)同义词加中文分词 | solr(四)索引文件之Extract Metadata

2013-04-09 13:08
浏览 7283
评论(0)
分类:开源软件
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论