利用thinking sphinx实现全文检索

全部 Ruby Python PHP Flash C++ .net Rails Flex C C# Django

浏览 23128 次

锁定老帖子主题：利用thinking sphinx实现全文检索该帖已经被评为精华帖
作者	正文
司徒正美等级: 性别: 文章: 89 积分: 330 来自: 广州	发表时间：2009-07-23 最后修改：2009-07-23 相关推荐: 使用thinking sphinx实现全文检索 Thinking Sphinx + Coreseek + rmmseg的安装与使用全文检索 Thinking-sphinx 的中文支持 mysql thinking_sphinx coreseek rails3 更多相关推荐 Ruby 利用thinking sphinx实现全文检索随便抄几段介绍一下Sphinx。 Sphinx支持高速建立索引（可达10MB/秒，而Lucene建立索引的速度是1.8MB/秒）高性能搜索（在2-4 GB的文本上搜索，平均0.1秒内获得结果）高扩展性（实测最高可对100GB的文本建立索引，单一索引可包含1亿条记录）支持分布式检索支持基于短语和基于统计的复合结果排序机制支持任意数量的文件字段（数值属性或全文检索属性）支持不同的搜索模式（“完全匹配”，“短语匹配”和“任一匹配”）支持作为Mysql的存储引擎 Thinking Sphinx是其Ruby接口，既然railscasts.com的Ryan都推荐，我们没有理由不去试试这个热门插件。 http://www.kuqin.com/searchengine/20081204/29401.html http://www.zhengjinjun.com/index.php/2009/06/28/sphinx搜索语法/ 安装支持 git clone git://github.com/freelancing-god/thinking-sphinx.git vendor/plugins/thinking_sphinx 如果是gem安装 config.gem( 'freelancing-god-thinking-sphinx', :lib => 'thinking_sphinx', :version => '1.1.12' ) 务必在你的rails项目的根目录的Rakefile文件中添加，插件方式则免去这一步！ require 'thinking_sphinx/tasks' 至于sphinx可到我的博客下载！（支持中文分词的版本!!!）解压后，记得把Sphinx的bin目录添加到系统变量中。定义索引 class Topic < ActiveRecord::Base #……………………………………………………其他实现……………………………………………………………… #如果想重新建立索引，在控制台输入rake ts:in INDEX_ONLY = true #=======================搜索部分===================== define_index do indexes :title,:sortable => true indexes first_post.body,:as => :body,:sortable => true indexes author has :created_at,:updated_at,:forum_id,:digest end #……………………………………………………其他实现……………………………………………………………… end class Post < ActiveRecord::Base #……………………………………………………其他实现……………………………………………………………… #如果想重新建立索引，在控制台输入rake ts:in INDEX_ONLY = true #=======================搜索部分===================== define_index do indexes topic(:title),:as => :title ,:sortable => true indexes body,:sortable => true indexes author has :created_at,:updated_at,:forum_id end #……………………………………………………其他实现……………………………………………………………… end 与Ferret相比，它不能实时更新索引（亦即不能在我们创建更新删除一个模型，不能重新索引相关数据）。幸好，Delta Indexes(增量索引)的出现解决了这问题。实现增量索引有三种方式：实时索引，定时索引与延时索引。实时索引最常用的就是实时索引，它要求我们在被索引的模型中增加一个属性，名为delta。 ruby script/generate migration AddDeltaToTopic delta:boolean class AddDeltaToTopic < ActiveRecord::Migration def self.up add_column :topics, :delta,:boolean, :default => true, :null => false add_column :posts, :delta,:boolean, :default => true, :null => false end def self.down remove_column :topics, :delta remove_column :posts, :delta end end define_index do indexes :title,:sortable => true indexes first_post.body,:as => :body,:sortable => true indexes author has :created_at,:updated_at,:forum_id,:digest #声明使用实时索引 set_property :delta => true end indexes topic(:title),:as => :title ,:sortable => true indexes body,:sortable => true indexes author has :created_at,:updated_at,:forum_id #声明使用实时索引 set_property :delta => true end 定时索引注:这个非我们项目的流程，作为小知识记一记就是！ set_property :delta => :datetime, :threshold => 1.hour 一小时索引一次，实际上是重建所有索引，所以注意间隔时间不要短于建立索引的时间。默认使用updated_at，这样就不用给被索引模型添加delta属性了。需要结合的重建索引的命令为rake thinking_sphinx:index:delta或rake ts:in:delta 延时索引 set_property :delta => :delayed 待到搜索时才重建索引。 sphinx索引速度在所有开源搜索引擎中是最快，能应对这种需求。需要结合的重建索引的命令为rake thinking_sphinx:delayed_delta或rake ts:dd 建立索引注:非流程的小知识以下任一个命令都可以，下面的基本上是上面的简写。 rake thinking_sphinx:index #将生成配置文件 rake thinking_sphinx:index INDEX_ONLY=true #不生成配置文件 rake ts:index rake ts:in 重建索引注:非流程的小知识以下任一个命令都可以，下面的基本上是上面的简写。 rake thinking_sphinx:rebuild rake ts:rebuild 设置Thinking Sphinx配置文件这个可有可无。虽然推崇COC，但对于某些场合中配置文件还是必不可少的。thinking_sphinx会根据我们提供的配置文件生成sphinx的配置文件，利用后者指导引擎去建立索引。在config下新建一文件sphinx.yml 一个简单的模板 development: &my_settings enable_star: 1 min_prefix_len: 0 min_infix_len: 2 min_word_len: 1 max_results: 70000 morphology: none listen: localhost:3312 exceptions: D:/Louvre/log/sphinx_exception.log chinese_dictionary: I:/sphinx/bin/xdict test: <<: my_settings production: <<: my_settings 生成Sphinx配置文件以下随便选一个。 rake thinking_sphinx:configure rake ts:conf rake ts:config 随便运行上面一个，就可以看到config文件夹中多出一个文件development.sphinx.conf 修改 searchd { address = 127.0.0.1:3312 port = 3312 ………………………………………………………… } 为 searchd { listen = 127.0.0.1:3312 ………………………………………………………… } 搜索charset_type = utf-8，在其下面添加下面添加chinese_dictionary = I:/sphinx/bin/xdict，chinese_dictionary是指定分词词典的选项，包括路径和文件名。有几个charset_type = utf-8，就添加几个chinese_dictionary = I:/sphinx/bin/xdict(通常位于index XXXX_core中) 到这里我们就可以执行索引了 rake ts:in INDEX_ONLY=true 注意，一定要使用NDEX_ONLY=true 索引成功，没有报错，大抵是这个样子。它索引的速度非常快。单一索引最大可包含1亿条记录，在1千万条记录情况下的查询速度为0.x秒（毫秒级）。Sphinx创建索引的速度为：创建100万条记录的索引只需3～4分钟，创建1000万条记录的索引可以在50分钟内完成，而只包含最新10万条记录的增量索引，重建一次只需几十秒。建立Search模块 ruby script/generate controller search index 添加路由规则。 map.online '/seach', :controller => 'seach', :action => 'index' 修改控制器。 class SearchController < ApplicationController def index @class = params[:class] \|\| "topic" @query = params[:query] \|\| '' unless @query.blank? if @class == "topic" @results = Topic.search @query else @results = Post.search @query end end end end 修改相应视图 thinking sphink现在还不支持摘要与高亮，不过我们还可以用一些取巧的方法来取代它们，如用truncate缩短显示的结果，利用rails自带的highlight方法实现高亮。 <% form_tag '/search', :method => :get ,:style => "margin-left:40%" do %> <input type="radio" name="class" value = "topic" <%= @class == "topic"? 'checked="checked"':'' %>>仅主题贴 <input type="radio" name="class" value = "post" <%= @class == "post"? 'checked="checked"':'' %>>所有贴子<br> <p> <%= text_field_tag :query, @query %> <%= submit_tag "搜索", :name => nil %> </p> <% end %> <%- if defined? @results -%> <style type="text/css"> .hilite{ color:#0042BD; background:#F345CC; } </style> <div id="search_result"> <%- @results.each do \|result\| -%> <h3 style="text-align:center;color:red"><%= highlight h(result.title),@query, '<span class="hilite">\1</span>' %></h3> <div><%= highlight simple_format(result.body), @query, '<span class="hilite">\1</span>'%></div> <%- end -%> </div> <%- end -%> 现在可以测试一下，但务必先启动rake ts:start才能进行搜索！关闭时一定要使用rake ts:stop，sphinx是后台运行的。关闭netBeans时不会像mongrel那样一同关闭，由于忘记关闭再又再开启一个sphinx进程，会导致各种奇怪的错误。这时，我们搜索到的结果应该是这个样子。分页 Thinking sphinx是原生支持will_paginate，默认每页为20条记录。我们可以修改一下。 `1.@results` `= Topic.search params[:search],` `:page` `=> (params[:page] \|\|` `1),:per_page` `=>` `5` 生成的结果是个特殊的集合（!seq:ThinkingSphinx::Collection ），拥有以下属性： @results.total_entries:总记录数 @results.total_pages:总页数 @results.current_page:当前页数 @results.per_page:每页的记录数声明：ITeye文章版权属于作者，受法律保护。没有作者书面许可不得转载。推荐链接
返回顶楼

司徒正美等级: 性别: 文章: 89 积分: 330 来自: 广州	发表时间：2009-07-23 多模型查询 ThinkingSphinx::Search.search "term", :classes => [Post, User, Photo, Event] 指定匹配模式 SPH_MATCH_ALL模式，匹配所有查询词（默认模式）。 Topic.search "Ruby Louvre",:match_mode => :all Topic.search "Louvre" PH_MATCH_ANY模式，匹配查询词中的任意一个。 Topic.search "Ruby Louvre", :match_mode => :any SPH_MATCH_PHRASE模式，将整个查询看作一个词组，要求按顺序完整匹配。 Topic.search "Ruby Louvre", :match_mode => :phrase SPH_MATCH_BOOLEAN模式，将查询看作一个布尔表达式。 Topic.search "Ruby \| Louvre, :match_mode => :boolean SPH_MATCH_EXTENDED模式，将查询看作一个Sphinx内部查询语言的表达式。 User.search "@name pat \| @username pat", :match_mode => :extended 加权为了提高搜索质量，我们可以对某些模型或字段进行加权，提高它们的优先度。默认都是1 User.search "pat allan", :field_weights => { :alias => 4, :aka => 2 } ThinkingSphinx::Search.search "pat", :index_weights => { User => 10 } 完整例子,修改我们的编辑器。 unless params[:search].blank? @q = params[:search] @results = Topic.search @q, :page => (params[:page] \|\| 1), :per_page => 5, :order => "@relevance DESC,updated_at DESC", :field_weights => { :title => 20 } end 对应的视图代码 <% @results.each_with_weighting do \|topic, weight\| %> <h3 style="text-align:center;color:red"><%= hilight h(topic.title),@q %></h3> <div><%= hilight simple_format(topic.body), @q %></div> <p>作者：<%= hilight h(topic.author),@q %> 发表时间 <%= topic.created_at.to_s(:db) %> 权重：<%= weight %></p> <% end %> Weighted MultiModel Search ThinkingSphinx::Search.search params[:query], :classes => [Article, Term], :limit => 1_000_000, :page => params[:page] \|\| 1, :per_page => 20, :index_weights => { "article_core" => 100, "article_delta" => 100, } 相关语法 @results.each_with_weighting do \|result, weight\| @results.each_with_group do \|result, group\| @results.each_with_group_and_count do \|result, group, count\| @results.each_with_count do \|result, count\| #Any attribute from the result set will work with each_with_blah - so, distances: @result.each_with_geodist do \|result, distance\| 条件搜索这有利于缩小搜索的范围，更容易得到我们想要的结果。注意，用作条件的属性必须被索引（以indexes或has方式） ThinkingSphinx::Search.search :with => {:tag_ids => 10} ThinkingSphinx::Search.search :with => {:tag_ids => [10,12]} ThinkingSphinx::Search.search :with_all => {:tag_ids => [10,12]} User.search :conditions => {:name => "司徒正美",roles => "admin"} Topic.search :conditions => {:forum_id => 10} Article.search "East Timor", :conditions => {:rating => 3..5} User.search :without => {:roles => "user"} #搜索所有高级会员，版主与超版。 User.search :without_ids => 1 通配符搜索在配置文件sphinx.yml中声明 development: &my_settings enable_star: 1 min_prefix_len: 0 min_infix_len: 2 嘛，我原来给出的模板就有了，所以大家就不要费心了。 Location.search "elbourn" Location.search "elbourn -ustrali", :star => true, :match_mode => :boolean #相当于搜索elbourn -ustrali User.search("oo@bar.c", :star => /[\w@.]+/u)#相当于搜索oo@bar.c 分组对时间进行分组。 define_index do # 其他设置 has :created_at end Fruit.search "apricot", :group_function => :day, :group_by => 'created_at' 一些时间函数 # * <TT>:day</TT> - YYYYMMDD # * <TT>:week</TT> - YYYYNNN (NNN is the first day of the week in question, counting from the start of the year ) # * <TT>:month</TT> - YYYYMM # * <TT>:year</TT> - YYYY 有了它我们可以很轻松地统计每个版本一周内的发贴量通过模型的某个属性进行分组 Fruit.search "apricot", :group_function => :attr, :group_by => 'size' 同一个模型使用多个索引 class User < ActiveRecord::Base #…………其他设置…………………… has_many :user_tags define_index do has user_tags, :as=>:type_ones where "tag_type=0" end define_index do has user_tags, :as=>:type_twos where "tag_type=1" end 一些有用的链接 http://freelancing-gods.com/posts/a_concise_guide_to_using_thinking_sphinx http://railscasts.com/episodes/120-thinking-sphinx http://freelancing-god.github.com/ts/en/ PS:javaeye中有许多地方不能达到我排版效果，可以到我的博客看一看。
返回顶楼	回帖地址 0 0 请登录后投票

cquaker 等级: 初级会员性别: 文章: 55 积分: 50 来自: 北京	发表时间：2009-07-24 最后修改：2009-07-24 写的很不错，楼主这个中文分词效果怎么样？另外下载了你主页上的支持中文的sphinx ，发现是windows下的，有无linux平台下的？
返回顶楼	回帖地址 0 0 请登录后投票

wodesign 等级: 初级会员性别: 文章: 2 积分: 60 来自: 苏州	发表时间：2009-07-25 谢谢分享，正需要，要好好学习下。
返回顶楼	回帖地址 0 3 请登录后投票

yangzhihuan 等级: 性别: 文章: 225 积分: 283 来自: 广州	发表时间：2009-07-26 看完楼主的介绍,感觉比act_as_ferret好用了,不支持中文分词的性能怎么样,另外gem版的插件支持ruby 1.9吗?
返回顶楼	回帖地址 0 0 请登录后投票

司徒正美等级: 性别: 文章: 89 积分: 330 来自: 广州	发表时间：2009-07-26 看了一下我的ruby版本，是1.8.6。应该支持ruby1.9吧，毕竟其插件的作者与合作者是那么多，每月都有更新，google讨论组的人气也很高！ http://github.com/freelancing-god/thinking-sphinx/tree/master
返回顶楼	回帖地址 0 0 请登录后投票

qichunren 等级: 性别: 文章: 501 积分: 340 来自: 蕲春－>上海	发表时间：2009-07-26 我想知道 Sphinx和Ferret 能不能用于Oracle数据库下。
返回顶楼	回帖地址 0 0 请登录后投票

derk 等级: 初级会员文章: 11 积分: 30 来自: ...	发表时间：2009-07-28 最后修改：2009-07-28 发现一个问题 def post_ids1 self.posts.map { \|post\| post.id }.join ',' end def post_ids2 self.posts.collect { \|post\| post.id } end User.search :without_ids => 1 ok ---------------------------------------- User.search :without_ids => post_ids1 wrong User.search :without_ids => post_ids2 ok
返回顶楼	回帖地址 0 0 请登录后投票

Raecoo 等级: 初级会员性别: 文章: 56 积分: 40 来自: 北京	发表时间：2009-07-30 chinese_dictionary 这个配置项是楼主自己添加的? 貌似TS的configure.rb里不支持,看demo里搜索的是英文,有搜索中文成功的示例看看不
返回顶楼	回帖地址 0 0 请登录后投票

司徒正美等级: 性别: 文章: 89 积分: 330 来自: 广州	发表时间：2009-07-31 Raecoo 写道 chinese_dictionary 这个配置项是楼主自己添加的? 貌似TS的configure.rb里不支持,看demo里搜索的是英文,有搜索中文成功的示例看看不给！大小: 191.7 KB 大小: 144.2 KB 查看图片附件
返回顶楼	回帖地址 0 0 请登录后投票