sphinx 的安装及使用 windws centos coreseek

sinykk

浏览: 357807 次
性别:
来自: 杭州

最近访客更多访客>>

zhongguocxy

jiffz

Coosee

yangganboy

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

之前在做JAVA时知道有一个LUCENCE,当时想等有空再去学习，后来主要做PHP了，就少兴趣去学LUCENCE了，而这次因项目中要用到全文检索，而我也不会，并且对这个全文检索很感兴趣，于是学了这个SPHINX。学在这之前很多疑惑，后面多看多试也了解了不少，我站在我最初的学的地方来回答一下当时我自己的疑惑

全文检索是什么，有什么用，怎么用

全文检索就是搜索数据库中所有指定是表，指定的字段，只是这种搜索效率远远高于SQL的 like or 等，因为这个还涉及到一个分词，也只有在全文检索中用到

另一个全文检索可以搜索任意文档，因为可以使用python 的万能数据源

至于怎么用：其实这个有点像memcache，相当于一个插件，通过PHP去调SPHINX里查出来的数据。

另也可以像SQL一查写SQL语句查询SPHINX里的数据

快速使用coreseek4

1、配置conf文件

创建索引
2、D:\coreseek4\bin>indexer -c ../etc/csft_sinykk.conf --all

启动一个SPHINX服务
3、D:\coreseek4\bin>searchd -c ../etc/csft_sinykk.conf --index all

在此处可以在线测试
D:\coreseek4\bin>search -c ../etc/csft_sinykk.conf 'teststr'

---------------------------------------------------------------
注：新版CORESEEK4.0.1并不支持主键字符串，官方说的支持字符串仅是 sql_field_string

升级SphinxSE为1.11-dev版本，支持字符串属性（补丁文件见下文）
在SphinxSE之中，可以调用返回sql_field_string等设置了string的属性，从而在MySQL通过SphinxSE查询时，可以得到Coreseek/Sphinx索引中保存的字符串值

如果要使用sql_attr_string sql_field_str新属性时一定要使用新的sphinxapi.php
---------------------------------------------------------------

=========================================

重点

在配置文件中有

sql_query                = SELECT id,cate_id,contents FROM articles
                                                              #sql_query第一列id需为整数
                                                              #title、content作为字符串/文本字段，被全文索

sql_attr_uint             = cate_id #凡时定义了sql_attr_类型的字段都不参与全文检索，但可以通过 setFilter进行过滤搜索

#sql_query_info = SELECT * FROM articles WHERE id=$id #命令行查询时，从数据库读取原始数据信息，仅供MYSQL使用，并且只用于调试使用

合并索引方法：
indexer --merge main delta --config /usr/local/coreseek/etc/csft.conf --rotate

bin\searchd -c etc\csft_mysql.conf --pidfile
注解：--pidfile这个选项一定要添加，强制生成pid，不然在合并索引时会报pid文件无法打开错误（这项非常重要）

更新索引（相当于重建索引，好处请查文档）

D:\coreseek\bin>indexer -c ../etc/csft_rtsinykk2.conf rtarticles_2_delta --rotate

==========================================

二、 Sphinx 在 windows 上的安装

1. 直接在 http://www.sphinxsearch.com/downloads.html 找到最新的 windows 版本，我这里下的是 Win32 release binaries with MySQL support ，下载后解压在 D:/sphinx 目录下；

2. 在 D:/sphinx/ 下新建一个 data 目录用来存放索引文件，一个 log 目录方日志文件，复制 D:/sphinx/sphinx.conf.in 到 D:/sphinx/bin/sphinx.conf （注意修改文件名）；

3. 修改 D:/sphinx/bin/sphinx.conf ，我这里列出需要修改的几个：

type           = mysql # 数据源，我这里是mysql
sql_host       = localhost # 数据库服务器
sql_user       = root # 数据库用户名
sql_pass       = '' # 数据库密码
sql_db         = test # 数据库
sql_port       = 3306 # 数据库端口

sql_query_pre = SET NAMES utf8 # 去掉此行前面的注释，如果你的数据库是uft8 编码的

index test1
{
# 放索引的目录
path      = D:/sphinx/data/
# 编码
charset_type     = utf-8
# 指定utf-8 的编码表
charset_table     = 0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F
# 简单分词，只支持0 和1 ，如果要搜索中文，请指定为1
ngram_len       = 1
# 需要分词的字符，如果要搜索中文，去掉前面的注释
ngram_chars      = U+3000..U+2FA1F
}

# 搜索服务需要修改的部分
searchd
{
# 日志
log = D:/sphinx/log/searchd.log

# PID file, searchd process ID file name
pid_file = D:/sphinx/log/searchd.pid

# windows 下启动searchd 服务一定要注释掉这个
# seamless_rotate = 1
}

4. 导入测试数据

sql 文件在 D:/sphinx/example.sql

C:/Program Files/MySQL/MySQL Server 5.0/bin>mysql -uroot test<d:/sphinx/example.sql

5. 建立索引

D:/sphinx/bin>indexer.exe test1 ( 备注 :test1 为 sphinx.conf 的 index test1() )
Sphinx 0.9.8-release (r1533)
Copyright (c) 2001-2008, Andrew Aksyonoff

using config file ‘./sphinx.conf’…
indexing index ‘test1′…
collected 4 docs, 0.0 MB
sorted 0.0 Mhits, 100.0% done
total 4 docs, 193 bytes
total 0.101 sec, 1916.30 bytes/sec, 39.72 docs/sec

D:/sphinx/bin>

6. 搜索 ’test’ 试试

D:/sphinx/bin>search.exe test1 (注：test1 为搜索的英文单词)

显示结果如下

using config file ‘./sphinx.conf’…
index ‘test1′: query ‘test ‘: returned 3 matches of 3 total in 0.000 sec

displaying matches:
1. document=1, weight=2, group_id=1, date_added=Wed Nov 26 14:58:59 2008
id=1
group_id=1
group_id2=5
date_added=2008-11-26 14:58:59
title=test one
content=this is my test document number one. also checking search within
phrases.
2. document=2, weight=2, group_id=1, date_added=Wed Nov 26 14:58:59 2008
id=2
group_id=1
group_id2=6
date_added=2008-11-26 14:58:59
title=test two
content=this is my test document number two
3. document=4, weight=1, group_id=2, date_added=Wed Nov 26 14:58:59 2008
id=4
group_id=2
group_id2=8
date_added=2008-11-26 14:58:59
title=doc number four
content=this is to test groups

words:
1. ‘test’: 3 documents, 5 hits
D:/sphinx/bin>

6. 测试中文搜索

修改 test 数据库中 documents 数据表，

UPDATE `test`.`documents` SET `title` = ‘ 测试中文 ’, `content` = ‘this is my test document number two ，应该搜的到吧 ’ WHERE `documents`.`id` = 2;

重建索引：

D:/sphinx/bin>indexer.exe test1

搜索 ’ 中文 ’ 试试：

D:/sphinx/bin>search.exe 中文 (注：搜索的中文字)
Sphinx 0.9.8-release (r1533)
Copyright (c) 2001-2008, Andrew Aksyonoff

using config file ‘./sphinx.conf’…
index ‘test1′: query ‘ 中文 ‘: returned 0 matches of 0 total in 0.000 sec

words:
D:/sphinx/bin>

貌似没有搜到，这是因为 windows 命令行中的编码是 gbk ，当然搜不出来。我们可以用程序试试，在 D:/sphinx/api 下新建一个 foo.php 的文件，注意 utf-8 编码

启动 Sphinx searchd 服务

D:/sphinx/bin>searchd.exe
Sphinx 0.9.8-release (r1533)
Copyright (c) 2001-2008, Andrew Aksyonoff

WARNING: forcing –console mode on Windows
using config file ‘./sphinx.conf’…
creating server socket on 0.0.0.0:9312
accepting connections

<?php
require ’sphinxapi.php’;
$s = new SphinxClient();
$s->SetServer(’localhost’,9312);
$result = $s->Query(’ 中文 ’);
var_dump($result);
?>

执行 PHP 查询：

访问 http://www.test.com/sphinx/api/foo.php ( 自己配置的虚拟主机 )

参考：http://blog.csdn.net/siren0203/article/details/5564082

======================================================

coreseek 分词使用

因coreseek集成了sphinx，所以下载coreseek,就可以直接使用其中ETC/文件夹下相关CONF文件。

主要配置，详细查看

http://www.coreseek.cn/products-install/mysql/

http://www.coreseek.cn/products-install/coreseek_mmseg/

charset_dictpath = D:/coreseek/etc/

charset_type = zh_cn.utf-8

测试代码

require ( "sphinxapi.php" );

$cl = new SphinxClient ();
$cl->SetServer ( '127.0.0.1', 9312);
$cl->SetConnectTimeout ( 3 );
$cl->SetArrayResult ( true );
$cl->SetMatchMode ( SPH_MATCH_ANY);
$cl->SetLimits(20,10);//分页
$res = $cl->Query ( '已经被我们格式化了', "*" );

if(isset($res['matches'])){

 $ids = '';
 foreach($res['matches'] as $r){
 $ids .=$r['id'].',';
 }
 $ids = substr($ids,0,-1);
 $idsarr = explode(',',$ids);
 $conn = new mysqli('localhost','root','','demo');
 $conn->set_charset('utf8');
 $sql = "select id,title,contents from articles where id in ($ids)";
 echo $sql;
 $result = $conn->query($sql);


 $words = array_keys($res['words']);
 var_dump($words);
 //将数据库查出来的数组放到新数组中，并使用数组的键名为IN 的ID，这样方便排序
 $data = array();
 while($i = $result->fetch_assoc()){
 echo $i['id'].replacestr($words,$i['title']).' ';
 echo replacestr($words,$i['contents']).' ';*/
 $data[$i['id']] = $i;
 }
 foreach($idsarr as $i){
 echo ' ================================= ';
 echo $data[$i]['id'].replacestr($words, $data[$i]['title']).' ';
 echo replacestr($words, $data[$i]['contents']).' ';
 }


 echo ' ';
 echo ' ';
 var_dump($res['matches']);
}else{
 echo ' not matches ';
 var_dump($res);
}

print_r($cl);
print_r($res);

function replacestr($arr,$str){
 foreach($arr as $r){
 $str = str_replace($r,"".$r."",$str);
 }
 return $str;
}

======================================================

在一个配置文件中配置多个 source index 并且在代码中使用 AddQuery方式可以一次查询多个类型的QUERY，如同时查询用户+应用+标签等

$sphinx->AddQuery($query, 'artists');
$sphinx->AddQuery($query, 'variations');
$sphinx->SetFilter('name', array(3));
$sphinx->SetLimits(0, 10);
$result = $sphinx->RunQueries();

======================================================

所有的搜索都不是死的，都是动态的，所以需要把搜索做成动态的，能搜索最新的信息。

思路：

1 建立主索引+增量索引

2并运行这个主索引searchd -c ../etc/csft_rtsinykk.conf --pidfile

3 定时任务linux crontab

更新增量索引 indexer -c ../etc/csft_rtsinykk.conf --rotate delta

合并索引 indexer --merge rtarticles delta --config /usr/local/coreseek/etc/ csft_rtsinykk.conf --rotate

4定时任务更新主索引

-------------------------------------------

今天试了在建立增量索引时不使用delta方式，增量索引为独立索引，只是其配置和主索引一样，不一样的地方在 sql_query里如 (SELECT id,cate_id,title,contents FROM articles WHERE id>(SELECT max_doc_id FROM sphinx_counter WHERE counter_id=1)
})

这样更新索引时只需要作

D:\coreseek\bin>indexer -c ../etc/csft_rtsinykk2.conf rtarticles_2_delta --rotate

然后在搜索的时候使用

$res = $cl->Query ( '人生不过是一场忍耐', "rtarticles_2 rtarticles_2_delta " );

这样就不需要进行索引合并（索引合并会带来较大的IO操作）

-----------------------------------2011-8-24---------------------------------------

参考：http://www.coreseek.cn/docs/coreseek_3.2-sphinx_0.9.9.html#live-updates

3.11. 实时索引更新

有这么一种常见的情况：整个数据集非常大，以至于难于经常性的重建索引，但是每次新增的记录却相当地少。一个典型的例子是：一个论坛有1000000个已经归档的帖子，但每天只有1000个新帖子。

在这种情况下可以用所谓的“主索引＋增量索引”（main+delta）模式来实现“近实时”的索引更新。

这种方法的基本思路是设置两个数据源和两个索引，对很少更新或根本不更新的数据建立主索引，而对新增文档建立增量索引。在上述例子中，那1000000个已经归档的帖子放在主索引中，而每天新增的1000个帖子则放在增量索引中。增量索引更新的频率可以非常快，而文档可以在出现几分种内就可以被检索到。

确定具体某一文档的分属那个索引的分类工作可以自动完成。一个可选的方案是，建立一个计数表，记录将文档集分成两部分的那个文档ID，而每次重新构建主索引时，这个表都会被更新。

Example 4. 全自动的即时更新

# in MySQL
CREATE TABLE sph_counter
(
    counter_id INTEGER PRIMARY KEY NOT NULL,
    max_doc_id INTEGER NOT NULL
);

# in sphinx.conf
source main
{
    # ...
    sql_query_pre = SET NAMES utf8
    sql_query_pre = REPLACE INTO sph_counter SELECT 1, MAX(id) FROM documents
    sql_query = SELECT id, title, body FROM documents \
        WHERE id<=( SELECT max_doc_id FROM sph_counter WHERE counter_id=1 )
}

source delta : main
{
    sql_query_pre = SET NAMES utf8
    sql_query = SELECT id, title, body FROM documents \
        WHERE id>( SELECT max_doc_id FROM sph_counter WHERE counter_id=1 )
}

index main
{
    source = main
    path = /path/to/main
    # ... all the other settings
}

# note how all other settings are copied from main,
# but source and path are overridden (they MUST be)
index delta : main
{
    source = delta
    path = /path/to/delta
}

======================================

---------------------------------------------------------------
安装 sphinxSE 引擎
SphinxSE Plugin Install (插件模式安装) :
---------------------------------------------------------------

    Chinese : 解压ha_sphinx.dll放到MySQL 5.1.x的lib/plugin/目录中，使用root登陆MySQL；

    Install (启用):
    mysql> INSTALL PLUGIN sphinx SONAME "ha_sphinx.dll" ;

    --------------------------------------------
    Uninstall (关闭):
    mysql > UNINSTALL PLUGIN sphinx ;

    检查引擎模块是否正常加载
    mysql> show engines;

    CONNECTION='sphinx://localhost:3312/cgfinal';，这里表示这个表采用SPHINXSE引擎，字符集是utf8，与sphinx的连接串是'sphinx://localhost:3312/cgfinal，cgfinal是索引名称

============================

在 centos5.6 64位机器上用coreseek4.1 beta时出错

重装libliconv都不行

============================

/usr/local/sphinx/src/sphinx.cpp:15557: undefined reference to `libiconv_open'
libsphinx.a(sphinx.o)(.text+0x53a01):/usr/local/sphinx/src/sphinx.cpp:15575: undefined
reference to `libiconv'
libsphinx.a(sphinx.o)(.text+0x53a28):/usr/local/sphinx/src/sphinx.cpp:15581: undefined
reference to `libiconv_close'

官方解决方法

## 如果出现undefined reference to `libiconv'的类似错误，可以按照如下方法处理：
##方法一：（Linux使用）
## 直接执行：export LIBS="-liconv"
##然后再次configure后，进行编译安装make && make install

民间解决方法

安装sphinx时又报错
解决
一开始以为libiconv的问题，又重装了几次还是一样，最后终于找着办法了
编辑：
./src/MakeFile文件(必须confiure后在编辑这个意识是告诉g++编译器要加入库iconv的支持)
将
LIBS = -lm -lz -lexpat -L/usr/local/lib -lrt -lpthread
改成
LIBS = -lm -lz -lexpat -L/usr/local/lib -lrt -lpthread -liconv

ok了

注意是liconv

sinykk_coreseek_conf.zip (4.1 KB)
下载次数: 2

分享到：

WEB网站压力测试 siege | 用php处理百万级以上的数据提高查询速度的 ...

2011-08-20 15:44
浏览 1507
评论(0)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论