sphinx coreseek 使用安装的简单描述

1、简介

1.1.Sphinx是什么

Sphinx是由俄罗斯人Andrew Aksyonoff开发的一个全文检索引擎。意图为其他应用提供高速、低空间占用、高结果相关度的全文搜索功能。Sphinx可以非常容易的与SQL数据库和脚本语言集成。当前系统内置MySQL和PostgreSQL 数据库数据源的支持，也支持从标准输入读取特定格式的XML数据。通过修改源代码，用户可以自行增加新的数据源（例如：其他类型的DBMS 的原生支持）

1.2.Sphinx的特性

(1)、高速的建立索引(在当代CPU上，峰值性能可达到10 MB/秒);

(2)、高性能的搜索(在2 – 4GB 的文本数据上，平均每次检索响应时间小于0.1秒);

(3)、可处理海量数据(目前已知可以处理超过100 GB的文本数据, 在单一CPU的系统上可处理100 M 文档);

(4)、提供了优秀的相关度算法，基于短语相似度和统计（BM25）的复合Ranking方法;

(5)、支持分布式搜索;

(6)、支持短语搜索

(7)、提供文档摘要生成

(8)、可作为MySQL的存储引擎提供搜索服务;

(9)、支持布尔、短语、词语相似度等多种检索模式;

(10)、文档支持多个全文检索字段(最大不超过32个);

(11)、文档支持多个额外的属性信息(例如：分组信息，时间戳等);

(12)、支持断词;

1.3.Sphinx中文分词

中文的全文检索和英文等latin系列不一样，后者是根据空格等特殊字符来断词，而中文是根据语义来分词。目前大多数数据库尚未支持中文全文检索，如Mysql。故，国内出现了一些Mysql的中文全文检索的插件，做的比较好的有hightman的中文分词。Sphinx如果需要对中文进行全文检索，也得需要一些插件来补充。其中我知道的插件有 coreseek 和 sfc 。

2、安装配置实例

2.1 在GNU/Linux/unix系统上安装

2.1.1 sphinx安装

下载包到/usr/local/src目录下。

tar zxvf sphinx-2.0.8-release.tar.gz

cd sphinx-2.0.8-release

./configure –prefix=/usr/local/sphinx #注意：这里sphinx已经默认支持了mysql

make && make install # 其中的“警告”可以忽略

tar xzvf coreseek-3.2.14.tar.gz

       cd coreseek-3.2.14

安装mmseg中文分词
    cd mmseg-3.2.14
  ./bootstrap    ＃输出的warning信息可以忽视，若是呈现error则须要解决
    ./configure --prefix=/usr/local/mmseg3
    make && make install
   cd ..

＃＃安装coreseek
   cd csft-3.2.14 
   sh buildconf.sh    ＃输出的warning信息可以忽视，若是呈现error则须要解决
  ./configure --prefix=/usr/local/coreseek  --without-unixodbc --with-mmseg --with-mmseg-i    ncludes=/usr/local/mmseg3/include/mmseg/ --with-mmseg-libs=/usr/local/mmseg3/lib/ --with-mysql    ＃＃若是提示mysql题目，可以查看MySQL数据源安装申明
make && make install

cd ..

生成和使用分词字典

bin/mmseg -u /usr/local/mmseg3/etc/unigram.txt

将生成unigram.txt.lib 文件

cp unigram.txt.lib /usr/local/mmseg3/etc/uni.lib

cd ../coreseek

[root@localhost coreseek]# mkdir dict ###创建字典目录
cp /usr/local/mmseg3/etc/unigram.txt.uni dict/uni.lib ###把创建的词典复制到dict
sudo vim dict/mmseg.ini ####创建mmseg的配置文件

mmseg.ini:
[mmseg]
merge_number_and_ascii=1;
number_and_ascii_joint=-;
compress_space=0;
seperate_number_ascii=1;
至此，mmseg配置完毕！下一步配置csft.conf——coreseek的配置文件

2.1.2 配置文件

cd /usr/local/sphinx/etc #进入sphinx的配置文件目录

cp sphinx.conf.dist sphinx.conf #新建Sphinx配置文件

sudo gedit sphinx.conf #编辑sphinx.conf

2.1.3 配置内容

(1)数据库的配置

(2)异常 #exceptions= /data/exceptions.txt 注释掉

(3) sql_query_pre= SET NAMES utf8去掉注释

(4)加一句： charset_dictpath = /usr/local/mmseg3/etc

(5) 变成：charset_type= zh_cn.utf-8

2.1.4 使用命令

#建立索引

/usr/local/coreseek/bin/indexer --all

/usr/local/coreseek/bin/search -c /usr/local/csft/etc/csft.conf 贾

3、配置实例

设备MYSQL数据源

  vi /usr/local/coreseek/etc/csft.conf

摘录我的MYSQL数据源设备文件

source src1
{
	# data source type. mandatory, no default value
	# known types are mysql, pgsql, mssql, xmlpipe, xmlpipe2, odbc
	type					= mysql

	#####################################################################
	## SQL settings (for 'mysql' and 'pgsql' types)
	#####################################################################

	# some straightforward parameters for SQL source types
	sql_host				= localhost
	sql_user				= root
	sql_pass				= 
	sql_db					= test
	sql_port				= 3306	# optional, default is 3306

	# UNIX socket name
	# optional, default is empty (reuse client library defaults)
	# usually '/var/lib/mysql/mysql.sock' on Linux
	# usually '/tmp/mysql.sock' on FreeBSD
	#
	# sql_sock				= /tmp/mysql.sock


	# MySQL specific client connection flags
	# optional, default is 0
	#
	# mysql_connect_flags	= 32 # enable compression

	# MySQL specific SSL certificate settings
	# optional, defaults are empty
	#
	# mysql_ssl_cert		= /etc/ssl/client-cert.pem
	# mysql_ssl_key		= /etc/ssl/client-key.pem
	# mysql_ssl_ca		= /etc/ssl/cacert.pem

	# MS SQL specific Windows authentication mode flag
	# MUST be in sync with charset_type index-level setting
	# optional, default is 0
	#
	# mssql_winauth			= 1 # use currently logged on user credentials


	# MS SQL specific Unicode indexing flag
	# optional, default is 0 (request SBCS data)
	#
	# mssql_unicode			= 1 # request Unicode data from server


	# ODBC specific DSN (data source name)
	# mandatory for odbc source type, no default value
	#
	# odbc_dsn				= DBQ=C:\data;DefaultDir=C:\data;Driver={Microsoft Text Driver (*.txt; *.csv)};
	# sql_query				= SELECT id, data FROM documents.csv


	# pre-query, executed before the main fetch query
	# multi-value, optional, default is empty list of queries
	#
	  sql_query_pre			= SET NAMES utf8
	# sql_query_pre			= SET SESSION query_cache_type=OFF


	# main document fetch query
	# mandatory, integer document ID field MUST be the first selected column
	sql_query				= \
                SELECT goods_id,goods_name,goods_color\
                FROM goods_test

	# range query setup, query that must return min and max ID values
	# optional, default is empty
	#
	# sql_query will need to reference $start and $end boundaries
	# if using ranged query:
	#
	# sql_query				= \
	#	SELECT doc.id, doc.id AS group, doc.title, doc.data \
	#	FROM documents doc \
	#	WHERE id>=$start AND id<=$end
	#
	# sql_query_range		= SELECT MIN(id),MAX(id) FROM documents


	# range query step
	# optional, default is 1024
	#
	# sql_range_step		= 1000


	# unsigned integer attribute declaration
	# multi-value (an arbitrary number of attributes is allowed), optional
	# optional bit size can be specified, default is 32
	#
	# sql_attr_uint			= author_id
	# sql_attr_uint			= forum_id:9 # 9 bits for forum_id
	sql_attr_uint			= group_id

	# boolean attribute declaration
	# multi-value (an arbitrary number of attributes is allowed), optional
	# equivalent to sql_attr_uint with 1-bit size
	#
	# sql_attr_bool			= is_deleted


	# bigint attribute declaration
	# multi-value (an arbitrary number of attributes is allowed), optional
	# declares a signed (unlike uint!) 64-bit attribute
	#
	# sql_attr_bigint			= my_bigint_id


	# UNIX timestamp attribute declaration
	# multi-value (an arbitrary number of attributes is allowed), optional
	# similar to integer, but can also be used in date functions
	#
	# sql_attr_timestamp	= posted_ts
	# sql_attr_timestamp	= last_edited_ts
	sql_attr_timestamp		= date_added

	# string ordinal attribute declaration
	# multi-value (an arbitrary number of attributes is allowed), optional
	# sorts strings (bytewise), and stores their indexes in the sorted list
	# sorting by this attr is equivalent to sorting by the original strings
	#
	# sql_attr_str2ordinal	= author_name


	# floating point attribute declaration
	# multi-value (an arbitrary number of attributes is allowed), optional
	# values are stored in single precision, 32-bit IEEE 754 format
	#
	# sql_attr_float = lat_radians
	# sql_attr_float = long_radians


	# multi-valued attribute (MVA) attribute declaration
	# multi-value (an arbitrary number of attributes is allowed), optional
	# MVA values are variable length lists of unsigned 32-bit integers
	#
	# syntax is ATTR-TYPE ATTR-NAME 'from' SOURCE-TYPE [;QUERY] [;RANGE-QUERY]
	# ATTR-TYPE is 'uint' or 'timestamp'
	# SOURCE-TYPE is 'field', 'query', or 'ranged-query'
	# QUERY is SQL query used to fetch all ( docid, attrvalue ) pairs
	# RANGE-QUERY is SQL query used to fetch min and max ID values, similar to 'sql_query_range'
	#
	# sql_attr_multi	= uint tag from query; SELECT id, tag FROM tags
	# sql_attr_multi	= uint tag from ranged-query; \
	#	SELECT id, tag FROM tags WHERE id>=$start AND id<=$end; \
	#	SELECT MIN(id), MAX(id) FROM tags


	# post-query, executed on sql_query completion
	# optional, default is empty
	#
	# sql_query_post		=

	
	# post-index-query, executed on successful indexing completion
	# optional, default is empty
	# $maxid expands to max document ID actually fetched from DB
	#
	# sql_query_post_index = REPLACE INTO counters ( id, val ) \
	#	VALUES ( 'max_indexed_id', $maxid )


	# ranged query throttling, in milliseconds
	# optional, default is 0 which means no delay
	# enforces given delay before each query step
	sql_ranged_throttle	= 0

	# document info query, ONLY for CLI search (ie. testing and debugging)
	# optional, default is empty
	# must contain $id macro and must fetch the document by that id
	sql_query_info		= SELECT * FROM documents WHERE id=$id

	# kill-list query, fetches the document IDs for kill-list
	# k-list will suppress matches from preceding indexes in the same query
	# optional, default is empty
	#
	# sql_query_killlist	= SELECT id FROM documents WHERE edited>=@last_reindex


	# columns to unpack on indexer side when indexing
	# multi-value, optional, default is empty list
	#
	# unpack_zlib = zlib_column
	# unpack_mysqlcompress = compressed_column
	# unpack_mysqlcompress = compressed_column_2


	# maximum unpacked length allowed in MySQL COMPRESS() unpacker
	# optional, default is 16M
	#
	# unpack_mysqlcompress_maxsize = 16M


	#####################################################################
	## xmlpipe settings
	#####################################################################

	# type				= xmlpipe

	# shell command to invoke xmlpipe stream producer
	# mandatory
	#
	# xmlpipe_command	= cat /usr/local/coreseek/var/test.xml

	#####################################################################
	## xmlpipe2 settings
	#####################################################################

	# type				= xmlpipe2
	# xmlpipe_command	= cat /usr/local/coreseek/var/test2.xml


	# xmlpipe2 field declaration
	# multi-value, optional, default is empty
	#
	# xmlpipe_field				= subject
	# xmlpipe_field				= content


	# xmlpipe2 attribute declaration
	# multi-value, optional, default is empty
	# all xmlpipe_attr_XXX options are fully similar to sql_attr_XXX
	#
	# xmlpipe_attr_timestamp	= published
	# xmlpipe_attr_uint			= author_id


	# perform UTF-8 validation, and filter out incorrect codes
	# avoids XML parser choking on non-UTF-8 documents
	# optional, default is 0
	#
	# xmlpipe_fixup_utf8		= 1
}


# inherited source example
#
# all the parameters are copied from the parent source,
# and may then be overridden in this source definition
source src1throttled : src1
{
	sql_ranged_throttle			= 100
}

#############################################################################
## index definition
#############################################################################

# local index example
#
# this is an index which is stored locally in the filesystem
#
# all indexing-time options (such as morphology and charsets)
# are configured per local index
index test1
{
	# document source(s) to index
	# multi-value, mandatory
	# document IDs must be globally unique across all sources
	source			= src1

	# index files path and file name, without extension
	# mandatory, path must be writable, extensions will be auto-appended
	path			= /usr/local/coreseek/var/data/test1

	# document attribute values (docinfo) storage mode
	# optional, default is 'extern'
	# known values are 'none', 'extern' and 'inline'
	docinfo			= extern
        
	# memory locking for cached data (.spa and .spi), to prevent swapping
	# optional, default is 0 (do not mlock)
	# requires searchd to be run from root
	mlock			= 0

	# a list of morphology preprocessors to apply
	# optional, default is empty
	#
	# builtin preprocessors are 'none', 'stem_en', 'stem_ru', 'stem_enru',
	# 'soundex', and 'metaphone'; additional preprocessors available from
	# libstemmer are 'libstemmer_XXX', where XXX is algorithm code
	# (see libstemmer_c/libstemmer/modules.txt)
	#
	# morphology 	= stem_en, stem_ru, soundex
	# morphology	= libstemmer_german
	# morphology	= libstemmer_sv
	morphology		= none

	# minimum word length at which to enable stemming
	# optional, default is 1 (stem everything)
	#
	# min_stemming_len	= 1


	# stopword files list (space separated)
	# optional, default is empty
	# contents are plain text, charset_table and stemming are both applied
	#
	#stopwords			= G:\data\stopwords.txt


	# wordforms file, in "mapfrom > mapto" plain text format
	# optional, default is empty
	#
	#wordforms			= G:\data\wordforms.txt


	# tokenizing exceptions file
	# optional, default is empty
	#
	# plain text, case sensitive, space insensitive in map-from part
	# one "Map Several Words => ToASingleOne" entry per line
	#
	#exceptions		= /data/exceptions.txt


	# minimum indexed word length
	# default is 1 (index everything)
	min_word_len		= 1
        charset_dictpath        = /usr/local/coreseek/dict/
	# charset encoding type
	# optional, default is 'sbcs'
	# known types are 'sbcs' (Single Byte CharSet) and 'utf-8'
	charset_type		= zh_cn.utf-8

	# charset definition and case folding rules "table"
	# optional, default value depends on charset_type
	#
	# defaults are configured to include English and Russian characters only
	# you need to change the table to include additional ones
	# this behavior MAY change in future versions
	#
	# 'sbcs' default value is
	# charset_table		= 0..9, A..Z->a..z, _, a..z, U+A8->U+B8, U+B8, U+C0..U+DF->U+E0..U+FF, U+E0..U+FF
	#
	# 'utf-8' default value is
	# charset_table		= 0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F


	# ignored characters list
	# optional, default value is empty
	#
	# ignore_chars		= U+00AD


	# minimum word prefix length to index
	# optional, default is 0 (do not index prefixes)
	#
	# min_prefix_len	= 0


	# minimum word infix length to index
	# optional, default is 0 (do not index infixes)
	#
	# min_infix_len		= 0


	# list of fields to limit prefix/infix indexing to
	# optional, default value is empty (index all fields in prefix/infix mode)
	#
	# prefix_fields		= filename
	# infix_fields		= url, domain


	# enable star-syntax (wildcards) when searching prefix/infix indexes
	# known values are 0 and 1
	# optional, default is 0 (do not use wildcard syntax)
	#
	# enable_star		= 1


	# n-gram length to index, for CJK indexing
	# only supports 0 and 1 for now, other lengths to be implemented
	# optional, default is 0 (disable n-grams)
	#
	# ngram_len				= 1


	# n-gram characters list, for CJK indexing
	# optional, default is empty
	#
	# ngram_chars			= U+3000..U+2FA1F


	# phrase boundary characters list
	# optional, default is empty
	#
	# phrase_boundary		= ., ?, !, U+2026 # horizontal ellipsis


	# phrase boundary word position increment
	# optional, default is 0
	#
	# phrase_boundary_step	= 100


	# whether to strip HTML tags from incoming documents
	# known values are 0 (do not strip) and 1 (do strip)
	# optional, default is 0
	html_strip				= 0

	# what HTML attributes to index if stripping HTML
	# optional, default is empty (do not index anything)
	#
	# html_index_attrs		= img=alt,title; a=title;


	# what HTML elements contents to strip
	# optional, default is empty (do not strip element contents)
	#
	# html_remove_elements	= style, script


	# whether to preopen index data files on startup
	# optional, default is 0 (do not preopen), searchd-only
	#
	# preopen					= 1


	# whether to keep dictionary (.spi) on disk, or cache it in RAM
	# optional, default is 0 (cache in RAM), searchd-only
	#
	# ondisk_dict				= 1


	# whether to enable in-place inversion (2x less disk, 90-95% speed)
	# optional, default is 0 (use separate temporary files), indexer-only
	#
	# inplace_enable			= 1


	# in-place fine-tuning options
	# optional, defaults are listed below
	#
	# inplace_hit_gap			= 0		# preallocated hitlist gap size
	# inplace_docinfo_gap		= 0		# preallocated docinfo gap size
	# inplace_reloc_factor	= 0.1	# relocation buffer size within arena
	# inplace_write_factor	= 0.1	# write buffer size within arena


	# whether to index original keywords along with stemmed versions
	# enables "=exactform" operator to work
	# optional, default is 0
	#
	# index_exact_words		= 1


	# position increment on overshort (less that min_word_len) words
	# optional, allowed values are 0 and 1, default is 1
	#
	# overshort_step			= 1


	# position increment on stopword
	# optional, allowed values are 0 and 1, default is 1
	#
	# stopword_step			= 1
}


# inherited index example
#
# all the parameters are copied from the parent index,
# and may then be overridden in this index definition
index test1stemmed : test1
{
	path			= /usr/local/coreseek/var/data/test1stemmed
	morphology		= stem_en
}


# distributed index example
#
# this is a virtual index which can NOT be directly indexed,
# and only contains references to other local and/or remote indexes
index dist1
{
	# 'distributed' index type MUST be specified
	type				= distributed

	# local index to be searched
	# there can be many local indexes configured
	local				= test1
	local				= test1stemmed

	# remote agent
	# multiple remote agents may be specified
	# syntax for TCP connections is 'hostname:port:index1,[index2[,...]]'
	# syntax for local UNIX connections is '/path/to/socket:index1,[index2[,...]]'
	agent				= localhost:9313:remote1
	agent				= localhost:9314:remote2,remote3
	# agent				= /var/run/searchd.sock:remote4

	# blackhole remote agent, for debugging/testing
	# network errors and search results will be ignored
	#
	# agent_blackhole		= testbox:9312:testindex1,testindex2


	# remote agent connection timeout, milliseconds
	# optional, default is 1000 ms, ie. 1 sec
	agent_connect_timeout	= 1000

	# remote agent query timeout, milliseconds
	# optional, default is 3000 ms, ie. 3 sec
	agent_query_timeout		= 3000
}

#############################################################################
## indexer settings
#############################################################################

indexer
{
	# memory limit, in bytes, kiloytes (16384K) or megabytes (256M)
	# optional, default is 32M, max is 2047M, recommended is 256M to 1024M
	mem_limit			= 32M

	# maximum IO calls per second (for I/O throttling)
	# optional, default is 0 (unlimited)
	#
	# max_iops			= 40


	# maximum IO call size, bytes (for I/O throttling)
	# optional, default is 0 (unlimited)
	#
	# max_iosize		= 1048576


	# maximum xmlpipe2 field length, bytes
	# optional, default is 2M
	#
	# max_xmlpipe2_field	= 4M


	# write buffer size, bytes
	# several (currently up to 4) buffers will be allocated
	# write buffers are allocated in addition to mem_limit
	# optional, default is 1M
	#
	# write_buffer		= 1M
}

#############################################################################
## searchd settings
#############################################################################

searchd
{
	# hostname, port, or hostname:port, or /unix/socket/path to listen on
	# multi-value, multiple listen points are allowed
	# optional, default is 0.0.0.0:9312 (listen on all interfaces, port 9312)
	#
	# listen				= 127.0.0.1
	# listen				= 192.168.0.1:9312
	# listen				= 9312
	# listen				= /var/run/searchd.sock


	# log file, searchd run info is logged here
	# optional, default is 'searchd.log'
	log					= /usr/local/coreseek/var/log/searchd.log

	# query log file, all search queries are logged here
	# optional, default is empty (do not log queries)
	query_log			= /usr/local/coreseek/var/log/query.log

	# client read timeout, seconds
	# optional, default is 5
	read_timeout		= 5

	# request timeout, seconds
	# optional, default is 5 minutes
	client_timeout		= 300

	# maximum amount of children to fork (concurrent searches to run)
	# optional, default is 0 (unlimited)
	max_children		= 30

	# PID file, searchd process ID file name
	# mandatory
	pid_file			= /usr/local/coreseek/var/log/searchd.pid

	# max amount of matches the daemon ever keeps in RAM, per-index
	# WARNING, THERE'S ALSO PER-QUERY LIMIT, SEE SetLimits() API CALL
	# default is 1000 (just like Google)
	max_matches			= 1000

	# seamless rotate, prevents rotate stalls if precaching huge datasets
	# optional, default is 1
	seamless_rotate		= 1

	# whether to forcibly preopen all indexes on startup
	# optional, default is 0 (do not preopen)
	preopen_indexes		= 0

	# whether to unlink .old index copies on succesful rotation.
	# optional, default is 1 (do unlink)
	unlink_old			= 1

	# attribute updates periodic flush timeout, seconds
	# updates will be automatically dumped to disk this frequently
	# optional, default is 0 (disable periodic flush)
	#
	# attr_flush_period	= 900


	# instance-wide ondisk_dict defaults (per-index value take precedence)
	# optional, default is 0 (precache all dictionaries in RAM)
	#
	# ondisk_dict_default	= 1


	# MVA updates pool size
	# shared between all instances of searchd, disables attr flushes!
	# optional, default size is 1M
	mva_updates_pool	= 1M

	# max allowed network packet size
	# limits both query packets from clients, and responses from agents
	# optional, default size is 8M
	max_packet_size		= 8M

	# crash log path
	# searchd will (try to) log crashed query to 'crash_log_path.PID' file
	# optional, default is empty (do not create crash logs)
	#
	# crash_log_path		= /usr/local/coreseek/var/log/crash


	# max allowed per-query filter count
	# optional, default is 256
	max_filters			= 256

	# max allowed per-filter values count
	# optional, default is 4096
	max_filter_values	= 4096


	# socket listen queue length
	# optional, default is 5
	#
	# listen_backlog		= 5


	# per-keyword read buffer size
	# optional, default is 256K
	#
	# read_buffer			= 256K


	# unhinted read size (currently used when reading hits)
	# optional, default is 32K
	#
	# read_unhinted		= 32K
}

　 a. source是设备数据源，遵守提示输入MYSQL的主机、帐号、暗码和数据库即可，我的MYSQL就安装在本机上（MYSQL的安装可自行百度）

　　b. sql_query_pre是在履行查询之前履行的SQL语句。（重视：在coreseek只能辨认utf8字符集编码，所以我们要履行转换一下）

　　c. sql_query是要查询进行索引的SQL语句，sql_attr_unit和sql_attr_timestamp是设置属性的，属性在全文检索中可以用来设置过滤和排序。

　　d. index和source应当是成对呈现，index就是设备索引的功能（我们还可以设备多个索引主索引+增量索引的功能）

　　e. searchd是常驻过程的全文检索办事，默认监控本机的9312端口

　　f. charset_type和charset_dictpath是中文分词设备

4、应用

4.1 在CLI上测试

建立索引： /usr/local/coreseek/bin/indexer --all

搜索中文：/usr/local/coreseek/bin/search -c /usr/local/coreseek/etc/csft.conf '贾'

5、附录