Sphinx Chinese Beginner's Guide

http://www.sphinxsearch.org/sphinx-tutorial

  • 1 Introduction
  • 1.1. What is Sphinx
  • 1.2. Features of Sphinx
  • 1.3.Sphinx Chinese word segmentation
  • 2. Installation configuration example
  • 2.1 Installation on GNU/Linux/unix systems
    • 2.1.1 sphinx installation
    • 2.1.2.sfc installation (see another article)
    • 2.1.3.coreseek installation (see separate article)
  • 2.2 Installation under Windows
  • 3. Configuration example
  • 4. Application
  • 4.1 Test on CLI
  • 4.2 Using API calls
  • 5. Appendix

1. Introduction to Sphinx

1.1. What is Sphinx

Sphinx is a full-text search engine developed by Russian Andrew Aksyonoff. It is intended to provide high-speed, low-space occupancy, and high-result relevance full-text search capabilities for other applications. Sphinx can be easily integrated with SQL databases and scripting languages. The current system has built-in support for MySQL and PostgreSQL database data sources, and also supports reading XML data in a specific format from standard input. By modifying the source code, users can add new data sources by themselves (for example: native support for other types of DBMS)

1.2. Features of Sphinx

  • high-speed indexing (peak performance up to 10 MB/sec on contemporary CPUs);
  • High-performance search (with an average response time of less than 0.1 seconds per retrieval on 2 - 4GB of text data);
  • Can handle massive data (currently known to handle more than 100 GB of text data, and 100 M documents on a single CPU system);
  • Provides an excellent relevancy algorithm, a compound Ranking method based on phrase similarity and statistics (BM25);
  • Support distributed search;
  • Phrase search supported
  • Provide document summary generation
  • Can be used as a storage engine for MySQL to provide search services;
  • Support Boolean, phrase, word similarity and other search modes;
  • The document supports multiple full-text search fields (maximum no more than 32);
  • The document supports multiple additional attribute information (eg: grouping information, timestamp, etc.);
  • Hyphenation is supported;

1.3.Sphinx Chinese word segmentation

Chinese full-text retrieval is different from English and other latin series. The latter is based on special characters such as spaces to segment words, while Chinese is based on semantics. At present, most databases do not yet support Chinese full-text search, such as Mysql. Therefore, some Mysql Chinese full-text search plug-ins have appeared in China, and the Chinese word segmentation of hightman is better. If Sphinx needs to perform full-text search in Chinese, it also needs some plugins to supplement it. Among the plugins I know are coreseek and sfc .

  • Coreseek is the most widely used Chinese full-text search for Sphinx, and it provides LibMMSeg , a Chinese word segmentation package designed for Sphinx . And provides binary distributions for multiple systems, including rpm, deb and binary packages under windows. In addition, coreseek also contributed the following to sphinx:
    • GBK encoded data source support
    • Chinese Tokenizer Using Chih-Hao Tsai MMSEG Algorithm
    • Chinese manual ( this Chinese manual is very convenient for those who are new to using sphinx in China, especially those who are not very good at English )
  • sfc (sphinx-for-chinese) is another Chinese word segmentation plugin provided by the netizen happy brother. The Chinese dictionary uses xdict . According to its introduction, after testing, the current version can basically reach half of the indexing speed (Linux test platform) of indexing UTF-8 English, which is half of the officially declared speed. (The time is mainly consumed on the participle). Now provides sphinx-for-chinese-0.9.10-dev-r2006.tar.gz which is synchronized with the latest version of sphinx (sphinx 0.9.10) . This version adds sql_attr_string, which has been tested by myself. Its installation and configuration are very convenient. Brother happy has another contribution in word segmentation - php-mmseg , which is an extension library of php for Chinese word segmentation.

Here, I would like to express my utmost respect to the above two authors

  • Also, if you are not interested in Chinese word segmentation. Or you only need to implement functions similar to like in sql, such as: select * from product where prodName like '%mobile%'. Sphinx won't let you down either. This may be the official website's simple implementation of Chinese - direct word indexing. And the search speed is not bad ^_^ .

This article will test the above three Chinese applications and record them in the form of documents, which may be the focus of this document.

2. Installation configuration example

2.1 Installation on GNU/Linux/unix systems

There are two ways to apply Sphinx on mysql:
1. Use API calls, such as using API functions or methods such as PHP and Java to query. The advantage is that there is no need to recompile mysql, the server process is "low coupling", and the program can be called flexibly and conveniently; the
disadvantage is that if there is an existing search program, some programs need to be modified. Recommended for programmers.
2. Use the plug-in method (sphinxSE) to compile sphinx into a mysql plug-in and use a specific sql statement to retrieve it. Its characteristics are that it is easy to combine on the sql side, and can directly return data to the client
without a second query (note), only the corresponding sql needs to be modified in the program, but this is very inconvenient for programs developed using the framework, such as using ORM. In addition, mysql needs to be recompiled, and mysql-5.1 or later versions are required to
support plug-in storage. System administrators can use this method to
query for the second time. Note: As of the release version - sphinx-0.9.9, after retrieving the results, sphinx can only return the ID of the record, not the sql data to be checked, so it needs to be redone. According to these IDs, the database is queried again.
The sphinx 0.9.10 version under development can store these text data. The author has tried it before, but the performance and storage effect are not good. After all, the official version has not yet been released.

This paper adopts the first method

To install under *nix system, the following software support is required first

Software Environment:

  • Operating System: Centos-5.2
  • Database: mysql-5.0.77-3.el5 mysql-devel (if you want to use sphinxSE plug-in storage, please use mysql-5.1 or above)
  • Compile software: gcc gcc-c++ autoconf automake
  • Sphinx: Sphinx-0.9.9 (latest stable version)

Install:

  • [root@localhost ~]# yum install -y mysql mysql-devel
  • [root@localhost ~]# yum install -y automake autoconf
  • [root@localhost ~]# cd /usr/local/src/
  • [root@localhost src]# wget http://www.sphinxsearch.com/downloads/sphinx-0.9.9.tar.gz
  • [root@localhost src]# tar zxvf sphinx-0.9.9.tar.gz
  • [root@localhost local]# cd sphinx-0.9.9
  • [root@localhost sphinx-0.9.9]# ./configure –prefix=/usr/local/sphinx #Note: here sphinx already supports mysql by default
  • [root@localhost sphinx-0.9.9]# make && make install # The "warning" can be ignored

After the installation is complete, check whether there are three directories bin etc var under /usr/local/sphinx. If so, the installation is correct!

2.1.2.sfc installation (click to enter)
2.1.3.coreseek installation (click to enter)

3. Configuration instance

3.1. Data sources.

Here we use the mysql data source. Details are as follows:

Mysql server:192.168.1.10

Mysql db :test

Mysql 表:test.sphinx_article

mysql> desc sphinx_article;
+———–+———————+——+—–+———+—————-+
| Field | Type | Null | Key | Default | Extra |
+———–+———————+——+—–+———+—————-+
| id | int(11) unsigned | NO | PRI | NULL | auto_increment |
| title | varchar(255) | NO | | | |
| cat_id | tinyint(3) unsigned | NO | MUL | | |
| member_id | int(11) unsigned | NO | MUL | | |
| content | longtext | NO | | | |
| created | int(11) | NO | MUL | | |
+———–+———————+——+—–+———+—————-+
6 rows in set (0.00 sec)

3.2, configuration file

  • [root@localhost ~]#cd /usr/local/sphinx/etc #Enter the sphinx configuration file directory
  • [root@localhost etc]# cp sphinx.conf.dist sphinx.conf #New Sphinx configuration file
  • [root@localhost etc]# vim sphinx.conf #Edit sphinx.conf

Specific instance configuration file:

##### index source ###########
source article_src
{
type = mysql ##### data source type
sql_host = 192.168.1.10 ######mysql host
sql_user = root ## ######mysql username
sql_pass=pwd############mysql password
sql_db=test #########mysql database name
sql_port=3306 ###### #####mysql port
sql_query_pre = SET NAMES UTF8 ###mysql search code, pay special attention to this, many people can't search in Chinese, the code of the database is GBK or other non-UTF8
sql_query = SELECT id, title, cat_id, member_id,content,created FROM sphinx_article ####### sql to get data

#####The following properties are used to filter or conditional queries############

sql_attr_uint = cat_id ######## unsigned integer attribute
sql_attr_uint = member_id
sql_attr_timestamp = created ############ UNIX timestamp attribute

sql_query_info = select * from sphinx_article where id=$id ######### for testing command interface (CLI) calls

}

### Index ###

index article
{
source = article_src ####Declare index source
path = /usr/local/sphinx/var/data/article #######Index file storage path and index file name
docinfo = extern #### # Document information storage method
mlock = 0 ###Cache data memory lock
morphology = none #### Morphology (invalid for Chinese)
min_word_len = 1 #### Indexed word minimum length
charset_type = utf-8 #### #Data encoding

##### character table, note: if this method is used, sphinx will perform word segmentation on Chinese,
##### is the word index, if you want to use Chinese word segmentation, you must use other word segmentation plugins such as coreseek, sfc

charset_table = U+FF10..U+FF19->0..9, 0..9, U+FF41..U+FF5A->a..z, U+FF21..U+FF3A->a..z,\
A..Z->a..z, a..z, U+0149, U+017F, U+0138, U+00DF, U+00FF, U+00C0..U+00D6->U+00E0..U+00F6,\
U+00E0..U+00F6, U+00D8..U+00DE->U+00F8..U+00FE, U+00F8..U+00FE, U+0100->U+0101, U+0101,\
U+0102->U+0103, U+0103, U+0104->U+0105, U+0105, U+0106->U+0107, U+0107, U+0108->U+0109,\
U+0109, U+010A->U+010B, U+010B, U+010C->U+010D, U+010D, U+010E->U+010F, U+010F,\
U+0110->U+0111, U+0111, U+0112->U+0113, U+0113, U+0114->U+0115, U+0115, \
U+0116->U+0117,U+0117, U+0118->U+0119, U+0119, U+011A->U+011B, U+011B, U+011C->U+011D,\
U+011D,U+011E->U+011F, U+011F, U+0130->U+0131, U+0131, U+0132->U+0133, U+0133, \
U+0134->U+0135,U+0135, U+0136->U+0137, U+0137, U+0139->U+013A, U+013A, U+013B->U+013C, \
U+013C,U+013D->U+013E, U+013E, U+013F->U+0140, U+0140, U+0141->U+0142, U+0142, \
U+0143->U+0144,U+0144, U+0145->U+0146, U+0146, U+0147->U+0148, U+0148, U+014A->U+014B, \
U+014B,U+014C->U+014D, U+014D, U+014E->U+014F, U+014F, U+0150->U+0151, U+0151, \
U+0152->U+0153,U+0153, U+0154->U+0155, U+0155, U+0156->U+0157, U+0157, U+0158->U+0159,\
U+0159,U+015A->U+015B, U+015B, U+015C->U+015D, U+015D, U+015E->U+015F, U+015F, \
U+0160->U+0161,U+0161, U+0162->U+0163, U+0163, U+0164->U+0165, U+0165, U+0166->U+0167, \
U+0167,U+0168->U+0169, U+0169, U+016A->U+016B, U+016B, U+016C->U+016D, U+016D, \
U + 016E-> U + 016F, U + 016F, U + 0170-> U + 0171, U + 0171, U + 0172-> U + 0173, U + 0173, U + 0174-> U + 0175, \
U + 0175, U + 0176-> U + 0177, U + 0177, U + 0178-> U + 00FF, U + 00FF, U + 0179-> U + 017A, U + 017A, \
U + 017B-> U + 017C, U + 017C, U + 017D-> U + 017E, U + 017E, U + 0410..U + 042F-> U + 0430..U + 044F, \
U + 0430..U + 044F, U + 05D0..U + 05EA, U + 0531..U + 0556-> U + 0561..U + 0586, U + 0561..U + 0587, \
U + 0621..U + 063A, U + 01B9, U + 01BF, U + 0640..U + 064A, U + 0660..U + 0669, U + 066E, U + 066F, \
U + 0671..U + 06D3, U + 06F0..U + 06FF, U + 0904..U + 0939, U + 0958..U + 095F, U + 0960..U + 0963, \
U + 0966..U + 096F, U + 097B..U + 097F, U + 0985..U + 09B9, U + 09CE, U + 09DC..U + 09E3, U + 09E6..U + 09EF, \
U + 0A05..U + 0A39, U + 0A59..U + 0A5E, U + 0A66..U + 0A6F, U + 0A85..U + 0AB9, U + 0AE0..U + 0AE3, \
U+0AE6..U+0AEF, U+0B05..U+0B39,U+0B5C..U+0B61, U+0B66..U+0B6F, U+0B71, U+0B85..U+0BB9, \
U+0BE6..U+0BF2, U+0C05..U+0C39,U+0C66..U+0C6F, U+0C85..U+0CB9, U+0CDE..U+0CE3, \
U+0CE6..U+0CEF, U+0D05..U+0D39, U+0D60,U+0D61, U+0D66..U+0D6F, U+0D85..U+0DC6, \
U+1900..U+1938, U+1946..U+194F, U+A800..U+A805,U+A807..U+A822, U+0386->U+03B1, \
U+03AC->U+03B1, U+0388->U+03B5, U+03AD->U+03B5,U+0389->U+03B7, U+03AE->U+03B7, \
U+038A->U+03B9, U+0390->U+03B9, U+03AA->U+03B9,U+03AF->U+03B9, U+03CA->U+03B9, \
U+038C->U+03BF, U+03CC->U+03BF, U+038E->U+03C5,U+03AB->U+03C5, U+03B0->U+03C5, \
U+03CB->U+03C5, U+03CD->U+03C5, U+038F->U+03C9,U+03CE->U+03C9, U+03C2->U+03C3, \
U+0391..U+03A1->U+03B1..U+03C1,U+03A3..U+03A9->U+03C3..U+03C9, U+03B1..U+03C1, \
U + 03C3..U + 03C9, U + 0E01..U + 0E2E, U + 0E30..U + 0E3A, U + 0E40..U + 0E45, U + 0E47, U + 0E50..U + 0E59, \
U + A000..U + A48F, U + 4E00..U + 9FBF, U + 3400..U + 4DBF, U + 20000..U + 2A6DF, U + F900..U + FAFF, \
U + 2F800. .U + 2FA1F, U + 2E80..U + 2EFF, U + 2F00..U + 2FDF, U + 3100..U + 312F, U + 31A0..U + 31BF, \
U + 3040..U + 309F , U + 30A0..U + 30FF, U + 31F0..U + 31FF, U + AC00..U + D7AF, U + 1100..U + 11FF, \
U + 3130..U + 318F, U + A000 .. U + A48F, U + A490 .. U + A4CF
min_prefix_len = 0 #minimum prefix
min_infix_len = 1 #minimum infix
ngram_len = 1 #

#With this option, each Chinese and English word will be segmented, and the speed will be slow
#ngram_chars = U+4E00..U+9FBF, U+3400..U+4DBF, U+20000..U+ 2A6DF, U+F900..U+FAFF,\
#U+2F800..U+2FA1F, U+2E80..U+2EFF, U+2F00..U+2FDF, U+3100..U+312F, U +31A0..U+31BF,\
#U+3040..U+309F, U+30A0..U+30FF, U+31F0..U+31FF, U+AC00..U+D7AF, U+1100. .U+11FF,\
#U+3130..U+318F, U+A000..U+A48F, U+A490..U+A4CF

}

######### indexer configuration #####
indexer
{
mem_limit = 256M ####### memory limit
}

############ sphinx service process########
searchd
{
#listen = 9312 ### Listening port, starting with this version, the official port 9312 has been officially authorized by IANA , the previous version defaulted to 3312

log = /usr/local/sphinx/var/log/searchd.log #### Service process log, once an exception occurs in sphinx, you can basically query valid information from here, and the problem of rotation can generally be found here To the answer
query_log = /usr/local/sphinx/var/log/query.log ### Client query log, the author's note: If you want to count some keywords, you can analyze this log file
read_timeout = 5 ## Request timeout
max_children = 30 ### The maximum number of searchd processes that can be executed at the same time
pid_file = /usr/local/sphinx/var/log/searchd.pid #######Process ID file
max_matches = 1000 ### The maximum query result Returns
seamless_rotate = 1 ### Whether seamless switching is supported, usually required for incremental indexing
}

3.3. Create an index file

[root@localhost sphinx]# bin/indexer -c etc/sphinx.conf article ### Commands to build index files
Sphinx 0.9.9-release (r2117)
Copyright (c) 2001-2009, Andrew Aksyonoff

using config file 'etc/sphinx.conf'…
indexing index 'article'…
collected 1000 docs, 0.2 MB
sorted 0.4 Mhits, 99.6% done
total 1000 docs, 210559 bytes
total 3.585 sec, 58723 bytes/sec, 278.89 docs/sec
total 2 reads, 0.031 sec, 1428.8 kb/call avg, 15.6 msec/call avg
total 11 writes, 0.032 sec, 671.6 kb/call avg, 2.9 msec/call avg
[root@localhost sphinx] #The
above means that the index has been successful. If unsuccessful, please modify the configuration file according to the error prompted, or ask questions here , I will solve it as soon as I see it

4. Application

4.1 Test on CLI

In the previous step, we built the index, and now we test the index we just built. There are two ways to test: CLI side and API call

The command test on the CLI side is to use the search command that comes with sphinx: search

###### Retrieve the keyword "Beijing" on the article index########
[root@localhost sphinx]# bin/search -c etc/sphinx.conf Beijing
Sphinx 0.9.9-release (r2117 )
Copyright (c) 2001-2009, Andrew Aksyonoff

using config file ‘etc/sphinx.conf’…
index ‘article’: query ‘北京 ‘: returned 995 matches of 995 total in 0.008 sec

displaying matches:
1. document=76, weight=2, cat_id=1, member_id=2, created=Sat Jan 23 19:05:09 2010
id=76
title=??????????
cat_id=1
member_id=2
content=????????????????????????????????
created=1264244709
2. document=85, weight=2, cat_id=1, member_id=2, created=Sat Jan 23 19:05:09 2010
id=85
title=????????????
cat_id=1
member_id=2
content=??▒????????????▒????????▒????▒?????????????????????????????
created=1264244709
…..这里省略….
20. document=17, weight=1, cat_id=1, member_id=2, created=Sat Jan 23 19:05:09 2010
id=17
title=????????????
cat_id=1
member_id=2
content=??????????????????????????????????????????????????????????
created=1264244709

words:
1. 'Beijing': 995 documents, 999 hits

So far, we can see that we have retrieved all the information about "Beijing"

Note: Here I am using the putty client, and the client encoding is set to utf-8, which is a prerequisite for the test

4.2 API calls

In this example, I use PHP's api to test. Before testing, start the sphinx service process and open port 9312 to the firewall of centos.

[root@localhost sphinx]# bin/searchd -c etc/sphinx.conf & ### make sphinx run in the background
[1] 5759
[root@localhost sphinx]# Sphinx 0.9.9-release (r2117)
Copyright (c) 2001-2009, Andrew Aksyonoff

using config file ‘etc/sphinx.conf’…
listening on all interfaces, port=9312

[1]+ Done bin/searchd -c etc/sphinx.conf

php test code:

<?php
header(‘Content-type:text/html;charset=utf-8′);
?><form name=”form1″ method=”get” action=”">
<label>
<input style=”width:400px;” type=”text” name=”keyword”>
</label>
<label>
<input type=”submit” name=”Submit” value=”sphinx搜索”>
</label>
</form>

<?php
$keyword = $_GET['keyword'];
if (trim($keyword)==”) {
die('Please enter the keyword');
}
else {
echo 'The keyword is: '.$keyword;
}

require "sphinxapi.php";
$cl = new SphinxClient();
$cl->SetServer('192.168.1.150', 9312); //Pay attention to the host here
#$cl->SetMatchMode(SPH_MATCH_EXTENDED); //Use more Field pattern
//dump($cl);
$index=”article”;
$res = $cl->Query($keyword, $index);
$err = $cl->GetLastError();
dump($res);
function dump($var)
{
echo '<pre>';
var_dump($var);
echo '</pre>';
}
?>

The results after retrieving the "Beijing" dump are as follows:

array(10) {
  ["error"]=>
  string(0) ""
  ["warning"]=>
  string(0) ""
  ["status"]=>
  int(0)
  ["fields"]=>
  array(2) {
    [0]=>
    string(5) "title"
    [1]=>
    string(7) "content"
  }
  ["attrs"]=>
  array(3) {
    ["cat_id"]=>
    int(1)
    ["member_id"]=>
    int(1)
    ["created"]=>
    int(2)
  }
  ["matches"]=>
  array(20) {
    [76]=>
    array(2) {
      ["weight"]=>
      string(1) "2"
      ["attrs"]=>
      array(3) {
        ["cat_id"]=>
        string(1) "1"
        ["member_id"]=>
        string(1) "2"
        ["created"]=>
        string(10) "1264244709"
      }
    }
  .....Omitted here.....
    [17]=>
    array(2) {
      ["weight"]=>
      string(1) "1"
      ["attrs"]=>
      array(3) {
        ["cat_id"]=>
        string(1) "1"
        ["member_id"]=>
        string(1) "2"
        ["created"]=>
        string(10) "1264244709"
      }
    }
  }
  ["total"]=>
  string(3) "995"
  ["total_found"]=>
  string(3) "995"
  ["time"]=>
  string(5) "0.008"
  ["words"]=>
  array(1) {
    ["Beijing"]=>
    array(2) {
      ["docs"]=>
      string(3) "995"
      ["hits"]=>
      string(3) "999"
    }
  }
}

So far, PHP can call the result!

appendix

This is the introductory manual I wrote for sphinx, and it is also archived for myself. To write this article, I re-installed Sphinx, created a new mysql table and added 1000 records, and repeated all the
processes by hand. If you have any errors or questions, please report to the following address, thank you!
Welcome to sphinx Chinese website (www.sphinxsearch.org) to discuss sphinx related issues and exchange your thoughts with me!

Welcome to other related articles about Sphinx written by me: Sphinx Chinese Word Segmentation, Sphinx Advanced Application, Sphinx FAQ, Sphinx Service Architecture

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325631507&siteId=291194637