【Introduction to Sphinx】

Sphinx is a full-text search engine based on SQL. It can be combined with MySQL and PostgreSQL for full-text search. It can provide more professional search functions than the database itself, making it easier for applications to achieve professional full-text search. Sphinx specially designs search API interfaces for some scripting languages, such as PHP, Python, Perl, Ruby, etc., and also designs a storage engine plug-in for MySQL.

A single Sphinx index can contain a maximum of 100 million records, and the query speed in the case of 10 million records is 0.x seconds (millisecond level). The speed at which Sphinx creates an index is: it only takes 3 to 4 minutes to create an index of 1 million records, and an index of 10 million records can be completed in 50 minutes, while an incremental index that only contains the latest 100,000 records needs to be rebuilt once Just a few tens of seconds.

Key features of Sphinx include:

High-speed indexing (nearly 10 MB/sec on newer CPUs); high-speed search (average query speed is less than 0.1 sec for 2-4G text volume); high availability (up to 100 GB of text, 100M documents can be supported on a single CPU) ); Provides good relevance ranking to support distributed search; Provides document summary generation; Provides search from MySQL's internal plug-in storage engine to support Boolean, phrase, and synonym queries; Supports multiple full-text search domains per document (default maximum 32); support multiple attributes per document; support word segmentation; support single-byte encoding and UTF-8 encoding;

 

 

How Sphinx Works

The entire workflow of Sphinx is that the Indexer program extracts data from the database, performs word segmentation on the data, and then generates single or multiple indexes according to the generated word segmentation, and passes them to the searchd program. Clients can then search through API calls.

 

 

Why use Sphinx

A similar requirement is encountered: users can search for the content of an article through the article title and article, and the article title and article content are stored in different libraries, and they are across the computer room.

 

Alternative

A. Implement cross-database LIKE query directly in the database

Advantages: simple operation Disadvantages: low efficiency, will cause greater network overhead

 

B. Combined with Sphinx Chinese word segmentation search engine

Advantages: high efficiency, with high scalability Disadvantages: not responsible for data storage

 

Use the Sphinx search engine to index the data, load the data in at one time, and then save it in memory after it is done. In this way, users only need to retrieve data on the Sphinx server when they search. Moreover, Sphinx does not have the defects of MySQL's companion disk I/O, and has better performance.

Other typical usage scenarios

1. Fast, efficient, scalable and core full-text search

When the amount of data is large, it is faster than MyISAM and InnoDB. Ability to create indexes on mixed data from multiple source tables, not limited to fields on a single table. Ability to consolidate search results from multiple indexes. Full text search can be optimized based on additional conditions on attributes.

 

2. Efficient use of WHERE clauses and LIMIT clauses

When doing a SELECT query in multiple WHERE conditions, the index selectivity is poor or there are no fields supported by the index at all, and the performance is poor. Sphinx can index keywords. The difference is that in MySQL, the internal engine decides whether to use an index or a full scan, while sphinx lets you choose which access method to use. Because sphinx stores data in RAM, sphinx doesn't do much I/O operations. And mysql has a kind of semi-random I/O disk read, which reads the records into the sort buffer line by line, then sorts them, and finally discards most of the lines. So sphinx uses less memory and disk I/O.

 

3. Optimize GROUP BY query

Sorting and grouping in sphinx use fixed memory, which is slightly more efficient than MySQL queries where similar datasets can all fit in RAM.

 

4. Generate result sets in parallel

sphinx allows you to generate several results from the same data at the same time, again using a fixed amount of memory. In contrast, traditional SQL methods either run two queries or create a temporary table for each search result set. And sphinx uses a multi-query mechanism to accomplish this task. Instead of launching queries one by one, several queries are batched together and submitted in one request.

 

5. Scale up and scale out

Scale up: increase CPU/core, expand disk I/O Scale out: multiple machines, i.e. distributed sphinx

 

6. Aggregate sharded data

Ideal for distributing data across different physical MySQL servers. Example: There is a 1TB table with 1 billion articles, which is sharded to 10 MySQL servers by user ID. Of course, the query of a single user is very fast. If you need to implement an archive paging function, display a certain user Articles published by all of my friends. Then it is necessary for colleagues to access multiple MySQL servers. This will be very slow. And sphinx only needs to create a few instances, map the frequently accessed article attributes in each table, and then perform paging query, a total of three lines of code configuration.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326230771&siteId=291194637