[Sphinx for full-text search]

Sphinx is a full-text search engine based on SQL. It can be combined with MySQL and PostgreSQL for full-text search. It can provide more professional search functions than the database itself, making it easier for applications to achieve professional full-text search. Sphinx specially designs search API interfaces for some scripting languages, such as PHP, Python, Perl, Ruby, etc., and also designs a storage engine plug-in for MySQL.

A single Sphinx index can contain a maximum of 100 million records, and the query speed in the case of 10 million records is 0.x seconds (millisecond level). The speed at which Sphinx creates an index is: it only takes 3 to 4 minutes to create an index of 1 million records, and an index of 10 million records can be completed in 50 minutes, while an incremental index that only contains the latest 100,000 records needs to be rebuilt once Just a few tens of seconds.

Key features of Sphinx include:

High-speed indexing (nearly 10 MB/sec on newer CPUs); high-speed search (average query speed is less than 0.1 sec for 2-4G text volume); high availability (up to 100 GB of text, 100M documents can be supported on a single CPU) ); Provides good relevance ranking to support distributed search; Provides document summary generation; Provides search from MySQL's internal plug-in storage engine to support Boolean, phrase, and synonym queries; Supports multiple full-text search domains per document (default maximum 32); support multiple attributes per document; support word segmentation; support single-byte encoding and UTF-8 encoding;



 

 

How Sphinx Works

The entire workflow of Sphinx is that the Indexer program extracts data from the database, performs word segmentation on the data, and then generates single or multiple indexes according to the generated word segmentation, and passes them to the searchd program. Clients can then search through API calls.



 

 

Why use Sphinx

Encountered usage scenarios

A similar requirement is encountered: users can search for the content of an article through the article title and article, and the article title and article content are stored in different libraries, and they are across the computer room.

Alternative

A. Implement cross-database LIKE query directly in the database

Advantages: simple operation Disadvantages: low efficiency, will cause greater network overhead

 

B. Combined with Sphinx Chinese word segmentation search engine

Advantages: high efficiency, with high scalability Disadvantages: not responsible for data storage

 

Use the Sphinx search engine to index the data, load the data in at one time, and then save it in memory after it is done. In this way, users only need to retrieve data on the Sphinx server when they search. Moreover, Sphinx does not have the defects of MySQL's companion disk I/O, and has better performance.

 

Other typical usage scenarios

1. Fast, efficient, scalable and core full-text search

When the amount of data is large, it is faster than MyISAM and InnoDB. Ability to create indexes on mixed data from multiple source tables, not limited to fields on a single table. Ability to consolidate search results from multiple indexes. Full text search can be optimized based on additional conditions on attributes.

 

2. Efficient use of WHERE clauses and LIMIT clauses

When doing a SELECT query in multiple WHERE conditions, the index selectivity is poor or there are no fields supported by the index at all, and the performance is poor. Sphinx can index keywords. The difference is that in MySQL, the internal engine decides whether to use an index or a full scan, while sphinx lets you choose which access method to use. Because sphinx stores data in RAM, sphinx doesn't do much I/O operations. And mysql has a kind of semi-random I/O disk read, which reads the records into the sort buffer line by line, then sorts them, and finally discards most of the lines. So sphinx uses less memory and disk I/O.

 

3. Optimize GROUP BY query

Sorting and grouping in sphinx use fixed memory, which is slightly more efficient than MySQL queries where similar datasets can all fit in RAM.

 

4. Generate result sets in parallel

sphinx allows you to generate several results from the same data at the same time, again using a fixed amount of memory. In contrast, traditional SQL methods either run two queries or create a temporary table for each search result set. And sphinx uses a multi-query mechanism to accomplish this task. Instead of launching queries one by one, several queries are batched together and submitted in one request.

 

5. Scale up and scale out

Scale up: increase CPU/core, expand disk I/O Scale out: multiple machines, i.e. distributed sphinx

 

6. Aggregate sharded data

Ideal for distributing data across different physical MySQL servers. Example: There is a 1TB table with 1 billion articles, which is sharded to 10 MySQL servers by user ID. Of course, the query of a single user is very fast. If you need to implement an archive paging function, display a certain user Articles published by all of my friends. Then it is necessary for colleagues to access multiple MySQL servers. This will be very slow. And sphinx only needs to create a few instances, map the frequently accessed article attributes in each table, and then perform paging query, a total of three lines of code configuration.



 

 

 

Inverted index

An inverted index is a data structure used to store a mapping of where a word is stored in a document or set of documents under full-text search. It is the most commonly used data structure in document retrieval systems.

Inverted Index: Inverted index is a specific storage form that implements "word-document matrix". Through inverted index, a list of documents containing this word can be quickly obtained according to the word.

The traditional index is: index ID -> document content, while the inverted index is: document content (word segmentation) -> index ID. It can be understood by analogy with the difference between forward proxy and reverse proxy. Forward proxies proxy internal requests to the outside, and reverse proxies proxy external requests to the interior. So it should be understood that it is more appropriate to transpose the index.

The inverted index mainly consists of two parts: "word dictionary" and "inverted file".

The word dictionary is a very important part of the inverted index. It is used to maintain the relevant information of all words that have appeared in the document collection, and it is also used to record the position information of the inverted list corresponding to a word in the inverted file. When searching is supported, according to the user's query words, go to the word dictionary to query, you can get the corresponding inverted list, and use this as the basis for subsequent sorting.

 

For a large-scale document collection, it may contain hundreds of thousands or even millions of different words. Whether a word can be quickly located directly affects the response speed of the search, so an efficient data structure is required to perform the word dictionary analysis. For building and searching, commonly used data structures include hash and linked list structure and tree dictionary structure.

 

Inverted Index Basics

Document: Generally, the processing object of search engines is Internet web pages, and the concept of document is broader, representing a storage object in the form of text. Compared with web pages, it covers more forms, such as Word, PDF, Documents in different formats such as html and XML can be called documents. Another example is an email, a short message, or a Weibo, which can also be called a document. In the remainder of this book, documents are used in many cases to represent textual information.

Document Collection: A collection composed of several documents is called a document collection. For example, a large number of Internet web pages or a large number of e-mails are specific examples of document collections.

Document ID: Inside the search engine, each document in the document collection will be given a unique internal number, and this number will be used as the unique identifier of the document, which is convenient for internal processing. The internal number of each document is It is called "document number", and DocID is sometimes used later to conveniently represent the document number.

Word ID: Similar to the document number, the search engine uses a unique number to represent a word, and the word number can be used as the unique representation of a word.

The Indexer program divides the obtained records according to the configured word segmentation algorithm, and then uses the inverted index as a data structure to store them.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326489282&siteId=291194637