Cattle! Single table tens of millions of rows database: LIKE search optimization notes

Recommended reading:

We often use the LIKE operator in the database to complete the fuzzy search of the data, the LIKE operator is used to search for the specified pattern in the column in the WHERE clause.

If you need to find all the data whose last name is "Zhang" in the customer table, you can use the following SQL statement:

SELECT * FROM Customer WHERE Name LIKE '张%'

If you need to find all the data in the customer table whose mobile phone end number is "1234", you can use the following SQL statement:

SELECT * FROM Customer WHERE Phone LIKE '%123456'

If you need to find all the data in the customer table that contains "show" in the name, you can use the following SQL statement:

SELECT * FROM Customer WHERE Name LIKE '%秀%'

The above three correspond to: left prefix matching, right suffix matching and fuzzy query, and correspond to different query optimization methods.

Data overview

Now there is a data table named tbl_like, which contains all the sentences in the four classics, with tens of millions of data:

Left prefix matching query optimization

If you want to query all sentences beginning with "Monkey King", you can use the following SQL statement:

SELECT * FROM tbl_like WHERE txt LIKE '孙悟空%'

The SQL Server database is relatively powerful and takes more than 800 milliseconds, which is not fast:

We can build an index on the txt column to optimize the query:

CREATE INDEX tbl_like_txt_idx ON [tbl_like] ( [txt] )

After applying the index, the query speed is greatly accelerated, only 5 milliseconds:

It can be seen from this that for the left prefix matching, we can speed up the query by increasing the index.

Right suffix matching query optimization

In the right suffix matching query, the above index is not effective for right suffix matching. Use the following SQL statement to query all the data ending with "Monkey King":

SELECT * FROM tbl_like WHERE txt LIKE '%孙悟空'

The efficiency is very low, and it takes 2.5 seconds:

We can use the "space for time" approach to solve the problem of low efficiency in the right suffix matching query.

In simple terms, we can reverse the string and turn the right suffix match into a left prefix match. Take "Gu Hai back and then catch Monkey King" as an example, the character string after inverting it is "Kong Wu Sun catches back and forth to the sea and defends." When you need to find the data ending with "Monkey King", just search for the data starting with "Kong Wu Sun".

The specific method is: add the "txt_back" column to the table, invert the value of the "txt" column, fill in the "txt_back" column, and finally add an index for the "txt_back" column.

ALTER TABLE tbl_like ADD txt_back nvarchar(1000);-- 增加数据列
UPDATE tbl_like SET txt_back = reverse(txt); -- 填充 txt_back 的值
CREATE INDEX tbl_like_txt_back_idx ON [tbl_like] ( [txt_back] );-- 为 txt_back 列增加索引

After the data table is adjusted, our SQL statement also needs to be adjusted:

SELECT * FROM tbl_like WHERE txt_back LIKE '空悟孙%'

After this operation, the execution speed is very fast:

It can be seen from this that: for right suffix matching, we can create a reverse order field to change the right suffix match to the left prefix match to speed up the query.

Fuzzy query optimization

When querying all the statements containing "Wukong", we use the following SQL statement:

SELECT * FROM tbl_like WHERE txt LIKE '%悟空%'

The statement cannot use the index, so the query is very slow, requiring 2.7 seconds:

Unfortunately, we do not have an easy way to optimize this query. But there is no easy way, and that doesn't mean there is no way. One of the solutions is: word segmentation + inverted index.

Word segmentation is the process of recombining consecutive word sequences into word sequences according to certain specifications. We know that in English writing, spaces between words are used as natural delimiters, while Chinese only words, sentences and paragraphs can be simply delimited by obvious delimiters, but words do not have a formal delimiter. Although English also has the problem of dividing phrases, at the word level, Chinese is much more complicated and difficult than English.

The inverted index comes from the need to find records based on the value of attributes in practical applications. Each item in this index table includes an attribute value and the address of each record with the attribute value. Since the attribute value is not determined by the record, but the position of the record is determined by the attribute value, it is called an inverted index. A file with an inverted index is called an inverted index file, or inverted file for short.

The above two puzzling texts are from Baidu Baike, you can choose to ignore them just like me.

We don't need superb word segmentation skills, because of the characteristics of Chinese, we only need "binary" word segmentation.

The so-called binary word segmentation means that every two characters of the text in a paragraph are used as a word to segment words. Let’s take the sentence “guard against Gu Hai back and then catch Monkey King” as an example. After the binary segmentation, the result is: guard against the ancient sea, go back to the ancient sea, return to the sea, come back, come again, catch again, catch the grandson, Monkey King, Wukong. Use C# to implement it briefly:

public static List<String> Cut(String str)
{
       var list = new List<String>();
       var buffer = new Char[2];
       for (int i = 0; i < str.Length - 1; i++)
       {
             buffer[0] = str[i];
             buffer[1] = str[i + 1];
             list.Add(new String(buffer));
       }
       return list;
}

Test the results:

We need a data table to match the segmented entries with the original data. In order to obtain better efficiency, we also use a covering index:

CREATE TABLE tbl_like_word (
  [id] int identity,
  [rid] int NOT NULL,
  [word] nchar(2) NOT NULL,
  PRIMARY KEY CLUSTERED ([id])
);
CREATE INDEX tbl_like_word_word_idx ON tbl_like_word(word,rid);-- 覆盖索引(Covering index)

The above SQL statement creates a data table named "tbl_like_word" and adds a joint index to its "word" and "rid" columns. This is our inverted table, and the next step is to fill it with data.

We need to use LINQPad's built-in database link function to link to the database, and then we can interact with the database in LINQPad. First read the data in the tbl_like table in batches of 3000 entries in the order of Id, segment the value of the txt field to generate the data required for tbl_like_word, and then store the data in batches. The complete LINQPad code is as follows:

void Main()
{
       var maxId = 0;
       const int limit = 3000;
       var wordList = new List<Tbl_like_word>();
       while (true)
       {
             $"开始处理:{maxId} 之后 {limit} 条".Dump("Log");
             //分批次读取
             var items = Tbl_likes
             .Where(i => i.Id > maxId)
             .OrderBy(i => i.Id)
             .Select(i => new { i.Id, i.Txt })
             .Take(limit)
             .ToList();
             if (items.Count == 0)
             {
                    break;
             }
             //逐条生产
             foreach (var item in items)
             {
                    maxId = item.Id;
                    //单个字的数据跳过
                    if (item.Txt.Length < 2)
                    {
                           continue;
                    }
                    var words = Cut(item.Txt);
                    wordList.AddRange(words.Select(str => new Tbl_like_word {  Rid = item.Id, Word = str }));
             }
       }
       "处理完毕,开始入库。".Dump("Log");
       this.BulkInsert(wordList);
       SaveChanges();
       "入库完成".Dump("Log");
}
// Define other methods, classes and namespaces here
public static List<String> Cut(String str)
{
       var list = new List<String>();
       var buffer = new Char[2];
       for (int i = 0; i < str.Length - 1; i++)
       {
             buffer[0] = str[i];
             buffer[1] = str[i + 1];
             list.Add(new String(buffer));
       }
       return list;
}

The above LINQPad script uses Entity Framework Core to connect to the database, and references the NuGet package "EFCore.BulkExtensions" to do data bulk insertion.

After that, you can arrange the query, query the inverted index first, and then associate it to the main table:

SELECT TOP 10 * FROM tbl_like WHERE id IN (
SELECT rid FROM tbl_like_word WHERE word IN ('悟空'))

The query speed is very fast, only a dozen milliseconds:

Because we have divided all sentences into two-character phrases, it is a more economical solution to directly use LIKE when a fuzzy query of a single character is needed. If there are more than two characters to be queried, the query word needs to be segmented. If you need to query the term "East Tu Datang", the constructed query sentence may look like this:

SELECT TOP 10*FROM tbl_like WHERE id IN (
SELECT rid FROM tbl_like_word WHERE word IN ('东土','土大','大唐'))

However, the query does not meet our expectations, because it will also filter out sentences that only contain "big soil":

We can take some tricks to solve this problem, such as GROUP first:

SELECT TOP
    10 *
FROM
    tbl_like
WHERE
    id IN (
    SELECT
        rid
    FROM
        tbl_like_word
    WHERE
        word IN ( '东土', '土大', '大唐' )
    GROUP BY
        rid
    HAVING
    COUNT ( DISTINCT ( word ) ) = 3
    )

In the above SQL statement, we have grouped the rid, and filtered out that the number of unique phrases is three (that is, the number of our query words). Thus, we can get the correct result:

From this we can see: For fuzzy queries, we can optimize the query speed by word segmentation + inverted index.

postscript

Although the SQL Server database is used in the presentation, the above optimization experience is common to most relational databases, such as MySQL and Oracle.

If you use the PostgreSQL database in actual work like the author, you can directly use the array type and configure the GiN index when doing inverted indexing to get a better development and use experience. It should be noted that although PostgreSQL supports functional indexes, if the function result is LIKE filtered, the index will not hit.

For small databases such as SQLite, fuzzy search cannot use the index, so the optimization methods of left prefix search and right suffix search do not take effect. However, generally we do not use SQLite to store large amounts of data, although the optimization method of word segmentation + inverted index can also be implemented in SQLite.

Guess you like

Origin blog.csdn.net/weixin_45784983/article/details/108411887