E-commerce system architecture design series (7): How to build an e-commerce product search system?

In the last article , I left you a thought question: how to build a product search system?

In today's article, let's talk about the product search system of e-commerce.

introduction

The feature of search can be said to be ubiquitous. Now there are very few websites or systems that do not provide search functions. Therefore, even if you are not a professional search programmer, you will inevitably encounter some search-related needs. On the surface, the function of searching for this thing is very simple. It is just a search box, enter keywords, and then search out the desired content.

The implementation behind the search can be very simple, how simple is it? We just use one SQL, LIKE can realize it; it can also be very complicated, how complicated is it? Not to mention companies that specialize in search such as Baidu and Google, and other non-professional Internet companies that do search, most of the search teams have a size of a thousand people. There are not only programmers, but also algorithm engineers, business experts, and so on. The difference between the two is only the speed of the search, and the quality of the searched content.

In this article, we will use the product search in e-commerce as an example to talk about how to use ES (Elasticsearch) to build a search system with a good experience quickly and at low cost.

Understand the inverted index mechanism

As we said just now, since most of our data is stored in the database, and SQL LIKE can also be used to achieve matching and search results, why do we need to build a special search system? Let's first analyze why the database is not suitable for searching.

The core requirement of search is full-text matching. For full-text matching, the index of the database is not useful at all, so it can only scan the whole table. Full-table scanning is already very slow, but this is not counted. It is also necessary to perform full-text matching on each record, that is, a word-by-word comparison, and this speed is even slower. Therefore, using data for search cannot meet the requirements in terms of performance.

So how does ES solve the search problem? Let's give an example to illustrate, assuming we have two products, one is Yantai Red Fuji apple, and the other is iPhone XS Max.

 The DOCID in this table is the ID that uniquely identifies a record, which is similar to the primary key in the database.

In order to support fast full-text search, ES uses a special index for text: Inverted Index . Then let's take a look at what the inverted indexes of these two commodity data look like in ES? Please see the table below.

 It can be seen that this inverted index table uses words as the index key, and the value of the inverted index of each word is a list, and the elements of this list are the DOCIDs of the product records containing this word.

How is this inverted index constructed?

When we write product records to ES, ES will first segment the field that needs to be searched, that is, the product title. Word segmentation is to split a continuous text into multiple words according to semantics. Then ES indexes the product records according to the words, forming an inverted index like the above table.

When we search for the keyword "iPhone", ES will also segment the keyword. For example, "iPhone" is divided into "apple" and "mobile phone". Then, ES will search for each keyword segment we entered in the inverted index, and the search result should be:

 Both records 666 and 888 can match the search keywords, but the product 888 has a higher matching degree than the product 666, because it can match both words, so the results are sorted according to the matching degree, and finally The returned search results are:

Apple iPhone XS Max (A2104) 256GB Gold Mobile Unicom Telecom 4G Mobile Phone Dual SIM Dual Standby

Yantai Red Fuji apple 5kg first-class platinum large single fruit 230g or more fresh fruit

It seems that the search effect is still good.

Why can inverted index achieve fast search? Let's analyze the search performance of the above example.

This search process is actually to do a second search on the above inverted index, once to find "apple" and once to find "mobile phone". Note that during the entire search process, we did not do any fuzzy matching of text. When the ES storage engine stores the inverted index, it is definitely not stored as a two-dimensional table as shown in the above table. In fact, its physical storage structure is similar to that of MySQL's InnoDB index, which is a search tree.

Do two lookups on the inverted index, that is, do a second lookup on the tree. Its time complexity is similar to the lookup of the second hit index in MySQL. Obviously, this search speed is several orders of magnitude faster than using MySQL full table scan plus fuzzy matching.

How to build commodity index in ES?

After understanding the principle of the inverted index, let's use ES to build a product index and simply implement a product search system. Although ES was born for search, it is still a storage system in essence. For some concepts in ES, you can basically find the corresponding nouns in the relational database. In order to facilitate your quick understanding of these concepts, I will list the corresponding relationships of these concepts for you to understand.

In ES, the logical structure of data is similar to MongoDB, and each piece of data is called a DOCUMENT, or DOC for short. DOC is a JSON object. Each JSON field in DOC is called FIELD in ES. A group of DOCs with the same fields are stored together. The logical container for storing them is called INDEX. The JSON structure of these DOCs is called MAPPING. The most difficult thing to understand here is the INDEX, which is actually similar to the concept of a table in MySQL, rather than the index we usually understand for finding data.

ES is a server-side program developed in Java. There are no external dependencies other than Java. Installation and deployment are very simple. Specifically, you can refer to its official documentation to install ES first, or refer to my ELK tutorial to install it . ES.

In addition, in order for ES to support Chinese word segmentation, a Chinese word segmentation plug-in IK Analysis for Elasticsearch needs to be installed for ES . The function of this plug-in is to tell ES how to segment Chinese text.

In order to realize product search, we need to store the product information in ES first. First, we define the data structure of the product stored in ES, that is, MAPPING.

Our MAPPING only needs two fields, sku_id is the product ID, and title saves the title of the product. When the user searches for a product, we match the product title in ES and return the sku_id list of eligible products.

ES provides a standard RESTful interface by default. It does not require a client and can be accessed directly using the HTTP protocol. You can use curl to operate ES through the command line.

Next, we use the above MAPPING to create INDEX, which is similar to creating a table in MySQL.

curl -X PUT "localhost:9200/sku" -H 'Content-Type: application/json' -d '{
        "mappings": {
                "properties": {
                        "sku_id": {
                                "type": "long"
                        },
                        "title": {
                                "type": "text",
                                "analyzer": "ik_max_word",
                                "search_analyzer": "ik_max_word"
                        }
                }
        }
}'
{"acknowledged":true,"shards_acknowledged":true,"index":"sku"}

Here, use the PUT method to create an INDEX, and the name of the INDEX is "sku", which is directly written in the URL of the request. The BODY of the request is a JSON object, and the content is the MAPPING we defined above, that is, the data structure. Here we need to pay attention, because we want to perform full-text search on the title field, so we define the data type as text, and specify to use the Chinese word segmentation plug-in IK we just installed as the word breaker for this field.

After creating INDEX, you can write product data into INDEX, you need to use HTTP POST method to insert data:

curl -X POST "localhost:9200/sku/_doc/" -H 'Content-Type: application/json' -d '{
        "sku_id": 100002860826,
        "title": "烟台红富士苹果 5kg 一级铂金大果 单果230g以上 新鲜水果"
}'
{"_index":"sku","_type":"_doc","_id":"yxQVSHABiy2kuAJG8ilW","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":0,"_primary_term":1}

curl -X POST "localhost:9200/sku/_doc/" -H 'Content-Type: application/json' -d '{
        "sku_id": 100000177760,
        "title": "苹果 Apple iPhone XS Max (A2104) 256GB 金色 移动联通电信4G手机 双卡双待"
}'
{"_index":"sku","_type":"_doc","_id":"zBQWSHABiy2kuAJGgim1","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":1,"_primary_term":1}

Here we have inserted two commodity data, one Yantai Red Fuji and one iPhone. Then you can directly search for the product, and the search uses the HTTP GET method.

curl -X GET 'localhost:9200/sku/_search?pretty' -H 'Content-Type: application/json' -d '{
  "query" : { "match" : { "title" : "苹果手机" }}
}'
{
  "took" : 23,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.8594865,
    "hits" : [
      {
        "_index" : "sku",
        "_type" : "_doc",
        "_id" : "zBQWSHABiy2kuAJGgim1",
        "_score" : 0.8594865,
        "_source" : {
          "sku_id" : 100000177760,
          "title" : "苹果 Apple iPhone XS Max (A2104) 256GB 金色 移动联通电信4G手机 双卡双待"
        }
      },
      {
        "_index" : "sku",
        "_type" : "_doc",
        "_id" : "yxQVSHABiy2kuAJG8ilW",
        "_score" : 0.18577608,
        "_source" : {
          "sku_id" : 100002860826,
          "title" : "烟台红富士苹果 5kg 一级铂金大果 单果230g以上 新鲜水果"
        }
      }
    ]
  }
}

Let's take a look at the URL in the request first, where "sku" means to search in the INDEX of sku, "_search" is a keyword, means to search, and the parameter pretty means to format the returned JSON, which is easy to read. Look at the JSON requesting BODY again. The match in the query indicates that full-text matching is required. The matched field is title, and the keyword is "iPhone".

It can be seen that in the returned result, 2 product records are matched, which is consistent with the expected returned result when we explained the inverted index earlier.

Review the entire process of using ES to build a product search service: first install ES and start the service, then create an INDEX, define MAPPING, write data, execute the query and return the query result, in fact, this process is the same as when we use the database, first The process of creating a table, inserting data, and querying is the same. So, you just use ES as a database that supports full-text search.

Summarize

ES is essentially a distributed memory database that supports full-text search, and is especially suitable for building search systems. The most important reason why ES can have very good full-text search performance is the use of inverted indexes.

The inverted index is an index structure specially designed for searching. The inverted index first divides the fields to be indexed, and then uses the word segmentation as an index to form a search tree, thus converting a full-text matching search into a pair tree This is the fundamental reason why the inverted index can perform searches quickly.

However, compared with the B-tree index used in general databases, the inverted index has poor write and update performance. Therefore, the inverted index is only suitable for full-text search and not suitable for frequently updated transaction data.

Thanks for reading, if you think this article has inspired you, please share it with your friends.

thinking questions

What should I do if there are more and more order data and the database is getting slower and slower?

Looking forward to, you are welcome to leave a message or contact online, discuss and exchange with me, "learn together, grow together".

previous article

E-commerce system architecture design series (6): What issues should be considered in the design of the "account system" of e-commerce?


recommended reading

Series sharing

------------------------------------------------------

------------------------------------------------------

My CSDN homepage

About me (personal domain name, more information about me)

My open source project collection Github

I look forward to learning, growing and encouraging together with everyone , O(∩_∩)O thank you

If you have any suggestions, or knowledge you want to learn, you can discuss and exchange with me

Welcome to exchange questions, you can add personal QQ 469580884,

Or, add my group number 751925591 to discuss communication issues together

Don't talk about falsehood, just be a doer

Talk is cheap,show me the code

Guess you like

Origin blog.csdn.net/hemin1003/article/details/132085399