Architectural Thinking Growth Series Tutorials (10) - E-commerce Search Engine Architecture Design

background

The e-commerce search engine is a tool to help customers quickly find the products they need to buy.

content

The criterion for measuring the success of an e-commerce search engine is: whether customers are getting closer to their real needs in a series of search behaviors. The faster the customer enters the product page to browse the product, the more accurate the search results recommended by the search engine are.

E-commerce search engines are a vertical field of traditional search engines. In order to better learn the relevant knowledge of search engines, first look at the technical architecture of a complete search engine.

Technical architecture of search engine

A complete search engine technical framework, as shown in the figure, the technical framework of the search engine is divided into three parts: information collection, establishment of index library, and provision of retrieval services.

Search Engine Technical Architecture
  • 1. Information collection

Discover and collect information and data on the Internet. Usually, this step is realized by crawling the webpage with a crawler (Crawler/Spider). Each independent search engine has its own web crawler crawler. The crawler Spider crawls from this website to another website along the hyperlinks in the webpage, and crawls more webpages through continuous visits through hyperlink analysis. The webpages that are crawled are called webpage snapshots. Since the application of hyperlinks in the Internet is very common, in theory, starting from a certain range of web pages, the vast majority of web pages can be collected.

  • 2. Build an index library

Extract and organize the collected information to build an index library. After the search engine captures the webpage, it needs to do a lot of preprocessing work before it can provide retrieval services. Among them, the most important thing is to extract keywords, build index library and index. According to different application scenarios, other possible processing includes removing duplicate web pages, word segmentation (in Chinese), judging web page types, analyzing hyperlinks, calculating the importance/richness of web pages, etc.

  • 3. Provide retrieval services

The retrieval service is provided by the retrieval device according to the query keywords input by the user. After receiving the keywords, the system quickly checks out the documents in the index library, evaluates the relevance between the documents and the query, sorts the results to be output, and returns the query results to the user. Usually, in addition to the title and URL of the web page, a snippet from the web page and other information are provided for the user's convenience.

In fact, search is already a very mature technology, so there is no need to discuss it here. Those who are interested can learn more about it online.

Here are a few more important technical points in the search technology architecture: distributed index, distributed search

Key technical points

1. Distributed index

Distributed indexing is to use a lot of hardware with common configurations to build indexes at the same time, and finally to merge indexes. The advantage of this approach is that it is scalable. When the data increases, there is no need to increase the storage device of a single machine, but to solve it through horizontal expansion and adding ordinary machines.

To establish a distributed index, a distributed system such as Hadoop can be used to build:

  • Hadoop implements a distributed file system (Hadoop Distributed File System), referred to as HDFS. HDFS is highly fault-tolerant and designed to be deployed on inexpensive hardware. At the same time it provides a high transfer rate to access application data, suitable for those applications with very large data sets.
  • The upper layer of HDFS is the MapReduce engine, which is used for parallel computing of large-scale data sets. The concepts Map (mapping) and Reduce (statute), and their main ideas, are borrowed from functional programming languages, as well as features borrowed from vector programming languages. Based on these distributed characteristics, search indexing can be easily extended through it.
  • It is a very good practice to use Hadoop platform and MapReduce mechanism to realize the establishment of distributed search index.

2. Distributed search

Distributed search is to divide the original single index file into n slices (shards). When searching, the n slices are searched in parallel, and each slice returns the topK hit result of the current shard, and then the local topK of the n slices are merged and sorted to obtain the global topK sorting result.

The benefits of distributed search are:

  • Better scalability, with horizontal scalability in both the number of user visits and the size of the index.
  • Higher stability, partial failure is tolerated, and the call success rate is significantly improved.
  • A more flexible full update strategy can target different types of data.
  • A more flexible sorting algorithm can do customized sorting for different categories.
  • Better maintainability and versatility, support for different types of searches.

 

Previous Chapter Tutorial

Architectural Thinking Growth Series Tutorials (9) - Personalized Recommendation Engine Architecture Design

The series of tutorials

Architectural Thinking Growth Series Tutorials

my column

 

 

At this point, all the introductions are over

 

 

-------------------------------

-------------------------------

 

My CSDN homepage

About me (personal domain name, more information about me)

My open source project collection Github

 

I look forward to learning, growing and encouraging together with everyone , O(∩_∩)O Thank you

Welcome to exchange questions, you can add personal QQ 469580884,

Or, add my group number  751925591 to discuss communication issues together

Don't talk about falsehood, just be a doer

Talk is cheap,show me the code

Guess you like

Origin blog.csdn.net/hemin1003/article/details/114928605