Chapter 7 Jarinick and Modern Language Processing
Summary: Learning is a lifetime thing.
Chapter 8 The Beauty of Simplicity—Boolean Algebra and Search Engines
⭐Search engine principle :
①Automatically download as many web pages as possible
②Establish a fast and effective index
③ Fair and accurate ranking of web pages based on relevance
1 Boolean algebra
Contains atomic energy, includes applications, does not include atomic bombs
A document has a TRUE or FALSE answer for each of the above conditions. In this way, logical reasoning and calculation are combined into one.
2 index
The query statement (SQL) of the database supports various complex logical combinations, but the basic principle behind it is based on Boolean operations.
The simplest index structure is to use a long binary number to indicate whether a keyword appears in each document. There are as many documents as there are documents, and each bit corresponds to a document, 1 means the corresponding document has this keyword, and 0 means not. In fact, it is about the process of quantizing texts that vary widely.
Note that doing boolean operations is very, very fast . Therefore, the index of the search engine becomes a large table: each row of the table corresponds to a keyword, and each keyword is followed by a set of numbers, which is the serial number of the document containing the keyword.
Due to the limitation of computer speed and capacity, early search engines could only index important and key subject terms.
Common search engines now index all words. However, this is extremely challenging engineering.
When the index is very large, it is usually stored on different servers in a distributed manner. The common practice is to divide the index into many parts according to the serial number of the webpage and store them in different servers. Whenever a query is accepted, the query is distributed to many servers, which simultaneously process user requests in parallel, and return the results to the main server for merge processing, and finally return the results to the user.
Users have more and more content, and according to this, indexes of different levels are designed . Commonly used fast, very low requirements.