The Beauty of Mathematics - Personal Notes for Chapters 7 and 8

Chapter 7 Jarinick and Modern Language Processing

Summary: Learning is a lifetime thing.

Chapter 8 The Beauty of Simplicity—Boolean Algebra and Search Engines

⭐Search engine principle :

①Automatically download as many web pages as possible

②Establish a fast and effective index

③ Fair and accurate ranking of web pages based on relevance

1 Boolean algebra

The relationship between literature search and Boolean operations: For example, if you want to find out about atomic energy, you don't want to know about the atomic bomb. You can write such a query statement "Atomic energy AND application AND (NOT atomic bomb)", indicating that the documents that meet the requirements must meet three conditions at the same time:

Contains atomic energy, includes applications, does not include atomic bombs

A document has a TRUE or FALSE answer for each of the above conditions. In this way, logical reasoning and calculation are combined into one.

2 index

Indexing makes search engines so quickly ready to find a large number of results.

The query statement (SQL) of the database supports various complex logical combinations, but the basic principle behind it is based on Boolean operations.

The simplest index structure is to use a long binary number to indicate whether a keyword appears in each document. There are as many documents as there are documents, and each bit corresponds to a document, 1 means the corresponding document has this keyword, and 0 means not. In fact, it is about the process of quantizing texts that vary widely.

Note that doing boolean operations is very, very fast . Therefore, the index of the search engine becomes a large table: each row of the table corresponds to a keyword, and each keyword is followed by a set of numbers, which is the serial number of the document containing the keyword.

Due to the limitation of computer speed and capacity, early search engines could only index important and key subject terms.

Common search engines now index all words. However, this is extremely challenging engineering.

When the index is very large, it is usually stored on different servers in a distributed manner. The common practice is to divide the index into many parts according to the serial number of the webpage and store them in different servers. Whenever a query is accepted, the query is distributed to many servers, which simultaneously process user requests in parallel, and return the results to the main server for merge processing, and finally return the results to the user.

Users have more and more content, and according to this, indexes of different levels are designed . Commonly used fast, very low requirements.