This is a search engine (a) - engine architecture, web crawler, indexing

This series of articles is a product point of view by trying to understand the complexity of many large search engines Wang to write out, shortcomings, welcome technical, non-technical students to read after error correction, together we explore common progress.

This part describes the architecture of search engines, web crawlers, and indexing.

 

First, the basic information search engine


1.1 What is Search Engine

Popular terms is to remove content from the Internet user is interested in the mass of information available to users.

1.2 Development History

Category: The pure artificial collected on behalf of navigation, such as yahoo and hao123 

-> Text Retrieval: The retrieval of information related to the degree of model query keywords and pages of text

-> Link Analysis: Analysis of web page using the link relationship between the importance of the web, google's technology on behalf of pageRank 

-> Users: understand the user needs as the core, typically thousands of thousand faces.

1.3 The basic architecture of search engines

The architecture there are three functions:
1, obtained by reptiles massive web page information on the Internet, stored locally and indexing easier to find;
2, a user enters a query query, parse the query intent and query distribution inquiries;
3, using query through a variety of algorithms to index the document (web page) order, return results that best meet several pieces of intent.
Data in this chapter from the first to introduce the role of a search engine.

Second, the web crawler


2.1 General framework for reptiles

Crawling process:

Select the portion of the page as a seed url, placed in a queue to be grasped url

-> Read the url to be gripped by DNS resolution, download page

-> download url is stored in the local database and index pages, url url crawled into the queue

-> parse url downloaded, extract all links, and has been drawn into a queue to be crawled url url duplicates removed

-> Continue download queue to be crawled url

-> form a cycle, until the queue is empty url be crawled.

2.2 crawler type

Batch-type reptiles (batch crawler): clearly grasping the scope and objectives, achieve the goal that is stopped.

Incremental reptiles (incremental crawler): continuous crawl, to catch the regularly updated web page.

Vertical crawler (focused crawler): fetch stage to identify whether the page relevant to the subject, whether or not gripping.

2.3 reptiles crawling strategy

Objective: preference important pages to crawl.

Breadth-first traversal policy (breathfirst): The new download link directly to the page that contains additional crawl URl be the end of the queue. This strategy implies that some pages priorities assumptions.

Incomplete pagerank policy (partial pagerank): url downloaded and formed to be gripped with a set of pages, set in the calculation performed pagerank, url to be sorted according to pagerank crawl. To enhance efficiency, when a new web page downloads over the K pagerank recalculated again, the new page is not extracted and pagerank value, all of the pages into the chain passes pagerank value summary, compared pagerank as a temporary value.

OCIP strategy (online pageimportance computation): each page given the same cash (cash), after downloading a page P, p pages of cash divided equally to each link included in the final download link cash based on the size of the sort.

Major stations priority strategies (larger sites first): wait for the download page to download the most priority sites.

2.4 page update policy

Objective: To determine when to update the downloaded pages so that local data and Internet content consistent with the original page.

Historical Reference Strategy: In the past frequently updated pages in the future will be updated frequently.

User Experience Strategy: to save multiple versions of a web page, to get the average impact on the quality of the search based on content changes each time, as the basis for judging the crawl update.

Cluster sampling strategy: Web cluster. Updated the same page with the same properties

2.5 Method gripping darknet

For web information stored in the database can not be obtained, using rich information query templates way to grab.

Determine whether it is rich in information query template method is ISIT algorithm. The basic idea is to start from a study by one-dimensional template, if rich information query templates, then extended to the two-dimensional template, and so on and gradually increase the number of dimensions, until no longer find rich information query templates.

 

The purpose is to try to get the latest reptiles and most comprehensive website information is stored locally.

Third, the index

After reptiles document (ie web page) to download information to the local, the need to establish an inverted index of documents. Inverted index is extracted words in the document establishing correspondence between words and documents, so that we can find the appropriate documentation by keyword matching.

3.1 Glossary

TF: word frequency

DF: document frequency

English Dictionary: maintenance of all relevant information in the document collection word appears, with the record corresponding to a word list of posting location information in the inverted file.

Document number (document id): each document within the document set given a unique internal number, when stored in a compressed data using a document number difference (D-gap) is stored. 
Inverted list (postinglist): there have been record the location of all documents and a list of words a word document appears in the document. Each record is called an inverted item (posting).

Inverted file (inverted file): all words inverted a file exists in the disk list sequentially.

Inverted index (inverted index): realization of the word - a particular form of document storage matrix. By inverted index, you can get a list of documents that contain the word. It consists of two parts: a word dictionary, inverted file.

 

3.2 Indexing


3.2.1 build word dictionary

The common data structure is a hash table, and the results of the dictionary tree

Ha Xijia list: body is a hash table, each entry saved pointer to the hash value identical conflict list of word forms.

The tree structure dictionary: dictionary entries need to follow in descending order, are hierarchical lookup structure. (To tell the truth I did not find out)

3.2.2 Indexing

Twice document traversal method (2-pass in-memory to create a full inversion memory index): the first pass scanning statistics (including the number of documents, the number of words, words that appear DF ​​information, etc.) and allocates memory and other resources, to do the preparatory work . A second scanning pass, first pass filled the allocated memory space.

This method requires a sufficiently large memory, and a slow scan twice.

Sorting (sort-based inversion): allocate a fixed amount of space used to store intermediate results and the index dictionary information, when the space is run out, the intermediate results written to disk clear the memory, as a memory area to store the intermediate results of the index.

  归并法(merge-based inversion):整体流程与排序法类似,但排序法在内存中放的是词典信息和三元组数据,二者间并没有直接联系,归并法是在内存中建立起目前处理文档子集的整套倒排索引;中间结果写入磁盘时,排序法将三元组数据排序后写入磁盘临时文件,词典一直保留在内存中,归并法将单词和对应倒排列表写入磁盘,随后彻底清空所占内存。

3.2.3 索引的更新

网页在不断变化,为保证索引能实时动态更新,还需要添加上临时索引、已删除文档列表。


临时索引:内存中实时建立的倒排索引。

已删除文档列表:存储已删除文档的id,形成文档ID列表。

文档被更改时,原先文档放入删除队列,解析更改后的文档内容放入临时索引中,通过该方式满足实时性。用户输入query查询时从倒排索引和临时索引中获得结果,然后利用删除文档列表过滤形成最终搜索结果。

临时索引的更新策略:


1、完全重建:新增文档超过一定数量,对新老文档合并后重新建立索引。

2、再合并策略:新增文档超过一定数量,临时索引合并到老索引中。

3、原地更新策略:增量索引的倒排列表追加到老索引相应位置的末尾。

4、混合策略:将单词根据不同性质分类,不同类别单词采取不同的索引更新策略。

3.2.4 索引的查询

常用的两种查询策略

一次一文档:以倒排列表中包含的文档为单位,将文档与查询的最终相似性得分计算完毕再计算另外一个文档的得分。

一次一单词:以单词为单位,计算文档对于搜索单词的得分,最后将所有单词得分相加。

3.3索引压缩

用户查询时需要将倒排列表信息从磁盘读取到内存中,搜索引擎的索引量都非常巨大,所以需要对索引进行压缩。索引主要包含两个部分:单词词典和对应的倒排列表,压缩也主要针对这两部分进行。
压缩算法指标(按重要性由高到低排列):解压速度、压缩率、压缩速度。

3.3.1 词典压缩

上图词典中DF和倒排列表指针都能用4个字节表示,但单词信息由于词长不同可能会造成存储空间浪费。

如上图的优化结构中,将连续词典分块,每个单词增加长度信息,多个单词共用指针信息。这样文档频率、倒排指针和单词地址都能固定大小。

3.3.2 文档编号重排

对文档ID重编号使得倒排列表中相邻两个文档的编号也尽可能相邻,使相邻文档的D-Gap值尽可能小,压缩算法效率会更高。具体算法不展开。

3.3.3 静态索引裁剪(static index pruning)

这是一种有损压缩,清除索引项中不重要的部分,同时尽可能保证搜索质量。常用的两种方法:以单词为中心的索引裁剪和以文档为中心的索引裁剪。

以单词为中心的索引裁剪需要计算单词与其对应的文档的相似性,据此判断是否保留索引项。索引建立好之后裁剪。

以文档为中心的裁剪计算单词的重要性,抛弃不重要的单词。建立索引之前裁剪。

 

至此,搜索引擎完成了网页数据的获取与存储,其余部分将在接下来的文章中介绍。

本文所有截图来自于《这就是搜索引擎:核心技术详解》,如有版权问题,我再重画 \ (-- 。 --) /

 

Guess you like

Origin www.cnblogs.com/fengtai/p/12315731.html