Crawler and Inverted Index Technology in Search Engines

I don't read much, so don't lie to me. This may be the best sentence for us. Last night, I chatted with the former Minister Cheung Tsai, and learned that he is now doing search engine optimization and works in that company, and he told me about his professional knowledge. But because I read too little books, I can't understand many professional terms. I even heard the word crawler for a while. Fortunately, I remember that crawler is a must for search engines. I checked it out today while reading the book on automata.

Web crawlers download web pages from the World Wide Web for search engines. Generally divided into traditional crawler and focused crawler.

The traditional crawler starts from the URL of one or several initial web pages , and obtains the URL on the initial web page. During the process of crawling the web page, it continuously extracts new URLs from the current page and puts them into the queue until certain stopping conditions of the system are met. In layman's terms, it is to obtain the desired content through source code analysis.

The workflow of the focus crawler is more complicated. It needs to filter the links that are not related to the topic according to a certain web page analysis algorithm, retain useful links and put them into the URL queue waiting to be crawled. Then, it will select the web page URL to be crawled next from the queue according to a certain search strategy, and repeat the above process until it reaches a certain condition of the system and stops. In addition, all web pages crawled by the crawler will be stored by the system, subjected to certain analysis, filtering, and indexed for subsequent query and retrieval ; for the focused crawler, the analysis results obtained in this process may also be Give feedback and guidance on the subsequent grabbing process.

Baidu directly. The crawler is just a website index, and then indexes this and then jumps to another website. The websites he skips are downloaded to our local.

Let's see how he grabs it

3. Grab strategy

In a crawler system, the queue of URLs to be crawled is an important part. The order in which the URLs in the queue of URLs to be crawled are arranged is also an important issue, because it involves which page to crawl first and which page to crawl later. The method of determining the order in which these URLs are arranged is called a crawling strategy. The following highlights several common crawling strategies:

1. Depth-first traversal strategy

The depth-first traversal strategy means that the web crawler will start from the start page, follow one link by one link, and then go to the next start page after processing the line, and continue to follow the link. Let's take the following figure as an example:

Path traversed: AFG EHI BCD

2. Breadth-first traversal strategy

The basic idea of breadth-first traversal is to insert links found in newly downloaded web pages directly at the end of the queue of URLs to be crawled. That is to say, the web crawler will first crawl all the webpages linked in the starting webpage, and then select one of the linked webpages, and continue to crawl all the webpages linked in this webpage. Or take the above picture as an example:

Traversal path: ABCDEF GHI

Well, the reptiles are here, and I was stunned for a while. Below is the inverted index.

Forward and Inverted Indexes

Let's first look at what an inverted index is and the difference between an inverted index and a positive index:

We know that the key step for a search engine is to establish an inverted index. The so-called inverted index is generally expressed as a keyword, followed by its frequency (the number of times it appears), its location (in which article or webpage it appears, and information about the date, author, etc.), it is equivalent to making an index for hundreds of billions of web pages on the Internet, just like a book's table of contents and tags. Readers who want to see a chapter related to a topic can find the relevant page directly according to the table of contents. There is no need to search page by page from the first page to the last page of the book.

Next, explain the difference between the forward index and the inverted index:

General index (forward index)

正排表是以文档的ID为关键字，表中记录文档中每个字的位置信息，查找时扫描表中每个文档中字的信息直到找出所有包含查询关键字的文档。正排表结构如图1所示，这种组织方法在建立索引的时候结构比较简单，建立比较方便且易于维护;因为索引是基于文档建立的，若是有新的文档假如，直接为该文档建立一个新的索引块，挂接在原来索引文件的后面。若是有文档删除，则直接找到该文档号文档对因的索引信息，将其直接删除。但是在查询的时候需对所有的文档进行扫描以确保没有遗漏，这样就使得检索时间大大延长，检索效率低下。

尽管正排表的工作原理非常的简单，但是由于其检索效率太低，除非在特定情况下，否则实用性价值不大。

倒排索引

倒排表以字或词为关键字进行索引，表中关键字所对应的记录表项记录了出现这个字或词的所有文档，一个表项就是一个字表段，它记录该文档的ID和字符在该文档中出现的位置情况。由于每个字或词对应的文档数量在动态变化，所以倒排表的建立和维护都较为复杂，但是在查询的时候由于可以一次得到查询关键字所对应的所有文档，所以效率高于正排表。在全文检索中，检索的快速响应是一个最为关键的性能，而索引建立由于在后台进行，尽管效率相对低一些，但不会影响整个搜索引擎的效率。

倒排表的结构图如图2：

倒排表的索引信息保存的是字或词后继数组模型、互关联后继数组模型条在文档内的位置，在同一篇文档内相邻的字或词条的前后关系没有被保存到索引文件内。

我只是默默地觉得正排索引没什么用，倒排索引就是专门为搜索引擎设计的吧，但是我比较想知道word（关键字）是全部还是只是大概每篇文章就几个关键字那样，我觉得应该全部都是关键字同时设立几个关键字优先吧。同时倒排列表如下:

roger 地址次数 <第几行，第4行，。。。>

地址他经常以相对地址来比的比如说第一个是10000 第二个是10005 第二个就直接写5了，否则会占用很多内存储存吧，真是机智啊。

倒排索引是搜索引擎之基石。建成了倒排索引后，用户要查找某个query，如在搜索框输入某个关键词：“结构之法”后，搜索引擎不会再次使用爬虫又一个一个去抓取每一个网页，从上到下扫描网页，看这个网页有没有出现这个关键词，而是会在它预先生成的倒排索引文件中查找和匹配包含这个关键词“结构之法”的所有网页。找到了之后，再按相关性度排序，最终把排序后的结果显示给用户。