Actual algorithm (2): The classic data structures and algorithms behind the search engine analysis

Actual algorithm (2): The classic data structures and algorithms behind the search engine analysis

How on one machine (the machine's memory is 8GB, hard disk is a 100 + GB) a small amount of code to achieve through search engines, the search engine is divided into four parts: collection, analysis, indexing, query. Collect namely the use of reptiles crawling web pages, analyzing responsible for web content extraction, segmentation, build a temporary index, calculated PageRank values ​​of several work index in charge of the temporary index obtained by the analysis phase, building an inverted index, query responsible for the user's request, according to inverted index access to relevant pages, page rank calculation, return query results to users

collect

For search engines, web pages are not known in advance where the search engine is how to crawl pages of it?

Directed graph, the vertex of each page like a search engine to the entire Internet seen as a data structure, if a page contains a link to another page, then even a directed edge between two vertices, can use graph traversal search algorithm to traverse the entire internet page, the search engine uses a breadth-first search strategy is to first find links to some of the more well-known web (high weight pages) of (such as Sina home, Tencent home, etc.) as seed page links placed in the queue, the crawler according breadth-first strategy, stop link removed from the queue, and then take the corresponding crawled pages, links to other web pages parsed contains, and then add a link to the parsed queue

1 to be crawling pages link to file: links.bin

First Search Process crawling pages in breadth, the crawler stop parsing the page link, put it into the queue, the queue will then link more to fit with a file stored on disk (links.bin) as BFS in the queue, the crawler links.bin file, remove the corresponding page links climb taken until after the crawl web pages to the parsed links stored directly to file links.bin

How to parse the page for a link? The whole page as a big string, using the string matching algorithm, in this big string, search for <link>such a page tab, and then read the order <link></link>directly to strings, that is, Web links

2 pages sentenced to re-file: bloom_filter.bin

How to avoid duplicate crawling same page? Use Bloom filter, you can quickly achieve savings sentenced to heavy memory page, but after the restart the machine downtime, Bloom filter was emptied, may lead to a large number of already crawling over the pages will be repeated crawling, you can periodically such as every half hour Bloom filter persisted to disk, the file is stored in the bloom_filter.bin, after restart, the disk file read bloom_filter.bin restored into memory

3 page original store files: doc_raw.bin

After crawling web pages, store it down to prepare for later offline analysis, indexing purposes, how to store vast amounts of raw data page it?

A plurality of web pages stored in one file, between each page separated by a certain identifier, easy to read, the following format

doc1_id \t doc1_size \t dic1 \r\n\r\n

Page number \ page size \ web \ delimiter

Such a document will not be much, because the file system has some restrictions on file size. Set each file size can not exceed a certain value, when more than this value, it will create a new file to store the new web crawling.

A machine size is about 100GB hard drive, a page size is 64KB mean, on one machine, may be stored about 100W to 200W web crawling pages 100W spend time up to several hours

4 web links and numbers corresponding to the file: doc_id.bin

Page numbers are actually assigned to each page a unique ID, to facilitate our analysis, index page, the page number how to do?

According to the order of pages being crawled, from small to large are numbered, the specific approach is: a center of maintenance counter after every crawling to a Web page, you get a number from the counter, a distribution this page, then the counter is incremented First, while stored pages, the correspondence between web links with numbers stored in another file doc_id.bin

analysis

After crawling down the page, the web page for offline analysis, the first page is to extract the text information, and the second word is creating a temporary index

1 extract pages of text information

Page is a semi-structured data, there are a variety of labels, JavaScript code, CSS style. For search engines, users can only concerned with that part of the information to the naked eye to see, from the page, extract the text message?

The reason why the web is semi-structured data, because it itself is in accordance with certain rules to write, that is, HTML syntax specification, relying on labels to extract text information extraction process is divided into two steps:

  • Removing the Javascript code, CSS style and content of the drop-down box (since the content of the user when the drop-down box is not operating is not visible) which is to remove the <style></style>,<script></script>,<option></option>three keywords, using AC automaton which multiple string matching algorithm, web page big string, disposable find <style></style>,<script></script>,<option></option>three key words, to traverse the string during this period even deleted from the page with notes
  • Remove all HTML tags, string matching algorithms implemented by

2 word and create a temporary index

Carried out after word text message, and create a temporary index

English pages word is very simple, just by spaces, punctuation delimiters can, Chinese web page, then segmentation method based on dictionary and rules

That thesaurus dictionary, contains a large number of commonly used words, with the thesaurus and the longest match rule, the text word, the so-called longest match rule, which matches the longest possible words

For example, to word the text is "the liberation of the Chinese people," the lexicon of "China", "China" and "Chinese people", "Liberation Army" that the words, they take the longest match, is the "Chinese people" a divided word, rather than "China", "Chinese people" divide a word, the word lexicon, construct the Trie tree structure, and then take pages of text matching in Trie tree

Text information for each page after word completion, resulting in a set list of words, the correspondence between words and pages written to a temporary index file, tmp_index.bin, this temporary index files used to build the inverted index file format:

term1_id \t doc_id \r\n

No word \ page number \ delimiter

In the temporary index file storage is the number of words that term_id, not a word, the purpose is to save storage space, this number come from? Maintain a calculator, whenever split from webpage text information in a new word, it took a number allocation from the counter, and then counter +1

Also need to use the hash table has compiled a record number of words, the process of web page text message word, the word to get out of the division, first locate the hash table, if found, the direct use of existing number, no, go to take counter number, and add new words and numbers to the hash table

When all pages word, the correspondence between the written word with number disk file, named term_id.bin

index

The temporary index is constructed to generate the analysis phase inverted index (word is what the document contains), inverted index recorded every word and inverted index structure graph includes its list of pages:

tid1 \t did1,did2,did3……,didx \r\n

term_id abbreviations, word number \ did1, ......, didx is a Web page that contains a numbered list of words

When constructing inverted index files, because a large index file can not be loaded into memory in a one-time, multi-way merge sort implementation, first on a temporary index files, sorted by size number of words, because a large temporary index, general memory-based sorting algorithm can not handle, use merge sort of thought, which is divided into multiple small files, the first independent sorting for each small file, and finally merge together. Can be treated directly with MapReduce

After the temporary index file sorting is completed, the same word was arranged together, just the order of traversal of the sorted temporary index file, you will be able to find out each word corresponding page number list, and then store them in inverted index file

In addition to inverted file, but also you need to file a record number of each word in inverted index file offset position, term_offset.bin, the role of this document is to help us quickly find a word in inverted index number storage location, and then quickly read the page number corresponding to the numbered list of words from the inverted index

tid1 \t offset1 \r\n

No offset word delimiter

Inquire

To achieve the ultimate user files before using several search:

  • doc_id.bin
  • term_id.bin
  • index.bin
  • term_offset.bin

In addition to the three other documents index.bin are loaded into memory, organized as a hash table this data structure, when the user in the search box, enter a query text, the text entered by the user to perform word processing, get word after the K word, take the K word, corresponding to term_id.bin hash table, find the corresponding word number, after the queries, get the K word corresponding to the number of words, holding the K word number go term_offset.bin corresponding hash table, find the number of each word in inverted index file offset position, got the K offset position, to the inverted index, find the K corresponding to a word that contains it pages numbered list

For the K pages numbered list, count the number of each page number appears, can make use of the hash table to statistics, the statistical results obtained, in accordance with the number of occurrences, from small to large order, appears more the number, the more instructions contained users look up words, and then take the page number to doc_id.bin file to find the corresponding web link, pagination can be displayed to the user

to sum up

The calculating PageRank pages right algorithm to calculate query results ranking tf-idf model

Write code for a search engine

How to achieve support summary information, and cache?

Summary Information:
increase summary.bin and summary_offset.bin. After extracting web page text information, extracted as the digest prior to 80-160 words, written to summary.bin, and written to the offset location summary_offset.bin.
summary.bin format:
doc_id \ t summary_size \ t the Summary \ r \ the n-\ r \ the n-
summary_offset.bin format:
doc_id \ t offset \ r \ the n-
digest display Google search results is the text near the search term. If you want to achieve this effect, you can save the entire page text, building search results, search terms to find the location on the page text, intercept text search near term.

Cached:
can doc_raw.bin as snapshots, recording an increase doc_raw_offset.bin doc_id in doc_raw.bin offset position.
doc_raw_offset.bin Format:
DOC_ID \ T offset \ R & lt \ n-

Published 75 original articles · won praise 9 · views 9162

Guess you like

Origin blog.csdn.net/ywangjiyl/article/details/104893377
Recommended