Introduction to Search Engine

Introduction to Search Engine

graph LR A [search engine] style A fill: # cc66ff A -> B [4 stages] A -> C [reptile] A -> D [<br> inverted index word document matrix] C -> E [link relationship <br> anti-cheat] C -> F [PageRank <br> OPIC] A -> G [net & dark two difficulties] D -> I [wordbreaker & NLP] I -> J [document number <br> word frequency <br> position] D -> K [dynamic index three modules] K -> L [distributed <br> & word document] D -> M [phrase queries]

This article read, "This is a search engine: the core technology explain," the study notes

Search engines can be considered an important part of the Internet and search engine technology is currently the main problem of information overload means to resolve

Can follow the techniques used to develop a search engine is divided into four stages:

  1. Categories, such as not containing the search box hao123HOME. At this point the need to manually categorize sites and summary
  2. Text retrieval, information retrieval using a model matching the user's search and page. And this was no independent association between different pages
  3. Link analysis, taking into account the relationship between links (hyperlinks) between pages based on text retrieval on the relationship between the use of a measure of popular web pages (important) degree
  4. User-centric, user-centric, understand and meet the individual needs of users. For example, a user searches for Apple, but in fact it may have to search for iphone 11

Target search engines are: more full, faster, more accurate

Search engine technology architecture

graph LR A ((Internet)) style A fill: # 00ff00 A -> B (Web crawler <br> similar web pages) B -> D [inverted index <br> <link relations>] C [cloud platform ] -> DE [anti-cheat] -> D style E fill: # cc66ff D -> F [content similarity <br> link analysis] F -> G ((page Rank)) H ((user) ) style H fill: # 00ff00 H -> I [Cache] I -> G

Related concepts

  • Seed URL & Web crawler, the web crawler needs to start providing initial URL, the URL is the seed URL
  • Web page classification & web crawler, downloaded, expired, to be downloaded (URL reptile has joined the queue), you can see the page, page unknown
  • Reptile classification , batch-type (reached the target stop), incremental (non-stop crawling), vertical (focus on specific areas of reptiles)
  • Friendly reptile , fulfill reptile ban caught agreement to protect the privacy of part of the site; minimize caught site of the network load
  • Reptile crawling strategy , because of limited resources and web crawlers nearly limitless resources, it is crawling with different types of web page update their different strategies
    • Breadth-first traversal
    • Incomplete PageRank, crawl part of the page and URL analysis of these URL, then just grab more important pages. A simple method: the pages viewed as nodes of the graph, is a link to the page number of the more general importance of the greater (we do not consider cheating). PageRank whole network is almost impossible, so the only part of the process
    • OPIC (online page importance computation), and non-thinking exactly the same PageRank. Each new page will be caught given the same cash, the cash will be divided equally to the current page URL, URL already in the system sorted according to size of cash. OCIP relative advantage in terms of PageRank speed and real-time
    • Major stations priority
  • Crawled pages update policy
    • Historical reference, the probability of future past frequently updated frequently updated web page is relatively large, the Poisson process
    • User experience strategy, high-impact pages should be updated as soon as possible
    • Cluster sampling strategy, different types of page update frequency different
  • Distributed crawling
    • Master-slave reptiles
    • Peer to peer reptiles

Dark Web

The so-called dark net refers to those stored in the database, you can not be accessed through a hyperlink set of resources. Simply put, network resources so-called dark or network is not difficult to be indexed by search engines. Ctrip ticket data need to check out by way of a combination of web crawlers for the purposes of this process is difficult to automate, ordinary violence will be arrested crawl the site to bring greater pressure

Dark Web crawler technical difficulties are twofold:

  • Too many combinations, how to inquire carefully selected portfolio?
  • Most of the query text box, enter the appropriate content crawlers how?

The above two solutions to some technical difficulties, you can refer "This is a search engine" in Section 2.6

index

The index is one of the core search engine technology is the basis for the search engine to quickly find the

Some concepts

  • Document (Doc), present in the form of text objects, or can be sub-word units independent storage system resolution. Such as PDF, Word, html, xml, etc.
  • Inverted index (inverted index), the following words are inverted index - A storage form of the document matrix. Documents can contain this word list based on the word quickly get through the inverted index. Implementation inverted index commonly used hash table and multi-branch tree
  • Word dictionary, search engines typically index of the unit is the word, the word dictionary is a collection of strings, all words in the document set ever constructed
  • Inverted list, the same word may appear in different documents, and in the same document may appear several times, inverted list to save the association between the word and the information document
  • Inverted file stores all the words and their corresponding inverted file list
  • Document frequency, indicating that a word appears in the current number of documents in the document collection

Word - document matrix

Word document matrix is ​​used to indicate the existence of a certain word in a document, such as:

Documents 1 Documents 2 ... N Documentation
Glossary 1 has
... ... ... ... ...
Glossary n has ...

Word breaker

In the English wording, the space between words is as natural delimiters, but only Chinese word, sentence and paragraph by a clear demarcation simple to delimiter, but not a formal word delimiters, word processing is the continuous sequence of words according to certain process specifications recombined into a sequence of words.

The basic elements of a search word is the search engine, to create a word document matrix, we need to use the word system extracts all the words in the document

Word processing documents is one of the natural language processing (NLP) is the core tool, NLP is the era of big data is extremely important tool . This article does not do word segmentation algorithm descriptions, detailed information can refer to other documents.

Inverted index example

Example document:

Document No. Document Content
1 Father of Google Maps quit Facebook
2 Father of Google Maps to join Facebook
3 Google Maps founder Las leave Google to join Facebook
4 Google Maps and Wave father quit Facebook cancel the project
5 Google Maps Las father joined the social networking site Facebook

With the word frequency, word frequency, and location of the document inverted index Example:

Word ID word Document frequency Inverted list (document number, word frequency, location)
1 Google 5 (1;1;<1>),(2;1;<1>),(3;2;<1;6>),(4;1;<1>),(5;1;<1>)
2 Founder 1 (3;1;<3>)
... ... ... ...

The word "Google" appeared in five documents, and appeared twice in the third document, the document at the first word and 6th place respectively in words

Establish and update the index

Common indexing in three ways: twice document method, sort and merge Act, more commonly used is merging algorithm. In the data can not only be used to a greater efflux algorithm using memory storage: merge

Dynamic index

We need to create a dynamic index when searching for dynamic document collection, then the system has three key components: an inverted index, index and temporary deleted documents list

Document changes will first build a temporary index. Update the document in accordance delete & re-add processing, deleted documents need to be maintained in the deleted document list, you need to delete the document list to filter the query results when returning query results.

graph LR A ((query)) style A fill: # 00ff00 A -> C [temporary index <br> <New Document>] A -> D [inverted index] C -> E [deleted documents list <filter>] D -> EE -> F ((query result))

Index update a variety of strategies: completely reconstructed, recombined, and mixed-place update strategies, explained in detail with reference to the original book 3.6. Strategy and reptiles similar words with different attributes its index update policy may be different.

Multiple-Field Index

Some documents have some structure, such as e-mail has sender, recipient, title and text, and some search indicates a part of the search structure in the document, such as a search only in the recipient list. Information between information structure and word document structure and multi-field indexes need some way to get the document

Common multiple-field index in three ways:

  • Multiple indexes , namely the establishment of an index for each part of the structure of the document
  • Inverted list , only the establishment of an index of documents, but the information structure down the list of words where the added row, the subsequent use of the results of these non-designated location information filtering
  • Expanded list mode , and the second way is similar to the above, but the location information is not stored in the inverted list. For example, an expanded list of a detailed description of the structure of a document: the first a word to the first 21 words are the sender information, so if the query word in results is not located 1 to 21 words, you can know the result is not located in hair pieces of personal information in

Distributed Index

For the processing of the current technology that can be used only distributed mass data, building distributed indexing for search engines, there are two solutions: according to the document division, divided by the word .

Documentation is divided by the capacity of all of the machine is equal, different machines for different document indexing, each query will be issued when broadcast to all machines

Words are divided into different machines according to the indexing of different words, a query involves only part of the machine

Commonly used method is indexed by document because indexed by word has the following disadvantages:

  • Poor scalability , each additional document will involve a lot of different machines, because a document that contains a lot of words, but these words are different machine index
  • Load balancing is poor , some very common words, such an index word machine needs more resources than other machines
  • Fault tolerance poor
  • Query restrictions , divided by word can only be used once a word (described below) to query the way, while others need to press the search for a stable manner

Inquire

After the establishment of the index can be used to query the index, the index query, there are two methods: a document once, once a word

  • Once a document

    Document-inverted list contained units, each calculation of a final document and query field of similarity, and then calculate the remaining documents, the final score of the document sorting

  • Once a word

    To check the word as a unit, each query score a document on a word, a complete word query all documents involved in the investigation before the next word. Accumulate all the words of the document is the final score score

Phrase Queries

Common phrases query method: positional information index, double index words and phrases Index

Position information indexed manner intuitive, but the word of the message contains relatively long time is not efficient

Double-word index to establish a connection between a common two-word message in two words, can be found under the word quickly by first word of the double word. Double-word index will consume a lot of storage space, it is generally used only in this way for the common phrase

Phrase index is to look at words as phrases and index, the index is generally required to get the phrase popular phrase data mining

The phrase also attributes and categories , messages of different attributes can use different indexing

Guess you like

Origin www.cnblogs.com/jiahu-Blog/p/11621816.html