Search engine work process and seo

Search engine work process and seo

The working process of the search engine is very complicated, and the working process of the search engine can be roughly divided into three stages.
  Crawling and crawling: Search engine spiders visit the page by following the link, obtain the HTML code of the page and store it in the database.
  Preprocessing: Search Winball will perform text extraction, Chinese word segmentation, indexing and other processing on the captured page data text to prepare the ranking program to call.
  Ranking: After the user enters a keyword, the ranking calls the index database data, calculates the correlation, and then generates a search result page in a certain format.
  Crawling and crawling
  Crawling and crawling is the first step of the search engine work to complete the task of data collection.
  Spider
  search engines to crawl page and access procedures are called spiders (spider), also known as a robot (bot).
  Spider agent name:
  Baidu Spider: Baiduspider+(+ http://www.baidu.com/search/spider.htm ) ·
  Yahoo China Spider: Mozilla/5.0 (compatible; Yahoo! Slurp China; http://misc.yahoo. com.cn/help.html ) ·
  English Yahoo spider: Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; http://help.yahoo.com/help/us/ysearch/slurp )
  Google spider: Mozilla/5.0 ( compatible; Googlebot/2.1; + http://www.google.com/bot.html) ·
  Microsoft Bing spider: msnbot/1.1 (+ http://search.msn.com/msnbot.htm
  Sogou spider: Sogou+web+robot+(+ http://www.sogou.com/docs/help/ webmasters.htm#07 ) ·
  Search spider: Sosospider+(+ http://help.soso.com/webspider.htm ) ·
  Youdao spider: Mozilla/5.0 (compatible; YodaoBot/1.0; http://www.yodao .com/help/webmaster/spider/ ; )
  Tracking links
  In order to crawl as many pages as possible on the web, search engine spiders will follow the links on the page, crawling from one page to the next, just like a spider crawling on a spider web. This is the origin of the name search engine spider. The simplest crawling traversal strategy is divided into two types, one is depth first, and the other is breadth first.
  Depth-first search
  Depth-first search is to always expand only one child node at each level of the search tree, and continue to advance in depth until it can no longer advance (reaching the leaf node or subject to the depth limit) before returning from the current node to the previous one Level node, continue to advance in the other direction. The search tree of this method is gradually formed from the root of the tree branch by branch.
  Depth first search is also called vertical search. Since a solvable problem tree may contain infinite branches, if the depth-first search strays into infinite branches (that is, the depth is infinite), it is impossible to find the target node. Therefore, the depth-first search strategy is incomplete. In addition, the solution obtained by applying this strategy is not necessarily the best solution (shortest path).
  Breadth first search
  In the depth first search algorithm, the node with the greater depth is expanded first. If the algorithm is changed to search according to the level of the node in the search, if the node of this layer is not searched and processed, the lower node cannot be processed, that is, the node with the smaller depth is expanded first, that is to say The first node can be expanded first, this kind of search algorithm is called breadth first search method.
  In the depth-first search algorithm, nodes with greater depth are expanded first. If the algorithm is changed to search according to the level of the node in the search, if the node of this layer is not searched and processed, the lower node cannot be processed, that is, the node with the smaller depth is expanded first, that is to say The first node can be expanded first, this kind of search algorithm is called breadth first search method.   Which pages are considered more important to
  attract spiders
? There are several influencing factors:
  · Website and page weight. Websites with high quality and old qualifications are considered to have higher weights, and the pages on such websites will be crawled more deeply, so more internal pages will be included.
  · Page update rate. Every time the spider crawls, the page data is stored. If the second crawl finds that the page is exactly the same as the first included, it means that the page has not been updated and the spider does not need to crawl it frequently. If the content of the page is updated frequently, the spider will visit this kind of page more frequently, and the new links that appear on the page will naturally be tracked by the spider faster and crawl new pages.
  · Import links. Whether it is an external link or an internal link of the same website, in order to be crawled by a spider, there must be an import link to enter the page, otherwise the spider has no chance to know the existence of the page. High-quality import links also often increase the depth of the export links on the page. Generally speaking, the homepage has the highest weight on a website, most of the external links point to the homepage, and the homepage is the most frequently visited by spiders. The closer you click to the homepage, the higher the page weight and the greater the chance of being crawled by spiders.
  Address database
  In order to avoid repeated crawling and crawling of URLs, search engines will build an address database to record pages that have been found to have not been crawled, and pages that have been crawled. There are several sources of uRL in the address database:
  (1) Manually entered seed website.
  (2) After the spider grabs the page, it parses the new link uRL from the HTML and compares it with the data in the address database. If it is a URL that is not in the address database, it is stored in the address database to be visited.
  (3) The web address submitted by the webmaster through the search engine webpage submission form.
  The spider extracts uRL from the address library to be visited according to the importance, visits and grabs the page, and then deletes the uRL from the address library to be visited and puts it in the visited address library.
  Most major search engines provide a form for webmasters to submit URLs. However, these submitted URLs are only stored in the address database, and whether they are included depends on the importance of the page. Most of the pages indexed by search engines are obtained by spiders following links. It can be said that the submission page is basically useless. Search engines prefer to discover new pages along the links themselves.
  The data captured by the file storage search engine spider is stored in the original page database. The page data is exactly the same as the HTML obtained by the user's browser. Each uRI has a unique file number.
  Copy content detection when crawling
  Detecting and deleting copied content is usually carried out in the pre-processing process described below, but now spiders also perform a certain degree of copied content detection when crawling and crawling files. When encountering a large amount of reprinted or plagiarized content on a website with a low weight, it is very likely that the crawling will not continue. This is why some webmasters found spiders in the log files, but the pages have never been actually included.
  Pre-processing
  In some SEO materials, "pre-processing" is also referred to as "indexing" because indexing is the most important step of pre-processing.
  The original pages crawled by search engine spiders cannot be directly used for query ranking processing. The number of pages in the search engine database is above trillions. After the user enters the search term, the ranking program analyzes the relevance of so many pages in real time, and the amount of calculation is too large to return the ranking results within a second or two. Therefore, the crawled pages must be preprocessed to prepare for the final query ranking.
  Like crawling, preprocessing is done in advance in the background, and users will not feel this process when searching.
  1. Extract text
  Current search engines are still based on text content. In addition to the visible text that the user can see on the browser, the HTML code in the page crawled by the spider also contains a large number of HTML format tags, JavaScript programs and other content that cannot be used for ranking. The first thing to do for search engine preprocessing is to remove tags and programs from HTML files to extract the text content of web pages that can be used for ranking processing.
  Today, April Fools’ Day
  , after removing the HTML code, the remaining text used for ranking is just this line:
  today’s April Fool’s Day, in
  addition to visible text, search engines will also extract some special codes that contain text information, such as the text in the Meta tag. , Picture alternative text, Flash file alternative text, link anchor text, etc.
  2. Chinese word segmentation
  Word segmentation is a unique step for Chinese search engines. Search engines store and process pages and user searches are word-based. There are spaces between words and words in languages ​​such as English, and the search engine indexing program can directly divide sentences into sets of words. There are no separators between Chinese words, and all the words and words in a sentence are connected together. Search engines must first distinguish which characters form a word, and which characters are themselves a word. For example, "weight loss method" will be segmented into two words "weight loss" and "method".
  There are basically two Chinese word segmentation methods, one is based on dictionary matching, and the other is based on statistics.
  The method based on dictionary matching refers to matching a piece of Chinese character to be analyzed with an entry in a pre-built dictionary. Scanning the existing entry in the dictionary from the Chinese character string to be analyzed will match successfully, or cut Separate a word.
  According to the scanning direction, the dictionary-based matching method can be divided into forward matching and reverse matching. According to the priority of matching length, it can be divided into maximum matching and minimum matching. The scanning direction and length are mixed first, and different methods such as forward maximum matching and reverse maximum matching can be generated.
  The dictionary matching method is simple to calculate, and its accuracy largely depends on the completeness and update of the dictionary.
  The word segmentation method based on statistics refers to analyzing a large number of text samples and calculating the statistical probability of words appearing next to each other. The more words appearing next to each other, the more likely it is to form a word. The advantage of the statistical-based method is that it reacts more quickly to newly appeared words, and it also helps to eliminate ambiguity.
  The word segmentation methods based on dictionary matching and statistics have their own advantages and disadvantages. The actual word segmentation system uses a mixture of the two methods, which is fast and efficient, and can identify new words and new words and eliminate ambiguity.
  The accuracy of Chinese word segmentation often affects the relevance of search engine rankings. For example, when searching for "Search Engine Optimization" on Baidu, you can see from the snapshot that Baidu regards the six words "Search Engine Optimization" as one word.
  When searching for the same word on Google, the snapshot shows that Google has divided it into two words: "search engine" and "optimization". Obviously, Baidu's segmentation is more reasonable, and search engine optimization is a complete concept. Google tends to be more fragmented when it comes to word segmentation.
  This difference in word segmentation is probably one of the reasons why some keyword rankings have different performance in different search engines. For example, Baidu prefers to completely match the search terms on the page. That is to say, when searching for "Gouxi Blog", if these four words appear continuously and completely, it is easier to get a good ranking on Baidu. Google is different, and does not require a complete match. On some pages, the words "Guoxian" and "blog" appear, but do not have to appear in a complete match. "Guoxian" appears in the front, and "blog" appears elsewhere on the page. Search for such a page in Google for "Guoxian blog" ", you can also get a good ranking.
  The word segmentation of a page by search engines depends on the size, accuracy and quality of the word segmentation algorithm of the word database, rather than on the page itself, so SEO personnel can do very little word segmentation. The only thing that can be done is to use some form to prompt search engines on the page. Certain words should be treated as one word, especially when there may be ambiguity, such as keywords in the page title, h1 tags, and boldface. If the page is about "kimono" content, then the two words "kimono" can be specifically marked in bold. If the page is about "makeup and clothing", you can mark the word "clothing" in bold. In this way, when the search engine analyzes the page, it knows that the word marked in bold should be a word.
  3. To stop words,
  whether in English or Chinese, there will be some words that appear frequently in the page content, but have no effect on the content, such as "的", "地", "得" and other auxiliary words, "Ah Interjections such as "", "ha", "ah", and adverbs or prepositions such as "thereby", "yi", and "que". These words are called stop words because they have no effect on the main meaning of the page. Common stop words in English include the, a, an, to, of, etc.
  Search engines will remove these stop words before indexing the page to make the subject of index data more prominent and reduce unnecessary calculations.
  4. Eliminate noise
  There are also some content on most pages that don't contribute much to the theme of the page, such as copyright notice text, navigation bars, advertisements, etc. Take common blog navigation as an example. Almost every blog page will have navigation content such as article classification and history archive, but these pages themselves have nothing to do with the words "category" and "history". When users search for keywords such as "history" and "category", it is meaningless and completely irrelevant to return blog posts just because these words appear on the page. Therefore, these blocks are all noise, which can only play a decentralized role in the subject of the page.
  Search engines need to identify and eliminate these noises, and no noise content is used when ranking. The basic method of denoising is to divide the page into blocks according to HTML tags, distinguishing the header, navigation, body, footer, advertisement and other areas. A large number of repeated blocks on the website are often noise. After denoising the page, what is left is the main content of the page.
  5. Deduplication
  search engines also need to deduplicate the page.
  The same article often appears repeatedly on different websites and on different URLs of the same website. Search engines do not like this kind of repetitive content. When users search, if they see the same article from different websites on the first two pages, the user experience is too bad, although they are all content-related. Search engines want to return only one article of the same article, so they need to identify and delete duplicate content before indexing. This process is called "de-duplication."
  The basic method of deduplication is to calculate the fingerprint of the page feature keywords, that is, select the most representative part of the keywords from the main content of the page (often the keywords with the highest frequency), and then calculate the digital fingerprints of these keywords . The keyword selection here is after word segmentation, stop word removal, and noise elimination. Experiments show that usually selecting 10 feature keywords can achieve relatively high calculation accuracy, and selecting more words will not contribute much to the improvement of deduplication accuracy.
  The typical fingerprint calculation method is MD5 algorithm (the fifth edition of information digest algorithm). The characteristic of this type of fingerprint algorithm is that any slight change in the input (feature keywords) will cause a big gap in the calculated fingerprint.
  Knowing the de-duplication algorithm of search engines, SEO personnel should know that the so-called pseudo-original creation of simply adding "的", "地", "得", and changing the order of paragraphs cannot escape the de-duplication algorithm of search engines, because Such operations cannot change the feature keywords of the article. Moreover, the search engine's de-duplication algorithm is likely not only at the page level, but at the paragraph level. Mixing different articles and cross-switching the order of paragraphs will not make reprinting and plagiarism original.
  6. Forward index
  Forward index can also be referred to as index for short.
  After text extraction, word segmentation, denoising, and deduplication, the search engine obtains unique, word-based content that reflects the main content of the page. Next, the search engine index program can extract keywords, divide the words according to the word segmentation program, convert the page into a set of keywords, and record the frequency, number of occurrences, and format of each keyword on the page (such as Appears in the title tag, boldface, H tag, anchor text, etc.), location (such as the first paragraph of the page, etc.). In this way, each page can be recorded as a string of keyword collections, and the weight information such as the word frequency, format, and location of each keyword is also recorded.
  The search engine index program stores the pages and keywords in the vocabulary structure into the index database. The simplified form of the index vocabulary is shown in Table 2-1.
  Each file corresponds to a file ID, and the content of the file is represented as a set of keywords. In fact, in the search engine index library, keywords have also been converted to keyword IDs. Such a data structure is called a forward index.
  7. Inverted index
  forward index can not be directly used for ranking. Suppose the user searches for keyword 2. If there is only a forward index, the ranking program needs to scan all the files in the index library to find the files containing keyword 2 and then perform correlation calculations. This amount of calculation cannot meet the requirements for real-time return of ranking results.
  Therefore, the search engine will reconstruct the forward index database into an inverted index, and convert the mapping from the file to the keyword into the mapping from the keyword to the file, as shown in Table 2-2.
  In the inverted index, the keyword is the primary key, and each keyword corresponds to a series of files, and the keyword appears in these files. In this way, when a user searches for a certain keyword, the sorting program locates the keyword in the inverted index, and can immediately find all files containing this keyword.
  8. Link relationship calculation
  Link relationship calculation is also a very important part of preprocessing. All mainstream search engine ranking factors now include link flow information between web pages. After the search engine grabs the content of the page, it must calculate in advance: which links on the page point to which other pages, which import links are on each page, and what anchor texts are used in the links. These complex link pointing relationships form the website and the page. The link weight of.
  Google PR value is one of the most important manifestations of this link relationship. Other search engines also perform similar calculations, although they are not called PR.
  Due to the huge number of pages and links, the link relationship on the Internet is constantly being updated, so the calculation of link relationships and PR takes a long time. There are special chapters on PR and link analysis.
  9. Special file processing
  In addition to HTML files, search engines can usually crawl and index a variety of text-based file types, such as PDF, Word, WPS, XLS, PPT, TXT files, etc. We often see these file types in search results. However, current search engines cannot handle non-text content such as pictures, videos, and Flash, nor can they execute scripts and programs.
  Although search engines have made some progress in recognizing images and extracting text content from Flash, they are still far from the goal of returning results directly by reading images, videos, and Flash content. The ranking of pictures and video content is often based on the related text content. For details, please refer to the integrated search section below.
  Rank
  After the search engine spider crawls through the interface, the search engine program calculates the inverted index, and the search engine is ready to process user searches at any time. After the user fills in the keywords in the search box, the ranking program calls the index library data, calculates the ranking and displays it to the customer. The ranking process is directly interactive with the customer.

喜欢建站的朋友,你们肯定也想要获取这些建站源码,我花了将近1年的时间为你们准备了,文末点个赞或者 star 下。

5000 mini programs: https://www.dzy10.com/x/yycs

3000 game source code: https://www.dzy10.com/yx

28000 website source code: https://www.dzy10.com/wzym

Receive 100,000 lessons for free: https://www.dzy10.com/zzxy

Get VIP for free and send it to website building class: https://www.dzy10.com/52443.html

Body art of beautiful women with big breasts: https://www.dzy10.com/mn

College students earn more than 10,000 a month easily, hurry up and get rich together: https://www.dzy10.com/zshy/wszq

Guess you like

Origin blog.51cto.com/15101624/2634861