How to optimize the SEO of the website

1. What is a search engine

A search engine is a computer program that helps users search for the content they need. In other words, the search engine matches the information stored in the computer with the user's information needs and displays the matching results.

For example: if you want to buy iPhone12 and want to know the configuration or price of "iPhone12", you enter "iPhone12" in the browser search box and click the search button. The keyword "iPhone12" here is your information needs. In the tenths of a second when the browser displayed the search results, its program searched the huge database according to keywords, and finally calculated all the web pages about "iPhone12".

Second, the working principle of search engines

There will be a very large database in the background of the search engine website, which stores a large number of keywords, and each keyword corresponds to many URLs. These URLs are called "search engine spiders" or "web crawlers" . The program is downloaded and collected bit by bit from the vast Internet. With the emergence of various websites, these hard-working "spiders" crawl on the Internet every day, from one link to another, download the content, analyze and refine, find the keywords, if the "spider" thinks If the keywords are not in the database but are useful to the user, they are stored in the back-end database. On the contrary, if the "spider" thinks it is spam or duplicate information, just discard it, continue crawling, find the latest and useful information, save it and provide it for users to search. When users search, they can retrieve URLs related to keywords and display them to visitors.

A keyword pair uses multiple URLs, so there is a sorting problem, and the corresponding URL that best matches the keyword will be ranked first. In the process of "spider" crawling web content and refining keywords, there is a question: can "spider" understand it. If the content of the website is flash, js, etc., then it is incomprehensible and confused. Even if the keywords are more appropriate, they are useless. Correspondingly, if the content of a website can be recognized by a search engine, the search engine will increase the weight of the website and increase the friendliness of the website. Such a process is called SEO.

3. Search engine work process (three stages)

The working process of search engines can be roughly divided into three stages.

[ The first stage segment] crawling and crawling: search engine spiders to access web pages by following links to get the HTML code page stored in the database.

1. What is a spider?

It is called the executor when crawling webpage data. In fact, it is a computer program, because the working process is very similar to spiders in reality. The industry calls it search engine spiders! The spider program sends an access request to the website page, and the server returns the HTML code, and the spider program stores the received code in the database of the original page. When a spider visits any website, it will first visit the robots.txt file in the root directory of the website! If the robots.txt file prohibits search engines from crawling certain files or directories, the spider will comply with these prohibitions and will not crawl those prohibited URLs.

2. How to track links?

In order to crawl as many pages as possible on the Internet, search engine spiders will follow the links on the website pages and crawl from one page to the next. This process is like a spider crawling on a spider web. This is the name of the search engine spider. origin. The entire Internet is composed of linked web pages. In theory, as long as the spider is given enough time, it can always crawl all the page links on the Internet. But the actual situation is not like this. Because the website and link structure are extremely complex, spiders need to adopt certain crawling strategies to traverse all pages on the Internet.

3. Crawling strategy

According to the link structure of the website, we can divide spider crawling strategies into two types: deep crawling and breadth crawling.

Deep crawling: The spider crawls forward along the found page link until there are no other links in front, then returns to the first page, and crawls forward along another link!

Breadth crawling: When the spider finds multiple links on a page, it does not crawl forward along a link, but crawls all the first-level links on the page, and then continues to find it on the second-level page The link crawls to the third page! Keep going like this

In actual work, the bandwidth resources and time of the spider are limited, and it is impossible to crawl all pages. Depth-first and breadth-first are usually mixed, so that as many websites as possible (breadth first) can be taken care of, as well as part of the inner pages of the website (depth first)

4. Attract spiders

According to the above introduction, it is impossible for spiders to include all pages. Therefore, SEO is to use various methods to attract spiders to crawl and include more pages of their website. Since all pages cannot be included, the spider must try to crawl important pages. So how does the spider determine which pages are more important? There are several influencing factors:

Website and page weight: high-quality, high-qualified websites have high weight

Page update rate: sites with high update frequency have high weight

Imported link: Whether it is an external link or an internal link, to be crawled by a spider, there must be an import link to enter the page. High-quality import links also often increase the depth of the export links on the page.

Click distance from the homepage: Generally speaking, the homepage has the highest weight. Most external links point to the homepage, and the homepage is the most frequently visited by spiders. Therefore, the closer the page is to the homepage, the higher the weight and the greater the chance of being crawled by spiders.

5. Address library

Search engines will build an address library to store pages in order to avoid repeated crawling and crawling of URLs by search engine spiders. This address library contains pages that have been crawled and pages that have not been crawled after being discovered. Must the URL in this address database be crawled by a spider? The answer is No. There are manually entered seed website addresses, and some webmasters submit URLs through search engine pages (generally personal blogs or websites use this method). After the spider crawls the page, the URL is parsed and compared with the address library. If it does not exist, deposit

6. File storage

The data crawled by search engine spiders is stored in this original page database, and the page data in it is exactly the same as the HTML obtained by the user's browser. Each URL has a unique file number

7. Detection of copied content

The spider will also perform a certain degree of copy content detection when crawling and fetching files! When encountering websites with low weights and a large number of plagiarized content, spiders are likely to stop crawling. This is why some webmasters found spiders in the log files, but the pages have never been actually included.

[Second stage] Preprocessing: The indexing program performs text extraction, Chinese word segmentation, indexing and other processing on the crawled web page data to prepare for the ranking program to call.

Because there are too many data in the search engine database, it is impossible for users to return the ranking results after entering keywords in the search box, but we often feel that it is fast. In fact, the key role is the preprocessing process. Like the crawling process, he completed it in advance in the background. Some people think that preprocessing is indexing, but this is not the case. Indexing is just a major step of preprocessing. So what is indexing? An index is a structure for sorting the values of one or more columns in a database list. There are five tasks to be done before indexing:

1. Extract text

The first thing a search engine has to do is to remove HTML format tags and javascript programs from HTML files, and extract the text content of the website pages that can be used for ranking processing. In addition to extracting visible text, search engines can also propose the following invisible text content, such as: text content in meta tags, image alternative text, alternative text for Flash files, link anchor text, etc.

2. Chinese word segmentation

In Chinese sentences, there is no separator between words and words, and the words in a sentence are all connected together. Therefore, at this time, the search engine must first distinguish which characters constitute a word, and which characters are themselves a word. For example: "Bosideng down jacket" is divided into two words: "Bosideng" and "down jacket". There are generally two methods for Chinese word segmentation:

Dictionary matching: Match a segment of Chinese characters to be analyzed with an entry in a pre-built dictionary. Scanning an existing entry in the dictionary from the Chinese character string to be analyzed will match successfully, or split a word.

According to search statistics: Statistics-based word segmentation methods refer to analyzing a large number of text samples and calculating the statistical probability of words appearing adjacent to each other. The more words appearing adjacent to each other, the more likely it is to form a word. The advantage of the statistical-based method is that it reacts more quickly to new words, and it is also helpful to eliminate ambiguity.

The word segmentation methods based on dictionary matching and statistics have their own advantages and disadvantages. The word segmentation system in actual use uses a mixture of the
two methods, which is fast and efficient, and can identify new words and new words and eliminate ambiguity. )

3. Go to stop words

What is a stop word? Some words in the page content that appear frequently but have no effect on the content. For example: "的", "地", "得" and other auxiliary words; "Ah", "ha", "Ah" and other interjections; "Thus", "以", "Que" and other prepositions . Common stop words in English, such as "the" and "of". These words are called stop words because they have no effect on the main meaning of the page. There are two main purposes for search engines to stop words:

One is to make the subject of index data more prominent and reduce unnecessary calculations

The second is to detect whether your content has a lot of repetition with the content in another database

4. Remove noise

The noise here is not what we call noise, it refers specifically to a kind of garbage, that is, redundant words! These words are generally included in the copyright notice text, navigation bar, and advertisement. Search engines need to identify and eliminate these noises, and no noise content is used when ranking. The basic method of noise elimination is to divide the page into blocks according to HTML tags, distinguishing the header, navigation, body, footer, advertisement and other areas. A large number of repetitive blocks on the website are often noise, which can only affect the theme of the page. Dispersion. After denoising the page, what is left is the main content of the page.

5. Removal (chong)

The same article often appears repeatedly on different websites and on different URLs of the same website. Search engines do not like this kind of repetitive content. When users search, if they see the same article from different websites on the first two pages, the user experience is too bad, although they are all content-related. The search engine hopes to return only one article in the same article, so it needs to identify and delete duplicate content before indexing. This process is called "de-duplication"

After the above five steps, the search engine will be able to get unique, word-based content that can reflect the main content of the page. Then the search engine program divides the keywords extracted above through the word segmentation program, and converts each website page into a set of keywords. At the same time, it records the frequency, number, and number of occurrences of each keyword on the page. Format (for example: title tag, boldface, H tag, anchor text, etc.) position (the first few paragraphs), these are recorded in the form of weight, and then placed in a place, this place is the vocabulary for placing these combination words Structure-index library, also called "thesaurus index form"

6, forward index

Page is converted to a set of keywords composition, while the recording frequency of each keyword appears on the page., And the
number of times now, the format (e.g., appear in the title tag, bold, H tags, anchor text, etc.), location (e.g., page The first paragraph,
text, etc.). In this way, each page can be recorded as a string of keyword collections,
and the weight information such as the word frequency , format, and location of each keyword is also recorded. Each folder corresponds to an ID, and the content of the file is represented as a set of keywords. In the index library of the search engine, at this time, the keywords have not been converted into keyword IDs. This data structure is called forward indexing.

7, inverted index

Because the forward index cannot be used for ranking directly, for example, if a user searches for a certain keyword 2, if only the forward index is performed, only the folder containing the keyword can be found, and the ranking cannot be actually returned. Inverted index will be used at this time. In the inverted index, the keywords become the primary key. Each keyword corresponds to a series of files, and each file has the keyword to be searched, so that when the user searches for a certain keyword, the sorting program can be Find the file corresponding to this keyword in the inverted list

8. Handling of special documents

In addition to HTML files, search engines can usually crawl and index a variety of text-based file types, such as
PDF, Word, WPS, XLS, PPT, TXT files, etc. We often see these
file types in search results . However, current search engines cannot process non-text content such as images, videos, and Flash, nor can they
execute scripts and programs. Although search engines have made some progress in recognizing images and extracting text content from Flash, they are still far from the goal of directly returning results by reading images, videos, and Flash content. The ranking of pictures and video content is often based on the text content related to them. So in SEO, try to use these as little as possible on your website

9. Calculation of link relationship

After the search engine crawls the page, it must also calculate in advance which links point to which pages on the page. What are the imported links on each page, and what anchor text is used in the link? It is these complex link-pointing relationships that form the link weight of the website and the page.

[Stage 3] Ranking: After the user enters a keyword, the ranking program calls the index library data to calculate the correlation, and then generates a search result page in a certain format.

1. Search term processing

Chinese word segmentation: As with page indexing, search terms must also be Chinese word segmentation, and the query string is converted into a word-based keyword combination. The principle of word segmentation is the same as that of page word segmentation.

Remove stop words: As with indexing, search engines also need to remove stop words from search terms to maximize ranking relevance and efficiency.

Instruction processing: such as plus sign, minus sign, etc., search engines need to identify and deal with it accordingly

Spelling error correction: If the user enters an obviously wrong word or English word, the search engine will prompt the user to use the correct word or spelling

Integrated search trigger: For example, search for stars, there will be pictures, videos and other content, suitable for hot topics

2. File matching

The inverted index quickly matches files. Assuming that the user searches for "Keyword 2 Keyword 7", the ranking program only needs to find the two words "Keyword 2" and "Keyword 7" in the inverted index, and they will be able to find that they contain these words respectively. All pages in two words. After a simple calculation, we can find all pages that contain both "keyword 2" and "keyword 7": file 1 and file 6

3. Initial subset selection

There are tens of thousands of pages on the Internet, and there are also tens of millions of pages that can be searched for a certain keyword. If the search engine comes up and directly calculates the relevance of the page, it is simply too time-consuming. In fact, users don't need to see these thousands of pages, all they need is one or two useful pages. At this time, the search engine will select 100 files according to the user's search terms, and then return them. So which one hundred files are they selected? It depends on the relative match between your website page and the keyword searched by the user. The page with high weight will enter the pre-selected subset of the search engine

4. Correlation calculation

After the initial subset is selected, the keyword relevance is calculated for the pages in the subset. The main factors affecting the relevance include the following aspects:

Keyword commonly used degree : The more commonly used words contribute less to the meaning of the search term, and the less frequently used words contribute more to the meaning. Suppose the search term entered by the user is "we DKI". The word "we" is very commonly used, and it appears on many pages. It has little contribution to the recognition and significance of the search term "our DKI". Those pages that contain the word "DKI" will be more relevant to the search term "our DKI"

Word frequency and density : It is generally believed that when there is no accumulation of keywords, the search term appears more frequently on the page, and the density is higher, indicating that the page is more relevant to the search term

Keyword location and format : As mentioned in the index section, the format and location of page keywords are recorded in the index library. Keywords appear in more important positions, such as title tag, boldface, H1, etc., indicating that the page is more relevant to the keyword. This part is what on-page SEO is going to solve

Keyword distance : After the segmentation, the complete match of the keyword appears, indicating that it is most relevant to the search term. For example, when searching for "weight loss method", the four words "weight loss method" appearing continuously and completely on the page are the most relevant. If the two words "weight loss" and "method" do not match continuously, they appear closer, and they are also considered by search engines to be slightly more relevant.

Link analysis and page weight : In addition to the factors of the page itself, the link and weight relationship between pages also affect the relevance of keywords, the most important of which is anchor text. The more import links that have search terms as anchor text on the page, the more relevant the page is. Link analysis also includes the subject of the link source page itself, the text surrounding the anchor text, and so on.

5. Ranking filtering and adjustment

After calculating the relevance, the general ranking has been determined. Afterwards, search engines may also have some filtering algorithms to slightly adjust the rankings, the most important of which is to impose penalties. Some suspected cheating pages are ranked first according to normal weight and relevance calculations, but the penalty algorithm of search engines may shift these pages to the back in the last step. Typical examples are Baidu's 11-bit algorithm, and Google's minus 6, minus 30, and minus 950 algorithms.

6. Ranking display

After all rankings are determined, the ranking program calls the title tag, description tag, and snapshot date of the original page to display it on the page. Sometimes search engines need to dynamically generate page summaries instead of calling the description tags of the page itself.

7, search cache

It can be said that it is a great waste to reprocess the ranking every time you search. The search engine will store the most common search terms in the cache, and the user will directly call it from the cache when searching, without having to go through file matching and correlation calculations, which greatly improves the ranking efficiency and shortens the search response time

8. Query and click log

The search engine records the IP address of the search user, the keywords searched, the search time, and which result pages are clicked. The data in these log files is of great significance for search engines to judge the quality of search results, adjust search algorithms, and anticipate search trends.

4. Overview of front-end SEO specifications

1. Reasonable title, description, keywords, the weight of the search for the three items is reduced one by one, and the title value can emphasize the key point. The description summarizes the content of the page at a high level. Don't pile up keywords too much. Keywords enumerate important keywords.

2. Semantic HTML tags

3. Non-decorative pictures must add alt

4. Put important content at the top of HTML and load first. The order of search engines to crawl HTML is from top to bottom to ensure that important content will be crawled.

5. Only one h1 tag appears on each page

6. Try not to make the page into flash, picture, or video, because search engines can't catch it

7. Less use of iframes, iframes cannot be captured

8. The page is as flat as possible, and the level is too deep and it is not conducive to crawling

9. Asynchronous loading content (ajax) is also unable to be crawled by search engines. Important information can be output directly, which is conducive to user experience and SEO optimization

10. Use friendly links to import links from your own website on other people’s websites

11. Submit unlisted sites to the login portals of major search engines

12. Improve website speed, which is an important indicator of search engine ranking

13. Doing a good 404 page is not only to improve the spider experience, but also to make the user experience better

Five, a detailed introduction to the front-end SEO specification

[1] Optimization of website structure layout

Generally speaking, the fewer levels of structure of the established website, the easier it is to be crawled by "spiders", and it is also easier to be included. Generally, the directory structure of small and medium-sized websites exceeds three levels, and "spiders" are reluctant to climb down. And according to the relevant data survey: if the visitor has not found the required information after jumping 3 times, it is likely to leave. Therefore, the three-tier directory structure is also a need for experience. For this we need to do the following

Control the number of homepage links: The homepage of the website has the highest weight. If there are too few homepage links and there is no "bridge", the "spider" cannot continue to crawl down to the inner page, which directly affects the number of website inclusions. However, there should not be too many homepage links. Once there are too many links, there is no substantial link, which will easily affect the user experience, will also reduce the weight of the homepage of the website, and the collection effect will not be good.

Flat directory hierarchy: Try to make the "spider" only need to jump three times to reach any internal page in the website.

Navigation optimization : navigation should use text as much as possible, and it can also be used with picture navigation, but the picture code must be optimized. The <img> tag must add alt and title attributes to tell the search engine navigation positioning, so that even if the picture fails to display normally When the time, the user can also see the prompt text. Secondly, breadcrumb navigation should be added to each web page. From the perspective of user experience, it can let users know their current position and the position of the current page in the entire website, and help users quickly understand the organization of the website. It forms a better sense of position, and provides an interface to return to each page, which is convenient for users to operate. For "spiders", they can clearly understand the structure of the website, and at the same time increase a large number of internal links to facilitate crawling and reduce the bounce rate.

The structure and layout of the website : page header : logo and main navigation, as well as user information. Main page : The text on the left, including breadcrumb navigation and text. Popular articles and related articles are placed on the right to retain visitors and allow visitors to stay more. For the "spider", these articles are related links, which enhance the relevance of the page and also increase the weight of the page. Copyright information and friendly links at the bottom of the page.

Put important content HTML code at the top : Search engines crawl HTML content from top to bottom. Using this feature, the main code can be read first, and unimportant codes such as advertisements are placed below. For example, when the codes in the left and right columns remain unchanged, just change the style and use float:left; and float:right; to interchange the positions of the two columns on the display at will, so as to ensure the importance The code is first, let the crawler crawl first. The same applies to multiple columns.

Control the size of the page, reduce http requests, and improve the loading speed of the website : a page is best not to exceed 100k, too large, the page loading speed is slow. When the speed is very slow, the user experience is not good, visitors cannot be retained, and once the timeout, the "spider" will also leave.

[2] Webpage code optimization

Highlight important content: reasonably design title, description and keywords. The title of the <title> only emphasizes the key points. Try to put important keywords in the front, do not repeat the keywords, and try not to set the same content in the <title> title of each page. <meta keywords>Keywords, just list the important keywords of a few pages, remember to pile up too much. <meta description> The description of the webpage needs to be a high-level summary of the content of the webpage. Remember not to be too long or overly piling up keywords. Each page must be different.

Semantic writing HTML code: Try to make the code semantic, use the appropriate tags in the appropriate places, and use the correct tags to do the right thing. Let readers and "spiders" be clear at a glance. For example: h1-h6 is used for headings, <nav> tag is used to set the main navigation of the page, the code in the list form uses ul or ol, and the important text uses strong, etc.

<a> Tag: In- page link, add title attribute to explain, let visitors and "spiders" know. For external links, if you link to other websites, you need to add the el="nofollow attribute to tell the "spider" not to crawl, because once the "spider" crawls the external link, it won't come back.

Body title: <h1> tag: The h1 tag has its own weight. "Spider" thinks it is the most important. A page has and at most only one H1 tag, which is placed above the most important title of the page, such as the logo on the homepage. Add the H1 tag. Use the <h2> tag for the subtitle, and the h title tag should not be used indiscriminately elsewhere.

<img> should use the "alt" attribute to explain: when the network speed is very slow, or the picture address is invalid, the function of the alt attribute can be reflected, and it can let the user know the function of the picture when the picture is not displayed. Set the height and width of the image at the same time to improve the loading speed of the page.

The table should use the <caption> table title tag: The caption element defines the table title. The caption tag must immediately follow the table tag

<strong> and <em> tags: <strong> tags are highly valued in search engines. They can highlight keywords and express important content. The emphasis of <em> tags is second only to <strong> tags, <b >, <i> tags are only used when displaying effects, and will not have any effect in SEO.

Do not use JS to output important content: Because "spiders" will not read the content in JS, important content must be placed in HTML. The front-end framework for SEO shortcomings can be compensated by server-side rendering

Minimize the use of iframe frames: because "spiders" generally don't read the content.

Search engines will filter out the content of display: none

Spiders can only grab the href in the a tag: <a href= "Default.aspx?id=1">Test</a> It’s best not to include parameters afterwards, <a href= "Default.aspx" >Test</a >If you bring the parameters, the spider will not consider it. In this case, URL rewriting is required.

The spider does not execute JavaScript: In other words, if the onclick spider is used in the a tag, it will not be caught.

The spider can only catch the page requested by the get, but not the page requested by the post

Create robots file: We want all the front pages of the webpage to be caught by spiders, but we don't want the background pages to be caught by spiders. Spiders are not that smart, knowing which is the foreground page and which background page of your website. Here you need to create a file called "robots.txt" (note that robots.txt is a protocol, not a command. Generally, robots.txt, which is best to follow, is the first file when a search engine searches the website.

Six, reference

Front-end interview planet

Articles are continuously updated every week. You can search for " Front-end Collection " on WeChat to read it for the first time, and reply to [ Video ] [ Book ] to receive 200G video materials and 30 PDF book materials