Spider pool website program development

The principle of the spider pool, the following content is taken from the Internet.

Generally, there are hyperlinks on web pages, and hyperlinks connect most web pages on the Internet to form a structure similar to a spider web. And one job of the spider is to follow the hyperlinks to crawl as many pages as possible that have not yet been crawled. In other words: It is equivalent to artificially creating a constantly growing web, trapping the spider in it, and letting it constantly crawl the pages in the website.

Let’s start, the first step is to find information

      and talk about past experience. If I remember correctly, this is the third time to search for information on Spider Pool. Every time there is an unexpected harvest, I remember when I first came into contact with it, it was cloudy and foggy. The second time, I bought the source code directly, understood it according to the code, and learned the best language in the world (PHP). At the same time, the CMS on the hand was also changed to a parasite metamorphic single-cell version (any name, express it).

      Mainly modified in data and presentation. In terms of data, the novel is intercepted by 500 words and saved to the information table of the CMS. Randomly extract 20 records from more than 1,000 records each time the home page is accessed (tell the spider that the website has been updated), and the content will be disrupted again when the detailed page is displayed (this is a bit redundant, and the reasons for the impact of the later inclusion may come from here) .

      This time is no exception. I am used to reading the first three or four pages of content on Baidu. This time, the top ten content of the search is already in line with my taste. With previous experience, I can quickly locate and take notes.

  The second step, practice

  I have worked on a certain eight platforms for a while, most of which are PHP chores. By the way, I also got into the door. At the same time, I also found that the world of PHP is still very exciting. Only half a bucket of water has to be scooped out.

  The SNOOP class and the readability class are mainly used, and commonly used functions such as replacement, interception and regularization are used. There is also an important pseudo-static, because I did not create a substantial story directory, I need to redirect the access to the story directory to meet my requirements, otherwise, I have to fill in the hole in the story directory.

  The content of the homepage and detail pages are acquired, processed, and then displayed. Go to the website (http://www.relon.net.cn/) first and show your ugliness in front of the great god.

  The third step is to wait for the results.

  After reading the author's website, my hands began to itch. The general function has been imitated, and it has been launched, and the statistical code has been placed. The author said that his website uses 2w IP every day, and I am a little excited when I think about it.

  The original plan was to change the parasite site and judge the IP of the visitor. If it is a search engine, I will add content that the spider has not crawled on the page I visit.

  The essence of the spider pool is "trapped spiders in it and let it crawl the pages in the website constantly". At this stage, the website has not realized the trapping of spiders. I am too kind to do bad things. Later, we will continue to fill in this pit (trapped spider) to achieve the effect of a pool.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326609438&siteId=291194637