Battle of reptiles

The basic principles of Web crawler

Before you begin to put the basic principles of reptiles or the general process:

  1. Get the initial URL.
  2. According to initial crawl page URL and get a new URL. Crawling the URL address corresponding to the URL address of the web or data required for analyzing the data stored in the database, and while crawling pages found new URL address, and which has been crawled URL has been stored in the crawling list, and re-used to determine the process of crawling.
  3. The new URL into the URL queue to be crawled.
  4. From the crawl URL queue to be read the new URL. Repeating the above process of crawling.
  5. When the crawler system setting stop condition is met, the stop crawling.

Web crawlers need to solve the problem:

First, whether the public website / site? (= You need to log in?)

Do not need to log on is how to label each user: 1, session 2, cookies 3 , IP address.
What is the purpose logged in to? (= Is it necessary to log? = Do you need to log in every time?)
To sign in every time, how to log on (= Code)

Second, how the page is loaded? (= Dynamic loading issue)

Where the required data can be found in: 1, html 2, json inside.
If the record is a dynamic Ajax how to do?

Third, you can direct how the requested page request? (= How the request header structure? Content = request header is necessary?)

How gracefully reptiles

Precision: The request to each byte of data is what we want, not to request extra garbage.
Speed: with optimal data analysis strategy, re-check policies, storage strategies.
Stability: complete exception handling mechanism breakpoint restart after a crash.
Quality: no way affect the normal operation of the target site.

Fetching strategy

In the reptile system until the crawl URL queue is a very important part. Because it involves the inside of the URL in what order, after significant problems to which pages to crawl that page, determine the order of these URL method, called crawling strategy. The following highlights some common crawling strategy:

1. depth-first traversal policy

Depth-first traversal strategy is the web crawler will start from the start page, then go to the next start page after tracking down a link to a link, finished with this line, continue to follow the link.
Traversal path: AFG EHI BCD

2. The breadth-first traversal policy

The basic idea width first traversal strategy is to link the new download page found to be directly inserted into the end of the URL to crawl queue. Also refers to a web crawler will crawl all the pages linked to the start page, and then select one of the links page, continue to crawl all pages of links on this page. Or in the above figure as an example:

Traversal path: ABCDEF GHI

3. The number of backlinks strategy

The number of backlinks is the number of pages is a link pointing to other pages. The number of back links indicates the content of a web page recommendation by the extent of other people. Therefore, many times the search engine crawling system will use this indicator to evaluate the importance of the web page, which determines the order crawl different pages.

In a real network environment, the presence of advertising links, link cheat, the number of backlinks that I can not wait for him to complete also the degree of importance. Therefore, the search engines tend to consider some solid backlinks.

4.Partial PageRank strategy

Partial PageRank algorithm draws on the idea of ​​PageRank algorithm: For already downloaded Web pages, along with the URL in the URL queue to be crawled, forming a collection of web pages, each page is calculated PageRank value, calculated after completion, will be crawl URL queue URL PageRank value arrangement according to the size, and fetch page in this order.

If every time we crawl a page, we recalculated the value of PageRank, a compromise is: after each crawl K pages, recalculate a PageRank value. But this situation will have a problem: the analysis has been downloaded for the page links, which we mentioned earlier unknown web pages that part, for the time being no PageRank value. To solve this problem, we give these pages a temporary PageRank values: This page will be passed in all the chain PageRank value aggregated, thus forming a page's PageRank value of the unknown, and thus participate in the sort. The following example:

5.OPIC Policy Policy

The algorithm is actually on the page carried a significance score. Before the algorithm starts, all pages to one and the same initial cash (cash). After downloading a page P, P is apportioned to cash out all analyzes link from P and P emptying the cash. For all pages crawl URL queue to be sorted by the number of cash.

6. major stations priority strategy

URL for all pages in the queue to be crawled, classified according to the site belongs. For the number of pages to be downloaded and more sites, with the highest priority. This strategy therefore called the major stations priority strategy.

A set of reptiles thinking process

About extracted data, i.e. the first layer in the figure above, the following strategy:

Log analysis:

The purpose logged into two

The first is to ensure data security, which is the data access pages are not equal, different users have different permissions, each user can choose to share data uploaded or private, similar to QQ space.
The second is to bail user security, this page are generally 'Web profile', each user has access to the site's functionality, but their data is not compromised, similar to Alipay.
Although each site login operation vary, the major sites in order to prevent crawlers can be said to do anything, but in the Web system, to maintain login status are only two: Cookies and Session (here if you have had web development experience is well understood), so we just simulate the Cookie and corresponding server want to recognize the value can already logged on to the server for 'illusion'. But always you need to sign in once to get Cookies produced after login. So the next question is how to log on.

Login strategy:

Let everybody is not prohibitive nothing more than a verification code when signing. In fact, there are many online public policies, such as picture identification code, automatic sliding verification code. But the hardest phone verification code.

But I say not these methods.

Now in the domestic network environment, BAT is like a big ecosystem, their hands clutching a large number of users, all a lot of sites in order to reduce the threshold of sites, often working with them, as we often see the use of QQ login, Login or micro letter, Taobao login, and so on. So we can directly access these interfaces log in, and automatically log online for policies of these sites can be described as very mature.

Construction requests:

    'Host': 'search.originoo.cn',
    'User-Agent':' Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:65.0) Gecko/20100101 Firefox/65.0',
    'Accept':' application/json, text/javascript, */*; q=0.01',
    'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
    'Accept-Encoding': 'gzip, deflate',
    'Referer':' http://www.originoo.com/ws/p.topiclist.php?cGljX2tleXdvcmRzPeW3peS6uiZtZWRpdW1fdHlwZT1waWMtdmVjdG9y',
    'Content-Type':' application/x-www-form-urlencoded; charset=UTF-8',
    'Content-Length':' 213',
    'Origin': 'http://www.originoo.com',
    'Connection': 'keep-alive',
    'Cache-Control':' max-age=0',
     'pic_keywords':'%E5%B7%A5%E4%BA%BA',
     'medium_type':'pic-vector',
     'pic_quality':'all',
     'sort_type':'4',
     'pic_orientation':'',
     'pic_quantity':'',
     'pic_gender':'',
     'pic_age':'',
     'pic_race':'',
     'pic_color':'',
     'pic_url':'',
     'user_id':'0',
     'company_id':'0',
     'page_index':str(i),
     'page_size':'40'

This is a typical POST request header, the request is POST characteristic parameters to bring back this request indicate what data is Yes.

Not all requests are long, the requested content is determined according to the server, the server complex design, then it is appropriate requests become very complex, and vice versa:

data = { "keyword": "workers",
"Color": "0",
"type": ". 6"
}
The above is a site POST request data portion.

In fact, it is a complex with a request verification server generates a large number of parameters, for example:

headers = {
'Host': 'dpapi.dispatch.paixin.com',
'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:65.0) Gecko/20100101 Firefox/65.0',
'Accept':'application/json, text/plain, /',
'Accept-Language':'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
'Accept-Encoding': 'gzip, deflate, br',
'Referer': 'https://v.paixin.com/media/photo/standard/%E5%B7%A5%E4%BA%BA/1',
'Content-Type': 'application/json;charset=utf-8',
'Content-Length': '43',
'Origin': 'https://v.paixin.com',
'Connection': 'keep-alive',
'Cookie': 'Hm_lvt_8a9ebc00eda51ba9f665488c37a93f41 = 1552048374 ; Hm_lpvt_8a9ebc00eda51ba9f665488c37a93f41 = 1552048386; Qs_lvt_169722 = 1552048378; Qs_pv_169722 = 4547593134179576000% 2C332112608763589300; Hm_lvt_f72440517129ff03cc6f22668c61aef3 = 1552048378; Hm_lpvt_f72440517129ff03cc6f22668c61aef3 = 1552048386; _ga = GA1.2.1067225704.1552048380; _gid = GA1.2.1727359604.1552048380; _gat = 1 ',
' Cache-Control ':' max-Age = 0 ',
}
the Cookie request header value is clearly not viewed, in fact, we see some baffling parameters do not feel fear during capture, because this is other programmers to write, they have their own naming methods and strategies, we can just confirm whether this value must be on it. The method is simple, every request to delete a portion of the parameters, see if you can request normal, if you can, then continue to delete, and then request until it can not access the unknown, if you can not access, it shows the parameters you just deleted is very important , this time to go find the source of this parameter.

Reference:
General reptiles writing ideas Vision_Tung: https://blog.csdn.net/Vision_Tung/article/details/88591726

Guess you like

Origin www.cnblogs.com/lskreno/p/11521343.html