Both methods python crawler analog browser instance Analysis

This article describes two methods examples python reptile simulate browser. Share to you for your reference, as follows:

403 reptiles crawling sites appear, because the site did anti-reptiles settings

A, Herders property

Crawling CSDN blog

import urllib.request
url = "http://blog.csdn.net/hurmishine/article/details/71708030"file = urllib.request.urlopen(url)

Crawling results

urllib.error.HTTPError: HTTP Error 403: Forbidden

This explains CSDN done some settings, to prevent others malicious crawling information

So then, we need to get into reptiles analog browser

Any open a Web page, such as opening Baidu, and then press F12, a window will appear, we switch to the Network tab, and then click Refresh site, select the pop-up box to the left of "www.baidu.com", namely the figure below shows: Here Insert Picture Description
down drag we will see a bunch of information "User-Agent" word, yes that's what we want. We'll copy it down.
At this time, the information we get is: "Mozilla / 5.0 (Windows NT 10.0; WOW64) AppleWebKit / 537.36 (KHTML, like Gecko) Chrome / 58.0.3029.110 Safari / 537.36"
Then we can use two ways to simulate browser access Web page.

Second, Method 1: Use - the build_opener () to modify the header

Since urlopen () does not support some of the advanced features of HTTP, so we need to modify the header. You can use urllib.request.build_opener () carried out, we modify the above code:

import urllib.request
url = "http://blog.csdn.net/hurmishine/article/details/71708030"headers = ("User-Agent","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36")
opener = urllib.request.build_opener()
opener.addheaders = [headers]
data = opener.open(url).read()
print(data)

The above code, we first define a variable headers to store the User-Agent information, definition format ( "User-Agent", specific information)
for specific information on our already acquired, this access to information once, after crawling other sites can also be used, so we can be saved, I do not always look for the F12.

Then we () to create custom objects with urllib.request.build_opener opener and assigned to the opener, then set addheaders opener, is set corresponding to the header information, the format is: "opener (object name) .addheaders = [header information ( That specific information we store)] ", set up after opener object we can use the open () method to open the corresponding URL. Format: "opener (object name) .open (url addresses)" We can use open read () method reads the corresponding data, and assigned to the data variable.

Get output

b'\r\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">\r\n     \r\n    <html xmlns="http://www.w3.org/1999/xhtml">\r\n    \r\n<head>  \r\n\r\n            <link rel="canonical" href="http://blog.csdn.net/hurmishine/article/details/71708030" rel="external nofollow" /> ...

Third, Method 2: Use the add_header () adds a header

In addition to this the above method may also be used urllib.request.Request add_header () at () Analog browser implementation.

First on the code

import urllib.request
url = "http://blog.csdn.net/hurmishine/article/details/71708030"req = urllib.request.Request(url)
req.add_header('User-Agent','Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36')
data = urllib.request.urlopen(req).read()
print(data)

Well, we have to analyze.

Import package, defined url address we can not say we use urllib.request.Request (url) Creates a Request object, and assigned to the variable req, create a Request object format: urllib.request.Request (url address)

Then we use the add_header () method of adding header information corresponding to the format: Request (object name) .add_header ( 'object name', 'target value')

Now that we have set up a header, and then we use urlopen () to open the Request object to open the corresponding website, we mostly use

data = urllib.request.urlopen (req) .read () to open the corresponding URL, and reads the content of the page, and assigned to the data variable.

I write to you, for everyone to recommend a very wide python learning resource gathering, click to enter , there is a senior programmer before learning to share experiences, study notes, there is a chance of business experience, and for everyone to carefully organize a python zero the basis of the actual project data, daily python to you on the latest technology, prospects, learning small details that need to comment on
the above, we used two methods to achieve a reptile simulation browser to open website and get content information web site, to avoid 403 error.
Worth noting that the method used is 1 addHeaders (
) method, the method used is add_header () method, note the difference between the end of the presence or absence and presence underlined s

Published 38 original articles · won praise 26 · views 40000 +

Guess you like

Origin blog.csdn.net/haoxun09/article/details/104741266