Note disposed crawler Python [A] of the simulated user access request header

Learning textbooks "python network data collection", most of the code for this book.

  Web crawler is crawling data is first authority to have crawling, crawling is not even the best code of rights can not be run. So first of all to disguise their reptiles, so unlike the reptile reptiles but like people access the web page. Ado begin camouflage.

  1. The modification request header

  Here's requests to use python modules, the Prime Minister tell us about the http request header, it is each time you visit the page information to a set of attributes and configure the server transmission. Here are seven fields are most browsers used to initialize the network request.

Attributes

content
Host https://www.google.com/
Connection keep-alive
Accept text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
User-Agent Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like
Gecko) Chrome/39.0.2171.95 Safari/537.36
Referrer https://www.google.com/
Accept-Encoding gzip,deflate,sdch
Accept-Language US-a, a; q = 0.8

This is the host request (The picture shows the book author's request to open the F12 can view their host request) user when accessing the website issued. It is without request issued by the request header python reptile.

Accept-Encoding identity
User-Agent

Python- urllib/3.4

Head can be customized with a request module requests. We use the following procedure to collect information on this site, we verify that the browser's cookie settings:  

 1 import requests
 2 from bs4 import BeautifulSoup
3 session = requests.Session() # 创建一个session对象 4 headers = {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) 5       AppleWebKit 537.36 (KHTML, like Gecko) Chrome", 6       "Accept":"text/html,application/xhtml+xml,application/xml; 7       q=0.9,image/webp,*/*;q=0.8"}
8 url = "https://www.whatismybrowser.com/ Developers / HTTP-headers- the What- IS -My-Browser-sending "# This site can display the request header to help us verify the page 10 REQ = Session.get (url, headers = headers) # initiating get request . 11 bsObj = the BeautifulSoup (req.text) 12 is Print (bsObj.find ( " Table " , { " class " : " Table-Striped " }). get_text)

output headers in the request header and should be set in the program is the same. This completes the first step in the simulation simulate user access request header.

 

Guess you like

Origin www.cnblogs.com/dfy-blog/p/11518406.html