python crawler.2. disguised browser

Some webpages will return with an error when crawling

urllib.error.HTTPError: HTTP Error 403: Forbidden

This is the URL that is detecting the connection object, so you need to disguise the browser and set the User Agent

Open the webpage in the browser ---> F12 ---> Network ---> Refresh

Then select an item to see User-Agent in the header

User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36

 

import urllib.request                   #url包

def openUrl(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36',
        'Host': 'jandan.net'
    }
    req = urllib.request.Request(url, headers=headers)
    response = urllib.request.urlopen(req)      #请求
    html = response.read() #Get
    html = html.decode("utf-8") #decode
    print(html) #print
    
if __name__ == "__main__":
    url = "http://jandan.net/ooxx/" #'http://www.douban.com/'
    openUrl(url)
 

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324620412&siteId=291194637