Introduction to python2 crawler

This blog is the author's notes on getting started with python crawler. It is for reference only. If there is any error, please correct me.

1. Crawling static pages (Baidu homepage http://www.baidu.com), IDE is pycharm

 
 
# -*- coding: utf-8 -*-
import urllib2
def base():
    #Define the url, then call the urlopen function to open, and the read result is stored in the data variable
    url = "http://www.baidu.com"
    data = urllib2.urlopen(url)
    #Save the content of data in the file
    file = open("D://demo.html", "w")
    file.write(data.read())
    
if __name__ == '__main__':
    base()



 
 
This code uses the most basic urlopen function to crawl web content. For some websites with anti-crawling measures, we use the method of adding headers to disguise the browser to crawl. For example, CSDN has an anti-crawling mechanism.

2. Disguise the browser. The method used here is to add the user-agent field information in the header. For details, please refer to the request header in the network options under the browser F12 debugging function. The author uses the Firefox browser

# -*- coding: utf-8 -*-
import random
import urllib2
from urllib import urlretrieve, urlcleanup

def openurl():
        url = "http://blog.csdn.net/szc889988/article/details/56331844"
        #The following is an array of user-agents
        user_agents = [
            'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11',
            'Opera/9.25 (Windows NT 5.1; U; en)',
            'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)',
            'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Kubuntu)',
            'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12',
            'Lynx/2.8.5rel.1 libwww-FM/2.14 SSL-MM/1.4.1 GNUTLS/1.2.9',
            "Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.7 (KHTML, like Gecko) Ubuntu/11.04 Chromium/16.0.912.77 Chrome/16.0.912.77 Safari/535.7",
            "Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:10.0) Gecko/20100101 Firefox/10.0 ",
        ]
        opener = urllib2.build_opener()
        agent = random.choice(user_agents)
        opener.addheaders = [("User-Agent", agent)]
        res = opener.open(url)
        htmlSource = res.read()
        fhandle = open("D://CSDN.html", "w")
        fhandle.write(htmlSource)
        print(htmlSource)

if __name__ == '__main__':
    openurl()

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326699902&siteId=291194637