Getting started with python-crawlers

I learned Python crawler today, which is very interesting. I wrote a blog to record the learning process.

1. Libraries that crawlers need to use

The most basic crawler only needs the urllib library, re library and chardet library

The urllib library is Python's built-in library for handling network requests. For basic crawler we just need to use its internal module urllib.requset.

The function to be used in urllib.request

urllib.request.urlopen(url(url)) will return a <class 'http.client.HTTPResponse'> 

The re library is a regular expression library used for string pattern matching to find the web content we need.

The chardet library is a library used to obtain the encoding method of web pages. You can use the chardet.detect() function to obtain the encoding format used by the web page.



2. The idea of ​​crawling web pages

First, use the urllib library to crawl the web page information, use the chardet library to obtain the code used by the web page, then convert the crawled web page information into a binary file, and use the decoding function to decode the binary file with a known code. Then you can get the complete information of the page. Find the code for the information you need on the original page. Analyze their patterns, extract their information with regular expressions, and finally write them into a folder.


3. Detailed explanation of crawling web pages

The usage of the urlopen() function is

urlopen(url, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT,*, cafile=None, capath=None, cadefault=False, context=None)

url is the web page URL (string format)

data is the access method, usually the default

timeout is the end time of the visit

Others generally do not need to be modified, just use the default values

The http.client.HTTPResponse object returned by the urlopen function is parsed into a binary file using the read() function.

The chardet.detect() function returns a dictionary

# {'encoding': 'ascii', 'confidence': 1.0, 'language': ''}

Where encoding represents encoding, confidence represents precision

Then use the decode() method to decode the whole page information.

The functions commonly used in the re library are match(pattern,string), search(pattern,string) and findall(pattern,string)

match is to match only the beginning, search is to match the first one from the beginning, and findall is to match all of the full text.



Here is a small example for you

Crawl Douban Books publisher name and number of books sold



from urllib.request import urlopen
import urllib.request
import chardet
import re

class Publish(object):
    def __init__(self):
        pass

    def getInfo(self,address):
        response = urlopen(address,timeout=2).read()
        char = chardet.detect(response)
        data = response.decode(char['encoding'])
        pattern1 = '<div class="name">(.*?)</div>'
        pattern2 = '<div class="works-num">(.*?) works for sale</div>'
        result1 = re.compile(pattern1).findall(data)
        result2 = re.compile(pattern2).findall(data)
        return [result1,result2]
    pass

    def writeTxT(self,address,fileName):
        result = self.getInfo(address)
        f = open(fileName,'w',encoding='utf-8')
        lenth = result[0].__len__()
        for i in range(0,lenth):
            f.write(str(i+1) +'\t' + result[0][i] + '\t' +result[1][i] + '\n')
        pass
        f.close()
    pass
pass



if __name__ == '__main__':
    publish = Publish()
    fileName = 'publish.txt'
    address = 'https://read.douban.com/provider/all'
    publish.writeTxT(address,fileName)
pass

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324475922&siteId=291194637