I learned Python crawler today, which is very interesting. I wrote a blog to record the learning process.
1. Libraries that crawlers need to use
The most basic crawler only needs the urllib library, re library and chardet library
The urllib library is Python's built-in library for handling network requests. For basic crawler we just need to use its internal module urllib.requset.
The function to be used in urllib.request
urllib.request.urlopen(url(url)) will return a <class 'http.client.HTTPResponse'>
The re library is a regular expression library used for string pattern matching to find the web content we need.
The chardet library is a library used to obtain the encoding method of web pages. You can use the chardet.detect() function to obtain the encoding format used by the web page.
2. The idea of crawling web pages
First, use the urllib library to crawl the web page information, use the chardet library to obtain the code used by the web page, then convert the crawled web page information into a binary file, and use the decoding function to decode the binary file with a known code. Then you can get the complete information of the page. Find the code for the information you need on the original page. Analyze their patterns, extract their information with regular expressions, and finally write them into a folder.
3. Detailed explanation of crawling web pages
The usage of the urlopen() function is
urlopen(url, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT,*, cafile=None, capath=None, cadefault=False, context=None)
url is the web page URL (string format)
data is the access method, usually the default
timeout is the end time of the visit
Others generally do not need to be modified, just use the default values
The http.client.HTTPResponse object returned by the urlopen function is parsed into a binary file using the read() function.
The chardet.detect() function returns a dictionary
# {'encoding': 'ascii', 'confidence': 1.0, 'language': ''}
Where encoding represents encoding, confidence represents precision
Then use the decode() method to decode the whole page information.
The functions commonly used in the re library are match(pattern,string), search(pattern,string) and findall(pattern,string)
match is to match only the beginning, search is to match the first one from the beginning, and findall is to match all of the full text.
Here is a small example for you
Crawl Douban Books publisher name and number of books sold
from urllib.request import urlopen import urllib.request import chardet import re class Publish(object): def __init__(self): pass def getInfo(self,address): response = urlopen(address,timeout=2).read() char = chardet.detect(response) data = response.decode(char['encoding']) pattern1 = '<div class="name">(.*?)</div>' pattern2 = '<div class="works-num">(.*?) works for sale</div>' result1 = re.compile(pattern1).findall(data) result2 = re.compile(pattern2).findall(data) return [result1,result2] pass def writeTxT(self,address,fileName): result = self.getInfo(address) f = open(fileName,'w',encoding='utf-8') lenth = result[0].__len__() for i in range(0,lenth): f.write(str(i+1) +'\t' + result[0][i] + '\t' +result[1][i] + '\n') pass f.close() pass pass if __name__ == '__main__': publish = Publish() fileName = 'publish.txt' address = 'https://read.douban.com/provider/all' publish.writeTxT(address,fileName) pass