This is to record my own crawler learning process.
Crawl web pages using url package
import urllib.request #url包 def main(): url = "http://www.douban.com/" response = urllib.request.urlopen(url) #request html = response.read() #Get html = html.decode("utf-8") #decode print(html) #print if __name__ == "__main__": main()
The urllib.request module is used to open and read urls
Several commonly used encoding methods for characters:
ASCII encoding: used to represent English, it is represented by 1 byte, the first bit is specified as 0, the other 7 bits store data, and a total of 128 characters can be represented.
Extended ASCII encoding: used to represent more European characters, using 8 bits to store data, a total of 256 characters can be represented
GBK/GB2312/GB18030: Represents Chinese characters. GBK/GB2312 stands for Simplified Chinese, GB18030 stands for Traditional Chinese.
Unicode encoding: contains all the characters in the world, is a character set.
UTF-8: It is one of the implementations of Unicode characters. It uses 1-4 characters to represent a symbol, and the byte length varies according to different symbols.