Urllib library to crawl web pages

1、urllib.request.urlopen

For crawling web pages:

file = urllib.request.urlopen("https://www.baidu.com")

Note: 1. File reading:

    1. Read line by line readline()

    readline() reads a line in the file at a time, and needs to use the ever-true expression to read the file in a loop. But still using readline() to read the file will give an error when the file pointer is moved to the end of the file. Therefore, a judgment statement needs to be added to the program to judge whether the file pointer is moved to the end of the file, and the loop is interrupted by this statement.

    2. Multi-line reading method readlines()

    To read a file with readlines(), you need to loop through the elements in the list returned by readlines(). The function readlines() can read multiple lines of data in a file at one time.

    3. One-time reading method read() The easiest way to read a file is to use read(), read() will read all the content from the file at one time, and assign it to a string variable.


2. Write and save the file:

    1, python basic file operation: write:

fhandle = open("D:/1.html","wb")
fhandle.write(data)
fhandle.close()

    2. urllib.request.urlretrieve(url , filename = local file address)

filename = urllib.request.urlretrieve("https://www.baidu.com" , filename = "D:/1.html")

    3、urllib.request.urlcleanup():

    After adding the above code, it is used to clear the cache garbage caused by urlretrieve.




Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325561808&siteId=291194637