How to quickly get started with Python crawler

**

How to quickly get started with Python crawler

**
Requirement : crawl Baidu homepage related information:
Remarks: The
module used by the first crawler is url, and these two modules need to be introduced. There are two ways; it is another module request under it, this is a module that comes with python , No need to download and install:

Method 1:
1. First download and install IDLE (Python 3.8 64-bit) , and Anaconda Navigator recommends using Jupyter Notebook here , because in the Anaconda Navigator environment, many packages that come with it do not need to be downloaded separately, so don’t be too easy to operate La.

2. Suppose we need to crawl Baidu's homepage: Insert picture description here
pay attention when writing code, just remove the s behind http: http://www.baidu.com/, here is a conversion, one more s, the protocol request will be more A request without s is safer, so that no relevant information can be obtained at all. The next step is to determine the URL:
Insert picture description here
sometimes in order to prevent some transfer characters in the URL, it is recommended to add an r in front .
Insert picture description here
The next step is to send the request, get the response information, send an open to the url , and receive the variables, and then read the .read() to read the crawled information:
Insert picture description here
then
print it
, and then execute it. , After modification as follows: the
Insert picture description here
complete code is as follows:

#方法一
import urllib.request

url=r"http://www.baidu.com/"

#发送请求.获取响应信息
reponse=request.urlopen(url).read()

print(reponse)

If executed again, the following result will be generated:
Insert picture description here
Method two, the code is as follows:

#方法二
import urllib.request  
s=urllib.request.urlopen("http://www.baidu.com")  
print(s.read()) 

Perform it as follows:
Insert picture description here
3. Let's compare the differences and connections between the source code of the page:
you can directly enter the web page, view the source code, and you can generate it:
Insert picture description here
It can also be presented in the form of code:

#查看页面源代码
import requests
head={"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36"}
s = requests.get("http://www.baidu.com",headers=head)
print(s.text)

The results obtained are as follows: It
Insert picture description here
can be obtained from this comparison, here the web page information is read by binary, so it is like this, the Chinese is displayed in binary .

4. How to check or print its length , the code is slightly changed, just add a len:

#查看长度
import urllib.request
url=r"http://www.baidu.com/"
#发送请求.获取响应信息
reponse=request.urlopen(url).read()
print(len(reponse))

After the implementation , as shown in the figure below:
Insert picture description here
Welcome friends to communicate and comment, reprinting needs to be noted everywhere. Thank you!

Guess you like

Origin blog.csdn.net/Louisliushahe/article/details/109673764