9 Python crawler introductory examples, recommended to collect! !

I took my friends to learn python crawlers and prepared a few simple introductory examples to share with you.

Main knowledge points involved:

1.How the web interacts

2. Application of get and post functions of requests library

3.Related functions and attributes of the response object

4. Open and save python files

Comments are given in the code and can be run directly. How to install the requests library (friends who have installed python can refer to it directly. If not, it is recommended to install the python environment first)

Windows users and Linux users are almost the same: open cmd and enter the following command. If the python environment is in the directory of the C drive, it will prompt that the permissions are insufficient. You only need to run the cmd window in administrator mode.

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple requests  

Similar for Linux users (ubantu as an example): If the permissions are not enough, just add sudo before the command.

sudo pip install -i https://pypi.tuna.tsinghua.edu.cn/simple requests  

1. Crawl powerful BD pages and print page information

# 第一个爬虫示例,爬取百度页面  
  
import requests #导入爬虫的库,不然调用不了爬虫的函数  
  
response = requests.get("http://www.baidu.com")  #生成一个response对象  
  
response.encoding = response.apparent_encoding #设置编码格式  
  
print("状态码:"+ str( response.status_code ) ) #打印状态码  
  
print(response.text)#输出爬取的信息

2. Examples of commonly used get methods, and examples of parameter passing below.

# 第二个get方法实例  
  
import requests #先导入爬虫的库,不然调用不了爬虫的函数  
  
response = requests.get("http://httpbin.org/get")  #get方法  
  
print( response.status_code ) #状态码  
  
print( response.text )  

3. Post method examples of commonly used methods, there are also parameter passing examples below.

# 第三个 post方法实例  
  
import requests #先导入爬虫的库,不然调用不了爬虫的函数  
  
response = requests.post("http://httpbin.org/post")  #post方法访问  
  
print( response.status_code ) #状态码  
  
print( response.text )  

4. Put method instance

# 第四个 put方法实例  
  
import requests #先导入爬虫的库,不然调用不了爬虫的函数  
  
response = requests.put("http://httpbin.org/put")  # put方法访问  
  
print( response.status_code ) #状态码  
  
print( response.text )  

5. Commonly used methods of getting method parameter passing examples (1)

If you need to pass multiple parameters, just use the & symbol to connect them as follows:

# 第五个 get传参方法实例  
  
import requests #先导入爬虫的库,不然调用不了爬虫的函数  
  
response = requests.get("http://httpbin.org/get?name=hezhi&age=20")  # get传参  
  
print( response.status_code ) #状态码  
  
print( response.text )  

6. Commonly used methods of getting method parameter passing examples (2)

You can pass multiple params using a dictionary

# 第六个 get传参方法实例  
  
import requests #先导入爬虫的库,不然调用不了爬虫的函数  
  
data = {
    
      
  "name":"hezhi",  
  "age":20  
}  
response = requests.get( "http://httpbin.org/get" , params=data )  # get传参  
  
print( response.status_code ) #状态码  
  
print( response.text )  

7. Commonly used method post method parameter passing example (2) is it similar to the previous one?

# 第七个 post传参方法实例  
  
import requests #先导入爬虫的库,不然调用不了爬虫的函数  
  
data = {
    
      
  "name":"hezhi",  
  "age":20  
}  
response = requests.post( "http://httpbin.org/post" , params=data )  # post传参  
  
print( response.status_code ) #状态码  
  
print( response.text )  

8. Regarding bypassing the anti-crawling mechanism, take zh dad as an example

# 第好几个方法实例  
  
import requests #先导入爬虫的库,不然调用不了爬虫的函数  
  
response = requests.get( "http://www.zhihu.com")  #第一次访问知乎,不设置头部信息  
  
print( "第一次,不设头部信息,状态码:"+response.status_code )# 没写headers,不能正常爬取,状态码不是 200  
  
#下面是可以正常爬取的区别,更改了User-Agent字段  
  
headers = {
    
      
  
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36"  
  
}#设置头部信息,伪装浏览器  
  
response = requests.get( "http://www.zhihu.com" , headers=headers )  #get方法访问,传入headers参数,  
  
print( response.status_code ) # 200!访问成功的状态码  
  
print( response.text )  

9. Crawl information and save it locally

Because of the directory relationship, a folder called crawler was created on the D drive and then the information was saved.

Pay attention to the encoding setting when saving the file

# 爬取一个html并保存  
  
import requests  
  
url = "http://www.baidu.com"  
  
response = requests.get( url )  
  
response.encoding = "utf-8" #设置接收编码格式  
  
print("\nr的类型" + str( type(response) ) )  
  
print("\n状态码是:" + str( response.status_code ) )  
  
print("\n头部信息:" + str( response.headers ) )  
  
print( "\n响应内容:" )  
  
print( response.text )  
  
#保存文件  
file = open("D:\\爬虫\\baidu.html","w",encoding="utf")  #打开一个文件,w是文件不存在则新建一个文件,这里不用wb是因为不用保存成二进制  
  
file.write( response.text )  
  
file.close()  

[Following the trend of the times, I have compiled a lot of Python learning materials here and uploaded them to the CSDN official. Friends in need can scan the QR code below to obtain them]

1. Study Outline

Insert image description here

2. Development tools

Insert image description here

3. Python basic materials

Insert image description here

4. Practical data

Insert image description here

Guess you like

Origin blog.csdn.net/Z987421/article/details/133270471