Use python crawlers with php to do an exercise.
Describe it:
- php as the server.
- python as a crawler.
- The crawler program crawls the data of an interface on the server side.
- The server checks the user-agent: if there is no response 403, sometimes it returns data.
- Interface data: just write some.
PHP server interface code:
Interface: xx/paqustuinfo.php?name=siri
Get information by name.
<?php
header("Content-type:text/html;charset=utf-8");
date_default_timezone_set("PRC");
$name = isset($_GET['name']) ? $_GET['name'] : "";
$name = trim($name);
if(empty($name) or $name=='""'){
$err = array('status' => 'error','msg' => '参数错误!');
echo json_encode($err,JSON_UNESCAPED_UNICODE);
die();
}
$arrstuinfo= array(
array('id' => 1,'name' => '小王','age' => 14,'sex' => 1),
array('id' => 2,'name' => '小李','age' => 14,'sex' => 1),
array('id' => 3,'name' => '小明','age' => 14,'sex' => 1),
array('id' => 4,'name' => '小爱','age' => 14,'sex' => 1),
array('id' => 5,'name' => '小华','age' => 14,'sex' => 1),
array('id' => 6,'name' => '小黄','age' => 14,'sex' => 1),
array('id' => 7,'name' => 'siri','age' => 14,'sex' => 1),
);
$currStu = array('status' => 'error','msg' => '没有这个人哦!');
foreach($arrstuinfo as $stu){
if($stu['name'] === $name){
$currStu = $stu;
break;
}
}
echo json_encode($currStu,JSON_UNESCAPED_UNICODE);
?>
The python crawler program code is as follows:
import requests
import json
import os
def show(dic):
if len(dic)<=2:
print(dic['msg'])
print()
return
num=33
print("*"*num)
print("| %s\t"*4 %('编号 ','姓名 ','年龄 ','性别 '),"|",sep="")
print("-"*num)
print("| %s\t" %dic['id'],end="")
print("|%s\t" %dic['name'], end="")
print("| %s\t" %dic['age'], end="")
sex="男" if dic['sex']==1 else "女"
print("| %s\t|" %sex, end="")
print()
print("-"*num)
print()
while True:
name=input("姓名:")
url = "xx/paqustuinfo.php?name=%s"%name
response = requests.get(url)
data = json.loads(response.text)
show(data)
The effect is as follows:
But as a server, we cannot obtain information for crawlers so easily, so the server needs to restrict processing. For example, when the server detects that the crawler program does not carry the user-agent, the server will report a 403 error or return other information, etc. What should I do? ?
The php server interface needs to handle:
Check user-agent:
<?php
header("Content-type:text/html;charset=utf-8");
date_default_timezone_set("PRC");
$ua = $_SERVER["HTTP_USER_AGENT"];
#本处只简单处理,不过多处理
if(substr($ua, 0,16)=="python-requests/" or substr($ua, 0,14)=="Python-urllib/"){
$err = array('status' => 'error','msg' => 'bug bug bug bug');
http_response_code(403);
die(json_encode($err,JSON_UNESCAPED_UNICODE));
}
$name = isset($_GET['name']) ? $_GET['name'] : "";
$name = trim($name);
if(empty($name) or $name=='""'){
$err = array('status' => 'error','msg' => '参数错误!');
echo json_encode($err,JSON_UNESCAPED_UNICODE);
die();
}
$arrstuinfo= array(
array('id' => 1,'name' => '小王','age' => 14,'sex' => 1),
array('id' => 2,'name' => '小李','age' => 14,'sex' => 1),
array('id' => 3,'name' => '小明','age' => 14,'sex' => 1),
array('id' => 4,'name' => '小爱','age' => 14,'sex' => 1),
array('id' => 5,'name' => '小华','age' => 14,'sex' => 1),
array('id' => 6,'name' => '小黄','age' => 14,'sex' => 1),
array('id' => 7,'name' => 'siri','age' => 14,'sex' => 1),
);
$currStu = array('status' => 'error','msg' => '没有这个人哦!');
foreach($arrstuinfo as $stu){
if($stu['name'] === $name){
$currStu = $stu;
break;
}
}
echo json_encode($currStu,JSON_UNESCAPED_UNICODE);
?>
When the python crawler requests again:
When no user-agent is carried:
Note: Since the requests module will not report an error, we can call raise_for_status()
Modify the crawler program as follows (key part):
#....
response = requests.get(url)
response.raise_for_status() ####
data = json.loads(response.text)
show(data)
Look at the effect again:
When the python crawler requests again:
When carrying user-agent:
#....
hd = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36"}
response = requests.get(url, headers=hd) #携带 user-agent
response.raise_for_status() ####
data = json.loads(response.text)
show(data)
Effect:
----END----
Learning only.