PHP server and python crawler combined with user-agent

Use python crawlers with php to do an exercise.

Describe it:

  • php as the server.
  • python as a crawler.
  • The crawler program crawls the data of an interface on the server side.
  • The server checks the user-agent: if there is no response 403, sometimes it returns data.
  • Interface data: just write some.

PHP server interface code:

Interface: xx/paqustuinfo.php?name=siri

Get information by name.

<?php
header("Content-type:text/html;charset=utf-8");
date_default_timezone_set("PRC");


$name = isset($_GET['name']) ? $_GET['name'] : "";
$name = trim($name);

if(empty($name) or $name=='""'){
    
    
	$err = array('status' => 'error','msg' => '参数错误!');
	echo json_encode($err,JSON_UNESCAPED_UNICODE);
	die();
}

$arrstuinfo= array(
	array('id' => 1,'name' => '小王','age' => 14,'sex' => 1),
	array('id' => 2,'name' => '小李','age' => 14,'sex' => 1),
	array('id' => 3,'name' => '小明','age' => 14,'sex' => 1),
	array('id' => 4,'name' => '小爱','age' => 14,'sex' => 1),
	array('id' => 5,'name' => '小华','age' => 14,'sex' => 1),
	array('id' => 6,'name' => '小黄','age' => 14,'sex' => 1),
	array('id' => 7,'name' => 'siri','age' => 14,'sex' => 1),

);

$currStu = array('status' => 'error','msg' => '没有这个人哦!');


foreach($arrstuinfo as $stu){
    
    
	if($stu['name'] === $name){
    
    
		$currStu = $stu;
		break;
	}
}

echo json_encode($currStu,JSON_UNESCAPED_UNICODE);

?>

The python crawler program code is as follows:

import requests
import json
import os

def show(dic):
    if len(dic)<=2:
        print(dic['msg'])
        print()
        return 

    num=33
    print("*"*num)
    print("| %s\t"*4 %('编号 ','姓名 ','年龄 ','性别 '),"|",sep="")
    print("-"*num)
    print("| %s\t" %dic['id'],end="")
    print("|%s\t" %dic['name'], end="")
    print("| %s\t" %dic['age'], end="")
    sex="男" if dic['sex']==1 else "女"
    print("| %s\t|" %sex, end="")
    print()    
    print("-"*num)
    print()

while True:
    name=input("姓名:")
    url = "xx/paqustuinfo.php?name=%s"%name

    response = requests.get(url)
    data = json.loads(response.text)
    show(data)

The effect is as follows:

insert image description here


But as a server, we cannot obtain information for crawlers so easily, so the server needs to restrict processing. For example, when the server detects that the crawler program does not carry the user-agent, the server will report a 403 error or return other information, etc. What should I do? ?


The php server interface needs to handle:

Check user-agent:

<?php
header("Content-type:text/html;charset=utf-8");
date_default_timezone_set("PRC");

$ua = $_SERVER["HTTP_USER_AGENT"];

#本处只简单处理,不过多处理
if(substr($ua, 0,16)=="python-requests/" or substr($ua, 0,14)=="Python-urllib/"){
    
    
	$err = array('status' => 'error','msg' => 'bug bug bug bug');
	http_response_code(403);
	die(json_encode($err,JSON_UNESCAPED_UNICODE));
}


$name = isset($_GET['name']) ? $_GET['name'] : "";
$name = trim($name);

if(empty($name) or $name=='""'){
    
    
	$err = array('status' => 'error','msg' => '参数错误!');
	echo json_encode($err,JSON_UNESCAPED_UNICODE);
	die();
}

$arrstuinfo= array(
	array('id' => 1,'name' => '小王','age' => 14,'sex' => 1),
	array('id' => 2,'name' => '小李','age' => 14,'sex' => 1),
	array('id' => 3,'name' => '小明','age' => 14,'sex' => 1),
	array('id' => 4,'name' => '小爱','age' => 14,'sex' => 1),
	array('id' => 5,'name' => '小华','age' => 14,'sex' => 1),
	array('id' => 6,'name' => '小黄','age' => 14,'sex' => 1),
	array('id' => 7,'name' => 'siri','age' => 14,'sex' => 1),

);

$currStu = array('status' => 'error','msg' => '没有这个人哦!');


foreach($arrstuinfo as $stu){
    
    
	if($stu['name'] === $name){
    
    
		$currStu = $stu;
		break;
	}
}

echo json_encode($currStu,JSON_UNESCAPED_UNICODE);

?>

When the python crawler requests again:

When no user-agent is carried:

insert image description here
Note: Since the requests module will not report an error, we can call raise_for_status()

Modify the crawler program as follows (key part):

    #....
    
    response = requests.get(url)
    response.raise_for_status() ####
    
    data = json.loads(response.text)
    show(data)

Look at the effect again:

insert image description here


When the python crawler requests again:

When carrying user-agent:

    #....
    
    hd = {
    
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36"}
	response = requests.get(url, headers=hd) #携带 user-agent
	response.raise_for_status() ####
	
	data = json.loads(response.text)
	show(data)
	

Effect:

insert image description here


----END----
Learning only.

おすすめ

転載: blog.csdn.net/qq_52722885/article/details/127640677