Python crawler technology series-03/4flask combines requests to test static pages and dynamic page crawling

python build web service

flask content reference:Flask framework introductory tutorial (very detailed)

flask installation and running tests

Install flask

pip install flask

Create a webapp.py file with the following content

from flask import Flask

# 用当前脚本名称实例化Flask对象,方便flask从该脚本文件中获取需要的内容
app = Flask(__name__)

#程序实例需要知道每个url请求所对应的运行代码是谁。
#所以程序中必须要创建一个url请求地址到python运行函数的一个映射。
#处理url和视图函数之间的关系的程序就是"路由",在Flask中,路由是通过@app.route装饰器(以@开头)来表示的
@app.route("/")
#url映射的函数,要传参则在上述route(路由)中添加参数申明
def index():
    return "Hello World!"

# 直属的第一个作为视图函数被绑定,第二个就是普通函数
# 路由与视图函数需要一一对应
# def not():
#     return "Not Hello World!"

# 启动一个本地开发服务器,激活该网页
app.run()


run code

 python webapp.py

The terminal output is as follows:

& D:/ProgramData/Anaconda3/envs/py10/python.exe d:/zjdemo/webapp.py
 * Serving Flask app 'webapp'
 * Debug mode: off
WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
 * Running on http://127.0.0.1:5000
Press CTRL+C to quit
127.0.0.1 - - [20/Nov/2023 08:20:47] "GET / HTTP/1.1" 200 -     
127.0.0.1 - - [20/Nov/2023 08:20:47] "GET /favicon.ico HTTP/1.1" 404 -

Enter in browser

http://127.0.0.1:5000

Return as follows
Insert image description here

flask returns complex html string

Create the webapp_html_str.py file with the following code:

from flask import Flask

# 用当前脚本名称实例化Flask对象,方便flask从该脚本文件中获取需要的内容
app = Flask(__name__)


html_str="""
<!DOCTYPE html>
<html lang="en">

<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Document</title>
</head>
<body>
    <table id="g570b4" border="1">
        <tr id="g419fe">
            <th id="g16b02">th标头
            </th>
            <th id="gaae0b">th标头
            </th>
            <th id="gd78bc" class=" u5899e">地址
            </th>
        </tr>
        <tr id="g5af9b">
            <td id="g920bb">td表格单元
            </td>
            <td id="g9de93" class=" uab6e6">td表格单元
            </td>
            <td id="gea8dc">上海浦东虹桥某某小区某某地点
            </td>
        </tr>
        <tr id="cf47d6" class=" u0cbcd ">
            <td id="c913e3" class=" ud690a ">td表格单元
            </td>
            <td id="c452e0" class=" uab6e6 ">td表格单元
            </td>
            <td id="c917b3" class=" u7eb06 ">td表格单元
            </td>
        </tr>
        <tr id="cba81f" class=" u0cbcd ">
            <td id="c3dae7" class=" ud690a ">td表格单元
            </td>
            <td id="c7d0f9" class=" uab6e6 ">td表格单元
            </td>
            <td id="c9fe10" class=" u7eb06 ">td表格单元
            </td>
        </tr>
    </table>
    <style>
        .u5899e {
            width: 162px;
        }
    </style>
</body>

</html>

"""

#程序实例需要知道每个url请求所对应的运行代码是谁。
#所以程序中必须要创建一个url请求地址到python运行函数的一个映射。
#处理url和视图函数之间的关系的程序就是"路由",在Flask中,路由是通过@app.route装饰器(以@开头)来表示的
@app.route("/")
#url映射的函数,要传参则在上述route(路由)中添加参数申明
def index():
    return html_str

# 直属的第一个作为视图函数被绑定,第二个就是普通函数
# 路由与视图函数需要一一对应
# def not():
#     return "Not Hello World!"

# 启动一个本地开发服务器,激活该网页
app.run()


Run
Run code

 python webapp.py

Enter in browser

http://127.0.0.1:5000

Return as follows
Insert image description here

flask returns html page

Return a static html page

In the project directory, create a templates directory and create a.html file in the templates directory. The code is as follows:

<!DOCTYPE html>
<html lang="en">

<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Document</title>
</head>
<body>
    <table id="g570b4" border="1">
        <tr id="g419fe">
            <th id="g16b02">th标头
            </th>
            <th id="gaae0b">th标头
            </th>
            <th id="gd78bc" class=" u5899e">地址
            </th>
        </tr>
        <tr id="g5af9b">
            <td id="g920bb">td表格单元
            </td>
            <td id="g9de93" class=" uab6e6">td表格单元
            </td>
            <td id="gea8dc">上海浦东虹桥某某小区某某地点
            </td>
        </tr>
        <tr id="cf47d6" class=" u0cbcd ">
            <td id="c913e3" class=" ud690a ">td表格单元
            </td>
            <td id="c452e0" class=" uab6e6 ">td表格单元
            </td>
            <td id="c917b3" class=" u7eb06 ">td表格单元
            </td>
        </tr>
        <tr id="cba81f" class=" u0cbcd ">
            <td id="c3dae7" class=" ud690a ">td表格单元
            </td>
            <td id="c7d0f9" class=" uab6e6 ">td表格单元
            </td>
            <td id="c9fe10" class=" u7eb06 ">td表格单元
            </td>
        </tr>
    </table>
    <style>
        .u5899e {
      
      
            width: 162px;
        }
    </style>
</body>

</html>

The project structure at this time is as follows:
Insert image description here

Create the webapp_html.py file with the following code:

from flask import Flask, render_template
 
app = Flask(__name__)
 
 
# “show”与函数index对应
# 运行index函数返回templates目录下的index.html页面
@app.route("/show")
def index():
    return render_template("a.html")
 
 
if __name__ == '__main__':
    app.run()

run code

python webapp_html.py

The output is as follows:

(py10) PS D:\zjdemo> & D:/ProgramData/Anaconda3/envs/py10/python.exe d:/zjdemo/webapp_html.py
 * Serving Flask app 'webapp_html'
 * Debug mode: off
WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
 * Running on http://127.0.0.1:5000
Press CTRL+C to quit
127.0.0.1 - - [20/Nov/2023 08:38:23] "GET / HTTP/1.1" 404 -
127.0.0.1 - - [20/Nov/2023 08:38:28] "GET /show HTTP/1.1" 200 -

Browser input:

http://127.0.0.1:5000/show

Returns as follows:
Insert image description here

Return a dynamic html page

Create a jsdemo.html in the templates directory with the following code:

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <meta http-equiv="X-UA-Compatible" content="IE=edge">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <title>Document</title>
  <style>
    fieldset,#d1 {
      
      
      padding: 10px;
      width: 300px;
      margin: 0 auto;
    }
  </style>
  
</head>
<body>
  <form id="form1" name="form1" method="post" action="">
    <fieldset>
      <legend>按时</legend>
      
      输入表格的行数:<input type="text" id="row" value="3" placeholder="请输入表格的行数" required autofocus><br>
      输入表格的列数:<input type="text" id="col" value="5" placeholder="请输入表格的列数" required autofocus><br>
      <input type="button" id="ok" value="产生表格" onclick="createTable()"/>
    </fieldset>
  </form>
  <div id="d1"></div>
  <script type="text/javascript">
    function createTable(){
      
      
      n=1;
      var str="<table width='100%' border='1' cellspacing='0' cellpadding='0'><tbody>";
      var r1=document.getElementById("row").value;
      var c1=document.getElementById("col").value;
      for(i=0;i<r1;i++)
      {
      
      
        str=str+"<tr align='center'>";
        for(j=0;j<c1;j++)
        {
      
      
          str=str+"<td>"+(n++)+"</td>";
        }
        str=str+"</tr>";
      }
      var d1=document.getElementById("d1");
      d1.innerHTML=str+"</tbody></table>";
    }
    createTable()
  </script>
</body>
</html>

Add the following code to webapp_html.py

@app.route("/jsdemo")
def jsdemo():
    return render_template("jsdemo.html")
重新启动web服务,运行代码

```python
python webapp_html.py

The output is as follows:

(py10) PS D:\zjdemo> & D:/ProgramData/Anaconda3/envs/py10/python.exe d:/zjdemo/webapp_html.py
 * Serving Flask app 'webapp_html'
 * Debug mode: off
WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
 * Running on http://127.0.0.1:5000
Press CTRL+C to quit

Enter in browser

http://127.0.0.1:5000/jsdemo

Returns to:
Insert image description here
Enter in the browser

http://127.0.0.1:5000/show

Returned as:
Insert image description here

Get static and dynamic html pages through requests

Create requestsdemo.py
The content is as follows:

import requests

url_one = "http://127.0.0.1:5000/show"
url_two = "http://127.0.0.1:5000/jsdemo"

res_one = requests.get(url_one)
print(res_one.content.decode('utf-8'))
print("--------------------------")
res_two = requests.get(url_two)
print(res_two.content.decode('utf-8'))

Run the code,

python .\requestsdemo.py

The output is as follows

(py10) PS D:\zjdemo> python .\requestsdemo.py
<!DOCTYPE html>
<html lang="en">

<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Document</title>
</head>
<body>
    <table id="g570b4" border="1">
        <tr id="g419fe">
            <th id="g16b02">th标头
            </th>
            <th id="gaae0b">th标头
            </th>
            <th id="gd78bc" class=" u5899e">地址
            </th>
        </tr>
        <tr id="g5af9b">
            <td id="g920bb">td表格单元
            </td>
            <td id="g9de93" class=" uab6e6">td表格单元
            </td>
            <td id="gea8dc">上海浦东虹桥某某小区某某地点        
            </td>
        </tr>
        <tr id="cf47d6" class=" u0cbcd ">
            <td id="c913e3" class=" ud690a ">td表格单元
            </td>
            <td id="c452e0" class=" uab6e6 ">td表格单元
            </td>
            <td id="c917b3" class=" u7eb06 ">td表格单元
            </td>
        </tr>
        <tr id="cba81f" class=" u0cbcd ">
            <td id="c3dae7" class=" ud690a ">td表格单元
            </td>
            <td id="c7d0f9" class=" uab6e6 ">td表格单元
            </td>
            <td id="c9fe10" class=" u7eb06 ">td表格单元
            </td>
        </tr>
    </table>
    <style>
        .u5899e {
    
    
            width: 162px;
        }
    </style>
</body>

</html>
--------------------------
<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <meta http-equiv="X-UA-Compatible" content="IE=edge">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <title>Document</title>
  <style>
    fieldset,#d1 {
    
    
      padding: 10px;
      width: 300px;
      margin: 0 auto;
    }
  </style>

</head>
<body>
  <form id="form1" name="form1" method="post" action="">        
    <fieldset>
      <legend>按时</legend>

      输入表格的行数:<input type="text" id="row" value="3" placeholder="请输入表格的行数" required autofocus><br>
      输入表格的列数:<input type="text" id="col" value="5" placeholder="请输入表格的列数" required autofocus><br>
      <input type="button" id="ok" value="产生表格" onclick="createTable()"/>
    </fieldset>
  </form>
  <div id="d1"></div>
  <script type="text/javascript">
    function createTable(){
    
    
      n=1;
      var str="<table width='100%' border='1' cellspacing='0' cellpadding='0'><tbody>";
      var r1=document.getElementById("row").value;
      var c1=document.getElementById("col").value;
      for(i=0;i<r1;i++)
      {
    
    
        str=str+"<tr align='center'>";
        for(j=0;j<c1;j++)
        {
    
    
          str=str+"<td>"+(n++)+"</td>";
        }
        str=str+"</tr>";
      }
      var d1=document.getElementById("d1");
      d1.innerHTML=str+"</tbody></table>";
    }
    createTable()
  </script>
</body>
</html>

It can be seen that the source code of the static page matches the effect after browser rendering, but the source code captured by the dynamic page and the effect after browser rendering are quite different, and the data cannot be obtained through methods such as xpath.

The complete directory of the project at this time is as follows:
Insert image description here

Remarks: HTML rendering process
Talk about the page rendering process
Browser rendering process (concise talk)

Summarize

This article mainly describes the process of flask installation and returning static pages and dynamic pages, and crawls static/dynamic pages through the requests library distribution. Through comparison, we can more clearly understand the meaning of dynamic rendering of pages, and introduce the role of the selenium library.

Guess you like

Origin blog.csdn.net/m0_38139250/article/details/134499193