Extract the course type and course name of the micro-course mall

Project requirements:
need to extract all the course types and course names in the micro-course mall
, as shown in the following figure:

Insert picture description hereAfter you get a requirement, don't worry about getting started and write the code directly, first think about the logic behind it:

  1. What is the web page structure of the micro-course mall and what are the components?
  2. The hierarchical relationship of the web page structure of the micro-course mall
  3. Which functions are used to implement this requirement, such as re module, findall extraction
  4. What are the steps to achieve this requirement

Next, implement the above requirements in the code:

## 本次目标是分析商城的分类结构,提取到课程名称和种类
import re
with open('static/html/index.html','r',encoding='utf-8') as f:
    html=f.read()
    print(html)

Output result:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>CSDN微课商城</title>
    <link rel="stylesheet" href="../css/main.css">
    <script type="text/javascript" src="../js/main.js"></script>
</head>
<body>

<div id="register" hidden="hidden">
    <h2 class="form_p">注册</h2>
    <p id="register_message">
        <!--信息有误-->
    </p>
    <form action="#" method="post" id="register_form">
        <input id="register_account" type="text" name="account" placeholder="账号(数字、英文、下换线,8-16位)"><br/>
        <input id="register_password" type="password" name="password" placeholder="密码(数字、英文、下换线,6-16位)"><br/>
        <!--<input type="password" name="repassword" placeholder="确认密码"><br/>-->
        <input id="register_submit" type="submit" value="注册">
    </form>
</div>

<div id="login" hidden="hidden">
    <h2 class="form_p">登录</h2>
    <p id="login_message">
        <!--信息有误-->
    </p>
    <form action="#" method="post" id="login_form">
        <input id="login_account" type="text" name="account" placeholder="账号"><br>
        <input id="login_password" type="password" name="password" placeholder="密码"><br>
        <input id="login_submit" type="submit" value="登录">
    </form>
</div>

<header>
    <span class="title"> <a href="index.html">CSDN微课商城</a> </span>
    <span>
        <form action="#" class="search_form">
            <input type="text" name="course" placeholder="按课程名称搜索">
            <input type="submit" value="搜索">
        </form>
    </span>
    <span class="user">
        <a href="javascript:show('login')">登录</a>/
        <a href="javascript:show('register')">注册</a>
        <!-- 已经登录显示的内容 -->
        你好:
        <a href="user.html">用户1</a>
        <a href="#">注销</a>
    </span>
</header>

<article>
    <section class="nav_section"><img src="../img/csdn_static/2.png" alt="" width="100%"></section>
    <section class="main_section"><h1>第一章 路由与模板</h1>
        <figure><a href="course.html"> <img src="../img/course/course.png">
            <figcaption><span class="course_name">Web原理与框架简介</span><span class="price">75</span></figcaption>
        </a></figure>
        <figure><a href="course.html"> <img src="../img/course/course.png">
            <figcaption><span class="course_name">Django环境搭建与入门案例</span><span class="price">153</span></figcaption>
        </a></figure>
        <figure><a href="course.html"> <img src="../img/course/course.png">
            <figcaption><span class="course_name">基本路由映射与命名空间</span><span class="price">154</span></figcaption>
        </a></figure>
        <figure><a href="course.html"> <img src="../img/course/course.png">
            <figcaption><span class="course_name">正则路由映射参数的传递与接收</span><span class="price">177</span></figcaption>
        </a></figure>
        <figure><a href="course.html"> <img src="../img/course/course.png">
            <figcaption><span class="course_name">反向解析处理器</span><span class="price">161</span></figcaption>
        </a></figure>
        <figure><a href="course.html"> <img src="../img/course/course.png">
            <figcaption><span class="course_name">Request对象与Response对象</span><span class="price">44</span></figcaption>
        </a></figure>
        <figure><a href="course.html"> <img src="../img/course/course.png">
            <figcaption><span class="course_name">上下文与模板调用</span><span class="price">97</span></figcaption>
        </a></figure>
        <figure><a href="course.html"> <img src="../img/course/course.png">
            <figcaption><span class="course_name">模板层基础语法</span><span class="price">105</span></figcaption>
        </a></figure>
        <figure><a href="course.html"> <img src="../img/course/course.png">
            <figcaption><span class="course_name">模板过滤器</span><span class="price">133</span></figcaption>
        </a></figure>
    </section>
    <section class="main_section"><h1>第二章 模型类实现</h1>
        <figure><a href="course.html"> <img src="../img/course/course.png">
            <figcaption><span class="course_name">ORM原理与数据库配置</span><span class="price">143</span></figcaption>
        </a></figure>
        <figure><a href="course.html"> <img src="../img/course/course.png">
            <figcaption><span class="course_name">表与字段的定义和常用字段约束</span><span class="price">118</span></figcaption>
        </a></figure>
        <figure><a href="course.html"> <img src="../img/course/course.png">
            <figcaption><span class="course_name">数据迁移与维护</span><span class="price">57</span></figcaption>
        </a></figure>
        <figure><a href="course.html"> <img src="../img/course/course.png">
            <figcaption><span class="course_name">模型类的增删改</span><span class="price">45</span></figcaption>
        </a></figure>
        <figure><a href="course.html"> <img src="../img/course/course.png">
            <figcaption><span class="course_name">模型类的查询方法</span><span class="price">187</span></figcaption>
        </a></figure>
        <figure><a href="course.html"> <img src="../img/course/course.png">
            <figcaption><span class="course_name">QuerySet详解</span><span class="price">197</span></figcaption>
        </a></figure>
    </section>
</article>

<footer>
    <div id="footer_div1">
        <p><a href="#">关于我们</a>| <a href="#">招聘</a>| <a href="#">广告服务</a>| <a href="#">网站地图</a></p>
        <p><a href="#">QQ客服</a>| <a href="#">kefu@csdn.ent</a>| <a href="#">客服论坛</a>| <a href="#">400-660-0108</a>| <a
                href="#">工作时间:8:30-22:00</a></p>
        <p> 百度提供站内搜索 北ICP备19004658 </p>
        <p> ©1999-2019 北京创新乐知网络技术有限公司 </p>
        <p> 版权申诉 家长监护 经营性网站备案信息 网络110报警服务 中国互联网举报中心 北京互联网违法和不良信息举报中心 </p>
    </div>
    <div id="footer_div2">
        <figure><img src="../img/csdn_static/二维码1.png">
            <figcaption>CSDN咨询</figcaption>
        </figure>
        <figure><img src="../img/csdn_static/二维码1.png">
            <figcaption>CSDN学院</figcaption>
        </figure>
        <figure><img src="../img/csdn_static/二维码1.png">
            <figcaption>CSDN企业招聘</figcaption>
        </figure>
    </div>
</footer>

</body>
</html>

The first object to be extracted is an html object. The
html object has its inherent attributes and tags, such as section tags, h1 tags, and span tags
. The main part of the mall is still wrapped by section tags, so the content of the main part is obtained first. The code is as follows:
PS: Pay attention to a processing detail, that is to replace \n through the sub function

    html_s=re.sub('\n','',html) #  对html进行清洗,通过sub函数的对\n进行替换
section_pattern='<section class="main_section">(.*?)</section>' ## 定义一个section正则表达式
section_s=re.findall(section_pattern,html_s)
## 接下来对每个section进行查找课程种类,而课程种类是被h1标签进行包裹的,因此需要定义一个正则表达式提取课程种类
## 接下来再课程种类下查找课程名称,而课程名称是被span标签包裹,因此需要定义一个正则表达式提取课程名称

category_pattern='<h1>(.*?)</h1>'
course_pattern='<span class="course_name">(.*?)</span>'
data_s=[]
for section in section_s:
    category=re.findall(category_pattern,section)[0]
    course=re.findall(course_pattern,section)
    data_s.append(
        {
    
    
            'category':category,
            'course':course
        }
    )
print(data_s)
[{
    
    'category': '第一章 路由与模板', 'course': ['Web原理与框架简介', 'Django环境搭建与入门案例', '基本路由映射与命名空间', '正则路由映射参数的传递与接收', '反向解析处理器', 'Request对象与Response对象', '上下文与模板调用', '模板层基础语法', '模板过滤器']}, {
    
    'category': '第二章 模型类实现', 'course': ['ORM原理与数据库配置', '表与字段的定义和常用字段约束', '数据迁移与维护', '模型类的增删改', '模型类的查询方法', 'QuerySet详解']}]
for data in data_s:
    print(data.get('category'))
    for course in data.get('course'):
        print('  ',course)
第一章 路由与模板
   Web原理与框架简介
   Django环境搭建与入门案例
   基本路由映射与命名空间
   正则路由映射参数的传递与接收
   反向解析处理器
   Request对象与Response对象
   上下文与模板调用
   模板层基础语法
   模板过滤器
第二章 模型类实现
   ORM原理与数据库配置
   表与字段的定义和常用字段约束
   数据迁移与维护
   模型类的增删改
   模型类的查询方法
   QuerySet详解

Experience:

  • In fact, the above code still uses the basic knowledge points of python,
  • Such as basic data types, dictionaries, lists
  • Such as basic statements, for loops
  • Basic function methods
    . Get() .findall()
    regular expressions in html business scenarios¶

Guess you like

Origin blog.csdn.net/weixin_42961082/article/details/109789334