一、网页HTML结构

在网上爬取了一个网页，如何在网页中找到自己想要的内容？

比如，想找到图片链接。网页中图片列表的结构如下：

 <div class="c s_li zxgx_list l">
     <ul> 
          <li>
             <a target="_blank" 
                href="https://www.xxx.com/HTM/2020/0301/1.html" title="测试">
                 <img src="https://img.xxx.com/2.jpg" 
                      alt="Python 练习" width="120" height="160" />
                 <strong>图片名字</strong>
             </a>
         </li> 
          <li>
             <a target="_blank" 
                href="https://www.xxx.com/HTM/2020/0301/2.html" title="测试">
                 <img src="https://img.xxx.com/2.jpg" 
                      alt="Python 练习" width="120" height="160" />
                 <strong>图片名字</strong>
             </a>
         </li> 
          <li>
             <a target="_blank" 
                href="https://www.xxx.com/HTM/2020/0301/3.html" title="测试">
                 <img src="https://img.xxx.com/3.jpg" 
                      alt="Python 练习" width="120" height="160" />
                 <strong>图片名字</strong>
             </a>
         </li> 
     </ul>
 </div>

通过爬虫，可以得到 HTML 的文本（其他代码略）。

 req = requests.get( url=weburl,params=None,headers=headers )
 req.encoding = "gb2312"
 req_html = req.text  # HTML 的文本

二、分析HTML结构

通过对 HTML 的分析，可以看出图片都是放在指定的 li 里的。

 <li>
     <a target="_blank" 
        href="https://www.xxx.com/HTM/2020/0301/3.html" title="测试">
         <img src="https://img.xxx.com/3.jpg" 
              alt="Python 练习" width="120" height="160" />
         <strong>图片名字</strong>
     </a>
 </li>

三、应用正则表达式

在 Python 里要找出它们，构造正则表达式：

 ex = '<li>.*?<img src="(.*?)".*?</li>'

利用正则的 findall( ) 方法找出所有的图片地址。

 L = re.findall(ex, req_html, re.DOTALL|re.I)

这里，因为字符串的内容有换行，所以使用了修饰符 re.DOTALL 。表示让正则中的 点（.）能代表换行符在内的所有单个字符。

默认情况下， 点（.） 是匹配任意字符，除了换行符。

循环输出列表 L 的元素示例如下：

 https://img.xxx.com/1.jpg
 https://img.xxx.com/2.jpg
 https://img.xxx.com/3.jpg

会得到所需要的图片路径。

四、正则表达式解析

在 Python 开发爬虫过程中经常会遇到正则表达式，其中 (.*?) 的使用概率较高，那么这个正则表达式到底什么意思呢？

.*? 表示非贪心算法，表示要精确的配对。

.* 表示贪心算法，表示要尽可能多的匹配。

( ) 表示要获取括弧之间的信息。

举例说明：

 req_html = """
 你好，我是那天你看到的那个 Red Man。
 This is the best day! I have a dream.
 Now that the Greeks held Troy and King Priam lay dead, 
 Aeneas said to his father,
 """

正则1：

 ex = "t.*\s"

 L = re.findall(ex, req_html,  re.DOTALL|re.I)  # 允许换行，忽略大小写
 print( len(L) )
 for item in L:
     print( item +"\n")

输出结果：L 就一个元素，且从“This...” 一直到结尾。

 1
 This is the best day! I have a dream.
 Now that the Greeks held Troy and King Priam lay dead, 
 Aeneas said to his father,

解析：.* 表示贪心算法，表示要尽可能多的匹配。所以，正则要求从“T” 开头，直到空格结束，就直接匹配到结尾了。

正则2：

 ex = "t.*?\s"

输出结果：

 8
 This   
 the 
 t   #  best 中的 t
 that 
 the 
 Troy 
 to 
 ther,

解析：.*? 表示非贪心算法，表示要精确的配对。所以，正则表达式要求从“T” 开头，直到空格结束，尽可能精准较少的去匹配字符串，就找到了8个。会把正则里的要求的字符，包含 t 和后面的空格（\s）都找出来。

正则3：

 ex = "t(.*?)\s"

输出结果：

 8
 his
 he
 
 hat
 he
 roy
 o
 her,

解析：( ) 表示要获取括弧之间的信息。所以，就获取了 t 和空格（\s）之间的内容。

stones4zd

发布了86 篇原创文章 · 获赞 146 · 访问量 5万+

私信关注

Python学习笔记10-爬虫中利用正则表达式分析出页面中想要的数据

一、网页HTML结构

二、分析HTML结构

三、应用正则表达式

四、正则表达式解析

猜你喜欢