Python study notes 10- crawlers use regular expressions to analyze the data page you want

First, the structure of HTML pages

Internet crawling a page, how to find the content they want in a web page?

For example, to find the picture link. Structure picture list of the pages as follows:

 <div class="c s_li zxgx_list l">
     <ul> 
          <li>
             <a target="_blank" 
                href="https://www.xxx.com/HTM/2020/0301/1.html" title="测试">
                 <img src="https://img.xxx.com/2.jpg" 
                      alt="Python 练习" width="120" height="160" />
                 <strong>图片名字</strong>
             </a>
         </li> 
          <li>
             <a target="_blank" 
                href="https://www.xxx.com/HTM/2020/0301/2.html" title="测试">
                 <img src="https://img.xxx.com/2.jpg" 
                      alt="Python 练习" width="120" height="160" />
                 <strong>图片名字</strong>
             </a>
         </li> 
          <li>
             <a target="_blank" 
                href="https://www.xxx.com/HTM/2020/0301/3.html" title="测试">
                 <img src="https://img.xxx.com/3.jpg" 
                      alt="Python 练习" width="120" height="160" />
                 <strong>图片名字</strong>
             </a>
         </li> 
     </ul>
 </div>

By crawler, you can get the HTML text (other code omitted).

 req = requests.get( url=weburl,params=None,headers=headers )
 req.encoding = "gb2312"
 req_html = req.text  # HTML 的文本

Second, the analysis of the HTML structure

Through the analysis of HTML, you can see the pictures are placed in the designated li.

 <li>
     <a target="_blank" 
        href="https://www.xxx.com/HTM/2020/0301/3.html" title="测试">
         <img src="https://img.xxx.com/3.jpg" 
              alt="Python 练习" width="120" height="160" />
         <strong>图片名字</strong>
     </a>
 </li> 

Third, use regular expressions

In Python to find them, construct a regular expression:

 ex = '<li>.*?<img src="(.*?)".*?</li>'

Regular use of findall () to find all the pictures address method.

 L = re.findall(ex, req_html, re.DOTALL|re.I)

Here, because the contents of the string has newline, so the use of a modifier re.DOTALL . Let denote the canonical point (.) To represent all individual character, including newline.

By default, the dot (.) Is a match any character except newline .

L sample loop output list elements as follows:

 https://img.xxx.com/1.jpg
 https://img.xxx.com/2.jpg
 https://img.xxx.com/3.jpg

Picture path will be needed.

Fourth, the regular expression parser

Python developers often encounter in the course of regular expressions reptiles, which (. *?) A higher probability of use, then the regular expression in the end what does that mean?

  • . *? Represents the non-greedy algorithm , means precise pairing.

  • * Represents the greedy algorithm , pledged to match as much as possible.

  • () Said to get information between the parentheses.

for example:

 req_html = """
 你好,我是那天你看到的那个 Red Man。
 This is the best day! I have a dream.
 Now that the Greeks held Troy and King Priam lay dead, 
 Aeneas said to his father,
 """
  • Regular 1:

 ex = "t.*\s"
 L = re.findall(ex, req_html,  re.DOTALL|re.I)  # 允许换行,忽略大小写
 print( len(L) )
 for item in L:
     print( item +"\n")

Output: L on an element, and from "This ..." until the end.

 1
 This is the best day! I have a dream.
 Now that the Greeks held Troy and King Priam lay dead, 
 Aeneas said to his father,

Analysis: * represents the greedy algorithm , pledged to match as much as possible. So, the regular requirements from the beginning of the "T", until the end of the space, it directly to the end of the match.

  • Regular 2:

 ex = "t.*?\s"

Output:

 8
 This   
 the 
 t   #  best 中的 t
 that 
 the 
 Troy 
 to 
 ther,

Analysis: .? * Represent non-greedy algorithm , it means precise pairing. Therefore, regular expression requirements from the beginning of the "T", until the end of the space, as precisely to match fewer strings , to find eight. Will request regular in character, contain t and back spaces (\ s) have to find out.

  • Regular 3:

 ex = "t(.*?)\s"

Output:

 8
 his
 he
 ​
 hat
 he
 roy
 o
 her,

Analysis: () said to get information between the parentheses. So, get a t and spaces (\ s) between the contents.

Published 86 original articles · won praise 146 · views 50000 +

Guess you like

Origin blog.csdn.net/weixin_42703239/article/details/104684626