Embarrassments Encyclopedia simple scripts crawling cases (using regular expression search)

1. Observe url:

Law clearly observed that url: https: //www.qiushibaike.com/text/page/X/

Wherein X is sequentially changed by the page order. That 1,2,3 .......

 

2. The main function of the preparation:

def main():
    url = "https://www.qiushibaike.com/text/page/1/"
    for i in range(1,11):
        url = "https://www.qiushibaike.com/text/page/%s/" %i
        parse_page(url)

Which parse_page () is an analytic function page. The above code can achieve crawl piece on page 1-10.

 

3. Analysis page parsing and writing:

View the target piece of source code:

We found that each piece will be placed under the span tag div tag. So you can write the corresponding regular expression:.? (.? *) <Div class = "content"> * <span> </ span>

Page parsing process of writing:

1) header information acquisition response coding.

def parse_page(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) '
                      'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3724.8 Safari/537.36'
    }
    response = requests.get(url,headers=headers)
    text = response.text
    print(text)

  Try to output text to see if there garbled situation. The garbled, text = response.text can change text = response.content.decode ( "utf-8")

2) preparation of regular expressions, scripts stored, removing excess portions, output display.

def parse_page(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) '
                      'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3724.8 Safari/537.36'
    }
    response = requests.get(url,headers=headers)
    text = response.text
    contents = re.findall(r'<div class="content".*?<span>(.*?)</span>',text,re.DOTALL)
    duanzi = []
    for content in contents:
        x = re.sub(r'<(.*?)>',"",content,re.DOTALL)
        duanzi.append(x.strip())
        print(x.strip())
        print("="*60)

  Using re module findall function to find all the pages in line with the character, and stored in duanzi list. Wherein re.DOTALL represents dot (.) Matches all characters, including newline. Because the string contains some matching tag so it is necessary, i.e., x = re.sub (r '<(. *?)>', "", Content, re.DOTALL). Finally, to clearly show each message crawling, using a split = number.

4. The results showed that all the code and:

import re
import requests

def parse_page(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) '
                      'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3724.8 Safari/537.36'
    }
    response = requests.get(url,headers=headers)
    text = response.text
    contents = re.findall(r'<div class="content".*?<span>(.*?)</span>',text,re.DOTALL)
    duanzi = []
    for content in contents:
        x = re.sub(r'<(.*?)>',"",content,re.DOTALL)
        duanzi.append(x.strip())
        print(x.strip())
        print("="*60)

def main():
    url = "https://www.qiushibaike.com/text/page/1/"
    for i in range(1,11):
        url = "https://www.qiushibaike.com/text/page/%s/" %i
        parse_page(url)

if __name__ == '__main__':
    main()

  

 

Guess you like

Origin www.cnblogs.com/zyde-2893/p/11203421.html