Data analysis-regular expressions-crawling pictures on Wikipedia

 

An overview of the principle of data analysis: the parsed partial text content will be stored between the label text or in the attribute corresponding to the label. The specific steps are the following two steps.

      -1. Positioning the designated label

      -2. Extract the data value stored in the tag or the attribute corresponding to the tag (analysis)

Data analysis is mainly used to focus crawlers, then there are three methods, 1. Regular expression 2. bs4 3.xpath

Today, I mainly use regular expressions to crawl the pictures in the Wikipedia.

Compared with the article I posted before, the crawling step of the content I talked about now will be one more step, that is, data analysis. Proceed as follows:

                    -Coding process: 1. Specify url 2. Initiate request 3. Get response data 4. Data analysis 5. Persistent storage

Here are some rules for regular expressions. For details, you can refer to related books, such as introductory books.

 

 

First of all, we go to the official website of Embarrassment Encyclopedia, click on the heat map, F12, click on Element, you can find the code of the corresponding picture is as follows. Our main purpose is to obtain the src part, because the URL information of the picture is actually.

 

<div class="thumb">

<a href="/article/123081097" target="_blank">
<img src="//pic.qiushibaike.com/system/pictures/12308/123081097/medium/ZS1B0Z8319JCAUGN.jpg" alt="糗事#123081097" class="illustration" width="100%" height="auto">
</a>
</div>

 

The next part is the complete code part, and the detailed comments are also marked in the code, I believe all friends can understand.

import requests
import re
import os
if __name__=='__main__':
    #指定页面url与UA伪装、建立一个文件夹
    if not os.path.exists('./qiutu'):
        os.mkdir('./qiutu')
    headers = {
            'User Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3947.100 Safari/537.36'
        }
    #设置一个通用的url
    url = 'https://www.qiushibaike.com/imgrank/page/%d/'
    # 发送请求并获取整个页面的响应数据
    for pageNum in range(1,3):
        new_url = format(url%pageNum)
        page_text = requests.get(url=new_url,headers=headers).text
        #使用聚焦爬虫正则表达式爬取所有的糗图图片进行解析/提取
        ex = '<div class="thumb">.*?<img src="(.*?)" alt.*?</div>'
        #将正则表达式运用于爬取到的所有数据
        img_list_data = re.findall(ex,page_text,re.S)
        #print(img_list_data)  验证一下
        for src in img_list_data:
            src = 'http:'+src   #拼接一个完整的url
            img_data= requests.get(url=src,headers=headers).content #获取数据的二进制形式
            img_name= src.split('/')[-1] #生成图片名字
            img_path= './qiutu/'+img_name #存储图片路径
            with open(img_path,'wb') as fp:
                fp.write(img_data)
                print(img_name,'over!!!')

As a result of the operation, a folder for storing picture data will appear in the current folder.

Guess you like

Origin blog.csdn.net/qwerty1372431588/article/details/105979889