requests library + regular expressions-simple crawler example-beautiful pictures

Regular expression-simple crawler example-4K beauty articles

section1: statement

Your own study notes, crawling content will not be used for business

section2: Download link analysis

First of all, we have to find the details page 4K beauty we want to crawl.
Insert picture description here
The picture here is what we want.
Next, we need to look at the source code of this page for further analysis. (You can choose to right-click to check or use the shortcut key Ctrl+Shift+I )
Take the first picture as an example (the code is as follows):
Insert picture description here
the source file tag src appears, we want to get the content behind it and compare it with'http :/ /pic.netbian.com 'is partially integrated into a link, but if you try it on the web page in advance, this result will appear.
(Why is it integrated with http://pic.netbian.com , because this is a link to the homepage of the website, hahaha)
Insert picture description here

It's a thumbnail, not the high-definition big picture we want.
Let's take a closer look. There is also a hyperlink label above the source file label. Click on it and come to this page:
Insert picture description here
then we will check this page again.
Insert picture description here
Another source file was found, then let's try combining it with " http://pic.netbian.com "!
Insert picture description here
ohhhhhh, is the high-definition big picture we want!

So the next step is to clarify your ideas and start crawling!

section3: Code writing

1. Import section

import requests
import re
import os

2. Construct the request

headers = {
    
    
     'user - agent': 'Mozilla / 5.0(WindowsNT10.0;Win64;x64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / 80.0.3987.116Safari / 537.36'
}
url='http://pic.netbian.com/4kmeinv/'
response_1=requests.get(url=url,headers=headers)
html_1=response_1.text

For the study of the Requests library, I have sorted out a little bit before (I didn’t know how to write at the time, it’s a bit messy)

Web crawler-----Introduction to the Requests library

3. The construction of regular expressions

Because we have to go to another page from the detail page to the high-definition image, the first regular expression we constructed is to go to another page.
Insert picture description here
What we need is the content behind the hyperlink tag href, so the regular expression can be written like this:

<a.*?href.+?(/tupian.+?html)

Regular expressions are common symbols, refer to the regular expression - commonly used matching rules

4. Data processing

After constructing, we need to take out these data

image_urls=re.findall('<a.*?href.+?(/tupian.+?html)',html_1,re.S)

We can print it and take a look at the result.
Insert picture description here
We can see that it is a list. The result obtained by re.findall is a result. Here we need to pay attention. Therefore, we need to traverse this result so that we can use it.

for image_url in image_urls:
    picture_urls = 'http://pic.netbian.com' + image_url

Because the extracted result is not a complete link, I added it by the way when I traversed it. Then we print it and see the result.
Insert picture description here
There is a series of links. Click the first one, which happens to be the one we just checked, indicating that we have succeeded.

Then, repeat the above steps to analyze this page and extract data

for image_url in image_urls:
    picture_urls = 'http://pic.netbian.com' + image_url
    # print(picture_urls)
    response_2 = requests.get(url=picture_urls, headers=headers)
    html_2 = response_2.text
    pictures = re.findall('<div.*?photo-pic.*?<img src="(/uploads.+?jpg).*?alt.+?"(.*?)"', html_2, re.S)

The regular expression here, when I rebuild, the extracted content is the source file and the content behind the alt
Insert picture description here

    for picture in pictures:
        picture_url = picture[0]
        picture_src = 'http://pic.netbian.com' + picture_url#高清图的源文件链接

        picture_name = picture[1] + '.jpg'#构建准备保存的图片的名称
        picture_name = picture_name.encode('iso-8859-1').decode('gbk')#这里是防止图片名称出现乱码的情况

5. Save the data

First, we have to create a folder

if not os.path.exists('D:/4K美女'):
    os.mkdir('D:/4K美女')

Secondly, do the final processing of the data we got

        picture_data = requests.get(url=picture_src, headers=headers).content#写入文件的内容——也就是想要的高清大图啦
        picture_path = 'D:/4K美女/' + picture_name#构建图片存储路径

Finally, write the file, save and close

        with open(picture_path, 'wb') as f:
            f.write(picture_data)
            print(picture_path, '下载完成')

6. Complete code

import requests
import re
import os

#创建文件夹
if not os.path.exists('D:/4K美女'):
    os.mkdir('D:/4K美女')

headers = {
    
    
     'user - agent': 'Mozilla / 5.0(WindowsNT10.0;Win64;x64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / 80.0.3987.116Safari / 537.36'
}
url='http://pic.netbian.com/4kmeinv/'
response_1=requests.get(url=url,headers=headers)
html_1=response_1.text
image_urls=re.findall('<a.*?href.+?(/tupian.+?html)',html_1,re.S)
# print(image_urls)
for image_url in image_urls:
    picture_urls = 'http://pic.netbian.com' + image_url
    # print(picture_urls)
    response_2 = requests.get(url=picture_urls, headers=headers)
    html_2 = response_2.text
    pictures = re.findall('<div.*?photo-pic.*?<img src="(/uploads.+?jpg).*?alt.+?"(.*?)"', html_2, re.S)

    for picture in pictures:
        picture_url = picture[0]
        picture_src = 'http://pic.netbian.com' + picture_url#高清图的源文件链接

        picture_name = picture[1] + '.jpg'#构建准备保存的图片的名称
        picture_name = picture_name.encode('iso-8859-1').decode('gbk')#这里是防止图片名称出现乱码的情况

        picture_data = requests.get(url=picture_src, headers=headers).content#写入文件的内容——也就是想要的高清大图啦
        picture_path = 'D:/4K美女/' + picture_name#构建图片存储路径

        # 保存图片
        with open(picture_path, 'wb') as f:
            f.write(picture_data)
            print(picture_path, '下载完成')

section4: Supplement (multi-page crawling)

If you want to crawl multiple pages , you only need to do a for loop processing for the first URL, namely:

for i in range(2,5):
    url='http://pic.netbian.com/4kmeinv/index_{}.html'.format(i)
    response_1=requests.get(url=url,headers=headers)
    html_1=response_1.text

So let's go! ! ! ! !

This is the first time I have completed a crawler example. I feel that I have written a good friend, can you give me a thumbs up and support it, hehe.

I know that I may not use very standardized terms in some places, and I hope that some big guys can give me some suggestions.

Guess you like

Origin blog.csdn.net/qq_44921056/article/details/112982240