Crawler combat: batch crawl Jingdong underwear pictures (automatically crawl multiple pages, not one page)

Do what the boys want to do and grab a lot of pictures of girls' underwear.
Author: Electrical - Yu Dengwu

Ready to work

If we want to download all the pictures of Jingdong underwear products locally, it will be a very huge project to copy and paste by hand. At this time, we can use python crawler to achieve.
Step 1: Analyze the web address

Starting page address

Starting page address

https://search.jd.com/Search?keyword=%E5%86%85%E8%A1%A3%E5%A5%B3&suggest=4.def.0.base&wq=%E5%86%85%E8%A1%A3%E5%A5%B3&page=1&s=56&click=1

(Here, you will see that what you see in the URL bar of the browser is Chinese, but copy the url and paste it into a notepad or code, it will become as follows?)
In the URL of many websites, there are some get Parameters or keywords are encoded, so when we copy them out, there will be problems. But the copied URL can be opened directly. Don't worry about this in this example.

So, how can we automatically crawl other pages other than the first page, open the third page, the web page address is as follows, the analysis found that the difference from the first page is: the first page last &page=1, the third page &page=3
we can think of The method of automatically obtaining multiple web pages can be implemented in a for loop. After each loop, page+1

The third page URL is shown in the figure

https://search.jd.com/Search?keyword=%E5%86%85%E8%A1%A3%E5%A5%B3&suggest=4.def.0.base&wq=%E5%86%85%E8%A1%A3%E5%A5%B3&page=3&s=56&click=1

Step 2: Analyze the webpage picture link
In each page, we must extract the corresponding picture, we can use regular expressions to match the link part of the picture in the source code, and then urllib.request.urlretrieve()save the corresponding link picture locally.
But there is a problem here. The pictures on this web page not only include the pictures in the list, but also include some irrelevant pictures next to it. So we can filter information. We need to find the area where the baby picture is located

  1. Step 1: Review the elements and find the first page, the first baby picture. The elements are shown in the figure
  2. Operation step 2: Click on the blank space to view the source code
    CTRL+F (search for the last few letters of the picture in operation step 1) to locate the part where the picture of baby 1 is located

Through several positioning, we found the source code format of the baby picture is as follows
Picture 1 source code

<img width="220" height="220" data-img="1" data-lazy-img="//img13.360buyimg.com/n7/jfs/t1/88198/38/15103/241083/5e6ef386E75f87219/0945cd20a8d40904.jpg" />

Picture 2 source code

<img width="220" height="220" data-img="1" data-lazy-img="//img10.360buyimg.com/n7/jfs/t1/62113/37/10114/445422/5d7a2269E8e2e7ed3/4b90428b88320241.jpg" />

So we can define regular rules

pat1='<img width="220" height="220" data-img="1" data-lazy-img="//(.+?\.jpg)'

Knowledge points.
Sometimes when we find the headers of the local computer network , we cannot crawl some webpages, and a 403 error will appear, because these webpages have some anti-crawler settings to prevent others from maliciously collecting information.
We can set some Headers information and simulate it as a browser to visit these websites, which can solve this problem.
First, click on Baidu in the webpage to make an action happen on the webpage. A lot of data appears in the lower window, as shown in the figure.

At this time, click www.baidu.com in the figure, and it will appear as shown in the figure

In Headers, drag down to find the user-agent
string of information, which is the information used by the simulated browser below, and copy it out.

Code

Language: python

from urllib.parse import quote
import string
import re
from urllib import request
import  urllib.request

#读取网页

def craw(url,page):
    # 模拟成浏览器
    headers = ("User-Agent",
               "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3947.100 Safari/537.36")
    opener = urllib.request.build_opener()
    opener.addheaders = [headers]
    # 将opener安装为全局
    urllib.request.install_opener(opener)

    url_request = request.Request(url)
    html1 = request.urlopen(url_request, timeout=10)
    html1 = html1.read().decode('utf-8')  # 加编码,重要!转换为字符串编码,read()得到的是byte格式的
    html=str(html1)
    #print(html)

    #定位图片
    pat1='<img width="220" height="220" data-img="1" data-lazy-img="//(.+?\.jpg)'
    imagelist=re.compile(pat1).findall(html)
    #print(imagelist)
    x=1
    for each in imagelist:
        print(each)
        try:
            imagename='D:\\deeplearn\\xuexicaogao\\图片\\'+str(page)+str(x)+'.jpg'
            imageurl="http://"+each #补全图片网页地址
            request.urlretrieve(imageurl, filename=imagename)  # 爬下载的图片放置在提前建好的文件夹里
        except Exception as e:
            print(e)
            x+=1
        finally:
            print('下载完成。')
        x+=1

for i  in range(1,30):#遍历网页1-29
    url="https://search.jd.com/Search?keyword=%E5%86%85%E8%A1%A3%E5%A5%B3&suggest=4.def.0.base&wq=%E5%86%85%E8%A1%A3%E5%A5%B3&page="+str(i)+"&s=56&click=1"
    craw(url,i)
print('结束')

Results folder
folder, there are more than 800 Photo

Insert picture description here
of: Electrical - Yu Dengwu
Insert picture description here

Guess you like

Origin blog.csdn.net/kobeyu652453/article/details/112691666