[Python3 crawler] 14_ Crawling mobile phone pictures on Taobao

Now we want to use the crawler to crawl the mobile phone pictures on Taobao, so how to crawl it? What preparations should be done?

First of all, we need to analyze the web page and see what rules the web page has.

Open Taobao website http://www.taobao.com/

image

We can see that the left side is the theme market, move the mouse to the [Women's Clothing/Men's Clothing/Underwear] column, we can see a more detailed display

image

If we need to crawl [down jacket] now, then we enter the interface of [down jacket] clothes

image

Looking at the browser address at this point, we can see

image

URL transcoding occurs when the URL is copied to word or elsewhere

We can select [page 1, 2, and 3 of the down jacket module for URL comparison], and the comparison results are as follows:

image

From the above figure, we can see that the s values ​​of the three pages are all different by 60

Then we look at the picture address again:

image

The marked place in the picture may be the biggest difference between the two pictures, so open the source code search

Image 1 search result

image

Image 2 search results

image

From the two URLs, we found common features: both start with "pic_url":"// , and the URL analysis ends here, then we will write the code next.

code show as below:

import urllib.request
 import re 
 #set keyword 
keywords = " down jacket " # quote function for url encoding (mask special characters) 
key = urllib.request.quote(keywords)
 #set User-Agent 
headers=( " User_Agent " , " Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0 " )
 #custom opener 
opener = urllib.request.build_opener()

opener.addheaders = [headers]
urllib.request.install_opener(opener)
#loop through and grab 
for i in range(0,2 ):
    url = "https://s.taobao.com/list?spm=a21bo.2017.201867-links-0.3.5af911d9rLmo4K&q="+key+"&cat=16&style=grid&seller_type=taobao&bcoffset=12&s="+str(i*60)
    #print(url)
    content = urllib.request.urlopen(url).read().decode("utf-8","ignore")
    rule = ' "pic_url":"//(.*?)" '  #Regular match 
    imglist = re.compile(rule).findall(content) #Get the picture list 
    for j in range(0,len(imglist)):
        img = imglist[j]
        imgurl = "http://"+img
        file = "D://source//img//"+str(i)+str(j)+".jpg"
        urllib.request.urlretrieve(imgurl,filename=file)

After crawling, we can open D:\source\img to view

image

We have successfully crawled, and the crawled images are the same as those on the page.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324620753&siteId=291194637