Now we want to use the crawler to crawl the mobile phone pictures on Taobao, so how to crawl it? What preparations should be done?
First of all, we need to analyze the web page and see what rules the web page has.
Open Taobao website http://www.taobao.com/
We can see that the left side is the theme market, move the mouse to the [Women's Clothing/Men's Clothing/Underwear] column, we can see a more detailed display
If we need to crawl [down jacket] now, then we enter the interface of [down jacket] clothes
Looking at the browser address at this point, we can see
URL transcoding occurs when the URL is copied to word or elsewhere
We can select [page 1, 2, and 3 of the down jacket module for URL comparison], and the comparison results are as follows:
From the above figure, we can see that the s values of the three pages are all different by 60
Then we look at the picture address again:
The marked place in the picture may be the biggest difference between the two pictures, so open the source code search
Image 1 search result
Image 2 search results
From the two URLs, we found common features: both start with "pic_url":"// , and the URL analysis ends here, then we will write the code next.
code show as below:
import urllib.request import re #set keyword keywords = " down jacket " # quote function for url encoding (mask special characters) key = urllib.request.quote(keywords) #set User-Agent headers=( " User_Agent " , " Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0 " ) #custom opener opener = urllib.request.build_opener() opener.addheaders = [headers] urllib.request.install_opener(opener) #loop through and grab for i in range(0,2 ): url = "https://s.taobao.com/list?spm=a21bo.2017.201867-links-0.3.5af911d9rLmo4K&q="+key+"&cat=16&style=grid&seller_type=taobao&bcoffset=12&s="+str(i*60) #print(url) content = urllib.request.urlopen(url).read().decode("utf-8","ignore") rule = ' "pic_url":"//(.*?)" ' #Regular match imglist = re.compile(rule).findall(content) #Get the picture list for j in range(0,len(imglist)): img = imglist[j] imgurl = "http://"+img file = "D://source//img//"+str(i)+str(j)+".jpg" urllib.request.urlretrieve(imgurl,filename=file)
After crawling, we can open D:\source\img to view
We have successfully crawled, and the crawled images are the same as those on the page.