Python novice reptile II: crawling Sogou picture (dynamic)

 

After crawling on a watercress critics success, reptiles feel pretty good, so want to climb the point images to play ...

Sogou Picture Address: https://pic.sogou.com/?from=category

 

First on the ultimate success of the source code:

import requests
import urllib
import json
from fake_useragent import UserAgent

def getSougouImag(category,length,path):
    n = length
    cate = category
    imgs_url  = []    # define empty list, save the picture URL 
    m = 0    # for displaying the number of image 
    URL =  ' https://pic.sogou.com/pics/channel/getAllRecomPicByTag.jsp?category= ' + + Cate ' Tag = E5%% & 85% 83%% the A8% E9 & Start the A8 = & len = 0 ' + STR (n-)
    headers  = { ' User-Agent ' :. UserAgent () Random}      # Set the UA 
    F = requests.get (URL, headers = headers)       # send a Get request 
    Print (f.status_code)
    js = json.loads(f.text)
    js = js['all_items']
    for j in js:
        imgs_url.append(j['thumbUrl'])
    for img_url in imgs_url:
        print('***** '+str(m)+'.jpg *****'+' Downloading...')
        urllib.request.urlretrieve (img_url, path + STR (m) + ' .jpg ' )     # download locally the url 
        m + =. 1 Print ( ' the Download Complete! ' )
    

getSougouImag ( ' wallpaper ' , 500, R & lt ' D: \ souGouImg / ' )

Renderings:

 

The following describes the start as a novice reptile steps ...

1, first open the page to view HTML source code

Press F12 to open the debug interface -> right-click the image -> click on the check

Information as shown in the red box will appear, not difficult to see this image url is the value of the src attribute of the img tag.

 

So Easy? That direct access to the value of the src attribute, then the download does not completely ok?

Man of few words said, open dry.

from  BS4  Import  the BeautifulSoup
 Import  Requests
 from  fake_useragent  Import  UserAgent    # UA repository 

URL ' https://pic.sogou.com/pics/recommend?category=%B1%DA%D6%BD&from=home#%E5%85%A8% the A8%%% 83 E9 269 ' 
headers  = { ' User-Agent ' :. UserAgent () Random}      # set the UA 
F = requests.get (URL, headers = headers)       # send a Get request 
Print (f.status_code)     # Print status code 
Soup = the BeautifulSoup (f.text, ' lxml ' )     # Parse the contents of the page with lxml parser 
Print (soup.select ( ' img ' ))    # filter out all img tag, and print properties and its contents

Code execution results are as follows:

Found printed html web page is not the same, all considered, this is not the picture of the source url, and then guess the picture is dynamic, Baidu also continue to find ... a big brother to the article, only to find out the following Search method.

 

2. Click NetWork-> Click XHR-> then scroll wheel down, it loads a new image -> click on the newly loaded out of the picture -> click Preview on the right side

 Find content under Preview for the json format

Found all_items, 0 ..... click on it found numerous figures, and then point to the development of many of the existing url, paste it into your browser to view and found that these are pictures url (rejoicing)

Find pictures of the real URL, the problem will become easier. For more details, please see the code or comment it ~

Guess you like

Origin www.cnblogs.com/v-fan/p/12503094.html