Teach novice Xiaobai how to use Python to crawl goddess photos in five minutes!

Today we are going to crawl the petal net https://huaban.com/
a paradise for designers to find inspiration! There are a large number of image materials to download, it is a high-quality image inspiration library

This time we use requests to log in to the petal net, crawl the page, and then use regular and json to extract useful information, and finally save the obtained image information to the local

1. Technology used

  • python basics
  • requests login page to get session user sessions and download pictures
  • Regular expression to extract useful information from the page
  • json parses the image in the page

2. Target page

https://huaban.com/search/?q=女神&category=photography

3. Results

Fourth, install the necessary libraries

  • win+R open and run
  • Output cmd into the console
  • Install requests separately
pip install  requests 
1

Five, analysis page

 

  1. Page pattern
    We click the paging button to get the pattern of the last parameter of the
    page. The first page: https://huaban.com/search/?q=Goddess&category=photography&page=1 The
    second page: https://huaban.com /search/?q=Goddess&category=photography&page=2

2. Log in

Through Fiddler we view the address and parameters of the login request

# 地址
https://huaban.com/auth/
# 参数
 "email": "******",
 "password": "*****",
 "_ref":"frame"
123456

I decided to use the session() function of requests to get the session information after the user logs in

3. Page information

We right-click to view the source code and find that the data is stored in javascript. We are going to use regular expressions to extract page information

Six, all codes

#-*- coding:utf-8 -*- 
import requests 
import re 
import json 
# import requests re-regular json 

''' 
login 
login petals get session 
''' 
def login(): 
    login_url ='https : //huaban.com /auth/' 
    # Login address 
    headers = { 
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0", 
        "Accept": "application / json" , 
        "Content-type": "application/x-www-form-urlencoded; charset=utf-8", 
        "Referer": "https://huaban.com/", 
    } 
    # request header information 

    session = requests.session () 
    #sesson Conversation 

    login_data = { 
        "email": "zengmumu%40126.com", 
        "password": "zmm123", 
        "_ref":"frame" 
    } 

    response = session.post(login_url, data=login_data, headers=headers,verify=False) 
    # Login Page 
    getPic(session,5) 
    # Get the picture, the first 5 pages 


''' 
getPic 
parses the picture address in the page 
session. The 
maximum number of session information num is the number of pages 
''' 
def getPic(session,num): 
    for i in range(1, num+1): 
        response = session.get("https://huaban.com/search/?q=%E5%A5%B3%E7%A5%9E&category=photography&page="+str(i)) 
        # Get page Information (the result of "beauty" text encoding is "%E5%A5%B3%E7%A5%9E")
        data = re.search('app\.page\[\"pins\"\] =(.*);\napp.page\[\"page\"\]', response.text, re.M | re .I | re.S) 
        # Extract all the picture information where the current page is located 
        data = json.loads(data.group(1)) 
        # Convert the string to a list 
        for item in data: 
            url = "https://hbimg. huabanimg.com/" + item["file"]["key"] 
            # Splicing picture address 
            index = item["file"]["type"].rfind("/") 
            type = "."+item[" file"]["type"][index + 1:] 
            # Get the type of picture 
            file_name = item["raw_text"] 
            # Get the Chinese name of the picture 
            download_img(url, file_name,type) 
            # Download the picture 

''' 
Download the picture 
url The address of the 
picture name The Chinese name of the picture 
type picture type 
''' 
def download_img(url,name,type): 
    response = requests.get(url,verify=False) 
    # Use requests to download pictures 
    index = url.rfind('/') 
    file_name = name+url[index + 1:]+type
    # Get the hash value of the picture 
    print("Download picture:" + file_name) 
    # Print the picture name 
    save_name = "./photo/" + file_name 
    # The address where the picture is saved (note that you need to create one for the photo, the same as the current .py file Folder) 
    with open(save_name, "wb") as f: 
        f.write(response.content) 
        # Write the picture to the local " 
' 
Define the main function" 
' 
def main(): 
    login() 

# If to the module The name is __main__ to execute the main main function 
if __name__ =='__main__': 
    main()

Is it very simple, the comments are all on the code, if you have any questions, please join the exchange and answer base! ⬅click him

 

Guess you like

Origin blog.csdn.net/weixin_43881394/article/details/109072117