Today we are going to crawl the petal net https://huaban.com/
a paradise for designers to find inspiration! There are a large number of image materials to download, it is a high-quality image inspiration library
This time we use requests to log in to the petal net, crawl the page, and then use regular and json to extract useful information, and finally save the obtained image information to the local
1. Technology used
- python basics
- requests login page to get session user sessions and download pictures
- Regular expression to extract useful information from the page
- json parses the image in the page
2. Target page
https://huaban.com/search/?q=女神&category=photography
3. Results
Fourth, install the necessary libraries
- win+R open and run
- Output cmd into the console
- Install requests separately
pip install requests 1
Five, analysis page
- Page pattern
We click the paging button to get the pattern of the last parameter of the
page. The first page: https://huaban.com/search/?q=Goddess&category=photography&page=1 The
second page: https://huaban.com /search/?q=Goddess&category=photography&page=2
2. Log in
Through Fiddler we view the address and parameters of the login request
# 地址 https://huaban.com/auth/ # 参数 "email": "******", "password": "*****", "_ref":"frame" 123456
I decided to use the session() function of requests to get the session information after the user logs in
3. Page information
We right-click to view the source code and find that the data is stored in javascript. We are going to use regular expressions to extract page information
Six, all codes
#-*- coding:utf-8 -*- import requests import re import json # import requests re-regular json ''' login login petals get session ''' def login(): login_url ='https : //huaban.com /auth/' # Login address headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0", "Accept": "application / json" , "Content-type": "application/x-www-form-urlencoded; charset=utf-8", "Referer": "https://huaban.com/", } # request header information session = requests.session () #sesson Conversation login_data = { "email": "zengmumu%40126.com", "password": "zmm123", "_ref":"frame" } response = session.post(login_url, data=login_data, headers=headers,verify=False) # Login Page getPic(session,5) # Get the picture, the first 5 pages ''' getPic parses the picture address in the page session. The maximum number of session information num is the number of pages ''' def getPic(session,num): for i in range(1, num+1): response = session.get("https://huaban.com/search/?q=%E5%A5%B3%E7%A5%9E&category=photography&page="+str(i)) # Get page Information (the result of "beauty" text encoding is "%E5%A5%B3%E7%A5%9E") data = re.search('app\.page\[\"pins\"\] =(.*);\napp.page\[\"page\"\]', response.text, re.M | re .I | re.S) # Extract all the picture information where the current page is located data = json.loads(data.group(1)) # Convert the string to a list for item in data: url = "https://hbimg. huabanimg.com/" + item["file"]["key"] # Splicing picture address index = item["file"]["type"].rfind("/") type = "."+item[" file"]["type"][index + 1:] # Get the type of picture file_name = item["raw_text"] # Get the Chinese name of the picture download_img(url, file_name,type) # Download the picture ''' Download the picture url The address of the picture name The Chinese name of the picture type picture type ''' def download_img(url,name,type): response = requests.get(url,verify=False) # Use requests to download pictures index = url.rfind('/') file_name = name+url[index + 1:]+type # Get the hash value of the picture print("Download picture:" + file_name) # Print the picture name save_name = "./photo/" + file_name # The address where the picture is saved (note that you need to create one for the photo, the same as the current .py file Folder) with open(save_name, "wb") as f: f.write(response.content) # Write the picture to the local " ' Define the main function" ' def main(): login() # If to the module The name is __main__ to execute the main main function if __name__ =='__main__': main()
Is it very simple, the comments are all on the code, if you have any questions, please join the exchange and answer base! ⬅click him