Preface
The text and pictures in this article are from the Internet and are for learning and communication purposes only. They do not have any commercial use. If you have any questions, please contact us for processing.
PS: If you need Python learning materials, you can click on the link below to get it yourself
Python free learning materials and group communication answers Click to join
Related environment configuration
- python 3.6
- pycharm
- requests
- parcel
The relevant module pip can be installed
Determine landing page
I want to crawl these material images, so I must start from the end first, and I must first know the address of the material download
By clicking the material to enter the material details page, you can see the local download address, and copy a few more material download address links:
http://www.aiimg.com/sucai.php?open=1&aid=126632&uhash=70a6d2ffc358f79d9cf71392 http://www.aiimg.com/sucai.php?open=1&aid=126630&uhash=99b07c347dc24533ccc1c144 http://www.aiimg.com/sucai.php?open=1&aid=126634&uhash=d7e8f7f02f57568e280190b4 123
The aid of each link is different. This should be the ID of each material. What is the uhash behind? I
originally thought about whether there is interface data in the web page data. This parameter can be found directly. Search in the developer tool does not have this parameter. , Check if there is this download link in the source code of the webpage~
If there is this link, we can download it directly after obtaining the link~
Routine operation:
1. Open the developer tools and check whether the webpage returns the data you want to obtain.
It can be found that the data we need is in the tab of the webpage, request the webpage to get the returned data
import requests url = 'http://www.aiimg.com/list.php?tid=1&ext=0&free=2&TotalResult=5853&PageNo=1' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36' } response = requests.get(url=url, headers=headers) print(response.text) 1234567
Parse web page data
import parsel selector = parsel.Selector(response.text) lis = selector.css('.imglist_d ul li a::attr(href)').getall() for li in lis: num_id = li.replace('.html', '').split('/')[-1] new_url = 'http://www.aiimg.com/sucai.php?aid={}'.format(num_id) response_2 = requests.get(url=new_url, headers=headers) selector_2 = parsel.Selector(response_2.text) data_url = selector_2.css('.downlist a.down1::attr(href)').get() title = selector_2.css('.toart a::text').get() download_url = 'http://www.aiimg.com' + data_url 1234567891011
save data
The materials are all psd, ai or cdr files after being saved in the form of zip compression package
def download(url, title): path = '路径' + title + '.zip' response = requests.get(url=url, headers=headers) with open(path, mode='wb') as f: f.write(response.content) print('{}已经下载完成'.format(title)) 123456
This is just a single page crawling, but also multi-page crawling~