Python crawls the design material website, uses the material for free, there is no money to spend money

Preface

The text and pictures in this article are from the Internet and are for learning and communication purposes only. They do not have any commercial use. If you have any questions, please contact us for processing.

PS: If you need Python learning materials, you can click on the link below to get it yourself

Python free learning materials and group communication answers Click to join

Related environment configuration

  • python 3.6
  • pycharm
  • requests
  • parcel

The relevant module pip can be installed

Determine landing page


I want to crawl these material images, so I must start from the end first, and I must first know the address of the material download


By clicking the material to enter the material details page, you can see the local download address, and copy a few more material download address links:

http://www.aiimg.com/sucai.php?open=1&aid=126632&uhash=70a6d2ffc358f79d9cf71392
http://www.aiimg.com/sucai.php?open=1&aid=126630&uhash=99b07c347dc24533ccc1c144
http://www.aiimg.com/sucai.php?open=1&aid=126634&uhash=d7e8f7f02f57568e280190b4
123

The aid of each link is different. This should be the ID of each material. What is the uhash behind? I
originally thought about whether there is interface data in the web page data. This parameter can be found directly. Search in the developer tool does not have this parameter. , Check if there is this download link in the source code of the webpage~


If there is this link, we can download it directly after obtaining the link~

Routine operation:

1. Open the developer tools and check whether the webpage returns the data you want to obtain.

 


It can be found that the data we need is in the tab of the webpage, request the webpage to get the returned data

import requests
url = 'http://www.aiimg.com/list.php?tid=1&ext=0&free=2&TotalResult=5853&PageNo=1'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
}
response = requests.get(url=url, headers=headers)
print(response.text)
1234567

Parse web page data

import parsel
selector = parsel.Selector(response.text)
lis = selector.css('.imglist_d ul li a::attr(href)').getall()
for li in lis:
    num_id = li.replace('.html', '').split('/')[-1]
    new_url = 'http://www.aiimg.com/sucai.php?aid={}'.format(num_id)
    response_2 = requests.get(url=new_url, headers=headers)
    selector_2 = parsel.Selector(response_2.text)
    data_url = selector_2.css('.downlist a.down1::attr(href)').get()
    title = selector_2.css('.toart a::text').get()
    download_url = 'http://www.aiimg.com' + data_url
1234567891011

save data

The materials are all psd, ai or cdr files after being saved in the form of zip compression package

def download(url, title):
    path = '路径' + title + '.zip'
    response = requests.get(url=url, headers=headers)
    with open(path, mode='wb') as f:
        f.write(response.content)
        print('{}已经下载完成'.format(title))
123456

 

 


This is just a single page crawling, but also multi-page crawling~

 

Guess you like

Origin blog.csdn.net/weixin_43881394/article/details/109059650