Preface
The text and pictures in this article are from the Internet and are for learning and communication purposes only, and do not have any commercial use. If you have any questions, please contact us for processing.
Python crawler, data analysis, website development and other case tutorial videos are free to watch online
https://space.bilibili.com/523606542
Preamble content
Python crawler beginners introductory teaching (1): crawling Douban movie ranking information
Python crawler novice introductory teaching (2): crawling novels
Python crawler beginners introductory teaching (3): crawling Lianjia second-hand housing data
Python crawler novice introductory teaching (4): crawling 51job.com recruitment information
Python crawler beginners' introductory teaching (5): Crawling the video barrage of station B
Python crawler novice introductory teaching (6): making word cloud diagrams
Python crawler beginners introductory teaching (7): crawling Tencent video barrage
Python crawler novice introductory teaching (8): crawl forum articles and save them as PDF
Python crawler beginners introductory teaching (9): multi-threaded crawler case explanation
Basic development environment
- Python 3.6
- Pycharm
Use of related modules
- re
- requests
Install Python and add it to the environment variables, pip installs the required related modules.
One, clear needs
The wallpaper of the other side is really beautiful in my opinion. Although it can be downloaded for free, but for small partners with conditions, you can still pay for it. After all, it is not expensive. It only costs 30 yuan to download the whole site without restrictions.
2. Web page data analysis
Click on the picture step by step and you can find that the url address of the details page is composed of the picture ID and the picture resolution .
Click again, and then you can see the real address of the picture. If you just want to find a wallpaper picture, then save the picture and get a high-definition wallpaper.
If you want to crawl the picture, there is a picture link in the details page.
So you only need to get the ID of the picture in the list page, because the resolution we crawled itself is 1920*1080, so this resolution is fixed. Just get the image ID and stitch the url.
There is a problem to pay attention to here:
Among the data on each page, there are two things that we don't need.
The data we need is with a title, and the data we don't need is without a title, so we can make a judgment based on this.
Three, code implementation
Get web page data
def get_response(html_url):
"""获取网页数据"""
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
}
response = requests.get(url=html_url, headers=headers)
return response
Get the ID of each wallpaper
def main(html_url):
"""获取图片ID"""
response = get_response(html_url)
selector = parsel.Selector(response.text)
image_info = selector.css('.list ul li')
for link in image_info:
image_title = link.css('a::attr(title)').get()
# 进行简单的判断,如果有标题就获取ID
if image_title:
id_info = link.css('a::attr(href)').get()
# /desk/23177.htm
image_id = id_info.replace('.htm', '').split('/')[-1]
Get the url address of each wallpaper
def get_image_url(image_id):
"""获取图片的真实url地址"""
page_url = f'http://www.netbian.com/desk/{image_id}-1920x1080.htm'
response = get_response(page_url)
selector = parsel.Selector(response.text)
image_url = selector.css('#endimg a::attr(href)').get()
return image_url
Save wallpaper
def save(image_url, title):
filename = 'images\\' + title + '.jpg'
image_content = get_response(image_url).content
with open(filename, mode='wb') as f:
f.write(image_content)
print('正在保存:', title)
When I finished writing the code, I found a problem when I ran it
The picture can be saved, but it is garbled, so you need to add a line of code to the function to get the webpage data
# 万能的转码方式
response.encoding = response.apparent_encoding
After updating the code, the title has changed to Chinese, but there are new problems
Each title is followed by an update time, return to the web page to see the source code.
The title attribute in the a tag contains the update time, and the alt attribute in the img tag does not have the update time, so we need to extract the title data again.
After the change:
def main(html_url):
"""获取图片ID"""
response = get_response(html_url)
selector = parsel.Selector(response.text)
image_info = selector.css('.list ul li')
for link in image_info:
image_title = link.css('a::attr(title)').get()
if image_title:
title = link.css('a img::attr(alt)').get()
id_info = link.css('a::attr(href)').get()
# /desk/23177.htm
image_id = id_info.replace('.htm', '').split('/')[-1]
These wallpapers are quite big, just turn the page according to the page number. For multi-threaded crawling, you can refer to the previous article.