I like to collect wallpapers. I found the wallpapers under the beautiful category of Bian desktop wallpapers. I like them very much. So I wrote a crawler, and later found that the webpage structure of the entire website is basically the same, so I added a bit of code to make the HD wallpapers of the entire webpage Climbed down
Article Directory
- Contents 1: Overview
- Table of Contents 2: Environmental Preparation
- Table of Contents 3: Analyze the page structure
- Table of Contents Four: Code Analysis
-
- Step 1: Set up global variables
- Step 2: Get the content list after the page is filtered
- Step 3: Get the URLs of all categories
- Step 4: Get the URLs of all the pages under the category page
- Step 5: Get the URL of the picture under the page
- The sixth step: locate the 1920 × 1080 resolution picture
- Step 7: Download pictures
- Table of Contents Five: Code's fault tolerance
- Directory six: complete code
Contents 1: Overview
On the computer, create a folder to store pictures of crawling the desktop
There are 25 folders under this folder, corresponding to categories
. There are several folders under each category folder, corresponding to the page
number. Under the page number folder, picture files are stored
Table of Contents 2: Environmental Preparation
- Environment preparation: how to use VSCode to write Python code? (Win10 x64 system)
You also need to use three third-party packages (you can look at the official documents if you are interested)
- requests : Get pages through http requests, official documents
- lxml : It is a parsing library of python, supports the parsing of HTML and XML, supports XPath parsing, and the parsing efficiency is very high, official documents
- Beautiful Soup4 : can extract data from HTML or XML files, official documents
Enter the following pip commands in the terminal to install them
python -m pip install beautifulsoup4
python -m pip install lxml
python -m pip install requests
Table of Contents 3: Analyze the page structure
- Because the resolution of my computer is 1920 × 1080, the resolution of the image I crawled is this
- Bi'an desktop wallpaper provides many categories for us to browse: calendar, animation, scenery, beauty, games, movies, dynamics, beauty, design... Wallpapers under the
4k category are important resources for the website’s revenue, and I don’t have a need for 4k wallpapers.
CSS selectors are not crawled : #header> div.head> ul> li:nth-child(1)> div> a , locate the a tag of the package classification
I will use the wallpapers under the beautiful category to explain how to crawl pictures next
- There are a total of 73 pages, except for the last page, each page has 18 pictures.
But in the code, we better need to automatically get the total page number. Well, the structure of the Bian desktop wallpaper website is really comfortable, basically the HTML structure of each page number They are all similar
CSS selectors: div.page a , there are only 6 a tag that locates the number of the package page - And the third image on each page is the same advertisement, it needs to be filtered out in the code
- The hyperlink on each page is very clear: http://www.netbian.com/weimei/index_ x .htm
x is exactly the page number of the page - Note: The pictures you see under the category are thumbnails with low resolution; to get the picture with 1920 × 1080 resolution, you need to jump twice
The following figure is an example.
In the category page, we can directly obtain the url of the image, but unfortunately, its resolution is not satisfactory;
through inspection, it is obvious that every image displayed in the category page is Point to another hyperlink
CSS selector: div#main div.list ul li a , locate the a tag that wraps the picture.
Click on the picture and jump to the new link for the first time . The following content is displayed on the page:
CSS Selector: div#main div.endpage div.pic div.pic-down a , locate the a tag that wraps the picture
Click the button to download wallpaper (1920 × 1080), jump to a new link for the second time , and finally achieve the goal. The resolution of the picture displayed in the link is 1920 × 1080
After twists and turns, I finally found the 1920 × 1080 high-resolution image of the picture.
CSS selector: div#main table a img , locate the img tag of the image
After my crawling test, a very few pictures failed to download due to many fragmentary problems, and a small number of pictures were given other resolutions although the website provided a download button of 1920 × 1080 resolution.
Table of Contents Four: Code Analysis
- Hereinafter all red bold content , follow my explanation, modify according to their own circumstances
Step 1: Set up global variables
index = 'http://www.netbian.com' # 网站根地址
interval = 10 # 爬取图片的间隔时间
firstDir = 'D:/zgh/Pictures/netbian' # 总路径
classificationDict = {
} # 存放网站分类子页面的信息
- index, to crawl the website root address of the webpage, the crawling image in the code needs to use its complete url to be stitched
- Interval, when we crawl the content of a website, we must take into account the affordability of the website server. Crawling a large amount of the website content in a short period of time will cause great pressure on the website server. We need to set the interval time
unit when crawling : second
because I want to climb the other side to take high-resolution images of all desktop site, and if crawling concentrated in a short time, on the one hand will give tremendous pressure web server, on the one hand our web server will force the broken links, so I set The crawl interval for each picture is 10 seconds; if you only crawl a few pictures, you can set the interval to a shorter time - firstDir , the root path of crawling pictures stored on your computer; when crawling pictures in the code, in the first-level directory, folders will be generated and stored according to the page number under the beautiful category of Bian desktop
- classificationDict , which stores the URL pointed to by the classification under the website and the path of the corresponding classification folder
Step 2: Get the content list after the page is filtered
Write a function to get the filtered content array of the page
- Two parameters are passed in
url: the url of the webpage
select: selector (seamlessly docked with the selector in CSS, I like it very much, locate the corresponding element in HTML) - Return a list
def screen(url, select):
html = requests.get(url = url, headers = UserAgent.get_headers()) # 随机获取一个headers
html.encoding = 'gbk'
html = html.text
soup = BeautifulSoup(html, 'lxml')
return soup.select(select)
- headers, the function is to pretend to be a user visiting the website. In order to ensure the success rate of crawlers, a header is randomly selected every time a page is crawled
- encoding, the encoding of the website
Step 3: Get the URLs of all categories
# 将分类子页面信息存放在字典中
def init_classification():
url = index
select = '#header > div.head > ul > li:nth-child(1) > div > a'
classifications = screen(url, select)
for c in classifications:
href = c.get('href') # 获取的是相对地址
text = c.string # 获取分类名
if(text == '4k壁纸'): # 4k壁纸,因权限问题无法爬取,直接跳过
continue
secondDir = firstDir + '/' + text # 分类目录
url = index + href # 分类子页面url
global classificationDict
classificationDict[text] = {
'path': secondDir,
'url': url
}
In the next code, I will use the wallpapers under the beautiful category to explain how to crawl high-definition pictures by jumping two links
Step 4: Get the URLs of all the pages under the category page
Most of the categories have more than 6 pages, you can directly use the screen function defined above, select is defined as div.page a
, and then the sixth element in the list returned by the screen function can get the page number of the last page we need
However, some categories have less than 6 pages,
such as:
Need to rewrite a filter function to get through sibling elements
# 获取页码
def screenPage(url, select):
html = requests.get(url = url, headers = UserAgent.get_headers())
html.encoding = 'gbk'
html = html.text
soup = BeautifulSoup(html, 'lxml')
return soup.select(select)[0].next_sibling.text
Get the url of all pages under the category page
url = 'http://www.netbian.com/weimei/'
select = '#main > div.page > span.slh'
pageIndex = screenPage(secondUrl, select)
lastPagenum = int(pageIndex) # 获取最后一页的页码
for i in range(lastPagenum):
if i == 0:
url = 'http://www.netbian.com/weimei/index.htm'
else:
url = 'http://www.netbian.com/weimei/index_%d.htm' %(i+1)
Since the HTML structure of the website is very clear, the code is simple and clear to write
Step 5: Get the URL of the picture under the page
Through inspection, you can see that the obtained URL is a relative address, which needs to be converted into an absolute address
select = 'div#main div.list ul li a'
imgUrls = screen(url, select)
The values in the list obtained by these two lines of code are like this:
<a href="/desk/21237.htm" target="_blank" title="星空 女孩 观望 唯美夜景壁纸 更新时间:2019-12-06"><img alt="星空 女孩 观望 唯美夜景壁纸" src="http://img.netbian.com/file/newc/e4f018f89fe9f825753866abafee383f.jpg"/><b>星空 女孩 观望 唯美夜景壁纸</b></a>
- Need to process the obtained list
- Get the href attribute value in the a tag and convert it into an absolute address, which is the url needed for the first jump
The sixth step: locate the 1920 × 1080 resolution picture
# Position to 1920 1080 resolution picturedef handleImgs(links, path):
for link in links:
href = link.get('href')
if(href == 'http://pic.netbian.com/'): # 过滤图片广告
continue
# 第一次跳转
if('http://' in href): # 有极个别图片不提供正确的相对地址
url = href
else:
url = index + href
select = 'div#main div.endpage div.pic div.pic-down a'
link = screen(url, select)
if(link == []):
print(url + ' 无此图片,爬取失败')
continue
href = link[0].get('href')
# 第二次跳转
url = index + href
# 获取到图片了
select = 'div#main table a img'
link = screen(url, select)
if(link == []):
print(url + " 该图片需要登录才能爬取,爬取失败")
continue
name = link[0].get('alt').replace('\t', '').replace('|', '').replace(':', '').replace('\\', '').replace('/', '').replace('*', '').replace('?', '').replace('"', '').replace('<', '').replace('>', '')
print(name) # 输出下载图片的文件名
src = link[0].get('src')
if(requests.get(src).status_code == 404):
print(url + ' 该图片下载链接404,爬取失败')
print()
continue
print()
download(src, name, path)
time.sleep(interval)
Step 7: Download pictures
Download operation
def download(src, name, path):
if(isinstance(src, str)):
response = requests.get(src)
path = path + '/' + name + '.jpg'
while(os.path.exists(path)): # 若文件名重复
path = path.split(".")[0] + str(random.randint(2, 17)) + '.' + path.split(".")[1]
with open(path,'wb') as pic:
for chunk in response.iter_content(128):
pic.write(chunk)
Table of Contents Five: Code's fault tolerance
One: Filter image ads
if(href == 'http://pic.netbian.com/'): # 过滤图片广告
continue
Two: Jumping to the page for the first time , there is no link we need
On the other side wallpaper website, the link to the first page is given a relative address
However, very few pictures are given the absolute address directly, and the category URL is given, so two steps are required.
if('http://' in href):
url = href
else:
url = index + href
...
if(link == []):
print(url + ' 无此图片,爬取失败')
continue
The following are the problems encountered when jumping to the page for the second time
Three: Unable to crawl pictures due to permission issues
if(link == []):
print(url + "该图片需要登录才能爬取,爬取失败")
continue
Four: Get the alt of img, when it is used as the file name of the downloaded image file, the name carries \t or special characters that are not allowed in the file name:
- In Python,'\t' is an escape character: space
- For file naming in windows system, the file name cannot contain \ /: *? "<> |A total of 9 special characters
name = link[0].get('alt').replace('\t', '').replace('|', '').replace(':', '').replace('\\', '').replace('/', '').replace('*', '').replace('?', '').replace('"', '').replace('<', '').replace('>', '')
Five: Get the alt of img, when it is used as the file name of the downloaded image file, the name is repeated
path = path + '/' + name + '.jpg'
while(os.path.exists(path)): # 若文件名重复
path = path.split(".")[0] + str(random.randint(2, 17)) + '.' + path.split(".")[1]
Six: Picture link 404
such as:
if(requests.get(src).status_code == 404):
print(url + ' 该图片下载链接404,爬取失败')
print()
continue
Directory six: complete code
- Lanzou cloud link: Python crawler, I want all the high-definition beautiful pictures (the other shore desktop wallpaper). After
downloading and decompressing the zip, there are two python files
The original site of the original blog: https://blog.csdn.net/Zhangguohao666/article/details/105131503