When traveling, we often see tourists wearing various costumes to take pictures in tourist attractions, and we don't pay much attention to it. When I browsed the web two days ago, I didn't intend to see a website. I saw a girl in Hanfu. Whether it is work needs or creating copywriting, it is a good idea to use such beautiful pictures as materials. If necessary, we will climb it, climb it, climb it!
Not much to say, we will introduce image crawling in detail below.
Analyze the website
The URL is as follows:
'http://www.aihanfu.com/zixun/tushang-1/'
This is the URL of the first page. According to observations, the URL of the second page, that is, the number 1 of the aforementioned website, becomes 2, and so on, you can access all pages.
According to the picture, we need to get the link of each sub-site, that is, the URL in href, then enter each URL, look for the image URL, and just download it.
Sub-link acquisition
In order to obtain the data in the above figure, we can use methods such as soup or re or xpath. In this article, the editor uses xpath to locate, write the positioning function, get the link of each sub-site, and then return to the main function. Here is a technique used , In the for loop, you can take a look!
def get_menu(url, heades):
"""
根据每一页的网址
获得每个链接对应的子网址
params: url 网址
"""
r = requests.get(url, headers=headers)
if r.status_code == 200:
r.encoding = r.apparent_encoding
html = etree.HTML(r.text)
html = etree.tostring(html)
html = etree.fromstring(html)
# 查找每个子网址对应的链接, 然后返回
children_url = html.xpath('//div[@class="news_list"]//article/figure/a/@href')
for _ in children_url:
yield _
Get title and image address
In order to collect as much data as possible, we collect the tags and image addresses. Of course, if other projects need to collect the publisher and time, there can be more, so this article will not be expanded.
We click on a URL link, as shown in the figure above, we can find that the title is in the head node, and the title is used when creating a folder.
code show as below:
def get_page(url, headers):
"""
根据子页链接,获得图片地址,然后打包下载
params: url 子网址
"""
r = requests.get(url, headers=headers)
if r.status_code == 200:
r.encoding = r.apparent_encoding
html = etree.HTML(r.text)
html = etree.tostring(html)
html = etree.fromstring(html)
# 获得标题
title = html.xpath(r'//*[@id="main_article"]/header/h1/text()')
# 获得图片地址
img = html.xpath(r'//div[@class="arc_body"]//figure/img/@src')
# title 预处理
title = ''.join(title)
title = re.sub(r'【|】', '', title)
print(title)
save_img(title, img, headers)
save Picture
When flipping each page, we need to save the pictures corresponding to the sub-links. Here we need to pay attention to the status judgment and path judgment of the request.
def save_img(title, img, headers):
"""
根据标题创建子文件夹
下载所有的img链接,选择更改质量大小
params:title : 标题
params: img : 图片地址
"""
if not os.path.exists(title):
os.mkdir(title)
# 下载
for i, j in enumerate(img): # 遍历该网址列表
r = requests.get(j, headers=headers)
if r.status_code == 200:
with open(title + '//' + str(i) + '.png', 'wb') as fw:
fw.write(r.content)
print(title, '中的第', str(i), '张下载完成!')
Main function
if __name__ == '__main__':
"""
一页一页查找
params : None
"""
path = '/Users/********/汉服/'
if not os.path.exists(path):
os.mkdir(path)
os.chdir(path)
else:
os.chdir(path)
# url = 'http://www.aihanfu.com/zixun/tushang-1/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
' AppleWebKit/537.36 (KHTML, like Gecko)'
' Chrome/81.0.4044.129 Safari/537.36'}
for _ in range(1, 50):
url = 'http://www.aihanfu.com/zixun/tushang-{}/'.format(_)
for _ in get_menu(url, headers):
get_page(_, headers) # 获得一页
So far we have completed all the links. The editor has introduced articles about crawlers more than once. On the one hand, I hope everyone can be familiar with crawling skills. On the other hand, I believe that crawlers are the basis of data analysis and data mining. There is no crawler to get data, how can data analysis be done? In order for everyone to learn and grow better in crawlers, students who need a complete code, click the link below to get it .
Recommended reading
-
Fun in depth! 10 interesting and easy-to-use AI projects (with Python source code)
-
Python analyzed 5 years of Shanghai Index data, this crop of leeks is not so easy to be cut
-
Mosaic becomes HD in seconds, this method called PULSE is on fire
-
GitHub Hot List|5 high-quality Python gadgets, the last one is a welfare!
For more exciting content, follow the WeChat public account "Python learning and data mining"
In order to facilitate technical exchanges, this account has opened a technical exchange group. If you have any questions, please add a small assistant WeChat account: connect_we. Remarks: The group is from CSDN, welcome to reprint, favorites, codewords are not easy, like the article, just like it! Thanks