Python爬取美桌网美女图片

因个人工作需要,想从网上爬一些美女图片当配图,于是搜到了美桌网,在meinvtag***标签下有一些高清美女图片,正符合我的需要,因此写了个简单爬虫下载。

首先观察网站特点。以meinvtag2标签为例,总共有5个页面,类似于http://www.win4000.com/meinvtag2_1.html,最后的数字代表页数,是爬虫最喜欢的url构成,直接可以开5个线程单独处理。在页面html源码中可以提取到每个相册集的url,打开相册集就可以一张张查看所有图片。图片页面的url构成同样简单,类似于http://www.win4000.com/meinv198397_2.html,但是无法获知某个相册集下有多少张图片,我采取的方法是,从1到50循环获取,当获取不到图片时跳出循环。代码如下

import requests
from bs4 import BeautifulSoup
import threading

def download_img_from_url(path,url):
    with open(path, 'wb') as f:
        f.write(requests.get(url).content)

def get_BS(url):
    html = requests.get(url)
    try:
        html.raise_for_status()
        return BeautifulSoup(html.text, "lxml")
    except:
        return None

def download(i):
    page_url = page_url_format.format(i)
    bs = get_BS(page_url)
    lists = bs.find_all("div",{'class':'tab_box'})
    tags = lists[1].find_all('a')
    for tag in tags:
        album_url = tag.get('href')
        album_url = album_url[0:-5] + '_'
        for id in range(1,50):
            img_page_url = album_url + str(id) + ".html"
            #print(img_page_url)
            bs2 = get_BS(img_page_url)
            if bs2:
                img_url = bs2.find("img", class_='pic-large').get('data-original')
                name = img_page_url[28:-5]
                download_img_from_url(save_path.format(name),img_url)
            else:
                break

page_url_format = 'http://www.win4000.com/meinvtag2_{}.html'
save_path = 'D:\\image\\{}.jpg'
threads = []
for i in range(1,6):  #直接每一页开一个线程
    thread = threading.Thread(target=lambda:download(i))
    thread.start()
    threads.append(thread)

for thread in threads:
    thread.join()
发布了8 篇原创文章 · 获赞 1 · 访问量 1020

猜你喜欢

转载自blog.csdn.net/w632782651/article/details/105632240