python之8步爬取妹子图

本文主要要是利用python爬虫爬取妹子图的所有图片,妹子图主要是静态网站,关于动态网站的爬虫,将会以豆瓣和堆糖网为例,后续进行爬取
首先是导入本次爬取需要的库,代码如下:

import random
import requests,re,threading,time,os,lxml
from requests import RequestException
from bs4 import BeautifulSoup

2、为了应对网站的反爬虫机制,要模拟请求头和设置代理IP,请求头网上一搜一大把,代理IP我选择的是国内高匿名(http://www.xicidaili.com/nn/),大部分的IP是可以使用的,不好之处就是,很快这些IP就不能使用了,最好的方式是构建代理池,让python自动爬取能用的IP,避免要经常改,代码如下:

Agent = ['Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)',
             'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.4.3.4000 Chrome/30.0.1599.101 Safari/537.36',
             'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36',
             'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
             'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36'
             ]
http = ['https://106.122.170.69:818','https://121.231.168.225:6666',
        'https://27.40.145.102:61234',
        'https://1.196.161.172:9999',
        'https://117.36.103.170:8118',
        'https://221.228.17.172:8181',
        'https://1.196.161.170:9999'
    ]
headers={}
proxies= {}
headers["User-Agent"] = random.choice(Agent)
proxies["https"] = random.choice(http)

3、构建网页请求,利用requests.get()方法,将请求头和代理IP传入参数,因为我们不知道哪个IP可以使用或者说爬取得过程可能会被网站反爬,所以利用try..expect 方法,代码如下:


def get_page_url(url):
    try:
        r = requests.get(url,headers = headers,proxies = proxies)
        r.encoding = "utf-8"
        html = r.text
        return html
    except RequestException as e:
        print(u'请求错',e)

4、解析网页,分析图片在网页的位置及标签,利用find_all找到所有的图片链接,代码如下:

def get_pic_link(html):
    soup = BeautifulSoup(html,'lxml')
    tag = soup.find_all('img',class_='lazy')
    return tag
    ```
  5、因为我要爬的不是一页的内容而是整个网站的内容,所有要把所有的页面链接全部找到,让python模拟我们鼠标点击的结果进行爬取,代码如下:
    ```

def get_pic_page_url(pic_link):
    reg = get_page_url(pic_link)
    soup = BeautifulSoup(reg,'lxml')
    soup = soup.find('div',class_='n_page')
    page_data = soup.get_text()[7:9]
    return page_data

6、定义一个文件存储的函数,首先定义一个文件存储的路径,我把爬出下来的文件全部放在test1 文件夹中,每一个都生一个文件夹,文件夹的名称是妹子的文件标题,先判断是否存在该文件,如果不存在就创建一个,如果存在就提示一下,代码如下:

def mk_dir(pic_save_name):
    pic_save_name = pic_save_name.strip()
    path = r"C:\**\**\**\test1\{}".format(pic_save_name)
    if not os.path.exists(path):
        os.mkdir(path)
    else:
        print('文件已存在')
    return path

7、定义一个下载器,函数的参数就是上面我们写的几个函数返回的项,代码如下:

def down_load_pic(pic_download_url,save_path,pic_save_name,n):
    try:
        pic_response = requests.get(pic_download_url, headers=headers, proxies=proxies)
        with open(save_path, "wb")as f:
            f.write(pic_response.content)
        print(u"正在下载{}第{}张图片".format(pic_save_name, n))
    except Exception as e:
        print("任务失败", e)

8,最后定义一个运行函数,运行函数原本不应该这样写,这样写太乱,但是懒没办法,所以就这么堆在一起了(实际工作中,完全不能这样),代码如下:

def main():
    l = [x for x in range(1,69)]
    for D in l:
        url = 'http://mm.douqq.com/index.php?page={}'.format(D)
        html = get_page_url(url)
        soup = get_pic_link(html)
        print(soup)
        for pic_url in soup:
            pic = pic_url.get("data-original").split('/')[-2]
            pic_link = 'http://mm.douqq.com/' + pic +'.html'
            pic_save_name = pic_url.get("alt")
            path = mk_dir(pic_save_name)
            page = get_pic_page_url(pic_link)
            for n in range(1,int(page)+1):
                pic_download_url = 'http://imgs.douqq.com:88/{}/{}'.format(pic,n)+'.jpg'
                save_path = path + '\\'+str(n) +'.jpg'
                try:
                    pic_response = requests.get(pic_download_url,headers = headers,proxies = proxies)
                    with open(save_path,"wb")as f:
                        f.write(pic_response.content)
                    print(u"正在下载{}第{}张图片".format(pic_save_name,n))
                except Exception as e:
                    print("任务失败",e)
        time.sleep(2.0)

main()

猜你喜欢

转载自blog.csdn.net/qq_39001049/article/details/81429590