抓取一个贴吧所有图片

这个是我博客的第一个小工具啊哈哈哈，觉得还是有点用处的，针对贴吧的图片实现全部下载。

这里涉及三个函数：

第一，针对一个帖子来抓所有图片：

def collect_in_page(url,path='',name='picture',start=1):
    #伪造的请求头
    header={
        'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
    }
    try:
        r = requests.get(url, headers=header,timeout=3)
    except:
        print('open web page <',url," > timeout! i will quit.")
        return start
    count=start#图片的计数器
    for img in BeautifulSoup(r.text).find_all('img',class_='BDE_Image'):#抓取当前页面的jpg格式图片
        try:
            link=img.get('src')
            urlretrieve(link,path+"{}_{}.jpg".format(name,count))
            print(link," downloaded.")
        except:
            print('picture <',img,'> error during download.')
        else:
            count+=1
    for a in BeautifulSoup(r.text).find_all('a'):#翻页，采用递归的策略
        if a.get_text()=='下一页':
            newurl=urljoin(url,str(a.get('href')))
            print('goto :',newurl)
            return collect_in_page(newurl,path,name,count)
        elif a.get_text()=='尾页':
            if urlparse(url)['-2']==str(a.get('href')):
                print('end to last page :',url)
                return count

    print('page :',url," collect finish!")
    return count

第二个，对一个吧，发送请求以及参数得到当前吧里所有帖子连接：

def enter(keyword,pgn):
    params={
        'ie':'utf-8',
        'kw':keyword,
        'pn':50*pgn,#每页是有50个连接的。
    }
    #kw是关键词，就是在百度贴吧搜索的那个关键词...
    host='https://tieba.baidu.com/f'
    url='https://tieba.baidu.com/'
    r=requests.get(host,params=params)
    r.encoding='utf-8'
    links=[]
    for a in BeautifulSoup(r.text).find_all('a',class_='j_th_tit '):
        links.append(urljoin(url,str(a.get('href'))))
    #其实大规模数据应该用生成器的，可是贴吧数据好像不是很多...
    return links

第三个就好说了，就是上面的两个函数互相配合啦：

def auto_collect(ba_name,path='',picname='tieba',start=1,limit=100):#start是文件名其实序号，limit是最多抓取多少图片，不过这个仅仅是范围估计不是准确值
    for pgn in range(0,100):#获取一个页面的连接，我这里是获取最多100个页面
        try:
            links=enter(ba_name,pgn)
            for link in links:
                start=collect_in_page(link,path,picname,start)
                if start>limit:
                    raise EOFError('out of limit you collected.')
        except:
            print("end of tieba or out of limit?")
            sys.exit(1)

每个链接进去一次，搜刮一次图片就走人。。。

大概用法是这样子的：

auto_collect('李毅','F://pic//','mypic',1,800)

然后就去李毅吧下载图片啦。首先F盘文件夹是要存在的。

PS：截止2018/4/28此程序依然可用。

抓取一个贴吧所有图片

猜你喜欢