sumafan:python爬虫多线程爬取数据小练习（附答案）

抓取 https://www.cnbeta.com/ 首页中新闻内容页网址，

抓取内容例子： https://hot.cnbeta.com/articles/game/825125

将抓取下来的内容页地址组成list,利用多线程，把所有内容页的内容写入文件，文件名以新闻id命名以上一篇为例子就是 825125.html。

☆`·.·˙˙`·..·˙˙`·..·˙☆`·.·˙˙`·..·˙☆`·.·˙˙`·..·˙˙`·..·☆`·.·˙˙`·..·˙˙`·..·˙☆`·.·˙˙`·..·˙☆`·.·˙˙`·☆.·˙˙`·..·☆`·.·˙˙`·..·˙˙`·..·˙☆`·.·˙˙`·..·˙☆☆`·.·˙˙`·..·˙˙`·..·˙☆`·.·˙˙`·..·˙☆`·.·˙˙`·..·˙˙`·..·☆`·.·˙˙`·..·˙˙`·..·˙☆`·.·˙˙`·..·˙☆`·.·˙˙`·☆.·˙˙`·..·☆`·.·˙˙`·..·˙˙`·..·˙☆`·.·˙˙`·..·˙☆☆`·.·˙˙`·..·˙˙`·..·˙☆`·.·˙˙`·..·˙☆`·.·˙˙`·..·˙˙`·..·☆`·.·˙˙`·..·˙˙`·..·˙☆`·.·˙˙`·..·˙☆`·.·˙˙`·☆.·˙˙`·..·☆`·.·˙˙`·..·˙˙`·..·˙☆`·.·˙˙`·..·˙☆☆`·.·˙˙`·..·˙˙`·..·˙☆`·.·˙˙`·..·˙☆`·.·˙˙`·..·˙˙`·..·☆`·.·˙˙`·..·˙˙`·..·˙☆`·.·˙˙`·..·˙☆`·.·˙˙`·☆.·˙˙`·..·☆`·.·˙˙`·..·˙˙`·..·˙☆`·.·˙˙`·..·˙☆☆`·.·˙˙`·..·˙˙`·..·˙☆`·.·˙˙`·..·˙☆`·.·˙˙`·..·˙˙`·..·☆`·.·˙˙`·..·˙˙`·..·˙☆`·.·˙˙`·..·˙☆`·.·˙˙`·

很久没有用到线程，看到这题的时候脑袋炸了（稳住不慌！）慌得一笔啊哈哈哈啊哈哈

import requests,time,os
from lxml import etree
import re
import threading

#定义请求类
class News(object):
    # # # #定义抓取方法
    def get_content(self,url):
        r = requests.get(url)
        html = r.content.decode("utf-8")
        # with open('./news.html','w',encoding='utf-8') as f:
        #     f.write(html)
        return html

    #定义数据匹配方法
    def get_data(self,html):
        #转换格式
        res = etree.HTML(html)
        detailurl = res.xpath("//div[@class='item']/dl/a/@href")
        # print(detailurl)
        urllist = []
        for i in detailurl:
            a  = re.compile(r'^h[\w.:/]*')
            res = re.match(a,i)
            if res:
                urllist.append(i)
        # print(urllist)
        urllistname=[]
        for i in urllist:
            name = i[-10:-4:1]
            urllistname.append(name)
        # print(urllistname)
        return urllistname,urllist

    def write_data(self,a,b):
        print(a,b)
        res = self.get_content(b)
        with open("./text/"+a+'.html','w',encoding='utf-8') as e:
            e.write(res)

if __name__ == '__main__':
    news = News()
    html = news.get_content("https://www.cnbeta.com/")
    urllistname,urllist = news.get_data(html)
    # print(urllist)
    for i in range(len(urllist)):
        write = threading.Thread(target=news.write_data,args=(urllistname[i],urllist[i]))
        write.setDaemon(True)
        write.start()
    write.join()

    print('ok')

效果：点进去就是一个下载后的详情页。

sumafan:python爬虫多线程爬取数据小练习（附答案）

猜你喜欢