【Python爬虫】Requests 请求并读写、保存到excel文件中

  • 爬取前程无忧职位信息

    此次我们用简单的爬虫来展示如何把爬到提取出的信息保存的excel文件中.(ps:首先你要安装好模块openpyxl否则就点击右上角离开,百度搜素安装.)

选前程无忧的网页作为案例是因为主编最近在看看工作的消息,想想就顺手写了一个为方便寻找满足自己要求的工作.为简化我们爬虫的需求,我们已经在前程无忧上的页面选择自己对应的职位需求和职位地区.我选的是杭州+互联网/电子商务对应的url地址是前程无忧

import requests
from lxml import etree

from openpyxl import Workbook
class Job():
    def __init__(self):
        self.url = 'https://search.51job.com/list/080200,000000,0000,32,9,99,%2B,2,{页数}.html'
        self.header = {
            "User-Agent": "User-Agent:Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11",
            "Connection": "keep - alive",
        }
        self.wb = Workbook()#   class实例化
        self.ws = self.wb.active#   激活工具表
        self.ws.append(['job', 'company', 'locate', 'salary', 'time'])# 添加对应的表头
    #    构造访问的url地址
    def geturl(self):
        url = [self.url.format(页数 =i) for i in range(1,9)]
        #	分析页面url获取的规则
        return url
    #   发出页面相应请求
    def parse_url(self,url):
        response =requests.get(url,headers =self.header,timeout = 5)
        return response.content.decode('gbk','ignore')
    #   提取页面的对应的信息
    def get_list(self,html_str):
        html = etree.HTML(html_str)
        connect_list = []
        lists = html.xpath("//div[@class ='el']")
        for list in lists:
            item = {}
            item['job'] = ''.join(list.xpath("./p/span/a/@title"))
            item['company'] = ''.join(list.xpath("./span[@class = 't2']/a/@title"))
            item['locate'] = ''.join(list.xpath("./span[@class = 't3']/text()"))
            item['salary'] = ''.join(list.xpath("./span[@class = 't4']/text()"))
            item['time'] = ''.join(list.xpath("./span[@class = 't5']/text()"))
            #   "".join(...)将列表转换为字符串的数据形式,方便保存到excel表格中
            connect_list.append(item)
        return connect_list
    #   保存提取的相关信息(.xlsx的形式保存进去)
    def save_list(self, connects):
        for connect in connects:
            self.ws.append([connect['job'], connect['company'], connect['locate'], connect['salary'], connect['time']])
        print('保存成功页招聘信息')
    #   启动爬虫
    def run(self):
        url_list = self.geturl()
        for url in url_list:
            html_url =self.parse_url(url)
            connects = self.get_list(html_url)
            self.save_list(connects[4:])
            #   提取到的信息中每页前四个为空白信息
        self.wb.save('job.xlsx')
            #   保存excel设置在这里是为让信息全部保存,
            #   如果放在save_list中会最好每次只保存一个页面上的信息不会以追加的形式写入excel

if __name__=='__main__':
    spider = Job()
    spider.run()


注意:欢迎大家指出意见,相互学习哦~

猜你喜欢

转载自blog.csdn.net/qq_39884947/article/details/82758666
今日推荐