xpath之起点中文网案例练习 - 代码天地

xpath之起点中文网案例练习

其他 2020-07-30 22:01:49 阅读次数: 0

import xlwt
import requests
from lxml import etree
import time


# 初始化列表，存入爬虫数据
all_info_list = []


# 定义获取爬虫信息的函数
def get_info(url):

    html = requests.get(url)
    selector = etree.HTML(html.text)

    # 定位大标签，以此循环
    infos = selector.xpath('//ul[@class="all-img-list cf"]/li')

    for info in infos:
        title = info.xpath('div[2]/h4/a/text()')[0]
        author = info.xpath('div[2]/p[1]/a[1]/text()')[0]
        style_1 = info.xpath('div[2]/p[1]/a[2]/text()')[0]
        style_2 = info.xpath('div[2]/p[1]/a[3]/text()')[0]
        style = style_1+'·'+style_2
        complete = info.xpath('div[2]/p[1]/span/text()')[0]
        introduce = info.xpath('div[2]/p[2]/text()')[0].strip()
        word = info.xpath('div[2]/p[3]/span/text()')[0].strip('万字')
        info_list = [title, author, style, complete, introduce, word]
        # 把数据存入列表
        all_info_list.append(info_list)
        # 睡眠1秒
        time.sleep(1)


# 程序主入口
if __name__ == '__main__':

    urls = ['http://a.qidian.com/? page={}'.format(str(i)) for i in range(1, 5)]
    # 获取所有数据
    for url in urls:
        get_info(url)

    # 定义表头
    header = ['title', 'author', 'style', 'complete', 'introduce', 'word']
    # 创建工作簿
    book = xlwt.Workbook(encoding='utf-8')
    # 创建工作表
    sheet = book.add_sheet('Sheet1')
    for h in range(len(header)):
        # 写入表头
        sheet.write(0, h, header[h])

    i = 1  # 行数
    for list in all_info_list:
        j = 0  # 列数
        # 写入爬虫数据
        for data in list:
            sheet.write(i, j, data)
            j += 1
        i += 1
    # 保存文件
    book.save('D://pytext/xiaoshuo.xls')

猜你喜欢

转载自www.cnblogs.com/yang16/p/13406491.html

xpath之起点中文网案例练习

起点中文网小说爬取-etree，xpath，os

python爬虫——爬起点中文网小说

字体文件反反爬-- 起点中文网

抓取起点中文网小说

爬虫练习-爬取起点中文网小说信息

网络爬虫之字体加密混淆：起点中文网

Python爬虫进阶之起点中文网字体反扒保姆级教程！！！

一周搞定scrapy之3，将爬取到的起点中文网信息保存到mysql

python 3 爬起点中文网，简单分析

Scrapy抓取起点中文网排行榜

小爬虫爬起点中文网收藏榜

爬虫(一) java爬取起点中文网小说

爬取起点中文网字体反爬取

网络爬虫&起点中文网完本榜500部小说

【爬虫实战】起点中文网小说的爬取

python 爬取起点中文网的小说(学习记录)

Python爬虫框架Scrapy入门（二）第一个爬虫程序：使用xpath爬取起点中文网

Scrapy 爬取起点中文网存储到 MySQL 数据库（自定义 middleware）

Python3爬取起点中文网阅读量信息，解决文字反爬~~~附源代码

5月第4周业务风控关注 |晋江文学城遭查处，起点中文网部分栏目停更

【历史上的今天】5 月 15 日：Mozilla 发布 Rust；起点中文网成立；Windows 启动音乐设计者出生

爬虫简单之二---使用进程爬取起点中文网的六万多也页小说的名字，作者，等一些基本信息，并存入csv中

7.7Bootstrap中文网练习

flex练习---纵横中文网

一周搞定scrapy之第一天--爬取起点中文小说网

Polymer中文网

FreeNAS中文网

Swiper中文网

opencv中文网

今日推荐

《美国对全球网络空间安全与发展的威胁和破坏》报告发布

火速冲上 GitHub 热榜 —— 开源编程语言、框架哪有这么可爱？

北京人形机器人创新中心发布全球首个纯电驱拟人奔跑的全尺寸人形机器人“天工”

LFOSSA 源来如此公开课 | 掌握云原生未来：CNCF 认证全面攻略与备考秘籍

周排行

循环神经网络（rnn）讲解

Tigao教程四：单独的关节运动

金蝶K3WISE15.0-注册套打教程

如何在Mac上配置Kubernetes

Android应用结束自身进程的方法

SpringMVC学习十三拦截器栈

中国驻洛杉矶总领馆举行新春招待会

HttpClient get post 发送

11 - three.js 笔记 - 绘制三维字体模型

Mysql递归获取某个父节点下面的所有子节点和子节点上的所有父节点

每日归档

更多

2024-05-01(4)

2024-04-30(1)

2024-04-29(40)

2024-04-28(0)

2024-04-27(56)

2024-04-26(39)

2024-04-25(22)

2024-04-24(36)

2024-04-23(26)

2024-04-22(39)