第4.2章简单的二级页面爬取并采用docx操作word

爬取这个网站的初衷，还是为了辅导儿子学习。古文很重要，相信高中生都很痛苦，那些古代文字都不知道是啥意思，所以还是早点背诵比较好。感谢网站的贡献者，我们可以直接写爬虫直接从上面爬下来，不用一个个字敲，或者买本厚厚的书。
爬虫的代码很简单,这里说明下：
parser='html'这个参数一般是不需要的，但是如果文档定义的是在xmlns="http://www.w3.org/1999/xhtm，就需要知道，感谢python中pyquery无法获取标签名的dom节点，困惑了我很久。
第一次先获取连接，然后进入详情界面。

import scrapy
from pyquery import PyQuery as pq

from life_example.items import ArticleItem


class GuWenSpider(scrapy.Spider):
    name = "guwen"
    start_urls = [
        "https://so.gushiwen.org/wenyan/guanzhi.aspx",
    ]

    def parse(self, response):
        soup = pq(response.body_as_unicode(),parser='html')
        links = soup('.sons .typecont span > a')
        for link in links:
            href = pq(link).attr('href')
            yield scrapy.Request(href, callback=self.parse_link)

    def parse_link(self,response):
        soup = pq(response.body_as_unicode(),parser='html')
        item = ArticleItem()
        item['title'] = soup('div.cont h1').text()
        item['author'] = soup('div.cont p.source').eq(0).text()
        item['content'] = soup('div.cont .contson').eq(0).text()
        yield item

通过管道将文章写入到word中，这里没有用模板来写，因为模板只能生成1个文档，python-docx 使用教程，python-docx-template

import os
from life_example.items import ArticleItem
from docx import Document
from docx.shared import Pt
from docx.oxml.ns import qn
from docx.enum.text import WD_ALIGN_PARAGRAPH
from docx.enum.section import WD_SECTION
class ArticlePipeline(object):

    def process_item(self,item,spider):
        if type(item) == ArticleItem:
            self.rw_file(item)

    def rw_file(self, item):
        doc_name = 'G:\\dzmfile\\pythonwork\\life_example\\life_example\\files\\文章\\guwen.docx'
        if os.path.exists(doc_name):
            doc = Document(doc_name)
        else:
            doc = Document()
        doc.styles['Normal'].font.name = '宋体'
        doc.styles['Normal']._element.rPr.rFonts.set(qn('w:eastAsia'), '宋体')
        h = doc.add_heading(item['title'],1)
        h.alignment = WD_ALIGN_PARAGRAPH.CENTER
        paras = item['content'].split('\n')
        for para in paras:
            p = doc.add_paragraph()
            run = p.add_run('\t'+para)
            run.font.size = Pt(20)
        # doc.add_section(start_type=WD_SECTION.CONTINUOUS)

        doc.add_page_break()
        doc.save(doc_name)

第4.2章 简单的二级页面爬取并采用docx操作word

猜你喜欢

第4.2章简单的二级页面爬取并采用docx操作word