python爬取古诗文网站诗文一栏的所有诗词

写在前面

曾经，我们都有梦，关于文学，关于爱情，关于一场穿越世界的旅行，如今我们深夜饮酒，杯子碰在一起，都是梦破碎的声音
曾经，面对诗文如痴如醉，而如今，已漠眼阑珊，风起云涌不再，呜呼哀哉，索一首诗篇以慰藉烁烁华年

卷一

前几日，发现古诗文网站，如获至宝，便被一时私念驱使，将其中的诗文一栏文章全部爬下来了。此一文以记之。

卷二

爬取整个过程如偷盗一般，条理清晰，速战速决。且听细细道来。

首先获取诗文一栏所有标签的URL，然后进入标签中，获取所有诗文详情页的URL
爬取每个详情页中的详细的、喜欢的信息，如：题目，作者，内容
将获取到的信息保存到数据库中

卷三

导入有用的包

#请求包
import requests
#解析网页的包
from lxml import etree
#导入数据库的类，该类在另一个文件中实现，后面会有
from write_database import Write_databases

类的构造函数

class GuShiWen():
    def __init__(self):
        self.main_url = 'https://www.gushiwen.org/'
        self.headers = {
            'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
        }
        self.bash_url = 'https://so.gushiwen.org/'
        #初始化数据库
        self.database = Write_databases()

首先获取诗文一栏所有标签的URL

    def get_all_shiw_urls(self):
        res = requests.get(self.main_url,headers=self.headers)
        html = etree.HTML(res.text)
        sons_div_lists = html.xpath(".//div[@class='sons'][1]/div[@class='cont']/a")[:-2]
        for a_info in sons_div_lists:
            a_href = a_info.xpath('./@href')[0]
            a_text = a_info.xpath('./text()')
            self.get_all_content_urls(a_href)

获取某个标签中所有诗文的url，并构造为可用的URL

    def get_all_content_urls(self,urls):
        text_html = requests.get(urls,headers=self.headers)
        html = etree.HTML(text_html.text)
        text_title = html.xpath('.//div[@class="title"][1]/h1/text()')
        text_dev = html.xpath('.//div[@class="sons"][1]/div')
        for item in text_dev:
            text_span = item.xpath('./span')
            for span_item in text_span:
                try:
                    text_a_href = span_item.xpath('./a/@href')[0]
                    text_a_text = span_item.xpath('.//text()')
                except:
                    continue
                self.get_poetry(self.bash_url + text_a_href)

爬取诗文的详细信息，并写入到数据库

    def get_poetry(self,url):
        poetry_html = requests.get(url,headers=self.headers)
        html = etree.HTML(poetry_html.text)
        poetry_div = html.xpath('.//div[@class="sons"]/div')[0]
        poetry_title = poetry_div.xpath('./h1/text()')[0]
        poetry_author = poetry_div.xpath('./p//text()')
        poetry_author = " ".join(poetry_author)
        poetry_cont = poetry_div.xpath('./div[2]//text()')
        poetry_cont = " ".join(poetry_cont)
        print("====="*10+'===='+'===')
        print(poetry_title)
        print(poetry_author)
        print(poetry_cont)
        self.write_database(poetry_title,poetry_author,poetry_cont)

    def write_database(self,title,author,cont):
        self.database.insert_data(title,author,cont)

最后,main函数

def main():
    gusw = GuShiWen()
    gusw.get_all_shiw_urls()

卷四

实现数据库类,主要包含的功能有，连接数据库，将爬到的信息写入到数据库，随机读出数据库中某一首诗词的信息，关闭数据库

import pymysql
import random

class Write_databases():
    def __init__(self):
        self.db = pymysql.connect(
            host = '127.0.0.1',
            user = 'root',
            password = 'root',
            database = 'gushiw',
            port = 3306
        )
        self.cursor = self.db.cursor()

    def insert_data(self,title,author,cont):
        sql = '''
            insert into gushiw_table(id,poetry_title,poetry_author,poetry_cont)
            values(null,%s,%s,%s)
        '''
        self.cursor.execute(sql,(title,author,cont))
        self.db.commit()
    def read_data(self):
        id = random.randint(127,4017)
        print(id)
        sql = 'select * from gushiw_table where id = %s'

        value = self.cursor.execute(sql,(id,))

        value = self.cursor.fetchall()
        print(value)
        title = value[0][1]
        author = value[0][2]
        cont = value[0][3]
        print(title,author,cont)
    def close_databases(self):
        self.db.close()

未完待续，之后写一个软件，将随机读出数据库中的诗文，并展示。

python爬取古诗文网站诗文一栏的所有诗词

写在前面

卷一

卷二

卷三

卷四

猜你喜欢