版权声明:转载请声明出处,谢谢! https://blog.csdn.net/qq_31468321/article/details/83999755
写在前面
曾经,我们都有梦,关于文学,关于爱情,关于一场穿越世界的旅行,如今我们深夜饮酒,杯子碰在一起,都是梦破碎的声音
曾经,面对诗文如痴如醉,而如今,已漠眼阑珊,风起云涌不再,呜呼哀哉,索一首诗篇以慰藉烁烁华年
卷一
前几日,发现古诗文网站,如获至宝,便被一时私念驱使,将其中的诗文一栏文章全部爬下来了。此一文以记之。
卷二
爬取整个过程如偷盗一般,条理清晰,速战速决。且听细细道来。
- 首先获取诗文一栏所有标签的URL,然后进入标签中,获取所有诗文详情页的URL
- 爬取每个详情页中的详细的、喜欢的信息,如:题目,作者,内容
- 将获取到的信息保存到数据库中
卷三
导入有用的包
#请求包
import requests
#解析网页的包
from lxml import etree
#导入数据库的类,该类在另一个文件中实现,后面会有
from write_database import Write_databases
类的构造函数
class GuShiWen():
def __init__(self):
self.main_url = 'https://www.gushiwen.org/'
self.headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
}
self.bash_url = 'https://so.gushiwen.org/'
#初始化数据库
self.database = Write_databases()
首先获取诗文一栏所有标签的URL
def get_all_shiw_urls(self):
res = requests.get(self.main_url,headers=self.headers)
html = etree.HTML(res.text)
sons_div_lists = html.xpath(".//div[@class='sons'][1]/div[@class='cont']/a")[:-2]
for a_info in sons_div_lists:
a_href = a_info.xpath('./@href')[0]
a_text = a_info.xpath('./text()')
self.get_all_content_urls(a_href)
获取某个标签中所有诗文的url,并构造为可用的URL
def get_all_content_urls(self,urls):
text_html = requests.get(urls,headers=self.headers)
html = etree.HTML(text_html.text)
text_title = html.xpath('.//div[@class="title"][1]/h1/text()')
text_dev = html.xpath('.//div[@class="sons"][1]/div')
for item in text_dev:
text_span = item.xpath('./span')
for span_item in text_span:
try:
text_a_href = span_item.xpath('./a/@href')[0]
text_a_text = span_item.xpath('.//text()')
except:
continue
self.get_poetry(self.bash_url + text_a_href)
爬取诗文的详细信息,并写入到数据库
def get_poetry(self,url):
poetry_html = requests.get(url,headers=self.headers)
html = etree.HTML(poetry_html.text)
poetry_div = html.xpath('.//div[@class="sons"]/div')[0]
poetry_title = poetry_div.xpath('./h1/text()')[0]
poetry_author = poetry_div.xpath('./p//text()')
poetry_author = " ".join(poetry_author)
poetry_cont = poetry_div.xpath('./div[2]//text()')
poetry_cont = " ".join(poetry_cont)
print("====="*10+'===='+'===')
print(poetry_title)
print(poetry_author)
print(poetry_cont)
self.write_database(poetry_title,poetry_author,poetry_cont)
def write_database(self,title,author,cont):
self.database.insert_data(title,author,cont)
最后,main函数
def main():
gusw = GuShiWen()
gusw.get_all_shiw_urls()
卷四
实现数据库类,主要包含的功能有,连接数据库,将爬到的信息写入到数据库,随机读出数据库中某一首诗词的信息,关闭数据库
import pymysql
import random
class Write_databases():
def __init__(self):
self.db = pymysql.connect(
host = '127.0.0.1',
user = 'root',
password = 'root',
database = 'gushiw',
port = 3306
)
self.cursor = self.db.cursor()
def insert_data(self,title,author,cont):
sql = '''
insert into gushiw_table(id,poetry_title,poetry_author,poetry_cont)
values(null,%s,%s,%s)
'''
self.cursor.execute(sql,(title,author,cont))
self.db.commit()
def read_data(self):
id = random.randint(127,4017)
print(id)
sql = 'select * from gushiw_table where id = %s'
value = self.cursor.execute(sql,(id,))
value = self.cursor.fetchall()
print(value)
title = value[0][1]
author = value[0][2]
cont = value[0][3]
print(title,author,cont)
def close_databases(self):
self.db.close()
未完待续,之后写一个软件,将随机读出数据库中的诗文,并展示。