用Python爬取某网站小说

最近看到身边不少朋友在看电子书，想到自己接触Python爬虫也有段时间了，于是就决定找篇小说来练练手，哈哈哈。

在某小说网随便找了一篇，首先看下要爬取的小说页码有没有什么规律，http://book.zongheng.com/chapter/774770/43742964.html
http://book.zongheng.com/chapter/774770/43764713.html
http://book.zongheng.com/chapter/774770/43790004.html
http://book.zongheng.com/chapter/774770/43801354.html

这是前四章的URL，很可惜并没有发现每章之间有什么规律，然后注意到左上角的目录导航，点击进入http://book.zongheng.com/showchapter/774770.html，发现从这个页面可以直接进入任何一个章节，或许我们想找的页面urls就在这里了。右键“审查元素”，果不其然，我们要找的就在这里。

到了这里，感觉胜利已经在招手了，接下来就是requests+beautifulsoup定位读取href存放在文件中，然后循环读取URL，最后把读取内容写入txt文件了。代码如下：

#引入库

import requests
from bs4 import BeautifulSoup

#查看目录页源代码

url = 'http://book.zongheng.com/showchapter/774770.html'
wb_data = requests.get(url)
soup = BeautifulSoup(wb_data.text, 'lxml')

print(soup)

#定位到想要的网址位置

urls = soup.select('body > div.container > div:nth-of-type(2) > div.volume-list > div:nth-of-type(2) > ul > li > a')

#提取href部分

pages = []
for url in urls:
pages.append(url.get('href'))

#现在，我们想要的所有章节URL都在这里了，是时候准备一下应对反爬虫措施了。

import time,random
headers = {'User-Agent':'输入自己的headers'
}

cookies = {'cookie':'输入自己的cookies'
}

#循环读取每个URL，并把小说内容写入txt文件中

for url in pages:
wb_data = requests.get(url, headers=headers, cookies=cookies)
time.sleep(random.randint(5,9)*.1+1) #读取一章后暂停一点几秒再开始
soup = BeautifulSoup(wb_data.text, 'lxml')
txts = soup.select('#readerFt > div > div.content > p') #定位到小说内容
with open('求生日记.txt', 'a') as f:
for txt in txts:
f.write(txt.get_text().replace(u'\xa0', u'')) #unicode中的‘\xa0’字符在转换成gbk编码时会出现问题，做下替换。

用Python爬取某网站小说

猜你喜欢