requests+bs4 module-simple crawler example-novel crawling-Biquge
Article Directory
section1: statement
1. The crawled content of this article is free download content
2. Your own study notes will not be used for commercial use
3. If there is any infringement in this article, please contact me to delete the article! ! !
section2: Ideas
When I was studying bs4, I looked for examples of crawlers. In terms of novels, most of what I found was to put the crawled content in different txt files, so I was wondering if I could put all the chapters in a txt folder in. So I wrote this article. (Look for a few novels by the way, hehe)
section3: Download link analysis:
First enter the page of the Biquge website, select a novel you want to crawl, and then right-click to check and find the pattern.
Looking for the position of the first chapter, I found the link I want to get, but at the same time I found that there are several nodes on it (the place circled in the blue box in the figure). In fact, this is the correspondence of the previous latest chapter, and I repeated it later, and it is also the place that needs to be removed later.
After finding the link, enter the link.
Then check, you will find the location of the chapter content. At the same time , we find that the id is content
, so we can extract the content.
But in order to facilitate the reading of the novel, there is no need to open the txt file chapter by chapter. We can use a list to put all the content together, and then download it.
See the next step
section4: Code writing
(I also have comments in the code part, including which step is to solve the above problem)
1. Guide package
import requests
import bs4
import os
2. Build the request header
headers = {
'user - agent': 'Mozilla / 5.0(WindowsNT10.0;Win64;x64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / 80.0.3987.116Safari / 537.36'
}
3. Create a folder to save the novel
# 创建文件夹
if not os.path.exists('D:/爬虫--笔趣阁'):
os.mkdir('D:/爬虫--笔趣阁')
4. Build a function to get the novel name and chapter link
def get_name_lists(url): # 得到小说名字和章节链接列表
response = requests.get(url=url, headers=headers)
html = response.text
soup = bs4.BeautifulSoup(html, 'html.parser')
novel_lists = soup.select('#list dd a') # 获取小说章节
novel_name = soup.select('#info h1')[0].string # 获得小说名
novel_lists = novel_lists[12:] # 去掉前面12节重复内容链接
return novel_name, novel_lists
5. Build a function to get the chapter name and chapter content
def get_content(url): # 得到章节名和章节内容
response = requests.get(url=url, headers=headers)
html = response.text
soup = bs4.BeautifulSoup(html, 'html.parser')
name = soup.select(".bookname h1")[0].get_text() # 得到章节名
text = soup.select("#content")[0].get_text().replace('', "").replace(' ', '')
text = text.replace('笔趣阁 www.52bqg.net,最快更新万古第一神最新章节! ', '') # 得到章节内容,并利用替换,去掉广告和空格
return name, text
6. Build a download function
def text_save(filename, data): # filename为写入的文件,data为要写入数据列表.
file = open(filename, 'w', encoding='utf-8')
for i in range(len(data)):
s = str(data[i]).replace('[', '').replace(']', '') # 去除[]
s = s.replace("'", '').replace(',', '') + '\n' # 去除单引号,逗号,每行末尾追加换行符
file.write(s) # 将列表中数据依次写入文件中
file.close()
7, build the main function
def main():
list_all = list() # 先定义一个空列表,方便之后把内容放在里面
base_url = 'https://www.52bqg.net/book_126836/'
novel_name, novel_lists = get_name_lists(base_url) # 调用函数
text_name = 'D:/爬虫--笔趣阁/' + '{}.txt'.format(novel_name)
# for i in range(len(novel_lists)): # 这个循环是爬取整本小说
for i in range(0, 2): # 学习笔记,所以只爬了前两节
novel_url = base_url + novel_lists[i].get("href")
name, novel = get_content(novel_url) # 调用函数
list_all.append(name)
list_all.append(novel)
print(name, '下载成功啦!!!')
text_save(text_name, list_all) # 调用函数
print('本小说所有章节全部下载完毕!!!')
8. Complete code
import requests
import bs4
import os
headers = {
'user - agent': 'Mozilla / 5.0(WindowsNT10.0;Win64;x64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / 80.0.3987.116Safari / 537.36'
}
# 创建文件夹
if not os.path.exists('D:/爬虫--笔趣阁'):
os.mkdir('D:/爬虫--笔趣阁')
def get_name_lists(url): # 得到小说名字和章节链接列表
response = requests.get(url=url, headers=headers)
html = response.text
soup = bs4.BeautifulSoup(html, 'html.parser')
novel_lists = soup.select('#list dd a') # 获取小说章节
novel_name = soup.select('#info h1')[0].string # 获得小说名
novel_lists = novel_lists[12:] # 去掉前面12节重复内容链接
return novel_name, novel_lists
def get_content(url): # 得到章节名和章节内容
response = requests.get(url=url, headers=headers)
html = response.text
soup = bs4.BeautifulSoup(html, 'html.parser')
name = soup.select(".bookname h1")[0].get_text() # 得到章节名
text = soup.select("#content")[0].get_text().replace('', "").replace(' ', '')
text = text.replace('笔趣阁 www.52bqg.net,最快更新万古第一神最新章节! ', '') # 得到章节内容,并利用替换,去掉广告和空格
return name, text
def text_save(filename, data): # filename为写入的文件,data为要写入数据列表.
file = open(filename, 'w', encoding='utf-8')
for i in range(len(data)):
s = str(data[i]).replace('[', '').replace(']', '') # 去除[]
s = s.replace("'", '').replace(',', '') + '\n' # 去除单引号,逗号,每行末尾追加换行符
file.write(s) # 将列表中数据依次写入文件中
file.close()
def main():
list_all = list() # 先定义一个空列表,方便之后把内容放在里面
base_url = 'https://www.52bqg.net/book_126836/'
novel_name, novel_lists = get_name_lists(base_url) # 调用函数
text_name = 'D:/爬虫--笔趣阁/' + '{}.txt'.format(novel_name)
# for i in range(len(novel_lists)): # 这个循环是爬取整本小说
for i in range(0, 2): # 学习笔记,所以只爬了前两节
novel_url = base_url + novel_lists[i].get("href")
name, novel = get_content(novel_url) # 调用函数
list_all.append(name)
list_all.append(novel)
print(name, '下载成功啦!!!')
text_save(text_name, list_all) # 调用函数
print('本小说所有章节全部下载完毕!!!')
if __name__ == '__main__':
main()
section5: running results
Because I was studying, I only downloaded two chapters. The parts that need to be revised when downloading the whole novel are explained in the previous part.
section6: Reference blog posts and learning links
1. The source of ideas for using the list method
Reference blog post: click here to get
2. Some learning methods of soup.select
Reference blog post: click here to get