Python crawler--fiction crawling--biquge

requests+bs4 module-simple crawler example-novel crawling-Biquge

section1: statement

1. The crawled content of this article is free download content
2. Your own study notes will not be used for commercial use
3. If there is any infringement in this article, please contact me to delete the article! ! !

section2: Ideas

When I was studying bs4, I looked for examples of crawlers. In terms of novels, most of what I found was to put the crawled content in different txt files, so I was wondering if I could put all the chapters in a txt folder in. So I wrote this article. (Look for a few novels by the way, hehe)

section3: Download link analysis:

First enter the page of the Biquge website, select a novel you want to crawl, and then right-click to check and find the pattern.
Insert picture description here

Looking for the position of the first chapter, I found the link I want to get, but at the same time I found that there are several nodes on it (the place circled in the blue box in the figure). In fact, this is the correspondence of the previous latest chapter, and I repeated it later, and it is also the place that needs to be removed later.

After finding the link, enter the link.

Then check, you will find the location of the chapter content. At the same time , we find that the id is content
Insert picture description here
, so we can extract the content.

But in order to facilitate the reading of the novel, there is no need to open the txt file chapter by chapter. We can use a list to put all the content together, and then download it.

See the next step

section4: Code writing

(I also have comments in the code part, including which step is to solve the above problem)

1. Guide package

import requests
import bs4
import os

2. Build the request header

headers = {
    
    
     'user - agent': 'Mozilla / 5.0(WindowsNT10.0;Win64;x64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / 80.0.3987.116Safari / 537.36'
}

3. Create a folder to save the novel

# 创建文件夹
if not os.path.exists('D:/爬虫--笔趣阁'):
    os.mkdir('D:/爬虫--笔趣阁')

4. Build a function to get the novel name and chapter link

def get_name_lists(url):  # 得到小说名字和章节链接列表
    response = requests.get(url=url, headers=headers)
    html = response.text
    soup = bs4.BeautifulSoup(html, 'html.parser')
    novel_lists = soup.select('#list dd a')  # 获取小说章节
    novel_name = soup.select('#info h1')[0].string  # 获得小说名
    novel_lists = novel_lists[12:]  # 去掉前面12节重复内容链接
    return novel_name, novel_lists

5. Build a function to get the chapter name and chapter content

def get_content(url):  # 得到章节名和章节内容
    response = requests.get(url=url, headers=headers)
    html = response.text
    soup = bs4.BeautifulSoup(html, 'html.parser')
    name = soup.select(".bookname h1")[0].get_text()  # 得到章节名
    text = soup.select("#content")[0].get_text().replace('', "").replace('     ', '')
    text = text.replace('笔趣阁 www.52bqg.net,最快更新万古第一神最新章节!    ', '')  # 得到章节内容,并利用替换,去掉广告和空格
    return name, text

6. Build a download function

def text_save(filename, data):  # filename为写入的文件,data为要写入数据列表.
    file = open(filename, 'w', encoding='utf-8')
    for i in range(len(data)):
        s = str(data[i]).replace('[', '').replace(']', '')  # 去除[]
        s = s.replace("'", '').replace(',', '') + '\n'  # 去除单引号,逗号,每行末尾追加换行符
        file.write(s)  # 将列表中数据依次写入文件中
    file.close()

7, build the main function

def main():
    list_all = list()  # 先定义一个空列表,方便之后把内容放在里面
    base_url = 'https://www.52bqg.net/book_126836/'
    novel_name, novel_lists = get_name_lists(base_url)  # 调用函数
    text_name = 'D:/爬虫--笔趣阁/' + '{}.txt'.format(novel_name)
    # for i in range(len(novel_lists)):   # 这个循环是爬取整本小说
    for i in range(0, 2):  # 学习笔记,所以只爬了前两节
        novel_url = base_url + novel_lists[i].get("href")
        name, novel = get_content(novel_url)  # 调用函数
        list_all.append(name)
        list_all.append(novel)
        print(name, '下载成功啦!!!')
    text_save(text_name, list_all)  # 调用函数
    print('本小说所有章节全部下载完毕!!!')

8. Complete code

import requests
import bs4
import os
headers = {
    
    
     'user - agent': 'Mozilla / 5.0(WindowsNT10.0;Win64;x64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / 80.0.3987.116Safari / 537.36'
}
# 创建文件夹
if not os.path.exists('D:/爬虫--笔趣阁'):
    os.mkdir('D:/爬虫--笔趣阁')


def get_name_lists(url):  # 得到小说名字和章节链接列表
    response = requests.get(url=url, headers=headers)
    html = response.text
    soup = bs4.BeautifulSoup(html, 'html.parser')
    novel_lists = soup.select('#list dd a')  # 获取小说章节
    novel_name = soup.select('#info h1')[0].string  # 获得小说名
    novel_lists = novel_lists[12:]  # 去掉前面12节重复内容链接
    return novel_name, novel_lists


def get_content(url):  # 得到章节名和章节内容
    response = requests.get(url=url, headers=headers)
    html = response.text
    soup = bs4.BeautifulSoup(html, 'html.parser')
    name = soup.select(".bookname h1")[0].get_text()  # 得到章节名
    text = soup.select("#content")[0].get_text().replace('', "").replace('     ', '')
    text = text.replace('笔趣阁 www.52bqg.net,最快更新万古第一神最新章节!    ', '')  # 得到章节内容,并利用替换,去掉广告和空格
    return name, text


def text_save(filename, data):  # filename为写入的文件,data为要写入数据列表.
    file = open(filename, 'w', encoding='utf-8')
    for i in range(len(data)):
        s = str(data[i]).replace('[', '').replace(']', '')  # 去除[]
        s = s.replace("'", '').replace(',', '') + '\n'  # 去除单引号,逗号,每行末尾追加换行符
        file.write(s)  # 将列表中数据依次写入文件中
    file.close()


def main():
    list_all = list()  # 先定义一个空列表,方便之后把内容放在里面
    base_url = 'https://www.52bqg.net/book_126836/'
    novel_name, novel_lists = get_name_lists(base_url)  # 调用函数
    text_name = 'D:/爬虫--笔趣阁/' + '{}.txt'.format(novel_name)
    # for i in range(len(novel_lists)):   # 这个循环是爬取整本小说
    for i in range(0, 2):  # 学习笔记,所以只爬了前两节
        novel_url = base_url + novel_lists[i].get("href")
        name, novel = get_content(novel_url)  # 调用函数
        list_all.append(name)
        list_all.append(novel)
        print(name, '下载成功啦!!!')
    text_save(text_name, list_all)  # 调用函数
    print('本小说所有章节全部下载完毕!!!')


if __name__ == '__main__':
    main()

section5: running results

Insert picture description here
Insert picture description here

Because I was studying, I only downloaded two chapters. The parts that need to be revised when downloading the whole novel are explained in the previous part.

section6: Reference blog posts and learning links

1. The source of ideas for using the list method

Reference blog post: click here to get

2. Some learning methods of soup.select

Reference blog post: click here to get

Guess you like

Origin blog.csdn.net/qq_44921056/article/details/113828803