Python crawler novice introductory teaching (2): crawling novels

Preface

The text and pictures in this article are from the Internet and are for learning and communication purposes only, and do not have any commercial use. If you have any questions, please contact us for processing.

Python crawler, data analysis, website development and other case tutorial videos are free to watch online

https://space.bilibili.com/523606542

preamble

Python crawler beginners introductory teaching (1): crawling Douban movie ranking information

Basic development environment

  • Python 3.6
  • Pycharm

Use of related modules

  • requests
  • parcel

Install Python and add it to the environment variables, pip installs the required related modules.

Python crawler novice introductory teaching (1): crawling novels

 

Single chapter crawl

Python crawler novice introductory teaching (1): crawling novels

 

One, clear needs

Crawl the novel content and save it locally

  • Novel name
  • Novel chapter name
  • Novel content
# 第一章小说url地址
url = 'http://www.biquges.com/52_52642/25585323.html'
url = 'http://www.biquges.com/52_52642/25585323.html'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
}
response = requests.get(url=url, headers=headers)
print(response.text)

Python crawler novice introductory teaching (1): crawling novels

 


There are garbled codes in the data returned by the requested webpage, which requires us to transcode.

Add a line of code to automatically transcode.

response.encoding = response.apparent_encoding

Python crawler novice introductory teaching (1): crawling novels

 

Three, parse the data

Python crawler novice introductory teaching (1): crawling novels

 


According to css selector, novel title and novel content can be extracted directly.

def get_one_novel(html_url):
    # 调用请求网页数据函数
    response = get_response(html_url)
    # 转行成selector解析对象
    selector = parsel.Selector(response.text)
    # 获取小说标题
    title = selector.css('.bookname h1::text').get()
    # 获取小说内容 返回的是list
    content_list = selector.css('#content::text').getall()
    # ''.join(列表) 把列表转换成字符串
    content_str = ''.join(content_list)
    print(title, content_str)


if __name__ == '__main__':
    url = 'http://www.biquges.com/52_52642/25585323.html'
    get_one_novel(url)

Python crawler novice introductory teaching (1): crawling novels

 

Fourth, save data (data persistence)

Use the usual saving method:  with open

def save(title, content):
    """
    保存小说
    :param title: 小说章节标题
    :param content: 小说内容
    :return: 
    """
    # 路径
    filename = f'{title}\\'
    # os 内置模块,自动创建文件夹
    if os.makedirs(filename):
        os.mkdir()
    # 一定要记得加后缀 .txt  mode 保存方式 a 是追加保存  encoding 保存编码
    with open(filename + title + '.txt', mode='a', encoding='utf-8') as f:
        # 写入标题
        f.write(title)
        # 换行
        f.write('\n')
        # 写入小说内容
        f.write(content)

Python crawler novice introductory teaching (1): crawling novels

 

Python crawler novice introductory teaching (1): crawling novels

 


Save a chapter of the novel, and that's it. What if you want to save the entire novel?

Whole novel crawler

Now that you know how to crawl a single chapter novel, you only need to get the url addresses of all single chapter novels in the novel, and you can crawl all the content of the novel.

Python crawler novice introductory teaching (1): crawling novels

 


The url addresses of all single chapters are in the dd tag, but this url address is incomplete, so when crawling down, the url address must be spliced.

def get_all_url(html_url):
    # 调用请求网页数据函数
    response = get_response(html_url)
    # 转行成selector解析对象
    selector = parsel.Selector(response.text)
    # 所有的url地址都在 a 标签里面的 href 属性中 
    dds = selector.css('#list dd a::attr(href)').getall()
    for dd in dds:
        novel_url = 'http://www.biquges.com' + dd
        print(novel_url)


if __name__ == '__main__':
    url = 'http://www.biquges.com/52_52642/index.html'
    get_all_url(url)

Python crawler novice introductory teaching (1): crawling novels

 


In this way, all the url addresses of the novel chapters are obtained.

Crawl the complete code

import requests
import parsel
from tqdm import tqdm


def get_response(html_url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
    }
    response = requests.get(url=html_url, headers=headers)
    response.encoding = response.apparent_encoding
    return response


def save(novel_name, title, content):
    """
    保存小说
    :param title: 小说章节标题
    :param content: 小说内容
    :return:
    """
    filename = f'{novel_name}' + '.txt'
    # 一定要记得加后缀 .txt  mode 保存方式 a 是追加保存  encoding 保存编码
    with open(filename, mode='a', encoding='utf-8') as f:
        # 写入标题
        f.write(title)
        # 换行
        f.write('\n')
        # 写入小说内容
        f.write(content)


def get_one_novel(name, novel_url):
    # 调用请求网页数据函数
    response = get_response(novel_url)
    # 转行成selector解析对象
    selector = parsel.Selector(response.text)
    # 获取小说标题
    title = selector.css('.bookname h1::text').get()
    # 获取小说内容 返回的是list
    content_list = selector.css('#content::text').getall()
    # ''.join(列表) 把列表转换成字符串
    content_str = ''.join(content_list)
    save(name, title, content_str)


def get_all_url(html_url):
    # 调用请求网页数据函数
    response = get_response(html_url)
    # 转行成selector解析对象
    selector = parsel.Selector(response.text)
    # 所有的url地址都在 a 标签里面的 href 属性中
    dds = selector.css('#list dd a::attr(href)').getall()
    # 小说名字
    novel_name = selector.css('#info h1::text').get()
    for dd in tqdm(dds):
        novel_url = 'http://www.biquges.com' + dd
        get_one_novel(novel_name, novel_url)


if __name__ == '__main__':
    novel_id = input('输入书名ID:')
    url = f'http://www.biquges.com/{novel_id}/index.html'
    get_all_url(url)

Python crawler novice introductory teaching (1): crawling novels

 

Python crawler novice introductory teaching (1): crawling novels

Guess you like

Origin blog.csdn.net/m0_48405781/article/details/113118597