How to automatically generate an epub e-book

E-books are now very easy to obtain, and the cost of making e-books is getting lower and lower. This article shares some methods of making epub e-books, hoping to help you.

insert image description here

The main structure of an e-book is:

  • Basic meta-information of books : title/author/introduction/cover image/publisher/publishing time and other information.
  • Contents information : You can quickly locate the location of each chapter, making it easy to search for the reading location.
  • Body Content of Chapters : Save detailed reading in this book.

The above is the main structure of our e-book. Of course, some additional information can also be added according to different software, but as an e-book, the above structure is enough.

I wrote a crawler project a long time ago, which is a tool specially used to automatically generate epub e-books. The main functions are:

  • Implemented in Python3 language, it supports custom rules to crawl target web pages.
  • There are built-in writing examples of gitbook/wordpress blog and javascript tutorial website, which can be referred to for learning.

At the beginning, the implementation process was relatively simple, and it was just written as a reptile exercise. Anyone can learn from it to realize their own better functions.

Let’s talk about the process of realizing this function today.

epub production project

  • ebook_spider project: A general purpose crawler project for making ebooks.

This project encapsulates the basic information for making e-books in a single method, and only needs to implement the corresponding method to automatically generate e-books.

1. EbookLib library

This is a Python library for making epub format e-books.

  • The project address , although there are not many Stars, it is still very easy to use.
  • Document address , you need to read the document introduction during the development process.

The epub file production of this project is completed using the ebooklib library, but for the sake of simplicity, a base class Ebook of e-books is encapsulated

2. Ebook crawler base class

This Ebookis a crawler base class, which can generate epub files according to the content of web page links matched by rules.

The general operation process is:

HTTPS访问
True,有章节
False,无chapter列表
True,有section
False,无section
电子书内容网址
是否有chapter列表?
获取chapter列表
是否有section列表?
获取文本内容
异常退出
输出epub文件

After inheriting the Ebook base class, you can implement the acquisition rules of the Chapter list and Section list and the text acquisition rules according to your own needs, and then automatically output the target URL to the epub e-book file, which has a built-in img image material download plug-in, which is very convenient. The image material is downloaded and packaged into an epub file.

Wordpress eBook Creation

Most personal blogs are built with wordpress, so let's see how to package the blog into an epub e-book file.

The WordPressEbook class is inherited Ebook, and then according to the layout rules of the wordpress website, the chapter and content extraction rules are implemented. After that, the epub file production task is handed over to the ebooklib library to help complete it. It can really be said to be very simple. You can find a WordPress blog and try it out.

Refer to the wp_ebook.py file in the project :

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
#########################################################################
# Author: zioer
# mail: [email protected]
# Created Time: 2020年01月02日 星期四 18时46分24秒
#########################################################################


from bs4 import BeautifulSoup as bs
from multiprocessing.dummy import Pool as ThreadPool
import multiprocessing as mp


from Base.EbookBase import Ebook


default_headers = {
    
    
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) \
    AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 \
    Safari/537.36'
}




class WordPressEbook(Ebook):
    # 电子书制作
    def __init__(self, params, outdir="./", proxy=""):
        if not isinstance(params, dict):
            raise TypeError("params type error! not dict type.")
        self.url = params['url']
        self.book_name = params['book_name']
        self.author = params['author']
        self.lang = params.get('lang', 'zh')
        self.identifier = params.get('id', 'id0001')
        self.page_num = params.get('page_num', 5)   # 新增参数,用于拼接页数请求`{url}/page/{page}`
        self.outdir = outdir
        self.proxy = proxy
        self.opts = {
    
    }
        self.plugin = None

    def fetch_chapter_list(self, text):
        '''
        抓取章url列表
        params:
            text:  章列表的HTML页面
        return
            list  [url, ...]
        '''
        soup = bs(text, "lxml")
        # 提取章列表
        a_list = ['{}/page/{}'.format(self.url, a) for a in range(1, self.page_num+1)]
        return a_list

    def fetch_section_list(self, text):
        '''
        抓取章/节(title,url)列表
        params:
            url:  访问页面URL
        return
            list  [(has_chapter, intro, (title, url)), ...] , has_chapter:  是否分章节,True-是,默认False
        '''
        soup = bs(text, "lxml")
        # 提取小节列表
        a_list = soup.select(r'h2 > a')
        # 提取书/章节描述信息(用于生成简介)
        section_list = [(a.get_text(), a.get('href')) for a in a_list]
        return section_list

    def fetch_content(self, text):
        """
        内容提取
        params:
            url: 小节URL地址
        return:
            content: 提取内容, 默认返回整个Body体内容
        """
        try:
            soup = bs(text, 'lxml')
            title = '<h1>' + soup.title.text + '</h1>\n'
            content = soup.find('div', 'entry-content')
            if content is None:
                content = soup.find('article')
            return title + content.prettify()
        except Exception as e:
            print(e)
        return None


if __name__ == '__main__':
    start_urls = [
        {
    
    
            'url': 'https://www.learnhard.cn/',
            'page_num': 3,
            'book_name': '悟空的修炼笔记',
            'author': 'learnhard.cn',
            'id': 'learnhard',
            'lang': 'zh'
        },
    ]
    ctx = mp.get_context('fork')
    p_list = []
    for params in start_urls[:]:
        ebook = WordPressEbook(params, outdir="./")
        p = ctx.Process(target=ebook.fetch_book)
        p.start()
        p_list.append(p)
    for p in p_list:
        p.join()

Summarize

This project is produced by combining the two requirements of web crawler and epub e-book production. As long as the chapters and content crawling rules are set, an epub e-book can be automatically generated, and then the epub format is converted to mobi or azw3 format. Save it to kindle for reading.

Do it yourself, you can achieve the purpose of reading blog posts and improve your programming skills. What do you think of this project?

If you like this project, follow me. Follow me so you don't get lost. Remember to click three links on github.

Guess you like

Origin blog.csdn.net/dragonballs/article/details/126615204