Realize their fiction reader with python

Original link: https://www.jianshu.com/u/8f2987e2f9fb

Some time ago, when the book shortage in the Himalayan APP discovered a novel anchor talk over the radio - king for mercy. This sounds funny, very interesting, but only the first 200 are free, behind will be charged. Chapter two cents, wanted to buy it and found say progress is slow and the whole book to more than 1,300 sheets, forget about the job requires 200 ocean, and he began, but did not know what time to go to go with. So look for the text version to read the text version have its shortcomings, you have to hand-eye linkage job. If you are busy with other things, but they could not suppress the urge to want to see, very tangled. The Internet to find a circle, no other audio. And those who have previously used reading software features, such as micro-channel reading the book chasing artifacts also start charging. then what should we do? This can be stumped you a programmer? Must drip, drip can not be determined. But I use the world's best programming language -Python

Ever since, clothed yourself, then let us implement your own fiction reader it.

Select speech synthesis
in order to read the text, it is necessary to use speech synthesis. Now this speech synthesis software, many of them iFlyTek and Baidu is the better are two, we are here on the use of speech synthesis Baidu API to achieve.

Create a voice synthesis application
is first registered Baidu account, then log on to the open platform Baidu AI
), create an application
Here Insert Picture Description
Here Insert Picture Description
to fill in the application name and description of the information submitted
Here Insert Picture Description
in mind AppID, API Key, Secret Key, when using the API will be used, check technical documentation
), using pip install baidu-aip After installing the API, a detailed example code within the document, it is easy to get started. There are a variety of parameters, such as volume, tone, speed, sound and other people. Now speech synthesis already have, have read the premise, the following is to get the content of the novel.

Gets content of the novel

小说内容的获取我们从笔趣阁网站上获取,一方面免费,另一方面没有反爬,找到网站首页https://www.biquge.info/40_40289/,使用requests大法就可以了。简单分析一下页面
Here Insert Picture Description
所有章节信息都在dd元素下,而且链接也是很有规律的,直接用xpath获取所有章节列表信息。

'''
遇到问题没人解答?小编创建了一个Python学习交流QQ群:857662006 寻找有志同道合的小伙伴,
互帮互助,群里还有不错的视频学习教程和PDF电子书!
'''
def get_chapters(self):
    url = "https://www.biquge.info/40_40289/"
    r = self.session.get(url)
    r.encoding = chardet.detect(r.content).get("encoding", "utf-8")
    html = etree.HTML(r.text)
    for item in html.xpath("//dl/dd/a"):
        yield item.attrib["title"], url + item.attrib["href"]

章节内容获取也非常简单,就不分析了

def get_content(self, url):
    r = self.session.get(url)
    r.encoding = chardet.detect(r.content).get("encoding", "utf-8")
    html = etree.HTML(r.text)
    title = html.xpath(r'//*[@class="bookname"]/h1')[0].text
    for info in html.xpath("//div[@id='content']"):
        text = info.xpath("string(.)")

这里有一点要注意的,获取的章节内容中有html元素,xpath为我们提供了string(.),提取多个子节点的文本,非常好用。

合成存储
小说内容获取成功了,与语音合成结合一下,小说阅读器的雏形就有了。简单实现如下:

'''
遇到问题没人解答?小编创建了一个Python学习交流QQ群:857662006 寻找有志同道合的小伙伴,
互帮互助,群里还有不错的视频学习教程和PDF电子书!
'''
import chardet
import requests
from lxml import etree
from aip import AipSpeech

class CollectNovels:
    def __init__(self):
        self.session = requests.session()
        self.session.headers["user-agent"] = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"

        """ 你的 APPID AK SK """
        APP_ID = '16416498'
        API_KEY = 'oEWGafQkaUGqmsmPbfkE5OMx'
        SECRET_KEY = '6jdsUcH0PXz5TYoELU47u58W5vPV9lwf'
        self.client = AipSpeech(APP_ID, API_KEY, SECRET_KEY)

    def get_chapters(self, url):
        r = self.session.get(url)
        r.encoding = chardet.detect(r.content).get("encoding", "utf-8")
        html = etree.HTML(r.text)
        for item in html.xpath("//dl/dd/a"):
            yield item.attrib["title"], url + item.attrib["href"]

    def get_content(self, url):
        r = self.session.get(url)
        r.encoding = chardet.detect(r.content).get("encoding", "utf-8")
        html = etree.HTML(r.text)
        for info in html.xpath("//div[@id='content']"):
            text = info.xpath("string(.)")
            for line in text.split("。"):
                content = self.client.synthesis(line, 'zh', 1, {"per": 0})
                with open("auido.mp3", "rb") as fp:
                    fp.write(content)

if __name__ == '__main__':
    novel = CollectNovels()
    home_url = "https://www.biquge.info/40_40289/"
    for title, url in novel.get_chapters(home_url):
        novel.get_content(url)

这里是生成了mp3文件,按行生成以后,再使用合成软件合成后,我们就可以放在任意地方去听了。但是这样也有缺陷,必须提前生成,然后才能使用播放器听,这样不是很方便。如果可以边生成边播放是不是更好呢?

播放合成语音
我们可以使用python的pygame库,其他的好几个库都不太好用,有些已经年久失修了,所以就不用了。

import time
import pygame
from io import BytesIO

pygame_mixer = pygame.mixer
pygame_mixer.init(frequency=frequency)
byte_obj = BytesIO()
byte_obj.write(content)
byte_obj.seek(0, 0)
pygame_mixer.music.load(byte_obj)
pygame_mixer.music.play()
while pygame_mixer.music.get_busy():
    time.sleep(0.1)
pygame_mixer.stop()

这里使用BytesIO将语音合成的二进制文件存储在内存中,就不需要再保存成本地mp3了,有一个需要注意的地方pygame_mixer.init(frequency=frequency),这个frequency参数是音频频率,如果不设置的话默认是22050,播放出来的声音和mp3播放相差太大了,一直以为是这个库有问题,换了好几个,有的是用不了,有的有问题,后来我才发现需要设置这个参数,那么这个参数从哪里来呢?查看之前生成的mp3文件属性
Here Insert Picture Description
然后将频率设置为16000就可以了。

最终处理

要生成我们最终可以使用的阅读器还有几个问题需要处理。

  • 合成一句播放一句,这样会有停顿,所以要使用并行处理。

  • 每次合成如果不进行存储,下一次就必须要重新合成。所以我们使用数据库存储合成的语音。

  • Gets chapters, synthesized speech, playing content requires a separate treatment.

Late planning

The latter can increase the front page, increase crawling novel information through the front, showing the progress of the synthesis, playback progress, select the chapter playback.

Guess you like

Origin blog.csdn.net/qdPython/article/details/102757377