Python crawler to play by yourself: Use a small python crawler to crawl the novels you want on the Internet

Python crawler to play by yourself: Use a small python crawler to crawl the novels you want on the Internet

The integrated development software pycharm used in this article is object-oriented development, which belongs to the rookie tutorial. You only need to read the article and code carefully to make it. Not much to say, just go to the tutorial

The source code is attached after each step. If you need the total .py file, please comment or add qq: 1284309379

In total:

This article is after I studied online, I typed out all the codes through my own understanding, and improved them step by step. It is definitely not perfect, and many places can be enriched. For example, the title of the book can be displayed on a GUI interface, so that it can be perfected into a small program, but there is one more aspect of knowledge used in that way. Since it's a rookie tutorial, I won't make any bells and whistles.

First of all, since it is python object-oriented programming, let's first make the framework of object-oriented programming:

import re
import requests
from requests import RequestException

def get_page(url):
    #以浏览器访问来修饰网址
    pass


def get_list(page):
    #获取所有小说网址以及书名
    pass

def get_chapter(novel_url):
    #获取已选择小说,章节链接以及目录
    pass

def get_content(chapter):
    #获取小说所有章节内容并传入txt文件
    write_tofile(chapter_content, chapter[1])
    pass

def write_tofile(chapter_content, chapter_name):
    pass

if __name__ == '__main__':
    url = 'http://www.xbiquge.la/xiaoshuodaquan/'
    page = get_page(url)
    novel_list = get_list(page)
    name = '斗罗大陆4终极斗罗'

    for item in novel_list:
        if item[1] == name:
            novel_chapter = get_chapter(item[0])
            for chapter in novel_chapter:
                get_content(chapter)

First, we must prepare

The first step is to install the library requests required this time. When the python environment variable is configured, you can directly use the windows+r key to call up the control panel, enter cmd, and then enter the following code to wait for the download to complete
pip install requests
. Or you can directly download the requests package in pycharm: link: download package in pycharm
The second step, open pycharm and import the package we need

import re
import requests
from requests import RequestException

Then, we start to analyze the requirements

Step 1: Many websites now have an anti-crawling mechanism, which makes it impossible for direct python crawlers to crawl, so what should we do? Then we will disguise our crawler, pretending that we are visiting this webpage with a browser, and we can crawl from it to what we want. Many people will say that the anti-crawling mechanism of the website is useless. In fact, it is not the case. This trick can stop many novices.

Next, we will directly use the code to "change the head" of our crawler. The browser that comes with win10 can directly press F2 on the web page, and right-click on the QQ browser to check. Find and click NetWork on the directory on the new page, then find a file ending with .css and click, find User-Agent and copy the content, the specific interface is as follows Find User-Agent, the next step is to
change head
crawl The headers of the website are replaced with browser browsing mode, and then all the source code of the web page is extracted. The specific code is as follows:

def get_page(url):
    #以浏览器访问来修饰网址
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
        }
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            response.encoding = response.apparent_encoding
            return response.text        else:
            return None
    except RequestException:
        return None

Step 2: We have pretended to be a browser and captured all the source code of the URL. The next step is the most important thing for the crawler. Go to the web page to see the source code and find out the correct regular expression: first give the original URL
: http ://www.xbiquge.la/xiaoshuodaquan/
Right-click on the blank URL, and then view the source code
insert image description here

    '''
    可以看出每本书的链接和书名是由
    	<li><a href="(链接)">(书名)</a></li>
    这样一个固定形式来写入的,那么我们需要做的事就是将每本书
    的链接网址以及书名给摘出来
	'''

The specific code is as follows: Parameters need to be passed in: the URL page of Biquge's homepage

Then write the regular expression according to the needs I said, use the re.compile() method to convert the regular expression into a pattern object, and then call the findall() method to grab all the objects that match the regular expression in the web page source code Returns a list, each element of which is a tuple: (link, book title). Careful analysis of the fetched list shows that the first 10 tuples are irrelevant to the requirements, so there is no need to return them.

def get_list(page):
    #获取所有小说网址以及书名
    pat = '<li><a href="(.*?)">(.*?)</a></li>'
    list = re.compile(pat).findall(page)
    return list[10:]

Step 3: The links and names of each novel have been obtained, and the next step is to match the novel we need. Parameters need to be passed in
: and the URL of the compared novel.
First, modify the URL of the novel and click to open it. The directory of the novel you need to crawl, continue to analyze the source code of the website
insert image description here
and find that all directory names and links are also regular:

<dd><a href ='(网址)' >(目录名称)</a></dd>

In the same way, start to write regular expressions to crawl out the link URL and directory name, and then return an object: (This place is easy to crash if the network speed is not good when crawling, so put it in the try statement)

def get_chapter(novel_url):
    #获取已选择小说,章节链接以及目录
    try:
        html = get_page(novel_url)
        pat = "<dd><a href='(.*?)' >(.*?)</a></dd>"
        chapters = re.compile(pat).findall(html)
        return chapters
    except:
        print("该章节抓取失败")

The fourth step: Through the third step, we have obtained the link of the table of contents, and this step is the most important step in the whole article, which is to obtain the text of the novel.
First decorate the link of the directory completely, and then use the get_page() method to decorate the crawler.
Then analyze the source code of the web page, you can clearly see that all the text is sandwiched under such an html language

<div id="content">正文<p>

After finding the rules, the next step is to write regular expressions to crawl out the text.
insert image description here
The specific implementation source code is as follows:

def get_content(chapter):
    #获取小说所有章节内容并传入txt文件
    chapter_url = 'http://www.xbiquge.la' + chapter[0]
    html = get_page(chapter_url)
    pat = '<div id="content">(.*?)<p>'
    chapter_content = re.compile(pat).findall(html)
    write_tofile(chapter_content, chapter[1])

Step 5: The last step, the text has been extracted, and then remove unnecessary things from the crawled text, and one file operation is ok.
Input parameters: chapter name and text.
Use the default string method str.replace() to remove the redundant content of the text, and then save it.
When saving to a txt file, it must be saved as an append.
The code is as follows:

def write_tofile(chapter_content, chapter_name):
    for content in chapter_content:
        #去掉空格和换行符号
        print(chapter_name)
        content = content.replace("&nbsp;&nbsp;&nbsp;&nbsp;", "").replace("<br />", "")
        with open('斗罗大陆4终极斗罗.txt', 'a', encoding='utf-8') as f:
            f.write(chapter_name+'\n\n\n\n'+content+'\n\n\n\n\n\n\n')

Additional: The above is to save all chapters to a txt file, and another form is to create a folder and save each chapter obtained by crawling in a txt file.
If you save it like this, you have to import the os library

import os
def write_tofile(chapter_content, chapter_name):
    if not os.path.exists(name):
        os.mkdir(name)
    for content in chapter_content:
        # 去掉空格和换行符
        content = content.replace("&nbsp;&nbsp;&nbsp;&nbsp;", "").replace("<br />", "")
        with open('斗罗大陆4终极斗罗‘ + '\{}.txt'.format(chapter_name), 'w', encoding='utf-8') as f:
            f.write(content)

Step 6:
Encapsulate the code in the form of python's object-oriented programming, and write out the main function

if __name__ == '__main__':
    url = 'http://www.xbiquge.la/xiaoshuodaquan/'
    page = get_page(url)
    novel_list = get_list(page)
    print(novel_list)
    name = '斗罗大陆4终极斗罗'

    for item in novel_list:
        if item[1] == name:
            novel_chapter = get_chapter(item[0])
            for chapter in novel_chapter:
                get_content(chapter)

Summarize

At this point, this little python crawler is finished, the effect is shown in the figure, you can see the novel and the complete crawl out. There are more than 400 chapters, so in the future, can I directly read the novel without having to look for resources (manually funny).
insert image description here
In general, this is just a small reptile, but it is a very important exercise for getting started with reptiles. It is wonderful to be able to crawl to what you want in your own way.

Guess you like

Origin blog.csdn.net/weixin_44440669/article/details/97660889