My crawler entry work (1)


1. Tools

  1. Development environment: Python3.7 + Visual Studio Code
  2. Browser: Google Chrome
  3. Install Python library: requests -used to get HTML, BeautifulSoup -used to parse HTML documents

2. Text

2.1 URL

Goal: Crawl on the website of Xinbi Quge's "Fighting Knives in the Snow" by the lords of the Fenghuo Opera. .
Directory page address: http://www.xbiquge.la/0/745/

2.2 Ideas

  1. Analyze the source code of the catalog page and find the tag that saves the URL of each chapter: right-click the mouse —> select check (N) or press F12, the source code of the web page will come out. To find the URL of the catalog, place the mouse on one of the chapter catalogs and right-click, and the source code will jump to the corresponding content. It can be seen that the URL of each chapter is stored in the div tag with the attribute id="list". You only need to find the div corresponding to the attribute, and then find all the a tags contained in it, which is to find the URLs of all the chapters.
    Insert picture description here
    Insert picture description here
    Insert picture description here
  2. Analyze the content of the chapter page: right-click on the content (text) —> check, and jump to the content corresponding to the source code. It is found that the content is stored in the div tag with the attribute id="content", so you need to find the corresponding div tag, and then output the content.
    Insert picture description here
  3. General idea: Get the URL of each chapter in the table of contents page—>traverse the content of each chapter through the URL of each chapter—>write into a text file (txt)

2.3 Implementation

2.3.1 Get the URL of each chapter

    def __get_Link_chapter(self):
        '''在目录页获得各个章节的url.

        解析目录页,通过属性找到存放各个章节url的div标签,
        获取各个章节的url并且返回

        '''
        # 当请求发生异常:连接或者超时错误,等待1S再尝试
        for try_counter in range(10):
            try:
                req_lp = requests.get(self.__url_lp, timeout=10)
                break
            except ConnectionError:
                print('尝试获取目录页ConnectionError:%d' % (try_counter+1))
            except TimeoutError:
                print('尝试获取目录页TimeoutError:%d' % (try_counter+1))
            except:
                print('尝试获取目录页OtherError:%d' % (try_counter+1))
            time.sleep(1)

        if try_counter >= 9:
            print('获取目录页失败')
            return
        else:
            try:
                req_lp.encoding = req_lp.apparent_encoding
                # 建立BeautifulSoup对象,指定解析器lxml
                bs_lp = BeautifulSoup(req_lp.text, 'lxml')
                # 找到所有对应属性的div标签
                div_list = bs_lp.find_all('div', attrs=self.__attrs_div_lp)
                # 找到所有的a标签
                link_chapter = []
                for div in div_list:
                    link_chapter += div.find_all('a')
                return link_chapter
            except TypeError:
                print('目录页解析异常:TypeError')
                return
            except:
                print('目录页解析异常:OtherError')
                return

2.3.2 Get the content of a chapter

    def __get_content_chapter(self, link):
        '''获取章节内容.

        :param link:在目录页解析后得到的a标签
                    内含章节名和url

        '''
        name_chapter = link.string
        url_chapter = self.__url_ws + link['href']  # 拼接得到章节页url
        # 在链接和读取过程中,出现异常的处理方式
        # 这里出现异常的话循环10次,等待1S重新读取
        for try_counter in range(10):
            try:
                # 超时设置为10S
                req_ct = requests.get(url_chapter, timeout=10)
                break
            except ConnectionError:
                print('尝试获取章节链接:ConnectionError%d' % (try_counter+1))
            except TimeoutError:
                print('尝试获取章节链接:TimeoutError%d' % (try_counter+1))
            except:
                print('尝试获取章节链接:OtherError%d' % (try_counter+1))
            time.sleep(1)

        if try_counter >= 9:
            print('获取链接失败:'+name_chapter)
            return name_chapter+'\n\n'
        else:
            try:
                req_ct.encoding = self.__encode
                # 建立BeautifulSoup对象
                bs_ct = BeautifulSoup(
                    req_ct.text, 'lxml')
                # 将找到的div内容转换成文本格式,
                # 并且将里面的&nbsp(不间断空格)替换成空格
                # 将br标签替换成换行符
                content = bs_ct.find(
                    'div', attrs=self.__attrs_div_ct)
                content = str(content).replace('<br/>','\n').replace('\xa0',' ')
                content = BeautifulSoup(content,'lxml').get_text()
                return name_chapter + '\n\n' + content + '\n\n'
            except TypeError:
                print('章节页解析异常:TypeError '+name_chapter)
                return name_chapter+'\n\n'
            except:
                print('章节页解析异常:OtherError '+name_chapter)
                return name_chapter+'\n\n'

2.3.3 Write content

    def write(self, path_save):
        '''写下载的文件到指定路径.

        :param path_save:指定的保存路径

        '''
        # 在指定的文件夹路路径之下新建与书名同名的文本文件
        path_save = path_save + '\\' + self.__name + '.txt'
        # 获取各个章节的URL
        link_chapter = self.__get_Link_chapter()
        if link_chapter is None:
            pass
        else:
            # 打开文件
            with open(path_save, 'w+', encoding=self.__encode) as file:
                for chapter, link in enumerate(link_chapter):
                    # 获取章节内容
                    content_chapter = self.__get_content_chapter(link)
                    file.write(content_chapter)
                    sys.stdout.write('下载进度:%.1f%%' % float(
                        chapter/len(link_chapter)*100)+'\r')
        print('<<'+self.__name+'>>下载完成')

2.3.4 Complete code

from bs4 import BeautifulSoup
import requests
import time
import sys



class fiction():

    def __init__(self, name, url_ws, url_lp, encode, attrs_div_lp={
    
    }, attrs_div_ct={
    
    }):
        self.__name = name  # 名字
        self.__url_ws = url_ws  # 网站url
        self.__url_lp = url_lp  # 链接(目录)页的url
        self.__attrs_div_lp = attrs_div_lp  # 链接(目录页)存放各个章节链接的div标签属性
        self.__attrs_div_ct = attrs_div_ct  # 章节页存放内容的div标签属性
        self.__encode = encode  # 指定编码格式

    def Update(self, name, url_ws, url_lp, encode, attrs_div_lp={
    
    }, attrs_div_ct={
    
    }):
        '''重置参数

        必须同时重置所有参数,否则可能出现错误

        '''
        self.__name = name  # 名字
        self.__url_ws = url_ws  # 网站url
        self.__url_lp = url_lp  # 链接(目录)页的url
        self.__attrs_div_lp = attrs_div_lp  # 链接(目录页)存放各个章节链接的div标签属性
        self.__attrs_div_ct = attrs_div_ct  # 章节页存放内容的div标签属性
        self.__encode = encode

    def __get_Link_chapter(self):
        '''在目录页获得各个章节的url.

        解析目录页,通过属性找到存放各个章节url的div标签,
        获取各个章节的url并且返回

        '''
        # 当请求发生异常:连接或者超时错误,等待1S再尝试
        for try_counter in range(10):
            try:
                req_lp = requests.get(self.__url_lp, timeout=10)
                break
            except ConnectionError:
                print('尝试获取目录页ConnectionError:%d' % (try_counter+1))
            except TimeoutError:
                print('尝试获取目录页TimeoutError:%d' % (try_counter+1))
            except:
                print('尝试获取目录页OtherError:%d' % (try_counter+1))
            time.sleep(1)

        if try_counter >= 9:
            print('获取目录页失败')
            return
        else:
            try:
                req_lp.encoding = req_lp.apparent_encoding
                # 建立BeautifulSoup对象,指定解析器lxml
                bs_lp = BeautifulSoup(req_lp.text, 'lxml')
                # 找到所有对应属性的div标签
                div_list = bs_lp.find_all('div', attrs=self.__attrs_div_lp)
                # 找到所有的a标签
                link_chapter = []
                for div in div_list:
                    link_chapter += div.find_all('a')
                return link_chapter
            except TypeError:
                print('目录页解析异常:TypeError')
                return
            except:
                print('目录页解析异常:OtherError')
                return

    def __get_content_chapter(self, link):
        '''获取章节内容.

        :param link:在目录页解析后得到的a标签
                    内含章节名和url

        '''
        name_chapter = link.string
        url_chapter = self.__url_ws + link['href']  # 拼接得到章节页url
        # 在链接和读取过程中,出现异常的处理方式
        # 这里出现异常的话循环10次,等待1S重新读取
        for try_counter in range(10):
            try:
                # 超时设置为10S
                req_ct = requests.get(url_chapter, timeout=10)
                break
            except ConnectionError:
                print('尝试获取章节链接:ConnectionError%d' % (try_counter+1))
            except TimeoutError:
                print('尝试获取章节链接:TimeoutError%d' % (try_counter+1))
            except:
                print('尝试获取章节链接:OtherError%d' % (try_counter+1))
            time.sleep(1)

        if try_counter >= 9:
            print('获取链接失败:'+name_chapter)
            return name_chapter+'\n\n'
        else:
            try:
                req_ct.encoding = self.__encode
                # 建立BeautifulSoup对象
                bs_ct = BeautifulSoup(
                    req_ct.text, 'lxml')
                # 将找到的div内容转换成文本格式,
                # 并且将里面的&nbsp(不间断空格)替换成空格
                # 将br标签换成换行符
                content = bs_ct.find(
                    'div', attrs=self.__attrs_div_ct)
                content = str(content).replace('<br/>','\n').replace('\xa0',' ')
                content = BeautifulSoup(content,'lxml').get_text()
                return name_chapter + '\n\n' + content + '\n\n'
            except TypeError:
                print('章节页解析异常:TypeError '+name_chapter)
                return name_chapter+'\n\n'
            except:
                print('章节页解析异常:OtherError '+name_chapter)
                return name_chapter+'\n\n'

    def write(self, path_save):
        '''写下载的文件到指定路径.

        :param path_save:指定的保存路径

        '''
        # 在指定的文件夹路路径之下新建与书名同名的文本文件
        path_save = path_save + '\\' + self.__name + '.txt'
        # 获取各个章节的URL
        link_chapter = self.__get_Link_chapter()
        if link_chapter is None:
            pass
        else:
            # 打开文件
            with open(path_save, 'w+', encoding=self.__encode) as file:
                for chapter, link in enumerate(link_chapter):
                    # 获取章节内容
                    content_chapter = self.__get_content_chapter(link)
                    file.write(content_chapter)
                    sys.stdout.write('下载进度:%.1f%%' % float(
                        chapter/len(link_chapter)*100)+'\r')
        print('<<'+self.__name+'>>下载完成')


if __name__ == '__main__':
    start = time.time()
    f = fiction(name='雪中悍刀行',
                url_ws='http://www.xbiquge.la',
                url_lp='http://www.xbiquge.la/0/745/',
                attrs_div_lp={
    
    'id': 'list'},
                attrs_div_ct={
    
    'id': 'content'},
                encode='utf-8')
    f.write(r'C:\Users\HP\Desktop\pytxt')
    stop = time.time()
    print('用时:%ds' % (stop-start))

3. Summary

3.1 html basic knowledge

Basic knowledge of html: rookie tutorial-HTML tutorial

3.2 Requests library method

  1. get(): Get the content of the server response. Note: It is best to set the timeout period, otherwise, if the server does not respond, the request will be sent all the time, and the program will always be stuck in that place, and no exception or error will be reported.
    Insert picture description here
    Insert picture description here
  2. Regarding the encoding problem: It is best to specify the encoding method. The encoding guessed by Requests may be inaccurate, and an exception may occur when writing text later. Can refer to a specific analysis know almost
    Insert picture description here
    html charset in the head section to specify the source encoding, encoding and actual inconsistencies sometimes described, this time using the specified coding Response.apparent_encoding.
 req_lp.encoding = req_lp.apparent_encoding

Insert picture description here
3. For more details on the Requests library, please refer to the official documents: quick start , advanced usage , development interface

3.3 BeauifulSoup library

  1. Create a BeauifulSoup object:
bs_lp = BeautifulSoup(req_lp.text, 'lxml')

The first parameter is the HTML document that needs to be parsed, and the second parameter is the specified document parser
Insert picture description here
Insert picture description here
2. find_all() method: finds all tags that meet the filter conditions under the current tag tag, and returns a bs4.element.ResultSet type, in fact It can be used like a list.
Insert picture description here
3. Encoding: After using BeautifulSoup, whether the original HTML document is gbk encoding, utf-8 encoding or other encoding, all will be converted to Unicode encoding, probably because Unicode encoding is compatible with all languages.

3.4 File writing

Python file operation: rookie tutorial-Python3 File (file) method

3.5 Each chapter URL

The URL of each chapter of the catalog page is incomplete, and the URL of the website needs to be spliced ​​together to be a complete URL.

url_chapter = self.__url_ws + link['href']  # 拼接得到章节页url

4. Reference

  1. Python3 web crawler quick start practical analysis
  2. Python3 crawler Chinese garbled problem solving? (Beautifulsoup4)

First modification: change the br tag to a newline

Guess you like

Origin blog.csdn.net/qq_36439722/article/details/106028350