Python crawler crawls the list of novel rankings and imports it into Excel for easy screening

requests+bs4+xlwt module-simple crawler example-novel ranking list-Biquge articles

Crawl the link to the novel content article: Portal

section1: statement

1. The crawled content in this article is for users to view content for free.
2. This article is my own study notes and will not be used for commercial use.
3. If there is any infringement in this article, please contact me to delete the article! ! !

section2: content crawling analysis:

Since it is necessary to crawl the rankings and filter the data, it is certainly necessary to crawl multiple pages. Then let's first take a look at the changes in the URL of each page. Sometimes the rules can be followed.
Insert picture description here
Insert picture description here
By comparison, we can quickly find the pattern, and then use the release loop in this place.

    for i in range(1, 5):  # 根据想要爬的页数而改动
        url = 'https://www.52bqg.net/top/allvisit/{}.html'.format(i+1)

Then it is to check and analyze a page.
Insert picture description here
The rule is also very obvious. A li node corresponds to the content of a novel. Using bs4, the content can be easily extracted.
After extracting the content, it is to save the content. Use the xlwt module to import the content into excel. Here I will put the code directly:

list_all = list()
path = 'D:/笔趣阁目录.xls'
workbook = xlwt.Workbook(encoding='utf-8', style_compression=0)
worksheet = workbook.add_sheet('小说目录', cell_overwrite_ok=True)  # 可覆盖  # 设置工作表名
col = ('小说类型', '小说名', '最新章节', '作者', '最新更新时间')
for i in range(0, 5):
   worksheet.write(0, i, col[i])  # 设置列名
for i in range(1, 5):  # 根据想要爬的页数而改动
   url = 'https://www.52bqg.net/top/allvisit/{}.html'.format(i)
   data_list = get_content(url)
   list_all.append([data_list])
for i in range(len(list_all)):  # i=0~1
   sleep(0.5)  # 延迟0.5秒)
   print('正在下载第{}页目录=====>  请稍后'.format(i+1))
   data_s = list_all[i]
   for j in range(len(data_s)):  # j=0
       data = data_s[j]
       for k in range(len(data)):  # k=0~49
           data_simple = data[k]
           for m in range(0, 5):  # m=0~4
               worksheet.write(1 + i * 50 + k, m, data_simple[m])
workbook.save(path)

(I am also because of this example, I just learned xlwt, I put the article I referenced in the back.)

section3: complete code

import requests
import bs4
import xlwt
from time import sleep
headers = {
    
    
     'user - agent': 'Mozilla / 5.0(WindowsNT10.0;Win64;x64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / 80.0.3987.116Safari / 537.36'
}


def get_content(url):
    res = requests.get(url=url, headers=headers)
    html = res.text
    soup = bs4.BeautifulSoup(html, 'html.parser')
    soup = soup.select('.novelslistss li')
    list_all = []
    for novel in soup[0: 50]:
        novel_type = novel.select('.s1')[0].string
        novel_name = novel.select('.s2 a')[0].string
        latest_chapters = novel.select('.s3 a')[0].string
        author = novel.select('.s4')[0].string
        update_time = novel.select('.s5')[0].string
        list_all.append([novel_type, novel_name, latest_chapters, author, update_time])
    return list_all


def main():
    list_all = list()
    path = 'D:/笔趣阁目录.xls'
    workbook = xlwt.Workbook(encoding='utf-8', style_compression=0)
    worksheet = workbook.add_sheet('小说目录', cell_overwrite_ok=True)  # 可覆盖  # 设置工作表名
    col = ('小说类型', '小说名', '最新章节', '作者', '最新更新时间')
    for i in range(0, 5):
        worksheet.write(0, i, col[i])  # 设置列名
    for i in range(1, 5):  # 根据想要爬的页数而改动
        url = 'https://www.52bqg.net/top/allvisit/{}.html'.format(i)
        data_list = get_content(url)
        list_all.append([data_list])
    for i in range(len(list_all)):  # i=0~1
        sleep(0.5)  # 延迟0.5秒)
        print('正在下载第{}页目录=====>  请稍后'.format(i+1))
        data_s = list_all[i]
        for j in range(len(data_s)):  # j=0
            data = data_s[j]
            for k in range(len(data)):  # k=0~49
                data_simple = data[k]
                for m in range(0, 5):  # m=0~4
                    worksheet.write(1 + i * 50 + k, m, data_simple[m])
    workbook.save(path)
    print('所检索所有页面目录=======>   全部保存成功!'.format(i))


if __name__ == '__main__':
    main()

section4: running results

Insert picture description here
Insert picture description here
You can use the filter function to filter your favorite novels with multiple conditions!

section5: Reference articles and learning links

1. The Python module xlwt writes to excel

Reference article: Click here to send

2. Crawling the ranking of the best universities in China (example)

Reference article: Click here to get

Guess you like

Origin blog.csdn.net/qq_44921056/article/details/113832307