[Python] [Reptile] Crawling the e-book "Red Star Shines on China"

1. Task

The e-book "Red Star Shines on China" from the website of reading classics is planned to be crawled .

2. Principle

There are no anti-climbing measures in this cyber warfare, and the HTML page structure is simple and clear, which is suitable for novices to practice. (However, the author of the goose decided to crawl this book only to complete the reading assignments for the course of history...)
If there is a small partner who is not clear about the basic principles and methods of crawlers, I recommend the "Web Crawler and Information Extraction" from Beijing Institute of Technology. MOOC ~ Personally speaking, it is more basic, systematic and clear~

Three, the code

from bs4 import BeautifulSoup
import requests

passage = ''
pages = []
for i in range(1703,1763):
    pages.append('http://www.dudj.net/renwuzhuanji/22/' + str(i) + '.html')#构建要爬取的网页地址

for p in pages:
    page = requests.get(p)
    page.encoding = page.apparent_encoding#要根据网页编码更改成对应的编码,否则爬下来的内容是乱码
    page = page.text
    soup = BeautifulSoup(page,'html5lib')
    passage += ('----' + soup.find('h2').text[0:soup.find('h2').text.index('更新时间')] + '\n')#获得章节标题,并去掉“更新时间balabala”字样
    #这里增加‘----’的目的是方便作者之后复制到word中使用查找功能以修改标题的格式
    s = soup.find('div', class_ = 'zw').find_all('p')
    for i in s:
        passage += (i.string + '\n')
    print(len(passage))

Next, save the crawled article

passage_file=open('C:\\Red_Star_over_China.txt',mode='w',encoding="utf-8")#注意:要改编码,否则会报错
passage_file.write(passage)
passage_file.close()

Four, reflection

1. The author did not further study how to write the file directly into the docx file, and directly copied and pasted the txt to provoke>.<
2. In addition, the author found that the author only crawled the subtitles under the chapters, and did not get the big chapter titles. If you want to get it, just crawl http://www.dudj.net/renwuzhuanji/22/the chapter title information under this page~ The author did not do extra work here because it does not hinder the completion of the reading task.

Welcome to leave a message~

Guess you like

Origin blog.csdn.net/why_not_study/article/details/105264372