Reptile crawling content novel site, and the output of each chapter to each txt file - Code World

Reptile crawling content novel site, and the output of each chapter to each txt file

Others 2020-03-09 23:31:09 views: null

First, determine the website link

The code used in the link, in https://www.biqukan.com link to the home of the election of a serialized novel

from bs4 import BeautifulSoup
import requests

link = 'https://www.biqukan.com/1_1094'

Second, the View page source

Found:
1, the site is gbk coding
Here Insert Picture Description
2, section all have a label you want to filter out this part
3, we want to start a chapter from the body roll, slice thought interception

# 获取结果res，编码是gbk（这个网站就是gbk的编码）
res = requests.get(link)
res.encoding = 'gbk'

# 使用BeatifulSoup得到网站中的文本内容
soup = BeautifulSoup(res.text)
lis = soup.find_all('a')	# 
lis = lis[42:-13]           # 不属于章节内容的都去掉

# 用urllist存储所有{章节名称:链接}
urldict = {}

# 观察小说各个章节的网址，结合后面的代码，这里只保留 split_link = 'https://www.biqukan.com/'
tmp = link.split("/")
split_link = "{0}//{1}/".format(tmp[0], tmp[2])

# 将各章节名字及链接形成键值对形式，并添加到大字典 urldict中
for i in range(len(lis)):
    print({lis[i].string: split_link + lis[i].attrs['href']})
    urldict.update({lis[i].string: split_link + lis[i].attrs['href']})

from tqdm import tqdm
for key in tqdm(urldict.keys()):
    tmplink = urldict[key]          # 章节链接
    res = requests.get(tmplink)     # 链接对应的资源文件html
    res.encoding = 'gbk'

    soup = BeautifulSoup(res.text)  # 取资源文件中的文本内容
    content = soup.find_all('div', id='content')[0]  # 取得资源文件中文本内容的小说内容

    with open('text{}.txt'.format(key), 'a+', encoding='utf8') as f:
        f.write(content.text.replace('\xa0', ''))

Published 131 original articles · won praise 81 · views 60000 +

Private letter concerns

Guess you like

Origin blog.csdn.net/weixin_43469047/article/details/104188941

Reptile crawling content novel site, and the output of each chapter to each txt file

Output the content of the console to a txt file

Using java to write code, the statistics of the number of each word txt file, output in the console

The length of each row of the output file shell

Reading a file output frequency of each letter

python output content to a local txt file

Output each bit of an integer

Reptile crawling form the airport site

Summary of each chapter of halcon

a real python reptile, crawling cool music list and write txt file saved locally

MultipartFile File into each other

Java monitors the file in real time, if the file is modified, get the file update content by line each time

The progressive output elements of each list

python crawling entire file to a local novel

Reptile learning: request + xpath crawling novel pen Fun Club

The VOC format label xml file marked by labelimg and the yolo format label txt file are converted to each other

"MySQL: each chapter homework answers"

Python reads txt file application --- reads a txt file with python, and writes specified data in each line in the txt file according to the corresponding judgment conditions.

C++ learning series---read file names and store them in txt and read each line of information from txt

Each read part of the file into memory

JS file call each other

Linux file each color meanings

The meaning of each variable yum file

Introduction of each file of a component of ofbiz

Meaning lsof output of each column information is as follows:

The output of the first day of the last day of each month

Calculating a number of words in each string and descending output

Details of each column output by MySQL SHOW TABLE

Many are content to reproduce or copy each Gangster

Solve highly adaptive iframe content in each case

Recommended

Rushing to the GitHub hot list——How can open source programming languages and frameworks be so cute?

Beijing Humanoid Robot Innovation Center launches Tiangong, the world's first full-size humanoid robot with purely electric drive for anthropomorphic running

Ranking

8个无需编写代码即可使用 Python 内置库的方法

Java collections interview knowledge Lite

Machine learning algorithms foundation - Introduction a (watermelon materials for the book)

(Easy) Ransom Note - LeetCode

[Five days] Qt from entry to actual combat: the second day

Remember once extremely pit father can not download Maven Jar package of issues: IDEA question

The minimum cost Shortest

OSPF study map (the most complete version)

Network Takeaways

GnuPG

Daily

More

2024-04-27(29)

2024-04-26(22)

2024-04-25(32)

2024-04-24(30)

2024-04-23(30)

2024-04-22(5)

2024-04-21(0)

2024-04-20(6)

2024-04-19(5)

2024-04-18(0)