1.开发环境

Python 3.6.0 |Anaconda 4.3.0 (64-bit)| (default, Dec 23 2016, 11:57:41) [MSC v.1900 64 bit (AMD64)] on win32

2.编码

网站的编码是gb2312

<meta http-equiv="Content-Type" content="text/html; charset=gb2312">

所以获取网页

req = requests.get(url=target)
req.encoding = 'gb2312'

写txt

with open("test.txt","a",encoding='gb2312') as f:

网页中有些代码用gb2312写txt会报错

UnicodeEncodeError: 'gb2312' codec can't encode character '\xa0' in position 5217: illegal multibyte sequence

把它们都替换了

with open("test.txt","a") as f:
		#\xa0 -> &nbsp;
		#\ufffd ->��
		#\u30fb 
		#2个<br><br>替换为2个换行再加一个段落首行空格
    		f.write(text_delete_bmp.replace('\ufffd','').\
			replace('\u30fb','').\
			replace('\xa0', '').\
			replace('　　',"\n  ").\
			replace('\n\n',"\n  "))  # 自带文件关闭功能，不需要再写f.close()

3.去除特定字符串

文章中有些特定的字符串是不需要的，例如

{
    
    ewcMVIMAGE,MVIMAGE, !09100020_0014_1.bmp}{
    
    ewc MVIMAGE,MVIMAGE, !09100020_0015_1.bmp}

利用正则把它们都去除掉。
字符串规则：以"{ewc开头"，以“.bmp}”结尾

text_delete_bmp=re.sub(r'{
    
    ewc.*?\.bmp}', "", text_context[0].text)

4.全代码

下载

2021-7-3 爬网页22-爬取某小说保存到txt(python3.6，静态页面，requests.get，去除特定字符串）

爬取某小说保存到txt(python3.6，静态页面，requests.get，去除特定字符串）

1.开发环境

2.编码

3.去除特定字符串

4.全代码

猜你喜欢