2021-7-3 爬网页22-爬取某小说保存到txt(python3.6,静态页面,requests.get,去除特定字符串)

爬取某小说保存到txt(python3.6,静态页面,requests.get,去除特定字符串)

1.开发环境

Python 3.6.0 |Anaconda 4.3.0 (64-bit)| (default, Dec 23 2016, 11:57:41) [MSC v.1900 64 bit (AMD64)] on win32

2.编码

网站的编码是gb2312

<meta http-equiv="Content-Type" content="text/html; charset=gb2312">

所以获取网页

req = requests.get(url=target)
req.encoding = 'gb2312'

写txt

with open("test.txt","a",encoding='gb2312') as f:

网页中有些代码用gb2312写txt会报错

UnicodeEncodeError: 'gb2312' codec can't encode character '\xa0' in position 5217: illegal multibyte sequence

把它们都替换了

with open("test.txt","a") as f:
		#\xa0 -> &nbsp;
		#\ufffd ->��
		#\u30fb 
		#2个<br><br>替换为2个换行再加一个段落首行空格
    		f.write(text_delete_bmp.replace('\ufffd','').\
			replace('\u30fb','').\
			replace('\xa0', '').\
			replace('  ',"\n  ").\
			replace('\n\n',"\n  "))  # 自带文件关闭功能,不需要再写f.close()

3.去除特定字符串

文章中有些特定的字符串是不需要的,例如

{
    
    ewcMVIMAGE,MVIMAGE, !09100020_0014_1.bmp}{
    
    ewc MVIMAGE,MVIMAGE, !09100020_0015_1.bmp}

利用正则把它们都去除掉。
字符串规则:以"{ewc开头",以“.bmp}”结尾

text_delete_bmp=re.sub(r'{
    
    ewc.*?\.bmp}', "", text_context[0].text)

4.全代码

下载

猜你喜欢

转载自blog.csdn.net/weixin_42555985/article/details/118439662