When I was crawling today, after crawling 20 programs, I got stuck inexplicably. I thought it was blocked by the server. I also added a user-agent pool and randomly obtained user-agent to form headers. I didn't expect that there was a problem with the last file naming. An illegal character appears in the string used for naming. Find information on the Internet and construct a function to remove illegal characters in the string through regular expressions:
import re
def validateTitle(title):
rstr = r"[\/\\\:\*\?\"\<\>\|]" # '/ \ : * ? " < > |'
new_title = re.sub(rstr, "_", title) # 替换为下划线
return new_title
Successfully solved the problem!
reference:
https://www.polarxiong.com/archives/Python-%E6%9B%BF%E6%8D%A2%E6%88%96%E5%8E%BB%E9%99%A4%E4%B8%8D%E8%83%BD%E7%94%A8%E4%BA%8E%E6%96%87%E4%BB%B6%E5%90%8D%E7%9A%84%E5%AD%97%E7%AC%A6.html
————————————————
Copyright Statement: This article is the original article of CSDN blogger "Burette_Lee", and it follows the CC 4.0 BY-SA copyright agreement. Please attach the original source link and this statement for reprinting .
Original link: https://blog.csdn.net/qq_29303759/article/details/81944733