How to transcode, email has special tools.
It is said on the Internet that encode is UTF-8 and then read. I found that there are many bugs. Since the mail is bytes, it is read according to bytes.
When the handle is configured (I set the handle to email)
mail=p.retr(i)[1]
mail=b'\n'.join(mail)
msg=email.message_from_bytes(mail)
You can get a message body.
Among them, msg is a class for processing mail.
Traverse all program bodies through the waik() command in email.
Then
some people in getpayload said: Multipart judgment is required on the Internet, but according to my experience: There will be no problems with BYTES, and it will be flattened.
Take the creation of a file as an example:
C:\
C:\d1
c:\d2
c:\d3\s1
Then after reading it, it will: d1, d2, s1, ignoring the problem of multipart...
get_content_charset() gets the encoding of the part.
Then, call get_payload(decode=True) to convert to bytes, and then encode and decode the bytes to get the text body.
def get_file(msg):
data_char=''
for part in msg.walk():
part_charset=part.get_content_charset()
print(part_charset)
part_type=part.get_content_type()
#print(part_type)
if part_type=="text/plain" or part_type=='text/html':
data=part.get_payload(decode=True)
try:
data=data.decode(part_charset,errors="replace")
except:
data=data.decode('gb2312',errors="replace")
data=html_to_plain_text(data)
data_char=data_char+'\n'+data
return data_char+'\n'
When the file is processed, it will get two parts: the html part and the plain text part.
Through this function, html can be converted into text text.
I don’t know much about this thing myself, so I didn’t say...
import re
from html import unescape
#这个程序的作用是为了将html转化成txt文本,转化能力还不错……
def html_to_plain_text(html):
text = re.sub('<head.*?>.*?</head>', ' ', html, flags=re.M | re.S | re.I)
text = re.sub(r'<a\s.*?>', ' HYPERLINK ', text, flags=re.M | re.S | re.I)
text = re.sub('<.*?>', ' ', text, flags=re.M | re.S)
text = re.sub(r'(\s*\n)+', '\n', text, flags=re.M | re.S)
return unescape(text)
The result of the mail processed by this processing method is (I deleted a little sensitive information for encryption):
服 <[email protected]>
<[email protected]>
你好!为高新专家装维人员,因前期村的情况,还请协助恢复原有权限,谢谢
省公司领导:
您好!
为高新专家装维人员,因前期村的情况,还请协助恢复原有权限,谢谢
您好! 为高新专家装维人员,因前期村的情况,还请协助恢复原有权限,谢谢
There will be problems with duplicate text.
If you want unformatted text, just read text/plain. I do not leak information, so I collect both html and plain information at the same time