How does PYTHON automatically process emails (4)-python gets the email text

How to transcode, email has special tools.
It is said on the Internet that encode is UTF-8 and then read. I found that there are many bugs. Since the mail is bytes, it is read according to bytes.

When the handle is configured (I set the handle to email)

        mail=p.retr(i)[1]
        mail=b'\n'.join(mail)
        msg=email.message_from_bytes(mail)

You can get a message body.

Among them, msg is a class for processing mail.

Traverse all program bodies through the waik() command in email.
Then
some people in getpayload said: Multipart judgment is required on the Internet, but according to my experience: There will be no problems with BYTES, and it will be flattened.
Take the creation of a file as an example:

C:\
C:\d1
c:\d2
c:\d3\s1

Then after reading it, it will: d1, d2, s1, ignoring the problem of multipart...

get_content_charset() gets the encoding of the part.
Then, call get_payload(decode=True) to convert to bytes, and then encode and decode the bytes to get the text body.

def get_file(msg):
    data_char=''
    for part in msg.walk():
        part_charset=part.get_content_charset()
        print(part_charset)
        part_type=part.get_content_type()
        #print(part_type)
        if part_type=="text/plain" or part_type=='text/html':
            data=part.get_payload(decode=True)
            try:
                data=data.decode(part_charset,errors="replace")
            except:
                data=data.decode('gb2312',errors="replace")
            data=html_to_plain_text(data)
            data_char=data_char+'\n'+data
    return data_char+'\n'

When the file is processed, it will get two parts: the html part and the plain text part.
Through this function, html can be converted into text text.
I don’t know much about this thing myself, so I didn’t say...

import re
from html import unescape

#这个程序的作用是为了将html转化成txt文本,转化能力还不错……
def html_to_plain_text(html):
    text = re.sub('<head.*?>.*?</head>', ' ', html, flags=re.M | re.S | re.I)
    text = re.sub(r'<a\s.*?>', ' HYPERLINK ', text, flags=re.M | re.S | re.I)
    text = re.sub('<.*?>', ' ', text, flags=re.M | re.S)
    text = re.sub(r'(\s*\n)+', '\n', text, flags=re.M | re.S)
    return unescape(text)

The result of the mail processed by this processing method is (I deleted a little sensitive information for encryption):

<[email protected]>
 <[email protected]>
你好!为高新专家装维人员,因前期村的情况,还请协助恢复原有权限,谢谢



省公司领导:
  您好!
为高新专家装维人员,因前期村的情况,还请协助恢复原有权限,谢谢
 您好!  为高新专家装维人员,因前期村的情况,还请协助恢复原有权限,谢谢   


There will be problems with duplicate text.

If you want unformatted text, just read text/plain. I do not leak information, so I collect both html and plain information at the same time

Guess you like

Origin blog.csdn.net/weixin_45642669/article/details/113592016