Java parses eml mail format files

basic introduction

The demand for emails is always based on sending or receiving emails. The technology selections involved in the past include Java Mail, Apache Commons Email, and Spring Mail. Due to the needs of work, the files in eml format are parsed. After learning about it, use Java To analyze the implementation of eml format files, the so-called eml format is a file format used by Microsoft in Outlook that follows RFC822 and its subsequent extensions, and has become a common format for various email software (local email file storage) Its source is the abbreviated form of English E-mail, which can be opened by Outlook or various local email clients, such as Foxmail and Notes.

After several encyclopedias of information, it is found that the analysis of Java Mail and Mime4J (Apache James sub-project module) can be mainly used. Apache James has a modular architecture based on a rich set of modern and efficient components. It finally provides Complete, stable, secure and scalable mail server integration on . James is composed of internal projects (Server, Mailet, Mailbox, Protocols, MPT) and external projects (Hupa, Mime4J, jSieve, jSPF, jDKIM), among which Mime4J is the implementation of parsing mail data files, as shown in the following figure:

Apache James Mime4J provides a parser for email stream formats in plain RFC822 and MIME. The parser uses a callback mechanism to report parsing events, such as start of entity header, body, etc. If you are familiar with the SAX Corporation XML parser interface you should have no problem getting started with mime4j. Mime4j can also be used to build email using message classes, with this tool mime4j automatically handles decoding fields and bodies, and attaching large temporary files.

Parsing implementation

(1) Use QQ mailbox to edit an email and send it out, and everywhere is a local eml file. The information (parameters that can be extracted) in the email includes the following:

A. Email subject:

B. Email content, the content part may be plain text and rich text, and the rich text includes HTML files, local pictures in the content area, etc.;

C. The recipient can be multiple recipients, and the recipient distinguishes between the nickname and the actual email address;

D. CC, consistent with the recipient;

E. The Bcc sender is the same as the recipient;

F. Attachment, there can be multiple attachment files;

G. Sending time, whether there is a time zone problem;

H. Mail size, the size of the mail file;

I. Message-ID The unique ID of the mail;

(2) The content of the email is specially constructed, with multiple recipients and CCs; multiple emails; the content of the email contains HTML rich text paragraphs and local image files, as shown in the following figure:

(3) Import maven dependencies (version 0.8.9 was released in early January 2023). After coordinate dependency practice, it is found that importing apache-mime4j-examples coordinates can directly import several dependent modules, which needs to be considered in practical applications Depend on other modules, exclude emamples and commons-logging dependencies as needed, the reference coordinates are as follows:

<!-- https://mvnrepository.com/artifact/org.apache.james/apache-mime4j-examples -->

<dependency>
    <groupId>org.apache.james</groupId>
    <artifactId>apache-mime4j-examples</artifactId>
    <version>0.8.9</version>
</dependency>

(4) Parsing implementation example:

package cn.chendd.eml;
 
 /**
  * Eml文件解析数据对象
  *
  * @author chendd
  * @date 2023/2/11 21:40
  */
 @Data
 public class EmlEntry {
 
     /**
      * 原始message对象
      */
     @JSONField(serialize = false)
     private Message message;
 
     /**
      * 消息ID
      */
     private String messageId;
 
     /**
      * 邮件主题
      */
     private String subject;
 
     /**
      * 纯文本邮件内容
      */
     private String textContent;
 
     /**
      * 富文本邮件内容
      */
     private String htmlContent;
 
     /**
      * 邮件附件
      */
     private List<MutableTriple<String , Long , InputStream>> attachments = Lists.newArrayList();
 
     /**
      * 发件人
      */
     private String from;
 
     /**
      * 收件人
      */
     private List<Pair<String , String>> to;
 
     /**
      * 抄送人
      */
     private List<Pair<String , String>> cc;
 
     /**
      * 密送人
      */
     private List<Pair<String , String>> bcc;
 
     /**
      * 邮件时间
      */
     private String dateTime;
 
}
package cn.chendd.eml;
 
 /**
  * 基本的eml文件解析示例
  *
  * @author chendd
  * @date 2023/2/11 19:26
  */
 public class EmlBasicTest {
 
     public static void main(String[] args) {
 
         try (InputStream inputStream = EmlBasicTest.class.getResourceAsStream("/Java解析Eml格式文件示例.eml")) {
             Message message = Message.Builder.of(inputStream).build();
             EmlEntry entry = new EmlEntry();
             entry.setMessage(message);
             entry.setMessageId(message.getMessageId());
             entry.setSubject(message.getSubject());
             entry.setFrom(address2String(message.getFrom()));
             entry.setTo(address2List(message.getTo()));
             entry.setCc(address2List(message.getCc()));
             entry.setBcc(address2List(message.getBcc()));
             TimeZone timeZone = TimeZone.getTimeZone(ZoneId.of("GMT"));
             SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
             sdf.setTimeZone(timeZone);
             entry.setDateTime(sdf.format(message.getDate()));
             MultipartImpl body = (MultipartImpl) message.getBody();
             List<Entity> bodyParts = body.getBodyParts();
             //邮件附件和内容
             outputContentAndAttachments(bodyParts , entry);
             System.out.println(JSON.toJSONString(entry , true));
         } catch (Exception e) {
             e.printStackTrace();
         }
     }
 
     /**
      * 递归处理邮件附件(附件区域附件、内容中的base64图片附件)、邮件内容(纯文本、html富文本)
      * @param bodyParts 邮件内容体
      * @param entry 数据对象
      * @throws IOException 异常处理
      */
     private static void outputContentAndAttachments(List<Entity> bodyParts , EmlEntry entry) throws IOException {
         for (Entity bodyPart : bodyParts) {
             Body bodyContent = bodyPart.getBody();
             String dispositionType = bodyPart.getDispositionType();
             if (ContentDispositionField.DISPOSITION_TYPE_ATTACHMENT.equals(dispositionType)) {
                 //正常的附件文件
                 BinaryBody binaryBody = (BinaryBody) bodyContent;
                 entry.getAttachments().add(MutableTriple.of(bodyPart.getFilename() , binaryBody.size() , binaryBody.getInputStream()));
                 continue;
             }
             if (bodyContent instanceof TextBody) {
                 //纯文本内容
                 TextBody textBody = (TextBody) bodyContent;
                 ContentTypeFieldLenientImpl contentType = (ContentTypeFieldLenientImpl) bodyPart.getHeader().getField(HttpHeaders.CONTENT_TYPE);
                 String mimeType = contentType.getMimeType();
                 //可动态获取内容的编码,按编码转换
                 if (MediaType.PLAIN_TEXT_UTF_8.toString().startsWith(mimeType)) {
                     entry.setTextContent(IOUtils.toString(textBody.getReader()));
                 }
                 if (MediaType.HTML_UTF_8.toString().startsWith(mimeType)) {
                     entry.setHtmlContent(IOUtils.toString(textBody.getReader()));
                 }
             } else if (bodyContent instanceof Multipart) {
                 MultipartImpl multipart = (MultipartImpl) bodyContent;
                 outputContentAndAttachments(multipart.getBodyParts() , entry);
             } else if (bodyContent instanceof BinaryBody) {
                 BinaryBody binaryBody = (BinaryBody) bodyContent;
                 outputContentInAttachment(bodyPart.getHeader(), binaryBody, entry);
             } else {
                 System.err.println("【是否还存在未覆盖到的其它内容类型场景】?");
             }
         }
     }
 
     /**
      * 处理内容中的图片附件
      *
      * @param header      附件头信息对象
      * @param binaryBody  附件对象
      * @param entry 解析数据对象
      */
     private static void outputContentInAttachment(Header header, BinaryBody binaryBody, EmlEntry entry) throws IOException {
         Field contentIdField = header.getField(FieldName.CONTENT_ID);
         Field typeField = header.getField(FieldName.CONTENT_TYPE);
         if (typeField instanceof ContentTypeField) {
             ContentTypeField contentTypeField = (ContentTypeField) typeField;
             if (contentTypeField.getMediaType().startsWith(MediaType.ANY_IMAGE_TYPE.type())) {
                 try (InputStream inputStream = binaryBody.getInputStream()) {
                     String base64 = Base64.getEncoder().encodeToString(IOUtils.toByteArray(inputStream));
                     String cid = StringUtils.substringBetween(contentIdField.getBody(), "<", ">");
                     String content = StringUtils.replace(entry.getHtmlContent(),
                             "cid:" + cid, "data:" + contentTypeField.getMimeType() + ";base64," + base64);
                     entry.setHtmlContent(content);
                 }
             }
         }
     }
 
     /**
      * 转换邮件联系人至String
      * @param addressList 邮件联系人
      * @return String数据
      */
     private static String address2String(MailboxList addressList) {
         if (addressList == null) {
             return StringUtils.EMPTY;
         }
         for (Address address : addressList) {
             return address.toString();
         }
         return StringUtils.EMPTY;
     }
 
     /**
      * 转换邮件联系人至list集合
      * @param addressList 邮件联系人
      * @return list集合
      */
     private static List<Pair<String , String>> address2List(AddressList addressList) {
         List<Pair<String , String>> list = Lists.newArrayList();
         if (addressList == null) {
             return list;
         }
         for (Address address : addressList) {
             Mailbox mailbox = (Mailbox) address;
             list.add(Pair.of(mailbox.getName() , mailbox.getAddress()));
         }
         return list;
     }
 }

Analysis result

(parsed JSON result)

(HTML paragraphs are saved as files)

other instructions

(1) The eml format file is a plain text file, which can be opened with Notepad, Notepad++ and other tools, so when you see its content, you can also perform custom analysis on demand;

(2) The actual application is definitely more complicated than this. This article is just an example rich in multiple knowledge details, and the actual processing in the work is much more complicated;

(3) Attachment names in some scenarios require special transcoding, and the actual attachment names are divided into multiple segments in the source file of the email, which need to be merged and transcoded;

(4) The email contains multiple email exchanges such as reply emails and forwarded emails, which require special handling;

(5) The content part of the email contains a variety of content types, and it is necessary to provide a variety of parsing adaptation programs (for example: some emails have image signatures in their signatures, which are mixed in the content area, and the rich text needs to be parsed first and then the binary content body wait);

(6) The code in this article is for reference only, and should not be used directly. The implementation of parsing by content type should be based on multiple parsing adaptations based on the factory model. See the source code of the sample project: as described in the personal site article below;

(7) For more information, please visit: https://www.chendd.cn/blog/article/1624252901639442434.html

Guess you like

Origin blog.csdn.net/haiyangyiba/article/details/129086959