javacc 教程5 词法状态的示例

上一节我们介绍了tokenManager，其中涉及到了词法状态，这一节我们对此作一个案例，着重演示词法状态的用法。

这里有一个邮件文件，里面包含往来邮件的详细内容，文件如下：

BABYL OPTIONS:
Version: 5
Labels:
Note:   This is the header of an rmail file.
Note:   If you are seeing it in rmail,
Note:    it means the file has no messages in it.

1, filed,,
Summary-line: 11-Jan       [email protected]  #A note on using RE's matching the empty string
Return-Path: <[email protected]>
Received: from Eng.Sun.COM by schizophrenia.Eng.Sun.COM (SMI-8.6/SMI-SVR4)
	id HAA21134; Sat, 11 Jan 1997 07:47:28 -0800
Received: from sunmail1.Sun.COM by Eng.Sun.COM (SMI-8.6/SMI-5.3)
	id HAA02652; Sat, 11 Jan 1997 07:44:26 -0800
Received: from Eng.Sun.COM by sunmail1.Sun.COM (SMI-8.6/SMI-4.1)
	id HAA06974; Sat, 11 Jan 1997 07:44:26 -0800
Received: from suntest.Eng.Sun.COM by Eng.Sun.COM (SMI-8.6/SMI-5.3)
	id HAA02640; Sat, 11 Jan 1997 07:44:24 -0800
Received: from asap.Eng.Sun.COM by suntest.Eng.Sun.COM (SMI-8.6/SMI-SVR4)
	id HAA21295; Sat, 11 Jan 1997 07:44:18 -0800
Received: from Eng.Sun.COM by asap.Eng.Sun.COM (SMI-8.6/SMI-SVR4)
	id HAA24684; Sat, 11 Jan 1997 07:44:19 -0800
Received: from mercury.Sun.COM by Eng.Sun.COM (SMI-8.6/SMI-5.3)
	id HAA02630; Sat, 11 Jan 1997 07:44:17 -0800
Received: from cs.albany.edu (cs.albany.edu [169.226.2.22]) by mercury.Sun.COM (SMI-8.6/mail.byaddr) with SMTP id HAA01828 for <[email protected]>; Sat, 11 Jan 1997 07:44:18 -0800
Received: from bhaskara.cs.albany.edu ([email protected] [169.226.2.60]) by cs.albany.edu (8.7.4/HUB03) with ESMTP id KAA01464; Sat, 11 Jan 1997 10:44:06 -0500 (EST)
Received: (from sreeni@localhost) by bhaskara.cs.albany.edu (8.7.4/CLI2) id KAA09608; Sat, 11 Jan 1997 10:43:58 -0500 (EST)
From: Sreenivasa Rao Viswanadha <[email protected]>
Date: Sat, 11 Jan 1997 10:43:58 -0500 (EST)
Message-Id: <[email protected]>
To: [email protected]
Subject: A note on using RE's matching the empty string
Cc: [email protected]
X-Sun-Charset: US-ASCII
Content-Type: text
Content-Length: 1639
X-Lines: 32
Status: RO

*** EOOH ***
Return-Path: <[email protected]>
From: Sreenivasa Rao Viswanadha <[email protected]>
Date: Sat, 11 Jan 1997 10:43:58 -0500 (EST)
To: [email protected]
Subject: A note on using RE's matching the empty string
Cc: [email protected]
X-Sun-Charset: US-ASCII
Content-Type: text
Content-Length: 1639
X-Lines: 32

In the last couple of days, we had seen a couple of users facing problems with
regular expressions that match "". There is a minor bug in the way it is 
implemented in 0.6.-9. We will fix it.

But the purpose of this mail is to suggest you should be careful when you use
RE's that match the "" string. Consider the following example of string literals
where two consecutive "" are interpreted as the literal " (equivalent to \" in
Java).

< STRING_LITERAL: ( "\"" (~["\""])* "\"" )* >

This will work in general. But, if this a part of a lot of other lexical rules,
then if there a lexical error, say a char is given that cannot be the first one
of any token, then, the lexer decides to use the empty string "" and match it
as STRING_LITERAL without actually giving the lexical error. And since this is
the empty string, no character will be consumed and you will start getting the
same STRING_LITERAL token (with "" as the image) infinite number of times. In
fact, if this was the only lexical rule, then if you give a input that starts
with any char other than the ", you will get into an infinite loop.

So a better alternative is to use the + operator which will not match the empty
string. As a matter of fact, I don't know any practical grammar where matching
"" is useful.

In version 0.5, the lexer generated implicitly treated it as + (which is not
totally right). But in 0.6.-9, it does it right and so there is a chance that
your grammar that used to work with 0.5 will not work with 0.6.-9. So if you
have any top-level lexical rule with ? or *, please change those rules so that
they don't match the empty string "".

Sreeni.

1,,
Summary-line: 11-Jan         [email protected]  #Re: Looking for HTML.jack
Return-Path: <[email protected]>
Received: from Eng.Sun.COM by schizophrenia.Eng.Sun.COM (SMI-8.6/SMI-SVR4)
	id QAA21339; Sat, 11 Jan 1997 16:41:53 -0800
Received: from sunmail1.Sun.COM by Eng.Sun.COM (SMI-8.6/SMI-5.3)
	id QAA18458; Sat, 11 Jan 1997 16:38:44 -0800
Received: from Eng.Sun.COM by sunmail1.Sun.COM (SMI-8.6/SMI-4.1)
	id QAA16642; Sat, 11 Jan 1997 16:38:51 -0800
Received: from suntest.Eng.Sun.COM by Eng.Sun.COM (SMI-8.6/SMI-5.3)
	id QAA18449; Sat, 11 Jan 1997 16:38:42 -0800
Received: from asap.Eng.Sun.COM by suntest.Eng.Sun.COM (SMI-8.6/SMI-SVR4)
	id QAA23127; Sat, 11 Jan 1997 16:38:42 -0800
Received: from Eng.Sun.COM by asap.Eng.Sun.COM (SMI-8.6/SMI-SVR4)
	id QAA24956; Sat, 11 Jan 1997 16:38:41 -0800
Received: from mercury.Sun.COM by Eng.Sun.COM (SMI-8.6/SMI-5.3)
	id QAA18438; Sat, 11 Jan 1997 16:38:36 -0800
Received: from chmls01.highway1.com (ne.highway1.com [24.128.1.82]) by mercury.Sun.COM (SMI-8.6/mail.byaddr) with SMTP id QAA22125 for <[email protected]>; Sat, 11 Jan 1997 16:38:42 -0800
Received: from papa ([24.128.36.164]) by chmls01.highway1.com
          (Netscape Mail Server v2.0) with SMTP id AAA17669;
          Sat, 11 Jan 1997 19:38:31 -0400
Message-ID: <[email protected]>
Date: Sat, 11 Jan 1997 19:38:32 -0500
From: [email protected] (Kimbo Mundy)
X-Mailer: Mozilla 3.0Gold (WinNT; U)
MIME-Version: 1.0
To: Rupert Nagler <[email protected]>
CC: [email protected]
Subject: Re: Looking for HTML.jack
References: <[email protected]>
Content-Transfer-Encoding: 7bit
X-Lines: 13
Status: RO
Content-Type: text/plain; charset="us-ascii"
Content-Length: 447

*** EOOH ***
Return-Path: <[email protected]>
Date: Sat, 11 Jan 1997 19:38:32 -0500
From: [email protected] (Kimbo Mundy)
X-Mailer: Mozilla 3.0Gold (WinNT; U)
MIME-Version: 1.0
To: Rupert Nagler <[email protected]>
CC: [email protected]
Subject: Re: Looking for HTML.jack
References: <[email protected]>
Content-Transfer-Encoding: 7bit
X-Lines: 13
Content-Type: text/plain; charset="us-ascii"
Content-Length: 447

Rupert Nagler wrote:
> 
> I am very impressed by the Jack-Concept and I am looking for a "HTML.jack".
> Is there anybody out there who has an example of a Jack-Definition file for
> HTML 3.2?

I previously sent a message entitled "A first cut at an HTML grammar".
Did people not get it?  If not, see:
http://www.tiac.net/users/kimbo/jack/HTML.jack

> Is there a way to construct a *.jack file out of a *.sgml file?

Sorry, I can't help with this.

1,,
Summary-line: 13-Jan         [email protected]  #Re: HTML?
Return-Path: <[email protected]>
Received: from suntest.Eng.Sun.COM by schizophrenia.Eng.Sun.COM (SMI-8.6/SMI-SVR4)
	id XAA21746; Sun, 12 Jan 1997 23:06:34 -0800
Received: from Eng.Sun.COM by suntest.Eng.Sun.COM (SMI-8.6/SMI-SVR4)
	id XAA29422; Sun, 12 Jan 1997 23:03:31 -0800
Received: from mercury.Sun.COM by Eng.Sun.COM (SMI-8.6/SMI-5.3)
	id XAA07269; Sun, 12 Jan 1997 23:03:30 -0800
Received: from chmls01.highway1.com (ne.highway1.com [24.128.1.82]) by mercury.Sun.COM (SMI-8.6/mail.byaddr) with SMTP id XAA18021 for <[email protected]>; Sun, 12 Jan 1997 23:03:30 -0800
Received: from papa ([24.128.36.164]) by chmls01.highway1.com
          (Netscape Mail Server v2.0) with SMTP id AAA20062
          for <[email protected]>; Mon, 13 Jan 1997 02:03:25 -0400
Message-ID: <[email protected]>
Date: Mon, 13 Jan 1997 02:03:28 -0500
From: [email protected] (Kimbo Mundy)
X-Mailer: Mozilla 3.0Gold (WinNT; U)
MIME-Version: 1.0
To: Sriram Sankar <[email protected]>
Subject: Re: HTML?
References: <[email protected]>
Content-Transfer-Encoding: 7bit
X-Lines: 34
Status: RO
Content-Type: text/plain; charset="us-ascii"
Content-Length: 1420

*** EOOH ***
Return-Path: <[email protected]>
Date: Mon, 13 Jan 1997 02:03:28 -0500
From: [email protected] (Kimbo Mundy)
X-Mailer: Mozilla 3.0Gold (WinNT; U)
MIME-Version: 1.0
To: Sriram Sankar <[email protected]>
Subject: Re: HTML?
References: <[email protected]>
Content-Transfer-Encoding: 7bit
X-Lines: 34
Content-Type: text/plain; charset="us-ascii"
Content-Length: 1420

Well, I finally got an HTML grammar out there (at
http://www.tiac.net/users/kimbo/jack/HTML.jack).  I hope you saw it, I
got some mailer errors, that seemed like the kind that you could ignore,
but at least one person didn't receive my first posting.

I'd be interested to know if this is the sort of thing people are
looking for, or do they want the full set of tags enumerated in the
grammar as well?  Also, if there you have any desire to bundle this with
Jack (possibly after upgrades and/or integration with other people's
work), please feel free.

I must say Jack is an amazing tool.  It was really easy to learn.  I
love how readable the grammars are, and I love being able to pass info
up and down the productions as the parser runs.  I never want to have to
settle for LALR(1) again!  Thanks for writing it!

我们需要提取上述文件中每一封邮件的摘要，标题，发件人，发送时间

最终效果如下：

DIGEST OF RECENT MESSAGES FROM THE JAVACC MAILING LIST
----------------------------------------------------------------------

MESSAGE SUMMARY:

1. A note on using RE's matching the empty string
2. Re: Looking for HTML.jack
3. Re: HTML?

----------------------------------------------------------------------

Message 1:

Subject: A note on using RE's matching the empty string
From: Sreenivasa Rao Viswanadha <[email protected]>
Date: Sat, 11 Jan 1997 10:43:58 -0500 (EST)

In the last couple of days, we had seen a couple of users facing problems with
regular expressions that match "". There is a minor bug in the way it is 
implemented in 0.6.-9. We will fix it.

But the purpose of this mail is to suggest you should be careful when you use
RE's that match the "" string. Consider the following example of string literals
where two consecutive "" are interpreted as the literal " (equivalent to \" in
Java).

< STRING_LITERAL: ( "\"" (~["\""])* "\"" )* >

This will work in general. But, if this a part of a lot of other lexical rules,
then if there a lexical error, say a char is given that cannot be the first one
of any token, then, the lexer decides to use the empty string "" and match it
as STRING_LITERAL without actually giving the lexical error. And since this is
the empty string, no character will be consumed and you will start getting the
same STRING_LITERAL token (with "" as the image) infinite number of times. In
fact, if this was the only lexical rule, then if you give a input that starts
with any char other than the ", you will get into an infinite loop.

So a better alternative is to use the + operator which will not match the empty
string. As a matter of fact, I don't know any practical grammar where matching
"" is useful.

In version 0.5, the lexer generated implicitly treated it as + (which is not
totally right). But in 0.6.-9, it does it right and so there is a chance that
your grammar that used to work with 0.5 will not work with 0.6.-9. So if you
have any top-level lexical rule with ? or *, please change those rules so that
they don't match the empty string "".

Sreeni.

----------------------------------------------------------------------

Message 2:

Subject: Re: Looking for HTML.jack
From: [email protected] (Kimbo Mundy)
Date: Sat, 11 Jan 1997 19:38:32 -0500

Rupert Nagler wrote:
> 
> I am very impressed by the Jack-Concept and I am looking for a "HTML.jack".
> Is there anybody out there who has an example of a Jack-Definition file for
> HTML 3.2?

I previously sent a message entitled "A first cut at an HTML grammar".
Did people not get it?  If not, see:
http://www.tiac.net/users/kimbo/jack/HTML.jack

> Is there a way to construct a *.jack file out of a *.sgml file?

Sorry, I can't help with this.

----------------------------------------------------------------------

Message 3:

Subject: Re: HTML?
From: [email protected] (Kimbo Mundy)
Date: Mon, 13 Jan 1997 02:03:28 -0500

Well, I finally got an HTML grammar out there (at
http://www.tiac.net/users/kimbo/jack/HTML.jack).  I hope you saw it, I
got some mailer errors, that seemed like the kind that you could ignore,
but at least one person didn't receive my first posting.

I'd be interested to know if this is the sort of thing people are
looking for, or do they want the full set of tags enumerated in the
grammar as well?  Also, if there you have any desire to bundle this with
Jack (possibly after upgrades and/or integration with other people's
work), please feel free.

I must say Jack is an amazing tool.  It was really easy to learn.  I
love how readable the grammars are, and I love being able to pass info
up and down the productions as the parser runs.  I never want to have to
settle for LALR(1) again!  Thanks for writing it!

----------------------------------------------------------------------

邮件摘要

首先我们需要抓取这几封往来邮件的摘要，也就是“MESSAGE SUMMARY”部分，这里我们选择抓取邮件的“Subject”部分作为SUMMARY，我们观察到每封邮件都是以“*** EOOH ***”字符串作为起始位置，而tokenManager的初始状态为DEFAULT，在此状态下tokenManager只是简单的吃掉扫描的字符，于是我们在DEFAULT状态下定义，当遇到“*** EOOH ***”时进行状态切换，然后在切换后的状态，再继续抓取我们需要匹配的字段。

我们定义Digest.jj文件，其内容如下：

PARSER_BEGIN(Digest)
package com.github.gambo.javacc.mail.digest;
import java.io.*;

public class Digest {

  static int count = 0;


  public static void main(String args[]) throws Exception {
    FileInputStream input = new FileInputStream("../sampleMailFile");
    Digest parser = new Digest(input);
    System.out.println("DIGEST OF RECENT MESSAGES FROM THE JAVACC MAILING LIST");
    System.out.println("----------------------------------------------------------------------");
    System.out.println("");
    System.out.println("MESSAGE SUMMARY:");
    System.out.println("");
    parser.MailFile();
    if (count == 0) {
      System.out.println("There have been no messages since the last digest posting.");
      System.out.println("");
      System.out.println("----------------------------------------------------------------------");
    } else {
      System.out.println("");
      System.out.println("----------------------------------------------------------------------");
      System.out.println("");
    }
  }

}

PARSER_END(Digest)


// PARSER SPECIFICATIONS BEGIN HERE

void MailFile() :
	{
	}
{
  (
	{
	  count++;
	}
    MailMessage()
  )*
  <EOF>
}

void MailMessage() :
	{
	  Token subj=null, body;
	}
{
  ( subj=<SUBJECT> )+
	{
	  System.out.println(count + ". " + ((subj==null) ? "no subject" : subj.image));
	}
  ( body=<BODY>)*
  <END>
}


// LEXICAL SPECIFICATIONS BEGIN HERE

TOKEN:
{
  <#EOL: "\n" | "\r" | "\r\n">
|
  <#TWOEOLS: (("\n"|"\r\n") <EOL>) | ("\r\r" [ "\n" ])>
|
  <#NOT_EOL: ~["\n","\r"]>
}

<DEFAULT>
SKIP:
{
  < <EOL> "*** EOOH ***" <EOL> > : MAILHEADER
|
  <~[]>
}

<MAILHEADER>
SKIP:
{
  <_TWOEOLS: <TWOEOLS>> : MAILBODY
|
  "Subject: " : MAILSUBJECT
|
  <~[]>
}

<MAILSUBJECT>
TOKEN:
{
  <SUBJECT: ( <NOT_EOL> )+>
}

<MAILSUBJECT>
SKIP:
{
  <_EOL1: <EOL>> : MAILHEADER
}


<MAILBODY>
TOKEN:
{
  <BODY: (~["\n","\r","\u001f"])* <EOL>>
|
  <END: "\u001f"> : DEFAULT
}

乍一看，此文件的定义有些繁琐，我们从一组可重用的正则表达式EOL、TWOEOLS和NOT_EOL开始看。

TOKEN:
{
  <#EOL: "\n" | "\r" | "\r\n"> //不同平台的换行符
|
  <#TWOEOLS: (("\n"|"\r\n") <EOL>) | ("\r\r" [ "\n" ])> //连续两个换行符，或者结束符，用以分割每一封邮件
|
  <#NOT_EOL: ~["\n","\r"]> //非换行符以外的字符
}

词法解析的流程如下:

在<DEFAULT>词法状态下，令牌管理器只是简单地吃掉字符，直到它看到标记为的邮件的开始：<<EOL> "*** EOOH ***" <EOL>>. 此时，它切换到状态<MAILHEADER>
```
<DEFAULT>
SKIP:
{
  < <EOL> "*** EOOH ***" <EOL> > : MAILHEADER
|
  <~[]>
}
```
<MAILHEADER>状态下，当我们匹配到"subject:"开头的字符串时则转换<MAILSUBJECT>状态。
```
<MAILHEADER>
SKIP:
{
  <_TWOEOLS: <TWOEOLS>> : MAILBODY
|
  "Subject: " : MAILSUBJECT
|
  <~[]>
}
```
<MAILSUBJECT>下我们匹配以"subject:"开头的字符串，并将其输出为token，每当获取到<SUBJECT>则将其内容打印出来，当匹配到两个换行符”<EOL>“时，则转换回<MAILHEADER>
```
<MAILSUBJECT>
TOKEN:
{
  <SUBJECT: ( <NOT_EOL> )+>
}

<MAILSUBJECT>
SKIP:
{
  <_EOL1: <EOL>> : MAILHEADER
}
```
在<MAILHEADER>状态下，当我们匹配到两个换行符，也就是”<TWOEOLS>“时，再跳转到<MAILBODY>，这里可以匹配邮件的正文内容。
<MAILBODY>状态下，当邮件正文的每一行匹配为<BODY: (~["\n","\r","\u001f"])* <EOL>>，代表0或多个非换行和结束符的多个字符，并且以换行符结束。当遇到"\u001f"字符串时，代表一封邮件结束。重新置为<DEFAULT>状态，进入新一轮的解析。
```
<MAILBODY>
TOKEN:
{
  <BODY: (~["\n","\r","\u001f"])* <EOL>>
|
  <END: "\u001f"> : DEFAULT
}
```

状态转换的图示如下：

      <DEFAULT> ---> <MAILHEADER> --+--> <MAILSUBJECT> -->+
       ^                |    ^                            |
       |                |    |                            |
       |                |    |                            |
       +- <MAILBODY> <--+    +----------------------------+

输出结果如下：

DIGEST OF RECENT MESSAGES FROM THE JAVACC MAILING LIST
----------------------------------------------------------------------

MESSAGE SUMMARY:

1. A note on using RE's matching the empty string
2. Re: Looking for HTML.jack
3. Re: HTML?

----------------------------------------------------------------------

邮件消息体

接下来我们继续解析邮件的消息体，也就是要在原有的基础是加上邮件正文，标题，时间，发件人的输出。在前一步的基础上我们增加了一些词法状态的转换：

      <DEFAULT> ---> <MAILHEADER> --+--> <MAILSUBJECT> -->+
       ^                |    ^      |                     |
       |                |    |      |                     |
       |                |    |      +--> <MAILFROM> ----->+
       +- <MAILBODY> <--+    |      |                     |
                             |      |                     |
                             |      +--> <MAILDATE> ----->+
                             |                            |
                             |                            |
                             +----------------------------+

我们这里需要定义对于FORM和DATE内容的解析

<MAILFROM>
TOKEN:
{
  <FROM: ( <NOT_EOL> )+>
}

<MAILFROM>
SKIP:
{
  <_EOL2: <EOL>> : MAILHEADER
}

<MAILDATE>
TOKEN:
{
  <DATE: ( <NOT_EOL> )+>
}

<MAILDATE>
SKIP:
{
  <_EOL3: <EOL>> : MAILHEADER
}

和<MAILSUBJECT>状态类似，以特定字符串开头（From:和Date:）的非换行符作为token，当遇到换行符时转换到<MAILHEADER>状态。

由于我们需要将消息体内容整体打印，所以我们解析的每一封邮件均存储到一个buffer中，待解析结束后再整体打印，这些可以在词法动作中实现：

void MailMessage() :
	{
	  Token subj=null, from=null, date=null, body;
	}
{
  ( subj=<SUBJECT> | from=<FROM> | date=<DATE> )+
	{
	  System.out.println(count + ". " + ((subj==null) ? "no subject" : subj.image));
	  buffer += "\n";
	  buffer += "Message " + count + ":\n";
	  buffer += "\n";
	  buffer += "Subject: " + ((subj==null) ? "no subject" : subj.image) + "\n";
	  buffer += "From: " + ((from==null) ? "" : from.image) + "\n";
	  buffer += "Date: " + ((date==null) ? "" : date.image) + "\n";
	  buffer += "\n";
	}
  ( body=<BODY>
	{
	  buffer += body.image;
	}
  )*
  <END>
	{
	  buffer += "\n";
	  buffer += "----------------------------------------------------------------------\n";
	}
}

整体的Digest.jj实现如下：

PARSER_BEGIN(Digest)
package com.github.gambo.javacc.mail.digest;
import java.io.*;

public class Digest {

  static int count = 0;

  static String buffer = "";

  public static void main(String args[]) throws Exception {
    FileInputStream input = new FileInputStream("../sampleMailFile");
    Digest parser = new Digest(input);
    System.out.println("DIGEST OF RECENT MESSAGES FROM THE JAVACC MAILING LIST");
    System.out.println("----------------------------------------------------------------------");
    System.out.println("");
    System.out.println("MESSAGE SUMMARY:");
    System.out.println("");
    parser.MailFile();
    if (count == 0) {
      System.out.println("There have been no messages since the last digest posting.");
      System.out.println("");
      System.out.println("----------------------------------------------------------------------");
    } else {
      System.out.println("");
      System.out.println("----------------------------------------------------------------------");
      System.out.println("");
      System.out.println(buffer);
    }
  }

}

PARSER_END(Digest)


// PARSER SPECIFICATIONS BEGIN HERE

void MailFile() :
	{
	}
{
  (
	{
	  count++;
	}
    MailMessage()
  )*
  <EOF>
}

void MailMessage() :
	{
	  Token subj=null, from=null, date=null, body;
	}
{
  ( subj=<SUBJECT> | from=<FROM> | date=<DATE> )+
	{
	  System.out.println(count + ". " + ((subj==null) ? "no subject" : subj.image));
	  buffer += "\n";
	  buffer += "Message " + count + ":\n";
	  buffer += "\n";
	  buffer += "Subject: " + ((subj==null) ? "no subject" : subj.image) + "\n";
	  buffer += "From: " + ((from==null) ? "" : from.image) + "\n";
	  buffer += "Date: " + ((date==null) ? "" : date.image) + "\n";
	  buffer += "\n";
	}
  ( body=<BODY>
	{
	  buffer += body.image;
	}
  )*
  <END>
	{
	  buffer += "\n";
	  buffer += "----------------------------------------------------------------------\n";
	}
}


// LEXICAL SPECIFICATIONS BEGIN HERE

TOKEN:
{
  <#EOL: "\n" | "\r" | "\r\n">
|
  <#TWOEOLS: (("\n"|"\r\n") <EOL>) | ("\r\r" [ "\n" ])>
|
  <#NOT_EOL: ~["\n","\r"]>
}

<DEFAULT>
SKIP:
{
  < <EOL> "*** EOOH ***" <EOL> > : MAILHEADER
|
  <~[]>
}

<MAILHEADER>
SKIP:
{
  <_TWOEOLS: <TWOEOLS>> : MAILBODY
    // We cannot have just a reference to a regular expression in a
    // lexical specification - i.e., we cannot simply have <TWOEOLS>.
|
  "Subject: " : MAILSUBJECT
|
  "From: " : MAILFROM
|
  "Date: " : MAILDATE
|
  <~[]>
}

<MAILSUBJECT>
TOKEN:
{
  <SUBJECT: ( <NOT_EOL> )+>
}

<MAILSUBJECT>
SKIP:
{
  <_EOL1: <EOL>> : MAILHEADER
}

<MAILFROM>
TOKEN:
{
  <FROM: ( <NOT_EOL> )+>
}

<MAILFROM>
SKIP:
{
  <_EOL2: <EOL>> : MAILHEADER
}

<MAILDATE>
TOKEN:
{
  <DATE: ( <NOT_EOL> )+>
}

<MAILDATE>
SKIP:
{
  <_EOL3: <EOL>> : MAILHEADER
}

<MAILBODY>
TOKEN:
{
  <BODY: (~["\n","\r","\u001f"])* <EOL>>
|
  <END: "\u001f"> : DEFAULT
}

本章节还有一个案例是以邮件文件作为输入，生成HTML格式的邮件FAQ。它会生成一个“index.html”文件，里面包含了所有的邮件标题，以及指向其他HTML文件的链接，如“1.html”、“2.html”等。词法状态的转换和本文的案例如出一辙，只有词法动作的实现有所区别，如果能读懂本文的案例，则很容易就能掌握。限于篇幅这里就不作赘述了。感兴趣的同学可以在教程代码里做进一步研读。

示例代码：GitHub - ziyiyu/javacc-tutorial: javacc教程