javacc Tutorial 5 Example of Lexical State

In the previous section, we introduced tokenManager, which involved lexical status. In this section, we will make a case to focus on demonstrating the usage of lexical status.

There is an email file here, which contains the details of incoming and outgoing emails. The file is as follows:

BABYL OPTIONS:
Version: 5
Labels:
Note:   This is the header of an rmail file.
Note:   If you are seeing it in rmail,
Note:    it means the file has no messages in it.

1, filed,,
Summary-line: 11-Jan       [email protected]  #A note on using RE's matching the empty string
Return-Path: <[email protected]>
Received: from Eng.Sun.COM by schizophrenia.Eng.Sun.COM (SMI-8.6/SMI-SVR4)
	id HAA21134; Sat, 11 Jan 1997 07:47:28 -0800
Received: from sunmail1.Sun.COM by Eng.Sun.COM (SMI-8.6/SMI-5.3)
	id HAA02652; Sat, 11 Jan 1997 07:44:26 -0800
Received: from Eng.Sun.COM by sunmail1.Sun.COM (SMI-8.6/SMI-4.1)
	id HAA06974; Sat, 11 Jan 1997 07:44:26 -0800
Received: from suntest.Eng.Sun.COM by Eng.Sun.COM (SMI-8.6/SMI-5.3)
	id HAA02640; Sat, 11 Jan 1997 07:44:24 -0800
Received: from asap.Eng.Sun.COM by suntest.Eng.Sun.COM (SMI-8.6/SMI-SVR4)
	id HAA21295; Sat, 11 Jan 1997 07:44:18 -0800
Received: from Eng.Sun.COM by asap.Eng.Sun.COM (SMI-8.6/SMI-SVR4)
	id HAA24684; Sat, 11 Jan 1997 07:44:19 -0800
Received: from mercury.Sun.COM by Eng.Sun.COM (SMI-8.6/SMI-5.3)
	id HAA02630; Sat, 11 Jan 1997 07:44:17 -0800
Received: from cs.albany.edu (cs.albany.edu [169.226.2.22]) by mercury.Sun.COM (SMI-8.6/mail.byaddr) with SMTP id HAA01828 for <[email protected]>; Sat, 11 Jan 1997 07:44:18 -0800
Received: from bhaskara.cs.albany.edu ([email protected] [169.226.2.60]) by cs.albany.edu (8.7.4/HUB03) with ESMTP id KAA01464; Sat, 11 Jan 1997 10:44:06 -0500 (EST)
Received: (from sreeni@localhost) by bhaskara.cs.albany.edu (8.7.4/CLI2) id KAA09608; Sat, 11 Jan 1997 10:43:58 -0500 (EST)
From: Sreenivasa Rao Viswanadha <[email protected]>
Date: Sat, 11 Jan 1997 10:43:58 -0500 (EST)
Message-Id: <[email protected]>
To: [email protected]
Subject: A note on using RE's matching the empty string
Cc: [email protected]
X-Sun-Charset: US-ASCII
Content-Type: text
Content-Length: 1639
X-Lines: 32
Status: RO

*** EOOH ***
Return-Path: <[email protected]>
From: Sreenivasa Rao Viswanadha <[email protected]>
Date: Sat, 11 Jan 1997 10:43:58 -0500 (EST)
To: [email protected]
Subject: A note on using RE's matching the empty string
Cc: [email protected]
X-Sun-Charset: US-ASCII
Content-Type: text
Content-Length: 1639
X-Lines: 32


In the last couple of days, we had seen a couple of users facing problems with
regular expressions that match "". There is a minor bug in the way it is 
implemented in 0.6.-9. We will fix it.

But the purpose of this mail is to suggest you should be careful when you use
RE's that match the "" string. Consider the following example of string literals
where two consecutive "" are interpreted as the literal " (equivalent to \" in
Java).

< STRING_LITERAL: ( "\"" (~["\""])* "\"" )* >

This will work in general. But, if this a part of a lot of other lexical rules,
then if there a lexical error, say a char is given that cannot be the first one
of any token, then, the lexer decides to use the empty string "" and match it
as STRING_LITERAL without actually giving the lexical error. And since this is
the empty string, no character will be consumed and you will start getting the
same STRING_LITERAL token (with "" as the image) infinite number of times. In
fact, if this was the only lexical rule, then if you give a input that starts
with any char other than the ", you will get into an infinite loop.

So a better alternative is to use the + operator which will not match the empty
string. As a matter of fact, I don't know any practical grammar where matching
"" is useful.

In version 0.5, the lexer generated implicitly treated it as + (which is not
totally right). But in 0.6.-9, it does it right and so there is a chance that
your grammar that used to work with 0.5 will not work with 0.6.-9. So if you
have any top-level lexical rule with ? or *, please change those rules so that
they don't match the empty string "".

Sreeni.


1,,
Summary-line: 11-Jan         [email protected]  #Re: Looking for HTML.jack
Return-Path: <[email protected]>
Received: from Eng.Sun.COM by schizophrenia.Eng.Sun.COM (SMI-8.6/SMI-SVR4)
	id QAA21339; Sat, 11 Jan 1997 16:41:53 -0800
Received: from sunmail1.Sun.COM by Eng.Sun.COM (SMI-8.6/SMI-5.3)
	id QAA18458; Sat, 11 Jan 1997 16:38:44 -0800
Received: from Eng.Sun.COM by sunmail1.Sun.COM (SMI-8.6/SMI-4.1)
	id QAA16642; Sat, 11 Jan 1997 16:38:51 -0800
Received: from suntest.Eng.Sun.COM by Eng.Sun.COM (SMI-8.6/SMI-5.3)
	id QAA18449; Sat, 11 Jan 1997 16:38:42 -0800
Received: from asap.Eng.Sun.COM by suntest.Eng.Sun.COM (SMI-8.6/SMI-SVR4)
	id QAA23127; Sat, 11 Jan 1997 16:38:42 -0800
Received: from Eng.Sun.COM by asap.Eng.Sun.COM (SMI-8.6/SMI-SVR4)
	id QAA24956; Sat, 11 Jan 1997 16:38:41 -0800
Received: from mercury.Sun.COM by Eng.Sun.COM (SMI-8.6/SMI-5.3)
	id QAA18438; Sat, 11 Jan 1997 16:38:36 -0800
Received: from chmls01.highway1.com (ne.highway1.com [24.128.1.82]) by mercury.Sun.COM (SMI-8.6/mail.byaddr) with SMTP id QAA22125 for <[email protected]>; Sat, 11 Jan 1997 16:38:42 -0800
Received: from papa ([24.128.36.164]) by chmls01.highway1.com
          (Netscape Mail Server v2.0) with SMTP id AAA17669;
          Sat, 11 Jan 1997 19:38:31 -0400
Message-ID: <[email protected]>
Date: Sat, 11 Jan 1997 19:38:32 -0500
From: [email protected] (Kimbo Mundy)
X-Mailer: Mozilla 3.0Gold (WinNT; U)
MIME-Version: 1.0
To: Rupert Nagler <[email protected]>
CC: [email protected]
Subject: Re: Looking for HTML.jack
References: <[email protected]>
Content-Transfer-Encoding: 7bit
X-Lines: 13
Status: RO
Content-Type: text/plain; charset="us-ascii"
Content-Length: 447

*** EOOH ***
Return-Path: <[email protected]>
Date: Sat, 11 Jan 1997 19:38:32 -0500
From: [email protected] (Kimbo Mundy)
X-Mailer: Mozilla 3.0Gold (WinNT; U)
MIME-Version: 1.0
To: Rupert Nagler <[email protected]>
CC: [email protected]
Subject: Re: Looking for HTML.jack
References: <[email protected]>
Content-Transfer-Encoding: 7bit
X-Lines: 13
Content-Type: text/plain; charset="us-ascii"
Content-Length: 447

Rupert Nagler wrote:
> 
> I am very impressed by the Jack-Concept and I am looking for a "HTML.jack".
> Is there anybody out there who has an example of a Jack-Definition file for
> HTML 3.2?

I previously sent a message entitled "A first cut at an HTML grammar".
Did people not get it?  If not, see:
http://www.tiac.net/users/kimbo/jack/HTML.jack

> Is there a way to construct a *.jack file out of a *.sgml file?

Sorry, I can't help with this.


1,,
Summary-line: 13-Jan         [email protected]  #Re: HTML?
Return-Path: <[email protected]>
Received: from suntest.Eng.Sun.COM by schizophrenia.Eng.Sun.COM (SMI-8.6/SMI-SVR4)
	id XAA21746; Sun, 12 Jan 1997 23:06:34 -0800
Received: from Eng.Sun.COM by suntest.Eng.Sun.COM (SMI-8.6/SMI-SVR4)
	id XAA29422; Sun, 12 Jan 1997 23:03:31 -0800
Received: from mercury.Sun.COM by Eng.Sun.COM (SMI-8.6/SMI-5.3)
	id XAA07269; Sun, 12 Jan 1997 23:03:30 -0800
Received: from chmls01.highway1.com (ne.highway1.com [24.128.1.82]) by mercury.Sun.COM (SMI-8.6/mail.byaddr) with SMTP id XAA18021 for <[email protected]>; Sun, 12 Jan 1997 23:03:30 -0800
Received: from papa ([24.128.36.164]) by chmls01.highway1.com
          (Netscape Mail Server v2.0) with SMTP id AAA20062
          for <[email protected]>; Mon, 13 Jan 1997 02:03:25 -0400
Message-ID: <[email protected]>
Date: Mon, 13 Jan 1997 02:03:28 -0500
From: [email protected] (Kimbo Mundy)
X-Mailer: Mozilla 3.0Gold (WinNT; U)
MIME-Version: 1.0
To: Sriram Sankar <[email protected]>
Subject: Re: HTML?
References: <[email protected]>
Content-Transfer-Encoding: 7bit
X-Lines: 34
Status: RO
Content-Type: text/plain; charset="us-ascii"
Content-Length: 1420

*** EOOH ***
Return-Path: <[email protected]>
Date: Mon, 13 Jan 1997 02:03:28 -0500
From: [email protected] (Kimbo Mundy)
X-Mailer: Mozilla 3.0Gold (WinNT; U)
MIME-Version: 1.0
To: Sriram Sankar <[email protected]>
Subject: Re: HTML?
References: <[email protected]>
Content-Transfer-Encoding: 7bit
X-Lines: 34
Content-Type: text/plain; charset="us-ascii"
Content-Length: 1420

Well, I finally got an HTML grammar out there (at
http://www.tiac.net/users/kimbo/jack/HTML.jack).  I hope you saw it, I
got some mailer errors, that seemed like the kind that you could ignore,
but at least one person didn't receive my first posting.

I'd be interested to know if this is the sort of thing people are
looking for, or do they want the full set of tags enumerated in the
grammar as well?  Also, if there you have any desire to bundle this with
Jack (possibly after upgrades and/or integration with other people's
work), please feel free.

I must say Jack is an amazing tool.  It was really easy to learn.  I
love how readable the grammars are, and I love being able to pass info
up and down the productions as the parser runs.  I never want to have to
settle for LALR(1) again!  Thanks for writing it!


We need to extract the summary, title, sender, and sending time of each email in the above file

The final effect is as follows:

DIGEST OF RECENT MESSAGES FROM THE JAVACC MAILING LIST
----------------------------------------------------------------------

MESSAGE SUMMARY:

1. A note on using RE's matching the empty string
2. Re: Looking for HTML.jack
3. Re: HTML?

----------------------------------------------------------------------


Message 1:

Subject: A note on using RE's matching the empty string
From: Sreenivasa Rao Viswanadha <[email protected]>
Date: Sat, 11 Jan 1997 10:43:58 -0500 (EST)


In the last couple of days, we had seen a couple of users facing problems with
regular expressions that match "". There is a minor bug in the way it is 
implemented in 0.6.-9. We will fix it.

But the purpose of this mail is to suggest you should be careful when you use
RE's that match the "" string. Consider the following example of string literals
where two consecutive "" are interpreted as the literal " (equivalent to \" in
Java).

< STRING_LITERAL: ( "\"" (~["\""])* "\"" )* >

This will work in general. But, if this a part of a lot of other lexical rules,
then if there a lexical error, say a char is given that cannot be the first one
of any token, then, the lexer decides to use the empty string "" and match it
as STRING_LITERAL without actually giving the lexical error. And since this is
the empty string, no character will be consumed and you will start getting the
same STRING_LITERAL token (with "" as the image) infinite number of times. In
fact, if this was the only lexical rule, then if you give a input that starts
with any char other than the ", you will get into an infinite loop.

So a better alternative is to use the + operator which will not match the empty
string. As a matter of fact, I don't know any practical grammar where matching
"" is useful.

In version 0.5, the lexer generated implicitly treated it as + (which is not
totally right). But in 0.6.-9, it does it right and so there is a chance that
your grammar that used to work with 0.5 will not work with 0.6.-9. So if you
have any top-level lexical rule with ? or *, please change those rules so that
they don't match the empty string "".

Sreeni.


----------------------------------------------------------------------

Message 2:

Subject: Re: Looking for HTML.jack
From: [email protected] (Kimbo Mundy)
Date: Sat, 11 Jan 1997 19:38:32 -0500

Rupert Nagler wrote:
> 
> I am very impressed by the Jack-Concept and I am looking for a "HTML.jack".
> Is there anybody out there who has an example of a Jack-Definition file for
> HTML 3.2?

I previously sent a message entitled "A first cut at an HTML grammar".
Did people not get it?  If not, see:
http://www.tiac.net/users/kimbo/jack/HTML.jack

> Is there a way to construct a *.jack file out of a *.sgml file?

Sorry, I can't help with this.


----------------------------------------------------------------------

Message 3:

Subject: Re: HTML?
From: [email protected] (Kimbo Mundy)
Date: Mon, 13 Jan 1997 02:03:28 -0500

Well, I finally got an HTML grammar out there (at
http://www.tiac.net/users/kimbo/jack/HTML.jack).  I hope you saw it, I
got some mailer errors, that seemed like the kind that you could ignore,
but at least one person didn't receive my first posting.

I'd be interested to know if this is the sort of thing people are
looking for, or do they want the full set of tags enumerated in the
grammar as well?  Also, if there you have any desire to bundle this with
Jack (possibly after upgrades and/or integration with other people's
work), please feel free.

I must say Jack is an amazing tool.  It was really easy to learn.  I
love how readable the grammars are, and I love being able to pass info
up and down the productions as the parser runs.  I never want to have to
settle for LALR(1) again!  Thanks for writing it!


----------------------------------------------------------------------

Email summary

First we need to grab the summary of these emails, which is the "MESSAGE SUMMARY" part. Here we choose to grab the "Subject" part of the email as the SUMMARY. We observe that each email starts with "*** EOOH * **" string as the starting position, and the initial state of tokenManager is DEFAULT. In this state, tokenManager simply eats the scanned characters, so we define it in the DEFAULT state. When encountering "*** EOOH ** *", perform state switching, and then continue to capture the fields we need to match in the post-switched state.

We define the Digest.jj file with the following contents:

PARSER_BEGIN(Digest)
package com.github.gambo.javacc.mail.digest;
import java.io.*;

public class Digest {

  static int count = 0;


  public static void main(String args[]) throws Exception {
    FileInputStream input = new FileInputStream("../sampleMailFile");
    Digest parser = new Digest(input);
    System.out.println("DIGEST OF RECENT MESSAGES FROM THE JAVACC MAILING LIST");
    System.out.println("----------------------------------------------------------------------");
    System.out.println("");
    System.out.println("MESSAGE SUMMARY:");
    System.out.println("");
    parser.MailFile();
    if (count == 0) {
      System.out.println("There have been no messages since the last digest posting.");
      System.out.println("");
      System.out.println("----------------------------------------------------------------------");
    } else {
      System.out.println("");
      System.out.println("----------------------------------------------------------------------");
      System.out.println("");
    }
  }

}

PARSER_END(Digest)


// PARSER SPECIFICATIONS BEGIN HERE

void MailFile() :
	{
	}
{
  (
	{
	  count++;
	}
    MailMessage()
  )*
  <EOF>
}

void MailMessage() :
	{
	  Token subj=null, body;
	}
{
  ( subj=<SUBJECT> )+
	{
	  System.out.println(count + ". " + ((subj==null) ? "no subject" : subj.image));
	}
  ( body=<BODY>)*
  <END>
}


// LEXICAL SPECIFICATIONS BEGIN HERE

TOKEN:
{
  <#EOL: "\n" | "\r" | "\r\n">
|
  <#TWOEOLS: (("\n"|"\r\n") <EOL>) | ("\r\r" [ "\n" ])>
|
  <#NOT_EOL: ~["\n","\r"]>
}

<DEFAULT>
SKIP:
{
  < <EOL> "*** EOOH ***" <EOL> > : MAILHEADER
|
  <~[]>
}

<MAILHEADER>
SKIP:
{
  <_TWOEOLS: <TWOEOLS>> : MAILBODY
|
  "Subject: " : MAILSUBJECT
|
  <~[]>
}

<MAILSUBJECT>
TOKEN:
{
  <SUBJECT: ( <NOT_EOL> )+>
}

<MAILSUBJECT>
SKIP:
{
  <_EOL1: <EOL>> : MAILHEADER
}


<MAILBODY>
TOKEN:
{
  <BODY: (~["\n","\r","\u001f"])* <EOL>>
|
  <END: "\u001f"> : DEFAULT
}

At first glance, the definition of this file is a bit cumbersome. Let's start with a set of reusable regular expressions EOL, TWOEOLS, and NOT_EOL.

TOKEN:
{
  <#EOL: "\n" | "\r" | "\r\n"> //不同平台的换行符
|
  <#TWOEOLS: (("\n"|"\r\n") <EOL>) | ("\r\r" [ "\n" ])> //连续两个换行符,或者结束符,用以分割每一封邮件
|
  <#NOT_EOL: ~["\n","\r"]> //非换行符以外的字符
}

The process of lexical analysis is as follows:

  1. In the <DEFAULT> lexical state, the token manager simply eats characters until it sees the beginning of a message marked: <<EOL> "*** EOOH ***" < EOL>>. At this point, it switches to state <MAILHEADER> 
    <DEFAULT>
    SKIP:
    {
      < <EOL> "*** EOOH ***" <EOL> > : MAILHEADER
    |
      <~[]>
    }
  2. In the <MAILHEADER> state, when we match a string starting with "subject:", we transition to the <MAILSUBJECT> state.
    <MAILHEADER>
    SKIP:
    {
      <_TWOEOLS: <TWOEOLS>> : MAILBODY
    |
      "Subject: " : MAILSUBJECT
    |
      <~[]>
    }
  3. <MAILSUBJECT> Next we match the string starting with "subject:" and output it as a token. Whenever we get <SUBJECT>, we print out its content. When two newline characters are matched,< ;EOL>", it will be converted back to <MAILHEADER>
    <MAILSUBJECT>
    TOKEN:
    {
      <SUBJECT: ( <NOT_EOL> )+>
    }
    
    <MAILSUBJECT>
    SKIP:
    {
      <_EOL1: <EOL>> : MAILHEADER
    }
  4. In the <MAILHEADER> state, when we match two newline characters, that is, "<TWOEOLS>", we jump to <MAILBODY>, where we can match the body content of the email.
  5. In the <MAILBODY> state, when each line of the email body matches <BODY: (~["\n","\r","\u001f"])* <EOL>>, it means 0 or more non-newline and terminator characters, and terminated by a newline character. When the string "\u001f" is encountered, it means the end of an email. Reset to the <DEFAULT> state and enter a new round of analysis.
    <MAILBODY>
    TOKEN:
    {
      <BODY: (~["\n","\r","\u001f"])* <EOL>>
    |
      <END: "\u001f"> : DEFAULT
    }

The diagram of state transition is as follows:

      <DEFAULT> ---> <MAILHEADER> --+--> <MAILSUBJECT> -->+
       ^                |    ^                            |
       |                |    |                            |
       |                |    |                            |
       +- <MAILBODY> <--+    +----------------------------+

The output is as follows:

DIGEST OF RECENT MESSAGES FROM THE JAVACC MAILING LIST
----------------------------------------------------------------------

MESSAGE SUMMARY:

1. A note on using RE's matching the empty string
2. Re: Looking for HTML.jack
3. Re: HTML?

----------------------------------------------------------------------

Email message body

Next, we continue to parse the message body of the email, which means adding the email body, title, time, and sender's output to the original basis. Based on the previous step, we added some lexical state transitions:

      <DEFAULT> ---> <MAILHEADER> --+--> <MAILSUBJECT> -->+
       ^                |    ^      |                     |
       |                |    |      |                     |
       |                |    |      +--> <MAILFROM> ----->+
       +- <MAILBODY> <--+    |      |                     |
                             |      |                     |
                             |      +--> <MAILDATE> ----->+
                             |                            |
                             |                            |
                             +----------------------------+

We need to define the parsing of FORM and DATE content here.

<MAILFROM>
TOKEN:
{
  <FROM: ( <NOT_EOL> )+>
}

<MAILFROM>
SKIP:
{
  <_EOL2: <EOL>> : MAILHEADER
}

<MAILDATE>
TOKEN:
{
  <DATE: ( <NOT_EOL> )+>
}

<MAILDATE>
SKIP:
{
  <_EOL3: <EOL>> : MAILHEADER
}

Similar to the <MAILSUBJECT> state, a non-newline character starting with a specific string (From: and Date:) is used as a token. When a newline character is encountered, it switches to the <MAILHEADER> state.

Since we need to print the message body content as a whole, each email we parse is stored in a buffer, and then printed as a whole after the parsing is completed. This can be implemented in lexical actions:

void MailMessage() :
	{
	  Token subj=null, from=null, date=null, body;
	}
{
  ( subj=<SUBJECT> | from=<FROM> | date=<DATE> )+
	{
	  System.out.println(count + ". " + ((subj==null) ? "no subject" : subj.image));
	  buffer += "\n";
	  buffer += "Message " + count + ":\n";
	  buffer += "\n";
	  buffer += "Subject: " + ((subj==null) ? "no subject" : subj.image) + "\n";
	  buffer += "From: " + ((from==null) ? "" : from.image) + "\n";
	  buffer += "Date: " + ((date==null) ? "" : date.image) + "\n";
	  buffer += "\n";
	}
  ( body=<BODY>
	{
	  buffer += body.image;
	}
  )*
  <END>
	{
	  buffer += "\n";
	  buffer += "----------------------------------------------------------------------\n";
	}
}

The overall Digest.jj implementation is as follows:

PARSER_BEGIN(Digest)
package com.github.gambo.javacc.mail.digest;
import java.io.*;

public class Digest {

  static int count = 0;

  static String buffer = "";

  public static void main(String args[]) throws Exception {
    FileInputStream input = new FileInputStream("../sampleMailFile");
    Digest parser = new Digest(input);
    System.out.println("DIGEST OF RECENT MESSAGES FROM THE JAVACC MAILING LIST");
    System.out.println("----------------------------------------------------------------------");
    System.out.println("");
    System.out.println("MESSAGE SUMMARY:");
    System.out.println("");
    parser.MailFile();
    if (count == 0) {
      System.out.println("There have been no messages since the last digest posting.");
      System.out.println("");
      System.out.println("----------------------------------------------------------------------");
    } else {
      System.out.println("");
      System.out.println("----------------------------------------------------------------------");
      System.out.println("");
      System.out.println(buffer);
    }
  }

}

PARSER_END(Digest)


// PARSER SPECIFICATIONS BEGIN HERE

void MailFile() :
	{
	}
{
  (
	{
	  count++;
	}
    MailMessage()
  )*
  <EOF>
}

void MailMessage() :
	{
	  Token subj=null, from=null, date=null, body;
	}
{
  ( subj=<SUBJECT> | from=<FROM> | date=<DATE> )+
	{
	  System.out.println(count + ". " + ((subj==null) ? "no subject" : subj.image));
	  buffer += "\n";
	  buffer += "Message " + count + ":\n";
	  buffer += "\n";
	  buffer += "Subject: " + ((subj==null) ? "no subject" : subj.image) + "\n";
	  buffer += "From: " + ((from==null) ? "" : from.image) + "\n";
	  buffer += "Date: " + ((date==null) ? "" : date.image) + "\n";
	  buffer += "\n";
	}
  ( body=<BODY>
	{
	  buffer += body.image;
	}
  )*
  <END>
	{
	  buffer += "\n";
	  buffer += "----------------------------------------------------------------------\n";
	}
}


// LEXICAL SPECIFICATIONS BEGIN HERE

TOKEN:
{
  <#EOL: "\n" | "\r" | "\r\n">
|
  <#TWOEOLS: (("\n"|"\r\n") <EOL>) | ("\r\r" [ "\n" ])>
|
  <#NOT_EOL: ~["\n","\r"]>
}

<DEFAULT>
SKIP:
{
  < <EOL> "*** EOOH ***" <EOL> > : MAILHEADER
|
  <~[]>
}

<MAILHEADER>
SKIP:
{
  <_TWOEOLS: <TWOEOLS>> : MAILBODY
    // We cannot have just a reference to a regular expression in a
    // lexical specification - i.e., we cannot simply have <TWOEOLS>.
|
  "Subject: " : MAILSUBJECT
|
  "From: " : MAILFROM
|
  "Date: " : MAILDATE
|
  <~[]>
}

<MAILSUBJECT>
TOKEN:
{
  <SUBJECT: ( <NOT_EOL> )+>
}

<MAILSUBJECT>
SKIP:
{
  <_EOL1: <EOL>> : MAILHEADER
}

<MAILFROM>
TOKEN:
{
  <FROM: ( <NOT_EOL> )+>
}

<MAILFROM>
SKIP:
{
  <_EOL2: <EOL>> : MAILHEADER
}

<MAILDATE>
TOKEN:
{
  <DATE: ( <NOT_EOL> )+>
}

<MAILDATE>
SKIP:
{
  <_EOL3: <EOL>> : MAILHEADER
}

<MAILBODY>
TOKEN:
{
  <BODY: (~["\n","\r","\u001f"])* <EOL>>
|
  <END: "\u001f"> : DEFAULT
}

本章节还有一个案例是以邮件文件作为输入,生成HTML格式的邮件FAQ。它会生成一个“index.html”文件,里面包含了所有的邮件标题,以及指向其他HTML文件的链接,如“1.html”、“2.html”等。词法状态的转换和本文的案例如出一辙,只有词法动作的实现有所区别,如果能读懂本文的案例,则很容易就能掌握。限于篇幅这里就不作赘述了。感兴趣的同学可以在教程代码里做进一步研读。

Sample code:GitHub - ziyiyu/javacc-tutorial: javacc tutorial

Guess you like

Origin blog.csdn.net/gambool/article/details/134205659