How to make Scanner return delimiters as tokens

Danilo Piazzalunga :

I am trying to use java.util.Scanner to tokenize an arithmetic expression, where the delimiters can either be:

  • Whitespace (\s+ or \p{Space}+), which should be discarded
  • Punctation (\p{Punct}), which should be returned as tokens

Example

Given this expression:

12 + (ab-bc*3)

I would like Scanner to return these tokens:

  • 12
  • +
  • (
  • ab
  • -
  • bc
  • *
  • 3
  • )

Code

So far, I have only been able to:

  • Eat up all of the punctation characters (not what I wanted):
    • new Scanner("12 + (ab-bc*3)").useDelimiter("\\p{Space}+|\\p{Punct}").tokens().collect(Collectors.toList())
    • Result: "12", "", "", "", "ab", "bc", "3"
  • Achieve partial success using positive lookahead
    • new Scanner("12 + (ab-bc*3)").useDelimiter("\\p{Space}+|(?=\\p{Punct})").tokens().collect(Collectors.toList())
    • Result: "12", "+", "(ab", "-bc", "*3", ")"

But now I am stuck.

Wiktor Stribiżew :

A matching approach allows you to use a much simpler regex here:

String text = "12 + (ab-bc*3)";
List<String> results = Pattern.compile("\\p{Punct}|\\w+").matcher(text)
    .results()
    .map(MatchResult::group)
    .collect(Collectors.toList());
System.out.println(results); 
// => "12", "+", "(", "ab", "-", "bc", "*", "3", ")"

See Java demo.

The regex matches

  • \p{Punct} - punctuation and symbol chars
  • | - or
  • \w+ - 1+ letters, digits or _ chars.

See the regex demo (converted to PCRE for the demo purpose).

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=327541&siteId=1