How does \G work in .split?

Kevin Cruijssen :

I like to do code-golfing in Java (even though Java way too verbose to be competitive), which is completing a certain challenge in as few bytes as possible. In one of my answers I had the following piece of code:

for(var p:"A4;B8;CU;EM;EW;E3;G6;G9;I1;L7;NZ;O0;R2;S5".split(";"))

Which basically loops over the 2-char Strings after we converted it into a String-array with .split. Someone suggested I could golf it to this instead to save 4 bytes:

for(var p:"A4B8CUEMEWE3G6G9I1L7NZO0R2S5".split("(?<=\\G..)"))

The functionality is still the same. It loops over the 2-char Strings.

However, neither of us was 100% sure how this works, hence this question.


What I know:

I know .split("(?<= ... )") is used to split, but keep the trailing delimiter.
There is also a way to keep a leading delimiter, or delimiter as separated item:

"a;b;c;d".split("(?<=;)")            // Results in ["a;", "b;", "c;", "d"]
"a;b;c;d".split("(?=;)")             // Results in ["a", ";b", ";c", ";d"]
"a;b;c;d".split("((?<=;)|(?=;))")    // Results in ["a", ";", "b", ";", "c", ";", "d"]

I know \G is used to stop after a non-match is encountered.
EDIT: \G is used to indicate the position where the last match ended (or the start of the string for the first run). Corrected definition thanks to @SebastianProske.

int count = 0;
java.util.regex.Pattern pattern = java.util.regex.Pattern.compile("match,");
java.util.regex.Matcher matcher = pattern.matcher("match,match,match,blabla,match,match,");
while(matcher.find())
  count++;
System.out.println(count); // Results in 5

count = 0;
pattern = java.util.regex.Pattern.compile("\\Gmatch,");
matcher = pattern.matcher("match,match,match,blabla,match,match,");
while(matcher.find())
  count++;
System.out.println(count); // Results in 3

But how does .split("(?<=\\G..)") work exactly when using \G inside the split?
And why does .split("(?=\\G..)") not work?

Here a "Try it online"-link for all code-snippets described above to see them in action.

T.J. Crowder :

how does .split("(?<=\\G..)") work

(?<=X) is a zero-width positive lookbehind for X. \G is the end of the previous match (not some kind of stop instruction) or beginning of input, and of course .. is two individual characters. So (?<=\G..) is a zero-width lookbehind for the end of the previous match plus two characters. Since this is split and we're describing a delimiter, making the entire thing a zero-width assertion means we only use it to identify where to break the string, not to actually consume any characters.

So let's walk through ABCDEF:

  1. \G matches beginning of input, and .. matches AB, so (?<=\G..) finds the zero-width space between AB and CD because this is a lookbehind: That is, the first point at which there is \G.. prior to the regex cursor is the point between AB and CD. So split between AB and CD.
  2. \G marks the location just after AB so (?<=\G..) finds the zero-width space between CD and EF, because as the regex cursor goes forward, that's the first place where \G.. matches: \G matching the location between AB and CD and .. matching CD. So split between CD and EF.
  3. Same again: \G marks the location just after CD so (?<=\G..) finds the zero-width space between EF and end-of-input. So split between EF and end-of-input.
  4. Create an array with all of the matches except the empty one at the end (because this is split with an implicit length = 0 which discards empty strings at the end).

Result { "AB", "CD", "EF" }.

And why does .split("(?=\\G..)") not work?

Because (?=X) is a positive lookahead. The end of the previous match will never be ahead of the regex cursor. It can only be behind it.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=36873&siteId=1