I like to do code-golfing in Java (even though Java way too verbose to be competitive), which is completing a certain challenge in as few bytes as possible. In one of my answers I had the following piece of code:
for(var p:"A4;B8;CU;EM;EW;E3;G6;G9;I1;L7;NZ;O0;R2;S5".split(";"))
Which basically loops over the 2-char Strings after we converted it into a String-array with .split
. Someone suggested I could golf it to this instead to save 4 bytes:
for(var p:"A4B8CUEMEWE3G6G9I1L7NZO0R2S5".split("(?<=\\G..)"))
The functionality is still the same. It loops over the 2-char Strings.
However, neither of us was 100% sure how this works, hence this question.
What I know:
I know .split("(?<= ... )")
is used to split, but keep the trailing delimiter.
There is also a way to keep a leading delimiter, or delimiter as separated item:
"a;b;c;d".split("(?<=;)") // Results in ["a;", "b;", "c;", "d"]
"a;b;c;d".split("(?=;)") // Results in ["a", ";b", ";c", ";d"]
"a;b;c;d".split("((?<=;)|(?=;))") // Results in ["a", ";", "b", ";", "c", ";", "d"]
I know \G
is used to stop after a non-match is encountered.
EDIT: \G
is used to indicate the position where the last match ended (or the start of the string for the first run). Corrected definition thanks to @SebastianProske.
int count = 0;
java.util.regex.Pattern pattern = java.util.regex.Pattern.compile("match,");
java.util.regex.Matcher matcher = pattern.matcher("match,match,match,blabla,match,match,");
while(matcher.find())
count++;
System.out.println(count); // Results in 5
count = 0;
pattern = java.util.regex.Pattern.compile("\\Gmatch,");
matcher = pattern.matcher("match,match,match,blabla,match,match,");
while(matcher.find())
count++;
System.out.println(count); // Results in 3
But how does .split("(?<=\\G..)")
work exactly when using \G
inside the split?
And why does .split("(?=\\G..)")
not work?
Here a "Try it online"-link for all code-snippets described above to see them in action.
how does
.split("(?<=\\G..)")
work
(?<=X)
is a zero-width positive lookbehind for X. \G
is the end of the previous match (not some kind of stop instruction) or beginning of input, and of course ..
is two individual characters. So (?<=\G..)
is a zero-width lookbehind for the end of the previous match plus two characters. Since this is split
and we're describing a delimiter, making the entire thing a zero-width assertion means we only use it to identify where to break the string, not to actually consume any characters.
So let's walk through ABCDEF
:
\G
matches beginning of input, and..
matchesAB
, so(?<=\G..)
finds the zero-width space betweenAB
andCD
because this is a lookbehind: That is, the first point at which there is\G..
prior to the regex cursor is the point betweenAB
andCD
. So split betweenAB
andCD
.\G
marks the location just afterAB
so(?<=\G..)
finds the zero-width space betweenCD
andEF
, because as the regex cursor goes forward, that's the first place where\G..
matches:\G
matching the location betweenAB
andCD
and..
matchingCD
. So split betweenCD
andEF
.- Same again:
\G
marks the location just afterCD
so(?<=\G..)
finds the zero-width space betweenEF
and end-of-input. So split betweenEF
and end-of-input. - Create an array with all of the matches except the empty one at the end (because this is
split
with an implicitlength = 0
which discards empty strings at the end).
Result { "AB", "CD", "EF" }
.
And why does
.split("(?=\\G..)")
not work?
Because (?=X)
is a positive lookahead. The end of the previous match will never be ahead of the regex cursor. It can only be behind it.