The task must be solved using regular expressions without using container classes.
Input: text (may consist of Latin and Cyrillic letters, does not contain _
)
Output: source text, but precede all repeated words with an underscore _
To consider a word as a sequence containing only letters (all other characters are not included in the word). Create a static convert method that converts input to output.
Method to complete:
public static String convert (String input) {
...
}
Input example:
This is a test
And this is also a test
And these are also tests
test
Это тест
Это также тест
И это также тесты
Output example:
This _is _a _test
_And this _is _also _a _test
_And these are _also tests
_test
_Это _тест
_Это _также _тест
И это _также тесты
My attempt:
public static void convert(String input) {
Pattern p = Pattern.compile("(\\b\\w+\\b)(?=[\\s\\S]*\\b\\1\\b[\\s\\S]*\\b\\1\\b)", Pattern.UNICODE_CHARACTER_CLASS);
String res = p.matcher(input+" "+input).replaceAll("_$1");
res = res.substring(0, res.length() - 1 - p.matcher(input).replaceAll("_$1").length());
System.out.println(res);
}
My output: enter image description here
This _is _a _test
_And this _is _also _a test
_And these are _also tests
_test
_Это _тест
_Это _также _тест
И это _также тесты
Word "test" in second row without "_" but i need "_test"
You may collect all repeated words and then prepend them with _
:
// Java 9+
String s = "This is a test\nAnd this is also a test\nAnd these are also tests\ntest\nЭто тест\nЭто также тест\nИ это также тесты";
String rx = "(?sU)\\b(\\w+)\\b(?=.*\\b\\1\\b)";
String[] results = Pattern.compile(rx).matcher(s).results().map(MatchResult::group).toArray(String[]::new);
System.out.println(s.replaceAll("(?U)\\b(?:" + String.join("|", results) + ")\\b", "_$0"));
// Java 8
String s = "This is a test\nAnd this is also a test\nAnd these are also tests\ntest\nЭто тест\nЭто также тест\nИ это также тесты";
String rx = "(?sU)\\b(\\w+)\\b(?=.*\\b\\1\\b)";
List<String> matches = new ArrayList<>();
Matcher m = Pattern.compile(rx).matcher(s);
while (m.find()) {
matches.add(m.group());
}
System.out.println(s.replaceAll("(?U)\\b(?:" + String.join("|", matches) + ")\\b", "_$0"));
See the Java demo online and the second snippet demo. Output:
This _is _a _test
_And this _is _also a _test
And these are _also tests
test
_Это _тест
_Это _также тест
И это _также тесты
Note I replaced [\s\S]
workaround construct with the .
combined with the s
DOTALL embedded flag option (so that .
could match line breaks, too), used Java 9+ .results()
method to return all matches and built the final pattern out of the found matches joined with |
OR alternation operator.
Details
(?sU)\b(\w+)\b(?=.*\b\1\b)
:(?sU)
- an embedded DOTALL (makes.
match linebreaks, too) and UNICODE_CHARACTER_CLASS (makes all shorthands Unicode aware) flag options\b
- word boundary(\w+)
- Group 1: 1+ word chars, letters, digits or_
s\b
- word boundary(?=.*\b\1\b)
- immediately to the right, there must be any 0+ chars, as many as possible, followed with the same value as in Group 1 as a whole word.
(?U)\\b(?:" + String.join("|", results) + ")\\b"
: this pattern will look like(?U)\b(?:test|is|Это|тест|также)\b
(?U)
- an embedded UNICODE_CHARACTER_CLASS flag option\b
- word boundary(?:test|is|Это|тест|также)
- a non-capturing alternation group\b
- word boundary
The replacement is _$0
for the second regex as the _
is appended to the whole match value, $0
.