How to change my regex to match/replace on 2nd, 3rd, ... words, but not on first one?

Vanua Aunav :

The task must be solved using regular expressions without using container classes.

Input: text (may consist of Latin and Cyrillic letters, does not contain _)

Output: source text, but precede all repeated words with an underscore _

To consider a word as a sequence containing only letters (all other characters are not included in the word). Create a static convert method that converts input to output.

Method to complete:

public static String convert (String input) {
    ...
}

Input example:

This is a test
And this is also a test
And these are also tests
test
Это тест
Это также тест
И это также тесты

Output example:

This _is _a _test
_And this _is _also _a _test
_And these are _also tests
_test
_Это _тест
_Это _также _тест
И это _также тесты

My attempt:

public static void convert(String input) {
        Pattern p = Pattern.compile("(\\b\\w+\\b)(?=[\\s\\S]*\\b\\1\\b[\\s\\S]*\\b\\1\\b)", Pattern.UNICODE_CHARACTER_CLASS);
        String res = p.matcher(input+" "+input).replaceAll("_$1");
        res = res.substring(0, res.length() - 1 - p.matcher(input).replaceAll("_$1").length());
        System.out.println(res);
    }

My output: enter image description here

This _is _a _test
_And this _is _also _a test
_And these are _also tests
_test
_Это _тест
_Это _также _тест
И это _также тесты

Word "test" in second row without "_" but i need "_test"

Wiktor Stribiżew :

You may collect all repeated words and then prepend them with _:

// Java 9+
String s = "This is a test\nAnd this is also a test\nAnd these are also tests\ntest\nЭто тест\nЭто также тест\nИ это также тесты";
String rx = "(?sU)\\b(\\w+)\\b(?=.*\\b\\1\\b)";
String[] results = Pattern.compile(rx).matcher(s).results().map(MatchResult::group).toArray(String[]::new);
System.out.println(s.replaceAll("(?U)\\b(?:" + String.join("|", results) + ")\\b", "_$0"));

// Java 8
String s = "This is a test\nAnd this is also a test\nAnd these are also tests\ntest\nЭто тест\nЭто также тест\nИ это также тесты";
String rx = "(?sU)\\b(\\w+)\\b(?=.*\\b\\1\\b)";
List<String> matches = new ArrayList<>();
Matcher m = Pattern.compile(rx).matcher(s);
while (m.find()) {
    matches.add(m.group());
}
System.out.println(s.replaceAll("(?U)\\b(?:" + String.join("|", matches) + ")\\b", "_$0"));

See the Java demo online and the second snippet demo. Output:

This _is _a _test
_And this _is _also a _test
And these are _also tests
test
_Это _тест
_Это _также тест
И это _также тесты

Note I replaced [\s\S] workaround construct with the . combined with the s DOTALL embedded flag option (so that . could match line breaks, too), used Java 9+ .results() method to return all matches and built the final pattern out of the found matches joined with | OR alternation operator.

Details

  • (?sU)\b(\w+)\b(?=.*\b\1\b):
    • (?sU) - an embedded DOTALL (makes . match linebreaks, too) and UNICODE_CHARACTER_CLASS (makes all shorthands Unicode aware) flag options
    • \b - word boundary
    • (\w+) - Group 1: 1+ word chars, letters, digits or _s
    • \b - word boundary
    • (?=.*\b\1\b) - immediately to the right, there must be any 0+ chars, as many as possible, followed with the same value as in Group 1 as a whole word.
  • (?U)\\b(?:" + String.join("|", results) + ")\\b": this pattern will look like (?U)\b(?:test|is|Это|тест|также)\b
    • (?U) - an embedded UNICODE_CHARACTER_CLASS flag option
    • \b - word boundary
    • (?:test|is|Это|тест|также) - a non-capturing alternation group
    • \b - word boundary

The replacement is _$0 for the second regex as the _ is appended to the whole match value, $0.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=306459&siteId=1