Java regex use one pattern instead of two

N4D4V :

I have a text file which contains lines, and some of them are in the following format:

  • 3 tabs,
  • after if few words and line break at the end.
  • I need to catch the words in these lines, one by one (with the index of each word in the text).

I thought about a solution using 2 regex patterns and 2 loops (added the code below), but I would like to know if there is a better solution using only one regex pattern.

Here is an example for lines from the text:

            Hello I am studying regex!
            This is a line in the text.
                Don't need to add this line
        nor this line.
            But this line should be included.
Map<String, Integer> wordsMap = New HashMap<>();

Pattern p = Pattern.compile("\\t{3}(.*)\\n");
Matcher m = p.matcher(text);

Pattern p2 = Pattern.compile("(\S+)");
Matcher m2 = p.matcher(");

while(m.find()) {
    m2.reset(m.group(1));
    while(m2.find()) {
        wordsMap.add(m2.group(1), m.start(1) + m2.start(1));
    }
}
Wiktor Stribiżew :

You may use

(?:\G(?!^)\h+|^\t{3})(\S+)

See the regex demo. Compile the pattern with Pattern.MULTILINE flag.

Get Group 1 data.

Details

  • (?:\G(?!^)\h+|^\t{3}) - either the end of the previous match but not at the start of a line followed with 1+ horizontal whitespace chars or three tabs at the start of a line
  • (\S+) - Group 1: any 1+ non-whitespace chars.

Java demo:

String s = "\t\t\tHello I am studying regex!\n\t\t\tThis is a line in the text.\n\t\t\t\tDon't need to add this line\n\t\tnor this line.\n\t\t\tBut this line should be included.";
Pattern pattern = Pattern.compile("(?:\\G(?!^)\\h+|^\t{3})(\\S+)", Pattern.MULTILINE);
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
    System.out.println("Match: '" + matcher.group(1) + "', Start: " + matcher.start(1)); 
} 

Output:

Match: 'Hello', Start: 3
Match: 'I', Start: 9
Match: 'am', Start: 11
Match: 'studying', Start: 14
Match: 'regex!', Start: 23
Match: 'This', Start: 33
Match: 'is', Start: 38
Match: 'a', Start: 41
Match: 'line', Start: 43
Match: 'in', Start: 48
Match: 'the', Start: 51
Match: 'text.', Start: 55
Match: 'But', Start: 113
Match: 'this', Start: 117
Match: 'line', Start: 122
Match: 'should', Start: 127
Match: 'be', Start: 134
Match: 'included.', Start: 137

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=155962&siteId=1