How to skip some part of Regex in Java?

nazar_art :

I have some pdf file, and the program reads it line by line.

Here is snipped from a file:

enter image description here

I need to extract:

12000

The parsed line looks like the following:

Bolighus fullverdi 4 374 720 12 000 11 806

I can't find a way how to skip first 7 numbers (4 374 720).

I tried to play with some matching like:

(\d+ ){3}

It founds 2 matches:

enter image description here

Regex how to get value at this case:

\d+ 000

But I want to omit 000 from the regex. In different documents, it will fail.

How to solve this issue?

Maybe you can suggest some other solution to this problem?

UPDATE:

With @PushpeshKumarRajwanshi answer everything is mostly done:

public static String groupNumbers(String pageLine) {
    String transformedLine = pageLine.replaceAll(" (?=\\d{3})", StringUtils.EMPTY);
    log.info("TRANSFORMED LINE: \n[{}]\nFrom ORIGINAL: \n[{}]", transformedLine, pageLine);
    return transformedLine;
}

public static List<String> getGroupedNumbersFromLine(String pageLine) {
    String groupedLine = groupNumbers(pageLine);
    List<String> numbers = Arrays.stream(groupedLine.split(" "))
            .filter(StringUtils::isNumeric)
            .collect(Collectors.toList());
    log.info("Get list of numbers: \n{}\nFrom line: \n[{}]", numbers, pageLine);
    return numbers;
}

However, I found one critical issue.

Sometimes pdf file can look like the following:

enter image description here

Where last 3 digits is a separate number.

And parsed line ends with:

313 400 6 000 370

Which produces an incorrect result:

313400, 6000370

instead of

313400, 6000, 370

UPDATE 2

Consider the next case:

enter image description here

Our line will look like:

Innbo Ekstra Nordea 1 500 000 1 302

it will produce 3 groups as a result:

1500000
1
302

In fact, we have only a second group is missing from input. Is it possible to make a regex more flexible if the second group is missing?

How to fix this behaviour?

Pushpesh Kumar Rajwanshi :

Your numbers have a special pattern which can be used to hack the problem for you. If you notice, any space in this string which is followed by exactly three digits can be removed to unite the number forming actual number, which will make this string,

Bolighus fullverdi 4 374 720 12 000 11 806

to this,

Bolighus fullverdi 4374720 12000 11806

And thus you can capture the second number easily by using this regex,

.*\d+\s+(\d+)\s+\d+

and capture group 2.

Here is a sample java code for same,

public static void main(String[] args) {
    String s = "Bolighus fullverdi 4 374 720 12 000 11 806";
    s = s.replaceAll(" (?=\\d{3})", "");
    System.out.println("Transformed string: " + s);
    Pattern p = Pattern.compile(".*\\d+\\s+(\\d+)\\s+\\d+");
    Matcher m = p.matcher(s);
    if (m.find()) {
        System.out.println(m.group(1));
    } else {
        System.out.println("Didn't match");
    }
}

Which outputs,

Transformed string: Bolighus fullverdi 4374720 12000 11806
12000

Hope this helps!

Edit:

Here is the explanation for this regex \D*\d+\s+(\d+)\s+\d+ for capturing required data from transformed string.

Bolighus fullverdi 4374720 12000 11806
  • .* --> Matches any data before the numbers and here it matches Bolighus fullverdi
  • \d+ --> Matches one or more digits and here it matches 4374720
  • \s+ --> Matches one or more space which is present between the numbers.
  • (\d+) --> Matches one or more digits and captures it in group 1 where it matches 12000
  • \s+ --> Matches one or more space which is present between the numbers.
  • \d+ --> Matches one or more digits and here it matches 11806

As OP wanted to capture the second number, hence I only grouped (put parenthesis around intended capture part) second \d+ but if you want to capture first number or third number, you can simply group them as well like this,

\D*(\d+)\s+(\d+)\s+(\d+)

Then in java code, calling,

m.group(1) would give group 1 number which is 4374720

m.group(2) would give group 2 number which is 12000

m.group(3) would give group 3 number which is 11806

Hope this clarifies and let me know if you need anything further.

Edit2

For covering the case for following string,

Andre bygninger 313 400 6 000 370

so that it captures 313400, 6000 and 370, I have to change the approach of the solution. And in this approach, I will not be transforming the string, but rather will capture the digits with spaces and once all three numbers are captured, will remove space between them. This solution will work for old string as well as new string above where we want to capture last three digits 370 as third number. But let's suppose we have following case,

Andre bygninger 313 400 6 000 370 423

where we have further 423 digits in the string, then it will be captured as following numbers,

313400, 6000370, 423

as it doesn't know whether 370 should go to 6000 or 423. So I have made the solution in a way where last three digits are captured as third number.

Here is a java code that you can use.

public static void main(String[] args) throws Exception {
    Pattern p = Pattern
            .compile(".*?(\\d{1,3}(?:\\s+\\d{3})*)\\s*(\\d{1,3}(?:\\s+\\d{3})*)\\s*(\\d{1,3}(?:\\s+\\d{3})*)");
    List<String> list = Arrays.asList("Bolighus fullverdi 4 374 720 12 000 11 806",
            "Andre bygninger 313 400 6 000 370");

    for (String s : list) {
        Matcher m = p.matcher(s);
        if (m.matches()) {
            System.out.println("For string: " + s);
            System.out.println(m.group(1).replaceAll(" ", ""));
            System.out.println(m.group(2).replaceAll(" ", ""));
            System.out.println(m.group(3).replaceAll(" ", ""));
        } else {
            System.out.println("For string: '" + s + "' Didn't match");
        }
        System.out.println();
    }
}

This code prints following output as you wanted,

For string: Bolighus fullverdi 4 374 720 12 000 11 806
4374720
12000
11806

For string: Andre bygninger 313 400 6 000 370
313400
6000
370

Here is the explanation for regex,

.*?(\\d{1,3}(?:\\s+\\d{3})*)\\s*(\\d{1,3}(?:\\s+\\d{3})*)\\s*(\\d{1,3}(?:\\s+\\d{3})*)
  • .*? --> Matches and consumes any input before the numbers
  • (\\d{1,3}(?:\\s+\\d{3})*) --> This pattern tries to capture first number which can start with one to three digits followed by space and exactly three digits and "space plus three digits" altogether can occur zero or more times.
  • \\s* --> Followed by zero or more space

And after that, same group (\\d{1,3}(?:\\s+\\d{3})*) is repeated two more times so it can capture numbers in three groups.

Since I have made three capturing groups, hence capturing has to take place in three groups for it to be a successful match. So for e.g. here is the mechanism of capturing this input,

Andre bygninger 313 400 6 000 370

First, .*? matches "Andre bygninger ". Then first group (\\d{1,3}(?:\\s+\\d{3})*) first matches 313 (because of \\d{1,3}) and then (?:\\s+\\d{3})* matches a space and 400 and it stops because next data followed is space followed by 6 which is just one digit and not three digit.

Similarly, second group (\\d{1,3}(?:\\s+\\d{3})*) first matches 6 (because of \\d{1,3}) and then (?:\\s+\\d{3})*) matches 000 and stops because, it needs to leave some data for matching group 3 else regex match will fail.

Finally, third group matches 370 as that is the only data that was left. So \\d{1,3} matches 370 and then (?:\\s+\\d{3})* matches nothing as it is zero or more group.

Hope that clarifies. Let me know if you still have any query.

Edit 22 Dec 2018 for grouping numbers in two groups only

As you want to group the data from this string,

Innbo Ekstra Nordea 1 500 000 1 302

Into two group of numbers having 1500000 and 1302, your regex needs to only have two groups and it becomes this like I replied in the comment,

.*?(\\d{1,3}(?:\\s+\\d{3})*)\\s*(\\d{1,3}(?:\\s+\\d{3})*)

Here is the java code for same,

public static void main(String[] args) throws Exception {
    Pattern p = Pattern
            .compile(".*?(\\d{1,3}(?:\\s+\\d{3})*)\\s*(\\d{1,3}(?:\\s+\\d{3})*)");
    List<String> list = Arrays.asList("Innbo Ekstra Nordea 1 500 000 1 302");

    for (String s : list) {
        Matcher m = p.matcher(s);
        if (m.matches()) {
            System.out.println("For string: " + s);
            System.out.println(m.group(1).replaceAll(" ", ""));
            System.out.println(m.group(2).replaceAll(" ", ""));
        } else {
            System.out.println("For string: '" + s + "' Didn't match");
        }
        System.out.println();
    }
}

Which prints this like you expect.

For string: Innbo Ekstra Nordea 1 500 000 1 302
1500000
1302

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=107052&siteId=1