I have some pdf file, and the program reads it line by line.
Here is snipped from a file:
I need to extract:
12000
The parsed line looks like the following:
Bolighus fullverdi 4 374 720 12 000 11 806
I can't find a way how to skip first 7 numbers (4 374 720
).
I tried to play with some matching like:
(\d+ ){3}
It founds 2 matches:
Regex how to get value at this case:
\d+ 000
But I want to omit 000
from the regex. In different documents, it will fail.
How to solve this issue?
Maybe you can suggest some other solution to this problem?
UPDATE:
With @PushpeshKumarRajwanshi answer everything is mostly done:
public static String groupNumbers(String pageLine) {
String transformedLine = pageLine.replaceAll(" (?=\\d{3})", StringUtils.EMPTY);
log.info("TRANSFORMED LINE: \n[{}]\nFrom ORIGINAL: \n[{}]", transformedLine, pageLine);
return transformedLine;
}
public static List<String> getGroupedNumbersFromLine(String pageLine) {
String groupedLine = groupNumbers(pageLine);
List<String> numbers = Arrays.stream(groupedLine.split(" "))
.filter(StringUtils::isNumeric)
.collect(Collectors.toList());
log.info("Get list of numbers: \n{}\nFrom line: \n[{}]", numbers, pageLine);
return numbers;
}
However, I found one critical issue.
Sometimes pdf file can look like the following:
Where last 3 digits is a separate number.
And parsed line ends with:
313 400 6 000 370
Which produces an incorrect result:
313400, 6000370
instead of
313400, 6000, 370
UPDATE 2
Consider the next case:
Our line will look like:
Innbo Ekstra Nordea 1 500 000 1 302
it will produce 3 groups as a result:
1500000
1
302
In fact, we have only a second group is missing from input. Is it possible to make a regex more flexible if the second group is missing?
How to fix this behaviour?
Your numbers have a special pattern which can be used to hack the problem for you. If you notice, any space in this string which is followed by exactly three digits can be removed to unite the number forming actual number, which will make this string,
Bolighus fullverdi 4 374 720 12 000 11 806
to this,
Bolighus fullverdi 4374720 12000 11806
And thus you can capture the second number easily by using this regex,
.*\d+\s+(\d+)\s+\d+
and capture group 2.
Here is a sample java code for same,
public static void main(String[] args) {
String s = "Bolighus fullverdi 4 374 720 12 000 11 806";
s = s.replaceAll(" (?=\\d{3})", "");
System.out.println("Transformed string: " + s);
Pattern p = Pattern.compile(".*\\d+\\s+(\\d+)\\s+\\d+");
Matcher m = p.matcher(s);
if (m.find()) {
System.out.println(m.group(1));
} else {
System.out.println("Didn't match");
}
}
Which outputs,
Transformed string: Bolighus fullverdi 4374720 12000 11806
12000
Hope this helps!
Edit:
Here is the explanation for this regex \D*\d+\s+(\d+)\s+\d+
for capturing required data from transformed string.
Bolighus fullverdi 4374720 12000 11806
.*
--> Matches any data before the numbers and here it matchesBolighus fullverdi
\d+
--> Matches one or more digits and here it matches4374720
\s+
--> Matches one or more space which is present between the numbers.(\d+)
--> Matches one or more digits and captures it in group 1 where it matches12000
\s+
--> Matches one or more space which is present between the numbers.\d+
--> Matches one or more digits and here it matches11806
As OP wanted to capture the second number, hence I only grouped (put parenthesis around intended capture part) second \d+ but if you want to capture first number or third number, you can simply group them as well like this,
\D*(\d+)\s+(\d+)\s+(\d+)
Then in java code, calling,
m.group(1)
would give group 1 number which is 4374720
m.group(2)
would give group 2 number which is 12000
m.group(3)
would give group 3 number which is 11806
Hope this clarifies and let me know if you need anything further.
Edit2
For covering the case for following string,
Andre bygninger 313 400 6 000 370
so that it captures 313400, 6000 and 370, I have to change the approach of the solution. And in this approach, I will not be transforming the string, but rather will capture the digits with spaces and once all three numbers are captured, will remove space between them. This solution will work for old string as well as new string above where we want to capture last three digits 370
as third number. But let's suppose we have following case,
Andre bygninger 313 400 6 000 370 423
where we have further 423
digits in the string, then it will be captured as following numbers,
313400, 6000370, 423
as it doesn't know whether 370 should go to 6000 or 423. So I have made the solution in a way where last three digits are captured as third number.
Here is a java code that you can use.
public static void main(String[] args) throws Exception {
Pattern p = Pattern
.compile(".*?(\\d{1,3}(?:\\s+\\d{3})*)\\s*(\\d{1,3}(?:\\s+\\d{3})*)\\s*(\\d{1,3}(?:\\s+\\d{3})*)");
List<String> list = Arrays.asList("Bolighus fullverdi 4 374 720 12 000 11 806",
"Andre bygninger 313 400 6 000 370");
for (String s : list) {
Matcher m = p.matcher(s);
if (m.matches()) {
System.out.println("For string: " + s);
System.out.println(m.group(1).replaceAll(" ", ""));
System.out.println(m.group(2).replaceAll(" ", ""));
System.out.println(m.group(3).replaceAll(" ", ""));
} else {
System.out.println("For string: '" + s + "' Didn't match");
}
System.out.println();
}
}
This code prints following output as you wanted,
For string: Bolighus fullverdi 4 374 720 12 000 11 806
4374720
12000
11806
For string: Andre bygninger 313 400 6 000 370
313400
6000
370
Here is the explanation for regex,
.*?(\\d{1,3}(?:\\s+\\d{3})*)\\s*(\\d{1,3}(?:\\s+\\d{3})*)\\s*(\\d{1,3}(?:\\s+\\d{3})*)
.*?
--> Matches and consumes any input before the numbers(\\d{1,3}(?:\\s+\\d{3})*)
--> This pattern tries to capture first number which can start with one to three digits followed by space and exactly three digits and "space plus three digits" altogether can occur zero or more times.\\s*
--> Followed by zero or more space
And after that, same group (\\d{1,3}(?:\\s+\\d{3})*)
is repeated two more times so it can capture numbers in three groups.
Since I have made three capturing groups, hence capturing has to take place in three groups for it to be a successful match. So for e.g. here is the mechanism of capturing this input,
Andre bygninger 313 400 6 000 370
First, .*?
matches "Andre bygninger "
. Then first group (\\d{1,3}(?:\\s+\\d{3})*)
first matches 313 (because of \\d{1,3}
) and then (?:\\s+\\d{3})*
matches a space and 400
and it stops because next data followed is space followed by 6
which is just one digit and not three digit.
Similarly, second group (\\d{1,3}(?:\\s+\\d{3})*)
first matches 6
(because of \\d{1,3}
) and then (?:\\s+\\d{3})*)
matches 000
and stops because, it needs to leave some data for matching group 3 else regex match will fail.
Finally, third group matches 370
as that is the only data that was left. So \\d{1,3}
matches 370
and then (?:\\s+\\d{3})*
matches nothing as it is zero or more group.
Hope that clarifies. Let me know if you still have any query.
Edit 22 Dec 2018 for grouping numbers in two groups only
As you want to group the data from this string,
Innbo Ekstra Nordea 1 500 000 1 302
Into two group of numbers having 1500000
and 1302
, your regex needs to only have two groups and it becomes this like I replied in the comment,
.*?(\\d{1,3}(?:\\s+\\d{3})*)\\s*(\\d{1,3}(?:\\s+\\d{3})*)
Here is the java code for same,
public static void main(String[] args) throws Exception {
Pattern p = Pattern
.compile(".*?(\\d{1,3}(?:\\s+\\d{3})*)\\s*(\\d{1,3}(?:\\s+\\d{3})*)");
List<String> list = Arrays.asList("Innbo Ekstra Nordea 1 500 000 1 302");
for (String s : list) {
Matcher m = p.matcher(s);
if (m.matches()) {
System.out.println("For string: " + s);
System.out.println(m.group(1).replaceAll(" ", ""));
System.out.println(m.group(2).replaceAll(" ", ""));
} else {
System.out.println("For string: '" + s + "' Didn't match");
}
System.out.println();
}
}
Which prints this like you expect.
For string: Innbo Ekstra Nordea 1 500 000 1 302
1500000
1302