Detecting whitespaces of fullwidth and halfwidth: regex VS Character.isWhitespace()

anndexi99 :

My task is to detect if there is any whitespaces of string from csv of a server app. The content of CSV would be combinations of Japanese, English, some symbols and numerics. Whitespaces in English is halfwidth and that in Japanese is fullwidth. The width and byte sizes of whitespace in the 2 languages are different.

I am coding in Java 8 and not using 3rd parties libraries is preferable.

There are 2 approaches I am considering and these are pseudocodes.

Regex:

targetStr.matches("\\s+");

Character.isWhitespace():

targetStr.codepoints()
             .filter(c -> Character.isWhitespace(c))
             .count() > 0

Would any one of the above pseudocode do the task?

Which is more efficient for my case?

Holger :

First of all, targetStr.matches("\\s+") and targetStr.codepoints().filter(c -> Character.isWhitespace(c)).count() > 0 bear entirely different logic.

String.matches requires the entire string to match, so with \s+ it has to consist of white-space entirely. In contrast, count() > 0 is satisfied if you have at least one white-space character, so it’s an inefficient and verbose version of targetStr.codepoints().anyMatch(Character::isWhitespace).

If you want to check whether all characters are white-space, you should use allMatch instead.

But further, there are different definitions of white-space

Character.isWhitespace:

Determines if the specified character (Unicode code point) is white space according to Java. A character is a Java whitespace character if and only if it satisfies one of the following criteria:

  • It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', '\u2007', '\u202F').
  • It is '\t', U+0009 HORIZONTAL TABULATION.
  • It is '\n', U+000A LINE FEED.
  • It is '\u000B', U+000B VERTICAL TABULATION.
  • It is '\f', U+000C FORM FEED.
  • It is '\r', U+000D CARRIAGE RETURN.
  • It is '\u001C', U+001C FILE SEPARATOR.
  • It is '\u001D', U+001D GROUP SEPARATOR.
  • It is '\u001E', U+001E RECORD SEPARATOR.
  • It is '\u001F', U+001F UNIT SEPARATOR.

the \s pattern (by default):

\s A whitespace character: [ \t\n\x0B\f\r]

So there’s a significant difference.

As explained in this answer, you can make \s match all white-space characters or use a pattern that matches all unicode white-space characters in the first place. Or refer to the same logic as Character.isWhitespace explicitly, which is not exactly the same:

If you want to strictly apply the logic of Character.isWhitespace, you can use

  • To match all characters
    • string.codePoints().allMatch(Character::isWhitespace)
    • string.matches("\\p{javaWhitespace}+")
    • string.isBlank() (JDK11)
  • To match when there’s at least one white-space character
    • string.codePoints().anyMatch(Character::isWhitespace)
    • string.matches(".*\\p{javaWhitespace}.*")
    • Pattern.compile("\\p{javaWhitespace}").matcher(string).find()

As the first bullet of the Character.isWhitespace documentation states, it will return false for the non-breaking space characters ('\u00A0', '\u2007', '\u202F'), despite they have the white-space Unicode property. If you want to match them as white-space, you can use

  • To match all characters
    • string.matches("(?U)\\s+")
    • string.matches("\\p{IsWhiteSpace}+")
  • To match when there’s at least one white-space character
    • string.matches("(?U).*\\s.*")
    • string.matches(".*\\p{IsWhiteSpace}.*")
    • Pattern.compile("\\p{IsWhitespace}").matcher(string).find()
    • Pattern.compile("\\s", Pattern.UNICODE_CHARACTER_CLASS).matcher(string).find()

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=165418&siteId=1