Java - \pL [\x00-\x7F]+ regex fails to get non English characters using String.match

user7294900 :

I need to validate name,saved in a String, which can be in any language with spaces using \p{L}:

You can match a single character belonging to the "letter" category with \p{L}

I tried to use String.matches, but it failed to match non English characters, even for 1 character, for example

String name = "อั";
boolean isMatch = name.matches("[\\p{L}]+")); // return false

I tried with/without brackets, adding + for multiple letters, but it's always failing to match non English characters

Is there an issue using String.matches with \p{L}?

I failed also using [\\x00-\\x7F]+ suggested in Pattern

\p{ASCII} All ASCII:[\x00-\x7F]
Wiktor Stribiżew :

You should bear in mind that Java regex parses strings as collections of Unicode code units, not code points. \p{L} matches any Unicode letter from the BMP plane, it does not match letters glued with diacritics after them.

Since your input can contain letters and diacritics you should at least use both \p{L} and \p{M} Unicode property classes in your character class:

String regex = "[\\p{L}\\p{M}]+";

If the input string can contain words separated with whitespaces, you may add \s shorthand class and to match any kind of whitespace you may compile this regex with Pattern.UNICODE_CHARACTER_CLASS flag:

String regex = "(?U)[\\p{L}\\p{M}\\s]+";

Note that this regex allows entering diacritics, letters and whitespaces in any order. If you need a more precise regex (e.g. diacritics only allowed after a base letter) you may consider something like

String regex = "(?U)\\s*(?>\\p{L}\\p{M}*+)+(?:\\s+(?>\\p{L}\\p{M}*+)+)*\\s*";

Here, (?>\\p{L}\\p{M}*+)+ matches one or more letters each followed with zero or more diacritics, \s* matches zero or more whitespaces and \s+ matches 1 or more whitespaces.

\p{IsAlphabetic} vs. [\p{L}\p{M}]

If you check the source code, \p{Alphabetic} checks if Character.isAlphabetic(ch) is true. It is true if the char belongs to any of the following classes: UPPERCASE_LETTER, LOWERCASE_LETTER, TITLECASE_LETTER, MODIFIER_LETTER, OTHER_LETTER, LETTER_NUMBER or it has contributory property Other_Alphabetic. It is derived from Lu + Ll + Lt + Lm + Lo + Nl + Other_Alphabetic.

While all those L subclasses form the general L class, note that Other_Alphabetic also includes Letter number Nl class, and it includes more chars than \p{M} class, see this reference (although it is in German, the categories and char names are in English).

So, \p{IsAlphabetic} is broader than [\p{L}\p{M}] and you should make the right decision based on the languages you want to support.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=75587&siteId=1