Relationship between Alnum and IsAlphabetic character classes in Java RegEx patterns

toniedzwiedz :

Looking at the Javadoc for java.util.regex.Pattern

\p{Alnum} An alphanumeric character:[\p{IsAlphabetic}\p{IsDigit}]

it appears that every character that matches \p{IsAlphabetic} should also match \p{Alnum}

However, it does not seem to be the case when the character has an accent. For example, the following assertion fails:

assertEquals("é".matches("\\p{IsAlphabetic}+"),"é".matches("\\p{Alnum}+"));

The same thing happens for other characters with accents such as ą, ó, ł, ź ż. All match \p{IsAlphabetic}+ but not \p{Alnum}+

Am I mis-interpreting the Javadoc? Or is this a bug in the documentation or implementation?

Joachim Sauer :

By default \p{Alnum} is treated as a POSIX character class which means it will only ever match ASCII characters. This means it will match a and 1 but not ä or ١.

The passage you quote only applies when the UNICODE_CHARACTER_CLASS flag is used.

Slightly oversimplified, this flag will turn the "old" POSIX style character classes into their equivalent Unicode character classes.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=142971&siteId=1