Regular expressions have a long history

Readers who have a basic understanding of regular expressions will not be unfamiliar with expressions such as "\d" and "[az]+". The former matches a number character, and the latter matches more than one lowercase English letter. However, if you have used tools under Linux/Unix such as vi, grep, awk, and sed, you may find that although these tools support regular expressions, the syntax is very different. Regular expressions such as "[az]+" are often either unrecognized or incorrectly matched. Moreover, there are differences between these tools themselves, the same structure, sometimes need to be escaped and sometimes not. This is why?

The reason is that most tools under Unix/Linux adopt POSIX specification, and at the same time, POSIX specification can be divided into two flavors. So, first of all it is necessary to understand the POSIX specification.

POSIX specification

The common regular expression notation is actually derived from Perl. In fact, regular expressions are derived from a prominent genre from Perl, called PCRE (Perl Compatible Regular Expression), "\d", "\w", "\ Notation such as s' is the characteristic of this genre. But in addition to PCRE, there are other schools of regular expressions, such as the regular expressions of the POSIX specification to be introduced below.

The full name of POSIX is Portable Operating System Interface for uniX, which consists of a series of specifications that define the functions that the UNIX operating system should support. Two schools of BRE (Basic Regular Expression, Basic Regular Expression) and ERE (Extended Regular Express, Extended Regular Expression) are defined. On POSIX-compliant UNIX systems, tools like grep and egrep follow the POSIX specification, as do regular expressions in some database systems.

BRE

Among Linux/Unix common tools, grep, vi, and sed all belong to the BRE faction, and their syntax looks strange. The meta characters "(", ")", "{", "}" must be escaped before they have Special meaning, so the regular expression "(a)b" can only match the string (a)b instead of the string ab; the regular expression "a{1,2}" can only match the string a{1,2} , the regular expression "a\{1,2\}" can match the string a or aa.

The reason why it is so troublesome is because these tools were born very early, and many functions of regular expressions have evolved gradually. Before these metacharacters may not have special meanings; in order to ensure backward compatibility, they can only be used Escape. And some functions are not even supported at all. For example, BRE does not support "+" and "?" quantifiers, nor does it support multiple-selection structures "(...|...)" and backreferences "\1", "\2"... .

But today, pure BRE is rare. After all, everyone already thinks that regular expressions "should" support functions such as multiple selection structures and backreferences, and it is not really too inconvenient. So although vi belongs to the BRE genre, it provides these functions. GNU has also extended BRE to support "+", "?", "|", but it must be written as "\+", "\?", "\|" when used, and also supports "\1", " \2' and other backreferences. In this way, tools such as GNU's grep, although nominally part of the BRE stream, are more precisely named GNU BRE.

ERE

Among Linux/Unix common tools, egrep and awk belong to the ERE faction. Although BRE is called "Basic" and ERE is called "Extended", ERE does not require a BRE-compatible syntax, but is self-contained. Therefore, the metacharacters do not need to be escaped (adding a backslash before the metacharacter will cancel its special meaning), so "(ab|cd)" can match the string ab or cd, and the quantifiers "+", "?", "{n,m}" can be used directly. ERE does not explicitly support backreferences, but many tools support backreferences such as "\1" and "\2".

Tools such as egrep produced by GNU belong to the ERE stream (the more accurate name is GNU ERE), but because GNU has made a lot of extensions to BRE, the so-called GNU ERE is actually just a statement, it has some functions GNU BRE has It's just that metacharacters don't need to be escaped.

下面的表格简要说明了几种POSIX流派的区别[1](其实,现在的BRE和ERE在功能上并没有什么区别,主要的差异是在元字符的转义上)。

几种POSIX流派的说明


流派

说明

工具

BRE

(、)、{、}都必须转义使用,不支持+、?、|

grep、sed、vi(但vi支持这些多选结构和反向引用)

GNU BRE

(、)、{、}、+、?、|都必须转义使用

GNU grep、GNU sed

ERE

元字符不必转义,+、?、(、)、{、}、|可以直接使用,\1、\2的支持不确定

egrep、awk

GNU ERE

元字符不必转义,+、?、(、)、{、}、|可以直接使用,支持\1、\2

grep –E、GNU awk


为了方便查阅,下面再用一张表格列出基本的正则功能在常用工具中的表示法,其中的工具GNU的版本为准。

常用Linux/Unix工具中的表示法


PCRE记法

vi/vim

grep

awk

sed

*

*

*

*

*

+

\+

\+

+

\+

?

\=

\?

?

\?

{m,n}

\{m,n}

\{m,n\}

{m,n}

\{m,n\}

\b *

\< \>

\< \>

\< \>

\y \< \>

(…|…)

…‖…

…‖…

(…|…)

(…|…)

(…)

(…)

(…)

\1 \2

\1 \2

\1 \2

不支持

\1 \2


注:PCRE中常用\b来表示“单词的起始或结束位置”,但Linux/Unix的工具中,通常用\<来匹配“单词的起始位置”,用\>来匹配“单词的结束位置”,sed中的\y可以同时匹配这两个位置。

POSIX字符组

在某些文档中,你还会发现类似『[:digit:]』、『[:lower:]』之类的表示法,它们看起来不难理解(digit就是“数字”,lower就是“小写”),但又很奇怪,这就是POSIX字符组。不仅在Linux/Unix的常见工具中,甚至一些变成语言中都出现了这些字符组,为避免困惑,这里有必要简要介绍它们。

在POSIX规范中,『[a-z]』、『[aeiou]』之类的记法仍然是合法的,其意义与PCRE中的字符组也没有区别,只是这类记法的准确名称是POSIX方括号表达式(bracket expression),它主要用在Unix/Linux系统中。POSIX方括号表示法与PCRE字符组的最主要差别在于:POSIX字符组中,反斜线\不是用来转义的。所以POSIX方括号表示法『[\d]』只能匹配\和d两个字符,而不是『[0-9]』对应的数字字符。

为了解决字符组中特殊意义字符的转义问题,POSIX方括号表示法规定,如果要在字符组中表达字符](而不是作为字符组的结束标记),应当让它紧跟在字符组的开方括号之后,所以POSIX中,正则表达式『[]a]』能匹配的字符就是]和a;如果要在POSIX方括号表示法中表达字符-(而不是范围表示法),必须将它紧挨在闭方括号]之前,所以『[a-]』能匹配的字符就是a和-。

POSIX规范也定义了POSIX字符组,它近似等价于于PCRE的字符组简记法,用一个有直观意义的名字来表示某一组字符,比如digit表示“数字字符”,alpha表示“字母字符”。

不过,POSIX中还有一个值得注意的概念:locale(通常翻译为“语言环境”)。它是一组与语言和文化相关的设定,包括日期格式、货币币值、字符编码等等。POSIX字符组的意义会根据locale的变化而变化,下面的表格介绍了常见的POSIX字符组在ASCII语言环境与Unicode语言环境下的意义,供大家参考。

POSIX字符组


POSIX字符组

说明

ASCII语言环境

Unicode语言环境

[:alnum:]*

字母字符和数字字符

[a-zA-Z0-9]

[\p{L&}\p{Nd}]

[:alpha:]

字母

[a-zA-Z]

\p{L&}

[:ascii:]

ASCII字符

[\x00-\x7F]

\p{InBasicLatin}

[:blank:]

空格字符和制表符

[ \t]

[\p{Zs}\t]

[:cntrl:]

控制字符

[\x00-\x1F\x7F]

\p{Cc}

[:digit:]

数字字符

[0-9]

\p{Nd}

[:graph:]

空白字符之外的字符

[\x21-\x7E]

[^\p{Z}\p{C}]

[:lower:]

小写字母字符

[a-z]

\p{Ll}

[:print:]

类似[:graph:],但包括空白字符

[\x20-\x7E]

\P{C}

[:punct:]

标点符号

[][!"#$%&'()*+,./:;<=>?@\^_`{|}~-]

[\p{P}\p{S}]

[:space:]

空白字符

[ \t\r\n\v\f]

[\p{Z}\t\r\n\v\f]

[:upper:]

大写字母字符

[A-Z]

\p{Lu}

[:word:]*

字母字符

[A-Za-z0-9_]

[\p{L}\p{N}\p{Pc}]

[:xdigit:]

十六进制字符

[A-Fa-f0-9]

[A-Fa-f0-9]


注1:标记*的字符组简记法并不是POSIX规范中的,但使用很多,一般语言中都提供,文档中也会出现。

注2:对应的Unicode属性请参考本系列文章已经刊发过的关于Unicode的部分。

POSIX字符组的使用有所不同。主要区别在于,PCRE字符组简记法可以脱离方括号直接出现,而POSIX字符组必须出现在方括号内,所以同样是匹配数字字符,单独出现时,PCRE中可以直接写『\d』,而POSIX字符组就必须写成『[[:digit:]]』。

Linux/Unix下的工具中,一般都可以直接使用POSIX字符组,而PCRE的字符组简记法『\w』、『\d』等则大多不支持,所以如果你看到『[[:space:]]』而不是『\s』,一定不要感到奇怪。

不过,在常用的编程语言中,Java、PHP、Ruby也支持使用POSIX字符组。其中Java和PHP中的POSIX字符组都是按照ASCII语言环境进行匹配;Ruby的情况则要复杂一点,Ruby 1.8按照ASCII语言环境进行匹配,而且不支持『[:word:]』和『[:alnum:]』,Ruby 1.9按照Unicode语言环境进行匹配,同时支持『[:word:]』和『[:alnum:]』。


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324815637&siteId=291194637
Recommended