Regular expressions (Regular Expression, RegEx) learning portal

1 Overview

Regular expressions (Regular Expression, RegEx) is a pattern match, a series of feature is described in the text.

As natural language tall, sturdy and other words to describe things abstracted features as regular expression is highly abstract character, used to describe the characteristics of the string.

Regular expressions do not usually exist independently, a variety of programming languages ​​and tools provide support for regular as the host language, and according to the characteristics of their own language, a certain cut or extended.

Regular entry is easy, limited grammar rules are easy to grasp, but currently the penetration rate is not high, mainly because of the large number of regular schools, the document various host languages ​​are too much attention to some of its details, and these details are usually beginners do not need attention.

Of course, if you want in-depth understanding of regular expressions, but these details are to be concerned, this is something, let's start with the beginning of a regular basis, to enter the world of regular expressions.

2. Regular Expressions basis

2.1 Basic Concepts

2.1.1 strings

20190311164157791.png
For strings a5, is composed of two characters a, 5as well as three positions composition, it is important for regular expression matching principle of understanding.

2.1.2 possession and zero-width characters

Regular expression matching process, if the sub-expression to match the content of character rather than position, and saved to the final match results, then consider this sub-expression is the possession of a character; if the sub-expression matching only the location, or content match is not saved to the final result of the match, then considers the sub-expression zero width.

Possession or zero-width character is matched against the contents of whether to save the final result in terms of matching.

Possession characters are mutually exclusive, zero-width non-mutually exclusive. That is a character, at the same time can only be matched by a sub-expression, and a position, but at the same time can be matched by sub-expression of multiple zero width.

2.1.3 Regular expressions constitute

Regular expressions consist of two characters. One is specific special meaning in a regular expression metacharacters , the other is a plain text characters .

Meta character may be a character, such as ^, a sequence of characters may be such \w.

2.2 yuan character (Meta Character)

2.2.1 character set (Character Classes)

Character set matches [ ]any character included. Although any one, but only one.

Character set supported by a hyphen -to indicate a range. When -the time range before and after the configuration, requires less than the preceding character code bits of the code bits of the following character.

[^…]Negated character set. Negated character set to represent any character that is not listed, the same can only be one. Negated character set is also supported by a hyphen -to indicate a range.

expression Explanation
[abc] Representation aor borc
[0-9] It represents 0~9any of a number equivalent to[0123456789]
[\u4e00-\u9fa5] A Chinese character represents any
[^a1<] Represents inter a, 1, <any other character out of a
[^a-z] It represents any character except for a lowercase letter

Example:
[0-9][0-9]In the match Windows 2003, the match is successful, the result of the match is 20.
[^inW]In the match Windows 2003, the match is successful, the result of the match is d.

2.2.2 Common abbreviation range of characters

For some commonly used character range, such as numbers, etc., due to the very popular, even with [0-9]such a set of characters still seem cumbersome, so defines some meta characters to represent the common range of characters.

expression Explanation
\d Any of a number that corresponds to [0-9], i.e., 0~9any one of a
\w Any letters or numbers or underscores, equivalent[a-zA-Z0-9_]
\s Any whitespace characters, the equivalent of[ \r\n\f\t\v]
\D Any non-numeric characters, \dnegated, equivalent[^0-9]
\W wNegated, equivalent to[^a-zA-Z0-9_]
\S Any non-white character \snegated, the equivalent of[^ \r\n\f\t\v]

Example:
\w\s\dIn the match Windows 2003, the match is successful, the result of the match is s 2.

2.2.3 decimal point

In addition to the decimal point can match \nany character other than. If you want to match include \nall the characters, including the general use [\s\S], or by .adding (?s)matching pattern to achieve.

expression Explanation
. In addition newline match \nany character other than a
2.2.4 Other metacharacters
expression Explanation
^ Location matches the beginning of the string, does not match any character
$ The end of the matching string position, does not match any character
\b Match a word boundary, does not match any character

For example:

^aIn the match cba, the match fails because the expression is required start position behind the character a, but cbaapparently not satisfied.
\d$In the match 123, the match is successful, the matching results 3, this expression requires the number at the end of the match, if not at the end of a number, such as 123abc, the match failed.

2.2.5 escape character

Some invisible characters, or metacharacters have special meaning in a regular, as you want to match the character itself, need \be escaped.

expression Explanation
\r\n Carriage return and line feed
\\ Match \itself
\^\$\. Each match ^, $and.

The following characters in the match itself, usually need to be escaped. In practical applications, depending on the circumstances, you may need to escape characters than characters listed below:
 . $ ^ { [ ( | ) + ? \

2.2.6 quantifier (Quantifier)

量词表示一个子表达式可以匹配的次数。量词可以用来修饰一个字符、字符组,或是用()括起来的子表达式。一些常用的量词被定义成独立的元字符。

表达式 说明 举例
{m} 表达式匹配m \d{3}相当于\d\d\d”(abc){2}相当于abcabc| |{m,n}| 表达式匹配最少*m*次,最多*n*次 |\d{2,3}可以匹配12321等*2*到*3*位的数字 | |{m,}| 表达式至少匹配*m*次 |[a-z]{8,}表示至少*8*位以上的字母 | |?| 表达式匹配*0*次或*1*次,相当于{0,1}|ab?可以匹配aab| || 表达式匹配0次或任意多次,相当于{0,}|<[^>]>[^>]表示*0*个或任意多个不是>的字符 | |+| 表达式匹配*1*次或意多次,至少*1*次,相当于{1,}|\d\s+\d`表示两个数字中间,至少有一个以上的空白字符

注意:在不是动态生成的正则表达式中,不要出现{1}这样的量词,如\w{1}在结果上等价于\w,但是会降低匹配效率和可读性,属于画蛇添足的做法。

2.2.7 分支结构(Alternation)

当一个字符串的某一子串具有多种可能时,采用分支结构来匹配,|表示多个子表达式之间的关系,|是以()限定范围的,如果在|的左右两侧没有()来限定范围,那么它的作用范围即为|左右两侧整体。

表达式 说明
\| 多个子表达式之间取的关系

举例:
^aa|b$在匹配cccb时,是可以匹配成功的,匹配的结果是b,因为这个表达式表示匹配^aab$,而b$在匹配cccb时是可以匹配成功的。
^(aa|b)$在区配cccb时,是匹配失败的,因为这个表达式表示在开始结束位置之间只能是aab,而cccb显然是不满足的。

3. 正则表达式进阶

3.1 捕获组(Capture Group)

捕获组就是把正则表达式中子表达式匹配的内容,保存到内存中以数字编号或手动命名的组里,以供后面引用。

表达式 说明
(Expression) 普通捕获组,将子表达式Expression匹配的内容保存到以数字编号的组里
(?<name> Expression) 命名捕获组,将子表达式Expression匹配的内容保存到以name命名的组里

普通捕获组(在不产生歧义的情况下,简称捕获组)是以数字进行编号的,编号规则是以(从左到右出现的顺序,从1开始进行编号。通常情况下,编号为0的组表示整个表达式匹配的内容。

命名捕获组可以通过捕获组名,而不是序号对捕获内容进行引用,提供了更便捷的引用方式,不用关注捕获组的序号,也不用担心表达式部分变更会导致引用错误的捕获组。

3.2 非捕获组

一些表达式中,不得不使用( ),但又不需要保存( )中子表达式匹配的内容,这时可以用非捕获组来抵消使用( )带来的副作用。

表达式 说明
(?:Expression) 进行子表达式Expression的匹配,并将匹配内容保存到最终的整个表达式的区配结果中,但Expression匹配的内容不单独保存到一个组内

3.3 反向引用

捕获组匹配的内容,可以在正则表达式的外部程序中进行引用,也可以在表达式中进行引用,表达式中引用的方式就是反向引用。

反向引用通常用来查找重复的子串,或是限定某一子串成对出现。

表达式 说明
\1\2 对序号为12的捕获组的反向引用
\k<name> 对命名为name的捕获组的反向引用

举例:
(a|b)\1在匹配abaa时,匹配成功,匹配到的结果是aa(a|b)在尝试匹配时,虽然既可以匹配a,也可以匹配b,但是在进行反向引用时,对应()中匹配的内容已经是固定的了。

3.4 环视(Look Around)

环视只进行子表达式的匹配,匹配内容不计入最终的匹配结果,是零宽度的。

环视按照方向划分有顺序和逆序两种,按照是否匹配有肯定和否定两种,组合起来就有四种环视。环视相当于对所在位置加了一个附加条件。

表达式 说明
(?<=Expression) 逆序肯定环视,表示所在位置左侧能够匹配Expression
(?<!Expression) 逆序否定环视,表示所在位置左侧不能匹配Expression
(?=Expression) Order certainly look around, represents the right location can match Expression
(?!Expression) Looking around the negative sequence, a right side does not match the location of Expression

Example:
(?<=Windows )\d+In the match Windows 2003, the match is successful, the matching results 2003. We know \d+which matches more than one number, and (?<=Windows )is equivalent to an additional condition, where the left side represents the position must be Windows, what it matches do not count matches. Also regular matching Office 2003, the match fails, because any string of numbers where the left substrings is not Windows.

(?!1)\d+In the match 123, the match is successful, the result of the match is 23. \d+Match more than one number, but with the additional conditions (?!1)required are not the right location 1, the location is successfully matched 2forward position.

3.5 lazy and greedy


Unfinished ......

Click here to visit the original (according to the right after entering the label, quickly navigate to this article)

Guess you like

Origin www.cnblogs.com/sinicheveen/p/12009355.html