[编程语言] - 正则表达式

一、正则表达式的概念

形而上者谓之道，形而下者谓之器！在学习和使用正则表达式时，知其然，知其所以然，不然的话，理解的不深刻，就会学着累，用着烦！

1.1 正则表达式很重要！

字符串是编程时涉及到的最多的一种数据结构，对字符串进行操作的需求无处不在。通过正则表达式可以精确匹配特定模式的字符串，进而更好的完成字符串操作，比如模式校验、检索、替换等等。

所以，程序员必须熟练使用正则表达式！

1.2 正则表达式是一种形式语言（formal language）

语言有其普遍性的特征和规律。从应用角度出发，由下而上的去学习和理解正则表达式往往事倍功半；而从形式语言的角度，自上而下的去学习正则表达式，则事半功倍。

正则表达式是一种由符号和规则组成的形式语言。形式语言是一种用简洁的、抽象的、形式化的数学公式来表达复杂语言语法的特殊语言，它只研究语言的组成规律，不涉及语义！

https://en.wikipedia.org/wiki/Regular_expression

有兴趣的同学可以了解一下形式语言（fromal language），形式逻辑（formal logic）相关的知识。

Metacharacter ：元字符

A metacharacter is a charachter that has a special meaning (instead of a literal meaning) to a computer program, such as a shell interpreter or a regex engine.

Regular character: 常规字符

regular character has a literal meaning

Regular expression：

Together, metacharacter and regular character can be used to identify text of a given pattern or process a number of instances of it.

wildcard character: has fewer metacharacters and a simple language-base.

In software, a wildcard character is a kind of placeholder represented by a single character, such as an asterisk(*), which can be interpreted as a number of liberal characters or an empty string.

regex 与 wildcard 比较：
regex processor： NFA(不确定的有穷自动机)，DFA(确定的有穷自动机)

1.3 正则表达式的应用

正则表达式就是字符串的模式匹配，常用于：

校验：检查一个字符串中是否包含指定模式的子字符串
替换：将匹配的子字符串替换为给定的新字符串
捕获：提取符合匹配模式的字符串

二、正则表达式的理解和使用

正则表达式是一个表达式，一个 expression
表达式由若干个 pattern 组成
pattern 由符号组成，且遵循特定的规则

本质上，一个模式单元（pattern unit）要做的事情就是来回答：谁 + 多少 + 在哪里

想一想，如果是我们自己来设计正则表达式的语言规则，我们应该如何设计呢？首先，至少需要三类符号来指代“谁”、“多少”、“在哪里”，更语言化的说法是，我们需要相关的代词，量词，和定位介词！

Patten = 代词 + 量词 + 定位介词

2.1 代词

代词，用来指代一个特定字符！一定要明白，代词限定了“一个”字符的范围！

常用的代词如下：

\d：数字
\D：非数字
\w： == [0-9a-zA-Z_]
\W： == [^0-9a-zA-Z_]
\s：空格
\S：非空格
. 任意字符
[0-9a-zA-Z_] : 字符集
[^xyz] :
[[:alpha:]] 任何字母
[[:digit:]] 任何数字
[[:alnum:]] 任何字母和数字
[[:space:]] 任何空白字符
[[:upper:]] 任何大写字母
[[:lower:]] 任何小写字母
[[:punct:]] 任何标点符号
[[:xdigit:]] 任何16进制的数字，相当于[0-9a-fA-F]

2.2 量词：Quantification

A quantifier after a token or group specifies how oftern that a preceding element is allowed to occur.

量词，就是有多少个代词。

* : >= 0
+ : >= 1
? : == 1 or == 0
{n} : == n
{n, m} : >=n and <= m
*? : 非贪婪
+? : 非贪婪

2.3 定位词

^ : 行首
$ : 行尾
\b : 单词边界
\B : 非单词边界

2.4 选择器:

用来捕获匹配字符串，或者组合多种模式进行匹配

(pattern)
(?:pattern) ：匹配 pattern 但不获取匹配结果，hello|Hello 可以写成 (H|h)ello
(?=pattern) ：正向肯定预查（look ahead positive assert）, Windows(?!95|98|NT|2000) 不匹配 Windows3.1
(?!pattern) ：正向否定预查(negative assert), Windows(?!95|98|NT|2000) 匹配 Windows3.1
(?<=pattern) ：反向(look behind)肯定预查
(?<!pattern) ：反向否定预查

捕获组是通过从左到右计算其开括号来编码，例如，在表达式 ((A)((B©)))，对应以下捕获组：

((A)(B©))
(A)
(B©)
©

2.5反向引用：

\1

二义性的符号 ^ : 在代词里表示非，在定位词中表示行首
二义性的符号？：在代词后面表示量词，在量词后面表示非贪婪，在选择器中表示非捕获

2.6 说明

/abcde?fg/

没有量词的代词，要精确匹配，比如例子中除了 e 之外的字符都要精确匹配，e 有量词？，所以表达式可以匹配 abcdefg, abcdfg

abcdefg[1-9]{3}

匹配 abcdefg 后面跟三个数字，比如 abcdefg123

/\bHel/

匹配 Hello，不匹配 oHell

/llo\b/

匹配 Hello，不匹配 helloy

三、 Java/Python/JS 中正则表达式的使用示例

3.1 JavaScript

正则表达式的用途：查找，替换，匹配，分割。
匹配包含完全匹配和子匹配（就是选择器中的pattern对应的匹配）！

// 模式定义的两种方式
var pattern = /cde/i
var pattern = new RegExp('cde', "i")

// 字符串： search, replace, split, match
var str = "abcxefcygcz"
str.search(/cxe/) // 2
str.replace(/cxe/, "xxx") // abxxxfcygcz
'a_*_b_#_c'.split(/_[#*]_/) // ['a', 'b', 'c']
"abcxefcygcz".split(/c[xyz]/) // ["ab", "ef", "g", ""]
"abcxefcygcz".match(/c(x|y|z)g([\w]+)/g)// ["cygcz"]
"abcxefcygcz".match(/c(x|y|z)/g) // ["cx", "cy", "cz"]
"abcxefcygcz".match(/c(x|y|z)/) // ["cx", "x", index: 2, input: "abcxefcygcz", groups: undefined]
"abcxefcygcz".match(/c(x|y|z)g([\w]+)/) // ["cygcz", "y", "cz", index: 6, input: "abcxefcygcz", groups: undefined]

// 模式： test，exec
/cxe/i.test("abcxefcygcz") // true
/c(x|y|z)g([\w]+)/g.exec("abcxefcygcz") // ["cygcz", "y", "cz", index: 6, input: "abcxefcygcz", groups: undefined]

修饰符	描述
i	执行对大小写不敏感的匹配。
g	执行全局匹配（查找所有匹配而非在找到第一个匹配后停止）。
m	执行多行匹配。

/*是否带有小数*/
/^\d+\.\d+$/ 

/*校验是否中文名称组成 */
/^[\u4E00-\u9FA5]{2,4}$/

/*校验是否全由8位数字组成 */
/^[0-9]{8}$/

/*校验电话码格式 */
/^((0\d{2,3}-\d{7,8})|(1[3584]\d{9}))$/

/*校验邮件地址是否合法 */
/^\w+@[a-zA-Z0-9]{2,10}(?:\.[a-z]{2,4}){1,3}$/

提取匹配的子字符串

3.2 Java

java.util.regex 包主要包括以下三个类：

Pattern
Matcher
PatternSyntaxException

import java.util.regex.*;
 
class RegexDemo{
   public static void main(String args[]){
        String content ="abcxefcygcz";
        String pattern = "cxe";
 
        boolean isMatch = Pattern.matches(pattern, content); // true


        Pattern r = Pattern.compile("c(x|y|z)g([\w]+)");
        Matcher m = r.matcher("abcxefcygcz")

        if (m.find()) {
            System.out.println("Found value: " + m.group(0) );
            System.out.println("Found value: " + m.group(1) );
            System.out.println("Found value: " + m.group(2) );
        } 

        m.rereplaceAll('xxx') // # "abxxxfcygcz"
   }
}

在 Java 中，\\ 表示：我要插入一个正则表达式的反斜线，所以其后的字符具有特殊的意义。

3.3 Python

re.match : 至匹配字符串的开始，失败则返回 None
re.search: 匹配整个字符串

正则表达式对象

re.RegexObject
re.MatchObject

flags	描述
re.I	执行对大小写不敏感的匹配。
re.M	执行多行匹配。

import re

re.match(r'abc', "abcxefcygcz").span() # (0, 3) 
re.match(r'abc', "abcxefcygcz").group() # abc
re.match(r'bcx', "abcxefcygcz") # None

g = re.search(r'c(x|y|z)g([\w]+)', "abcxefcygcz")
g.group() # "cygcz"
g.group(1) #  "y"
g.group(2) # "cz"

# sub : 字符串替换
re.sub('cxe', 'xxx', "abcxefcygcz") # "abxxxfcygcz"

def m_upper(m):
    return m.group().upper()

re.sub('cxe', m_upper, "abcxefcygcz") # abCXEfcygcz

# compile
pattern = re.compile(r'([a-z]+) ([a-z]+)', re.I)   # re.I 表示忽略大小写
m = pattern.match('Hello World Wide Web')

# findall : 返回一个列表，如果没有找到匹配的，则返回空列表。
re.findall(r'\d+', 'runoob 123 google 456') # ['123', '456']

# finditer : 返回一个迭代器
it = re.finditer(r"\d+","12a32bc43jf3") 
for match in it: 
    print (match.group() )

# split : 
re.split('c[xyz]', "abcxefcygcz") # ["ab", "ef", "g", ""]