Regular expressions -Regex Detailed

 

1. What is a regular expression

Regular expression is a character logical formulas string operation, is to use some combination of a particular pre-defined characters, and these particular character, form a "string rule", this "rule string" is used to express characters for string filter logic. Given a regular expression and another string, you can get a specific part of the expression we want from a string through the positive. Regular expressions flexibility, logic and functionality is very strong, can quickly reach the amount of complex control string with a very simple way, but for those who were in contact, it is rather obscure. Since regular expressions are the main application object is a text, so he has applications in a variety of text editors occasions.

2. concept

Regular expressions, also known as regular expressions. (English: Regular Expression, the code is often abbreviated as regex, regexp or RE), a concept in computer science. Regular expressions are typically used to retrieve, replace the text in line with those of a model (rule) is. Many programming languages support regular expressions for string operations. For example, in Perl it is built on a powerful regular expression engine. Regular Expressions concept was originally developed by Unix tools in the software (such as sed and grep ) of popularity. Regular expressions are usually abbreviated to "regex", the singular has regexp, regex, the complex has regexps, regexes, regexen.

3. Grammar

You are likely to use  ? And  * wildcards to find files on your hard disk. ? Name of zero or one character wildcard matching files,  the * wildcard matches zero or more characters. The following first give a simple example 1

Regular expression rules: ^ [0-9] + abc $

^ To match the beginning of the string.

[0-9] + matches a plurality of digital, [0-9] matches a single digit, + match preceding the "+" character (here number [0-9]) one or more.

$ matching letters abc abc abc and ending, $ end position to match the input string.

Above regular expression can match 123abc, 0abc, but can not match 123abc d, abc, because they do not conform to the rules of regular expressions

Example 2

 Above regular expression can match  runoob, runoob1, runoob, run_oob , but does not match  ru , because it contains the letters too short, less than 3 can not match. It does not match  runoob $ , because it contains special characters.

Regular Expressions online testing tool

https://c.runoob.com/front-end/854

https://regex101.com/ ( $$ recommended )

 

 

 

 

 

Regular expressions (regular expression) describes a set of strings (pattern), it can be used to check whether a string containing the certain substring, replacing the sub-string matching or removed from a string meet a certain criteria substring and so on.

E.g:

  • runoo + b, can match runoob, runooob, runoooooob the like, in front of the number represents a + character must appear at least once (one or more times).

  • runoo * b, you can match runob, runoob, runoooooob like asterisk representing the character may not occur, can also occur one or more times (0, or 1, or more).

  • colou? r matches color or colour ,? front of question marks represent maximum number of characters can appear only once (0, or 1).

Method of constructing a regular expression and a way to create a mathematical expression of the same. That is, using a variety of metacharacters and operators can combine small expressions together to create larger expression. Component Regular expressions can be a single character, character set, the range of characters between the selected characters, or any combination of all these components.

Regular expression pattern by a common text characters (e.g. characters a to z) and special characters (referred to as "meta character") thereof. Modes are described in the text search to match one or more strings. Regular expression as a template, a character pattern to match with the search string.

4. Expressions Collection

character description
\ The next character is marked as a special character, or a literal character, or a backward reference, or an octal escape. For example, " n" matches the character n" ." " \n" Matches a newline. Serial " \\," matching " \" and " \(" the match (" ."
^ Matches the beginning of the string. If the object is set RegExp Multiline property, ^ also matches " \n" or " \rposition" after.
$ Matches the input end of the string. If the RegExp object's Multiline property is set, $ also matches " \n" or " \rposition before."
* Matches the preceding subexpression zero or more times. For example, zo * matches " z" and zoo" ." * Is equivalent to {0}.
+ Matches the preceding subexpression one or more times. For example, " zo+" can match " zo" and zoo" ", but can not match z" ." + Is equivalent to {1}.
? Matches the preceding subexpression zero or one. For example, " do(es)?" matches " does" or " does" in do" ." ? Is equivalent to {0,1}.
{n} n is a nonnegative integer. Matching the determined n times. For example, " o{2}" does not match the " Bob" in o" ", but can match the " food" in the two o.
{n,} n is a nonnegative integer. Matching at least n times. For example, " o{2,}" does not match the " Bob" in o" ", but it can match " fooooodall o" in. " o{1,}" Is equivalent to o+" ." " o{0,}" Is equivalent to o*" ."
{n,m} m and n are non-negative integers, where n <= m. Match at least n times and match up to m times. For example, " o{1,3}" will match " fooooood" in the first three o. " o{0,1}" Is equivalent to o?" ." Please note that no spaces between the comma and the two numbers.
? When the character immediately to any other qualifier (*, +,?, { N}, {n,}, {n, m}) when the rear, non-greedy matching pattern. Non-greedy pattern matches as little as possible the search string, and the default greedy pattern matches as much of the string search. For example, the string oooo" ", " o+?" will match a single o" " and " o+" matches all the o" ."
. In addition to matching " \nany single character other than" in. To match including " \n" any characters, including use as " (.|\n)" model.
(pattern) Match the pattern and get the match. The matching can be obtained from the Matches have been used in collection SubMatches VBScript, JScript is used in the $ 0 ... $ 9 properties. To match parentheses characters, use " \(" or \)" ."
(?:pattern) 匹配pattern但不获取匹配结果,也就是说这是一个非获取匹配,不进行存储供以后使用。这在使用或字符“(|)”来组合一个模式的各个部分是很有用。例如“industr(?:y|ies)”就是一个比“industry|industries”更简略的表达式。
(?=pattern) 正向肯定预查,在任何匹配pattern的字符串开始处匹配查找字符串。这是一个非获取匹配,也就是说,该匹配不需要获取供以后使用。例如,“Windows(?=95|98|NT|2000)”能匹配“Windows2000”中的“Windows”,但不能匹配“Windows3.1”中的“Windows”。预查不消耗字符,也就是说,在一个匹配发生后,在最后一次匹配之后立即开始下一次匹配的搜索,而不是从包含预查的字符之后开始。
(?!pattern) 正向否定预查,在任何不匹配pattern的字符串开始处匹配查找字符串。这是一个非获取匹配,也就是说,该匹配不需要获取供以后使用。例如“Windows(?!95|98|NT|2000)”能匹配“Windows3.1”中的“Windows”,但不能匹配“Windows2000”中的“Windows”。预查不消耗字符,也就是说,在一个匹配发生后,在最后一次匹配之后立即开始下一次匹配的搜索,而不是从包含预查的字符之后开始
(?<=pattern) 反向肯定预查,与正向肯定预查类拟,只是方向相反。例如,“(?<=95|98|NT|2000)Windows”能匹配“2000Windows”中的“Windows”,但不能匹配“3.1Windows”中的“Windows”。
(?<!pattern) 反向否定预查,与正向否定预查类拟,只是方向相反。例如“(?<!95|98|NT|2000)Windows”能匹配“3.1Windows”中的“Windows”,但不能匹配“2000Windows”中的“Windows”。
x|y 匹配x或y。例如,“z|food”能匹配“z”或“food”。“(z|f)ood”则匹配“zood”或“food”。
[xyz] 字符集合。匹配所包含的任意一个字符。例如,“[abc]”可以匹配“plain”中的“a”。
[^xyz] 负值字符集合。匹配未包含的任意字符。例如,“[^abc]”可以匹配“plain”中的“p”。
[a-z] 字符范围。匹配指定范围内的任意字符。例如,“[a-z]”可以匹配“a”到“z”范围内的任意小写字母字符。
[^a-z] 负值字符范围。匹配任何不在指定范围内的任意字符。例如,“[^a-z]”可以匹配任何不在“a”到“z”范围内的任意字符。
\b 匹配一个单词边界,也就是指单词和空格间的位置。例如,“er\b”可以匹配“never”中的“er”,但不能匹配“verb”中的“er”。
\B 匹配非单词边界。“er\B”能匹配“verb”中的“er”,但不能匹配“never”中的“er”。
\cx 匹配由x指明的控制字符。例如,\cM匹配一个Control-M或回车符。x的值必须为A-Z或a-z之一。否则,将c视为一个原义的“c”字符。
\d 匹配一个数字字符。等价于[0-9]。
\D 匹配一个非数字字符。等价于[^0-9]。
\f 匹配一个换页符。等价于\x0c和\cL。
\n 匹配一个换行符。等价于\x0a和\cJ。
\r 匹配一个回车符。等价于\x0d和\cM。
\s 匹配任何空白字符,包括空格、制表符、换页符等等。等价于[ \f\n\r\t\v]。
\S 匹配任何非空白字符。等价于[^ \f\n\r\t\v]。
\t 匹配一个制表符。等价于\x09和\cI。
\v 匹配一个垂直制表符。等价于\x0b和\cK。
\w 匹配包括下划线的任何单词字符。等价于“[A-Za-z0-9_]”。
\W 匹配任何非单词字符。等价于“[^A-Za-z0-9_]”。
\xn 匹配n,其中n为十六进制转义值。十六进制转义值必须为确定的两个数字长。例如,“\x41”匹配“A”。“\x041”则等价于“\x04&1”。正则表达式中可以使用ASCII编码。.
\num 匹配num,其中num是一个正整数。对所获取的匹配的引用。例如,“(.)\1”匹配两个连续的相同字符。
\n 标识一个八进制转义值或一个向后引用。如果\n之前至少n个获取的子表达式,则n为向后引用。否则,如果n为八进制数字(0-7),则n为一个八进制转义值。
\nm 标识一个八进制转义值或一个向后引用。如果\nm之前至少有nm个获得子表达式,则nm为向后引用。如果\nm之前至少有n个获取,则n为一个后跟文字m的向后引用。如果前面的条件都不满足,若n和m均为八进制数字(0-7),则\nm将匹配八进制转义值nm。
\nml 如果n为八进制数字(0-3),且m和l均为八进制数字(0-7),则匹配八进制转义值nml。
\un 匹配n,其中n是一个用四个十六进制数字表示的Unicode字符。例如,\u00A9匹配版权符号(©)。

5.常用正则表达式

用户名 /^[a-z0-9_-]{3,16}$/
密码 /^[a-z0-9_-]{6,18}$/
十六进制值 /^#?([a-f0-9]{6}|[a-f0-9]{3})$/
电子邮箱 /^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})$/
/^[a-z\d]+(\.[a-z\d]+)*@([\da-z](-[\da-z])?)+(\.{1,2}[a-z]+)+$/
URL /^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/
IP 地址 /((2[0-4]\d|25[0-5]|[01]?\d\d?)\.){3}(2[0-4]\d|25[0-5]|[01]?\d\d?)/
/^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$/
HTML 标签 /^<([a-z]+)([^<]+)*(?:>(.*)<\/\1>|\s+\/>)$/
删除代码\\注释 (?<!http:|\S)//.*$
Unicode编码中的汉字范围 /^[\u2E80-\u9FFF]+$/

 

 

 

 

 

6.为什么使用正则表达式?

典型的搜索和替换操作要求您提供与预期的搜索结果匹配的确切文本。虽然这种技术对于对静态文本执行简单搜索和替换任务可能已经足够了,但它缺乏灵活性,若采用这种方法搜索动态文本,即使不是不可能,至少也会变得很困难。

通过使用正则表达式,可以:

  • 测试字符串内的模式。
    例如,可以测试输入字符串,以查看字符串内是否出现电话号码模式或信用卡号码模式。这称为数据验证。
  • 替换文本。
    可以使用正则表达式来识别文档中的特定文本,完全删除该文本或者用其他文本替换它。
  • 基于模式匹配从字符串中提取子字符串。
    可以查找文档内或输入域内特定的文本。

例如,您可能需要搜索整个网站,删除过时的材料,以及替换某些 HTML 格式标记。在这种情况下,可以使用正则表达式来确定在每个文件中是否出现该材料或该 HTML 格式标记。此过程将受影响的文件列表缩小到包含需要删除或更改的材料的那些文件。然后可以使用正则表达式来删除过时的材料。最后,可以使用正则表达式来搜索和替换标记。

7.发展历史

正则表达式的"祖先"可以一直上溯至对人类神经系统如何工作的早期研究。Warren McCulloch 和 Walter Pitts 这两位神经生理学家研究出一种数学方式来描述这些神经网络。

1956 年, 一位叫 Stephen Kleene 的数学家在 McCulloch 和 Pitts 早期工作的基础上,发表了一篇标题为"神经网事件的表示法"的论文,引入了正则表达式的概念。正则表达式就是用来描述他称为"正则集的代数"的表达式,因此采用"正则表达式"这个术语。

随后,发现可以将这一工作应用于使用 Ken Thompson 的计算搜索算法的一些早期研究,Ken Thompson 是 Unix 的主要发明人。正则表达式的第一个实用应用程序就是 Unix 中的 qed 编辑器。

如他们所说,剩下的就是众所周知的历史了。从那时起直至现在正则表达式都是基于文本的编辑器和搜索工具中的一个重要部分。

8.应用领域

目前,正则表达式已经在很多软件中得到广泛的应用,包括 *nix(Linux, Unix等)、HP 等操作系统,PHP、C#、Java 等开发环境,以及很多的应用软件中,都可以看到正则表达式的影子。

9.推荐书籍

 

 

 参考:

https://www.runoob.com/regexp/regexp-tutorial.html

https://www.w3cschool.cn/zhengzebiaodashi/regexp-tutorial.html

https://regex101.com/

Guess you like

Origin www.cnblogs.com/Kevin-Yang/p/11444118.html