An article Getting Started with Regular Expressions

Suitable for readers of this blog: beginners of regular expressions, programmers who want to get started with regular expressions

 

    Regular expressions describe a pattern of string matching that can be used to detect whether a string matches a certain rule. Most programming languages ​​integrate regular expressions. Although the different schools make the syntax of regular expressions slightly different in each language, the basic ideas of using regular expressions and the common syntax of regular expressions are interoperable.

    In daily programming and maintenance, regular expressions are widely used. For example, in JavaScript, we verify form content, use Nginx as a reverse proxy to match urls, or use the grep command to search on Linux systems. There are too many examples of text content and text replacement, etc.

    When I first came into contact with regular expressions, I actually rejected them at first. They looked like this→^[a-zA-Z]\w{5,17}$, or this way→(^\d{15}$ )|(^\d{18}$)|(^\d{17}(\d|X|x)$). It's a mess... There's no logic to it... It's like a book from the sky... What the hell... oh Fake... But you know, this is also a mess when you haven't learned English → How are you. When you haven't learned java, this is also a mess → System.out.println("Hello World");, so the regular expression is nothing, can't understand it, it only means that you haven't mastered it yet.

    This blog will take you to get started with regular expressions. You can understand and write regular expressions. Let's get to the point!

 

    First, let's introduce keywords. A regular expression is composed of ordinary characters and metacharacters. Here I call metacharacters "keywords". Below I will introduce these keywords:

 

^ beginning
$ end
[] character group
[-] Joiner
* match 0 or more times
+ match one or more times
. any character
| or
() Delimitation
? match 0 or one time
\n backreference
{m,n},{n,},{n} interval quantifier
[^] Character group exclusion
\ escape
\d number
\s whitespace
\S non-whitespace characters
\w Match alphanumeric underscore
\W matches non-alphanumeric underscores
\b match word boundaries
\B match non-word boundaries

After mastering these keywords, you can understand and write your own regular expressions. Don't worry, let's come one by one.

  1. matches the beginning of the text. It does not match a certain character, only one position, for example: ^a matches text starting with a, such as: ab, abc, etc., while ^ab matches text starting with a, and the next character is b, such as: ab, abck. Note: Regular expressions should be understood in units of characters, so ^ab should be understood as starting with a character, and the next character is b. And don't understand that it starts with ab, which is wrong.
  2. $ matches the end of the text. Such as abc$ , the same as the previous one, the match ends with the character c, the second-to-last character is b, and the third-to-last character is a text, such as aabc, asabc.
  3. [] character group, matches any one of the characters. For example, ^a[abc]a$ matches text that starts with the character a, the next character is a or b or c, the next character is a and the character is the ending character, such as aba, aaa, aca.
  4. The [-] connector, matches a range. For example, [az] matches any one of a, b, c...x, y, z, [a-zA-Z] matches a, b, c...x , y, z, A, B... any one of Y, Z, [0-9] matches any one of 0, 1, 2...9. Note: -Only appear in the character group [] , and not the first element to represent the connector, such as [-12] , match any one of the characters -, 1, 2 here - just a normal character. Take a comprehensive chestnut ab[-a-z0-9] , match the character a, the next character is -, the next character is b, the next character is -, a, b, c...y , z, 0, 1, 2...9 of any character text.
  5. * Matches the previous expression 0 or more times, to name a few chestnuts, ab*c matches the character a, the next character is n b (n may be any natural number), the next character is the text of c, such as ac , abc, abbc, abbbc. a[abc]*c matches the character a, the next n characters are any of a, b, c (n may be any natural number), and the next character is c, such as ac, abc, abac, abbacac. More complicated, (a[abc]*c)* , the expression in () appears n times, such as acabcabacabbacac.
  6. + Match one or more times. The same as the principle of * , except that * can match 0 times, and + can match at least once.
  7. ? Match 0 or 1 times. Same principle as * and + .
  8. . matches any character, such as ab , matches the character a, the next character is any character, and the next character is the text of b, such as ahb, a&b. Note: If . appears in [] , it does not represent any character, but only represents ordinary character ., and only when . appears outside [] does it represent any character.
  9. | Or, matches the expression on the left or the expression on the right. Example: grep|grap matches grep or grep. Note that the scope of | is not about two characters, but about the entire expression. () can change its scope. For example gr(e|a)p also matches grep or grep.
  10. () is the parenthesis, not much explanation.
  11. \n (n is a natural number) backreference, matches the text matched by the nth () in front, for example: ([az]+) \1 , can match repeated words, such as you you, ha ha. This \1 matches ([az]+) the text that has already been matched.
  12. {m,n} , {n,} , {n} interval quantifiers, the same principle as *+?, {m,n} matches m times to n times, {n} matches n times, {n,} matches at least n times times, for example: [a-zA-Z]{1,5} matches text consisting of 1 to 5 uppercase and lowercase alphabetic characters, such as a, aAa.
  13. The [^] character group is excluded and does not match any characters in [] , such as [^0-9] matches any character that is not a number. Note: ^ must appear in the character group [] , and must appear at the beginning of the character group to represent the exclusion of the character group . The ^ in [0-9^] does not represent exclusion, but only represents a common character, matching 0 to 9 or ^Any character.
  14. \d matches digits, equivalent to [0-9] .
  15. \D matches non-digits, equivalent to [^0-9] .
  16. \ escape character, \. matches character ., \* matches character *.
  17. \s  matches any whitespace character such as space, tab, carriage return, etc.
  18. \S  matches any non-whitespace character.
  19. \w matches alphanumeric underscores. Equivalent to [A-Za-z0-9_] .
  20. \W  matches non-alphanumeric underscores. Equivalent to  [^A-Za-z0-9_] .
  21. \b matches word boundaries. For example,  er\b can match er in never, but not er in verb.
  22. \B matches non-word boundaries. The principle is the same as above.

Well, finally finished, understand the 22 keywords introduced above, you are already getting started.

Let’s practice it and try to analyze several regular expressions. Let’s take the two expressions mentioned at the beginning as an example:

1 、 ^[a-zA-Z] \ w {5,17} $

    The first character is a letter, uppercase or lowercase, followed by 5 to 17 alphanumerics or underscores. ( \w is equivalent to [A-Za-z0-9_] ), which is an expression that checks whether a 6-18 digit password is valid.

 

2、 (^\d{15}$)|(^\d{18}$)|(^\d{17}(\d|X|x)$)

    The expression is composed of multiple 'or', so if you take it apart, the expression is equivalent to ^\d{15}$ or ^\d{18}$ or ^\d{17}(\d| X|x)$ , that is to say, it matches the text composed of 15-bit integers, or the text composed of 18 integers, or the text composed of 17-bit integers plus an integer or an x ​​or an X. To put it bluntly, the expression is to match the ID number.

 

Next, use it and try it out:

1. First write an expression to check the QQ number:

    First of all, we need to know that the qq number is composed of 5 to 11 integers starting from 10000. The first character cannot be 0, so the expression matching the first character is written as follows: ^[1-9] The 4 to 10 digits after the Natural numbers, so improve it: ^[1-9]\d{4,10} , done!

2. Write another expression that matches dates. Suppose we want to match date strings in the format of yyyy-mm-dd:

    Let's match the year first. In theory, the year can be a number composed of any four numbers. Here we stipulate that it matches the date between 2000 and 2999, so the expression of the year is written like this: ^2\d{3} . Then there is a hyphen - followed by the month, the month is 1 to 12, so the first digit could be 0 or 1, the next digit could be any number from 0 to 9, perfect: ^2\d{ 3}-[01]\d . Next is a connector -, followed by a date, the date is 1 to 31, so the first digit may be 0, 1, 2, 3, and the next digit may be any number from 0-9, perfect: ^2\ d{3}-[01]\d-[0123]\d . Get it! (Note: \d is equivalent to [0-9] )

 

After understanding the above, you have already started, and the next step is your play space!

Note: Due to different genres, the support for regular expressions in each language is slightly different, but it is generally the same. For specific differences, please refer to the relevant documentation when using it.

 

Here are some common examples:

1. The expression of the check digit

  • Numbers: ^[0-9]*$
  • n-digit number: ^\d{n}$
  • At least n digits : ^\d{n,}$
  • mn digits: ^\d{m,n}$
  • Zero and non-zero leading numbers: ^(0|[1-9][0-9]*)$
  • Non-zero leading numbers with up to two decimal places: ^([1-9][0-9]*)+(.[0-9]{1,2})?$
  • Positive or negative numbers with 1-2 decimal places: ^(\-)?\d+(\.\d{1,2})$
  • Positive, negative, and decimal numbers: ^(\-|\+)?\d+(\.\d+)?$
  • Positive real numbers with two decimal places: ^[0-9]+(\.[0-9]{2})?$
  • Positive real numbers with 1~3 decimal places: ^[0-9]+(\.[0-9]{1,3})?$
  • Non-zero positive integer: ^[1-9]\d*$ or ^([1-9][0-9]*){1,3}$ or ^\+?[1-9][0- 9]*$
  • Non-zero negative integers: ^\-[1-9][]0-9"*$ or ^-[1-9]\d*$
  • Non-negative integer: ^\d+$ or ^[1-9]\d*|0$
  • Non-positive integer: ^-[1-9]\d*|0$ or ^((-\d+)|(0+))$
  • Non-negative floating point numbers: ^\d+(\.\d+)?$ or ^[1-9]\d*\.\d*|0\.\d*[1-9]\d*|0?\ .0+|0$
  • Non-positive floating point numbers: ^((-\d+(\.\d+)?)|(0+(\.0+)?))$ or ^(-([1-9]\d*\.\d *|0\.\d*[1-9]\d*))|0?\.0+|0$
  • Positive float: ^[1-9]\d*\.\d*|0\.\d*[1-9]\d*$ or ^(([0-9]+\.[0-9 ]*[1-9][0-9]*)|([0-9]*[1-9][0-9]*\.[0-9]+)|([0-9]* [1-9][0-9]*))$
  • Negative float: ^-([1-9]\d*\.\d*|0\.\d*[1-9]\d*)$ or ^(-(([0-9]+\ .[0-9]*[1-9][0-9]*)|([0-9]*[1-9][0-9]*\.[0-9]+)|([ 0-9]*[1-9][0-9]*)))$
  • Floating point numbers: ^(-?\d+)(\.\d+)?$ or ^-?([1-9]\d*\.\d*|0\.\d*[1-9]\d *|0?\.0+|0)$

Second, the expression of the check character

  • Chinese characters: ^[\u4e00-\u9fa5]{0,}$
  • English and numbers: ^[A-Za-z0-9]+$ or ^[A-Za-z0-9]{4,40}$
  • All characters of length 3-20: ^.{3,20}$
  • A string of 26 English letters: ^[A-Za-z]+$
  • A string of 26 uppercase English letters: ^[AZ]+$
  • A string of 26 lowercase English letters: ^[az]+$
  • A string consisting of numbers and 26 English letters: ^[A-Za-z0-9]+$
  • A string consisting of numbers, 26 English letters or underscores: ^\w+$ or ^\w{3,20}$
  • Chinese, English, numbers including underscore: ^[\u4E00-\u9FA5A-Za-z0-9_]+$
  • Chinese, English, numbers but not including underscores and other symbols: ^[\u4E00-\u9FA5A-Za-z0-9]+$ or ^[\u4E00-\u9FA5A-Za-z0-9]{2,20}$
  • You can enter characters including ^%&',;=?$\": [^%&',;=?$\x22]+
  • Characters containing ~ are prohibited: [^~\x22]+

 

Three, special needs expression

  • Email地址:^\w+([-+.]\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*$
  • 域名 :[a-zA-Z0-9] [- a-zA-Z0-9] {0,62} [/ [a-zA-Z0-9] [- a-zA-Z0-9] {0 , 62}) + /.?
  • InternetURL:[a-zA-z]+://[^\s]* 或 ^http://([\w-]+\.)+[\w-]+(/[\w-./?%&=]*)?$
  • Mobile number: ^(13[0-9]|14[5|7]|15[0|1|2|3|5|6|7|8|9]|18[0|1|2|3| 5|6|7|8|9])\d{8}$
  • 电话号码("XXX-XXXXXXX"、"XXXX-XXXXXXXX"、"XXX-XXXXXXX"、"XXX-XXXXXXXX"、"XXXXXXX"和"XXXXXXXX):^(\(\d{3,4}-)|\d{3.4}-)?\d{7,8}$
  • 国内电话号码(0511-4405222、021-87888822):\d{3}-\d{8}|\d{4}-\d{7}
  • 电话号码正则表达式(支持手机号码,3-4位区号,7-8位直播号码,1-4位分机号):((\d{11})|^((\d{7,8})|(\d{4}|\d{3})-(\d{7,8})|(\d{4}|\d{3})-(\d{7,8})-(\d{4}|\d{3}|\d{2}|\d{1})|(\d{7,8})-(\d{4}|\d{3}|\d{2}|\d{1}))$)
  • 身份证号(15位、18位数字),最后一位是校验位,可能为数字或字符X:(^\d{15}$)|(^\d{18}$)|(^\d{17}(\d|X|x)$)
  • 帐号是否合法(字母开头,允许5-16字节,允许字母数字下划线):^[a-zA-Z][a-zA-Z0-9_]{4,15}$
  • 密码(以字母开头,长度在6~18之间,只能包含字母、数字和下划线):^[a-zA-Z]\w{5,17}$
  • 强密码(必须包含大小写字母和数字的组合,不能使用特殊字符,长度在8-10之间):^(?=.*\d)(?=.*[a-z])(?=.*[A-Z]).{8,10}$
  • 日期格式:^\d{4}-\d{1,2}-\d{1,2}
  • 一年的12个月(01~09和1~12):^(0?[1-9]|1[0-2])$
  • 一个月的31天(01~09和1~31):^((0?[1-9])|((1|2)[0-9])|30|31)$
  • 钱的输入格式:
    1. 有四种钱的表示形式我们可以接受:"10000.00" 和 "10,000.00", 和没有 "分" 的 "10000" 和 "10,000":^[1-9][0-9]*$
    2. 这表示任意一个不以0开头的数字,但是,这也意味着一个字符"0"不通过,所以我们采用下面的形式:^(0|[1-9][0-9]*)$
    3. 一个0或者一个不以0开头的数字.我们还可以允许开头有一个负号:^(0|-?[1-9][0-9]*)$
    4. 这表示一个0或者一个可能为负的开头不为0的数字.让用户以0开头好了.把负号的也去掉,因为钱总不能是负的吧。下面我们要加的是说明可能的小数部分:^[0-9]+(.[0-9]+)?$
    5. 必须说明的是,小数点后面至少应该有1位数,所以"10."是不通过的,但是 "10" 和 "10.2" 是通过的:^[0-9]+(.[0-9]{2})?$
    6. 这样我们规定小数点后面必须有两位,如果你认为太苛刻了,可以这样:^[0-9]+(.[0-9]{1,2})?$
    7. 这样就允许用户只写一位小数.下面我们该考虑数字中的逗号了,我们可以这样:^[0-9]{1,3}(,[0-9]{3})*(.[0-9]{1,2})?$
    8. 1到3个数字,后面跟着任意个 逗号+3个数字,逗号成为可选,而不是必须:^([0-9]+|[0-9]{1,3}(,[0-9]{3})*)(.[0-9]{1,2})?$
    9. 备注:这就是最终结果了,别忘了"+"可以用"*"替代如果你觉得空字符串也可以接受的话(奇怪,为什么?)最后,别忘了在用函数时去掉去掉那个反斜杠,一般的错误都在这里
  • xml文件:^([a-zA-Z]+-?)+[a-zA-Z0-9]+\\.[x|X][m|M][l|L]$
  • 中文字符的正则表达式:[\u4e00-\u9fa5]
  • 双字节字符:[^\x00-\xff] (包括汉字在内,可以用来计算字符串的长度(一个双字节字符长度计2,ASCII字符计1))
  • 空白行的正则表达式:\n\s*\r (可以用来删除空白行)
  • HTML标记的正则表达式:<(\S*?)[^>]*>.*?|<.*? /> ( 首尾空白字符的正则表达式:^\s*|\s*$或(^\s*)|(\s*$) (可以用来删除行首行尾的空白字符(包括空格、制表符、换页符等等),非常有用的表达式)
  • 腾讯QQ号:[1-9][0-9]{4,} (腾讯QQ号从10000开始)
  • 中国邮政编码:[1-9]\d{5}(?!\d) (中国邮政编码为6位数字)
  • IP地址:((?:(?:25[0-5]|2[0-4]\\d|[01]?\\d?\\d)\\.){3}(?:25[0-5]|2[0-4]\\d|[01]?\\d?\\d))

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326408872&siteId=291194637