Frequently asked questions and solutions

The basic idea of ​​using regular rules to solve problems. There are some methods that are relatively fixed, such as decomposing the problem into multiple small problems and solving each small problem accordingly: if there may be multiple characters in a certain position, use character groups. If there are multiple strings at a certain position, use the multi-select structure. If the number of occurrences is uncertain, use quantifiers. If there are requirements for the appearing position, use anchor points to lock the position.

If certain characters cannot appear in the content to be searched, this situation is relatively simple. You can exclude character groups by using square brackets. For example, non-vowel letters can be represented by [^aeiou].

Frequently asked questions and solutions

 1. Match numbers

The matching of numbers is relatively simple and can be easily solved through the character groups, quantifiers, etc. we have learned.

  • Numbers can be represented by \d or [0-9] in regular expressions.
  • If there are multiple consecutive numbers, you can use \d+ or [0-9]+.
  • If there are n bits of data, \d{n} can be used.
  • If it is at least n bits of data, \d{n,} can be used.
  • If it is an mn-digit number, you can use \d{m,n}.

2. Match positive numbers, negative numbers and decimals

If you want the regular expression to match numbers such as 3, 3.14, -3.3, +2.7, etc., it should be noted that there may or may not be positive and negative symbols at the beginning, so you can use [-+]? to express the decimal point and the following The content may not be there, so you can use (?:\.\d+)? to represent it. Therefore, the regular expression for matching positive numbers, negative numbers and decimals can be written as [-+]?\d+(?:\.\d+) ?.

Non-negative integers, including 0 and positive integers, can be expressed as [1-9]\d*|0.

Non-positive integers, including 0 and negative integers, can be expressed as -[1-9]\d*|0.

3. Floating point numbers

Negative floating point number representation: -\d+(?:\.\d+)?.

Positive floating point number representation: \+?(?:\d+(?:\.\d+)?|\.\d+).

4. Hexadecimal numbers

In addition to 0-9, hexadecimal numbers also have af (or AF) representing the six numbers 10 to 15, so the regular expression can be written as [0-9A-Fa-f]+.

5. Mobile phone number

We can simply use character groups and multi-select branches to accurately match mobile phone number segments. If you only limit the first 2 digits, you can express it as 1[3-9]\d{9}. If you want to be more precise, limit it to the first three digits, such as using 1(?:3\d|4[5-9] |5[0-35-9]|6[2567]|7[0-8]|8\d|9[1389])\d{8}. If you want to be accurate to 4 or even 5 digits, you can write it yourself based on the public number segment information, but it should be noted that the more accurate it is, as long as there is a new number segment, you have to change this rule, which will be difficult to maintain. kind of hard. In addition, in actual use, you may also want to consider the situation where some numbers have prefixes such as +86 or 0086.

6. ID number

my country's ID card number is divided into two generations, the first generation is 15 digits, and the second generation is 18 digits. If it is 18 bits, the last digit can be X (or {14} to express that the second generation has 3 more digits of data than the first generation. You can use the quantifier 0 to 1 times, which is written as

[1-9]\d{14}(\d\d[0-9Xx])?。

7. Postal code

The zip code is generally a 6-digit number, which is relatively simple and can be written as \d{6}. The 6-digit number is also very likely to appear in other situations, such as part of the mobile phone number and part of the ID number, so if it is data extraction, Generally, an assertion needs to be added, which is written as (?<!\d)\d{6}(?!\d).

8. Tencent QQ number

Currently, QQ numbers cannot start with 0. The longest one is 10 digits, and the shortest one starts from 10000 (5 digits). From the rules, we can know that the first number is 1-9, followed by 4 to 9 digits, that is, it can be represented by [1-9][0-9]{4,9}.

9. Chinese characters

Chinese is a multi-byte Unicode character, but some languages ​​do not support this attribute. Another method can be used, which is the range of code values. The range of Chinese is between 4E00 - 9FFF, which can cover most situations of daily use. .

Different languages ​​have some differences in representation methods. For example, in Python, Java, and JavaScript, Unicode can be expressed as \u code value, that is, the regular expression matching Chinese can be written as [\u4E00-\u9FFF]. If used in PHP, Unicode needs to be written in the format of \u{code value}.

10. IPv4 address

IPv4 addresses are usually expressed in the format of 27.86.1.226, with 4 numbers separated by dots, and the range of each digit is 0-255. For example, when extracting IP from logs, if you do not require such accuracy, generally use \d{1,3 }(\.\d{1,3}){3} is enough. Please note that the period needs to be escaped.

We can express IPv4 as XXXX, we can use quantifiers and write it as (?:X.){3}X or X(?:.X){3}. Since Add parentheses to it, so the IPv4 regular expression should be written as

(?:1\d\d|2[0-4]\d|25[0-5]|0?[1-9]\d|0{0,2}\d)(?:\.(?:1\d\d|2[0-4]\d|25[0-5]|0?[1-9]\d|0{0,2}\d)){3}。

11. Date and time

Assuming the date format is yyyy-mm-dd, it should be \d{4}-(?:1[0-2]|0?[1-9])-(?:[12]\d|3[01] |0?[1-9]).

For example, the time format is 23:34. If it is a 24-hour clock, the hours are 0-23 and the minutes are 0-59, so it can be written as (?:2[0-3]|1\d|0?\d):( ?:[1-5]\d|0?\d).

12. Email

The composition of the mailbox is relatively complex. The format is username@hostname. The username part can usually be composed of English letters, numbers, underscores, dots, etc., but the dot cannot be at the beginning, nor can it appear repeatedly. According to RFC5322, there is no way to write a perfect regular expression. We can implement some simplified versions, such as: [a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a -zA-Z0-9-.]+.

13. Web page tags

Match tags that appear, such as title. Generally, web page tags are not case-sensitive. We can use (?i)<title>.*?</title> to match. When extracting the content inside quotation marks, you can use [^"]+, and when extracting the content inside square brackets, you can use [^>]+ and other methods.

This article is a study note for August Day 29. The content comes from Geek Time's "Introduction to Regular Expressions Course". This course is recommended.

Guess you like

Origin blog.csdn.net/key_3_feng/article/details/132571422