Regular expressions-study notes

Regular Expression (Regular Expression) is an expression used to describe rules for string pattern matching to achieve search and replace functions.

Basic elements

  • Characters: Basic computer character encoding, commonly used numbers and English letters.
  • Metacharacters: Special characters, characters used to represent special semantics.

Character match

Single character (one-to-one match)

The simplest regular expression can be composed of simple numbers and letters, without special semantics, just a one-to-one correspondence. That is, the regular expression is used to filter only one matched character. If this character is not a special character, using the transfer symbol will give it a special meaning.

Special characters The regular expression is Memory method
Newline \n new line
Form feed \f form feed
Carriage return \r return
Whitespace \s space
Tabs \t tab

Multiple characters (one-to-many match)

Introduce set interval and wildcards to achieve one-to-many matching. In regular expressions, the way to define sets is to use brackets and metacharacters. Even with the definition of sets and intervals, if you match multiple characters at the same time, you still need to list them one by one. This is inefficient, so a batch of simple regular expressions used to match multiple characters at the same time are derived from regular expressions. formula.

Interval matching Regular expression Memory method
Any character except newline . Period, except for the end of sentence
Single digit, [0-9] \d digit
Except [0-9] \D not digit
Single character including underscore, [A-Za-z0-9_] \w word
Non-word characters \W not word
Matches whitespace characters, including spaces, tabs, form feeds, and newlines \s space
Match non-whitespace characters \S not space

Loop and repeat (multiple character matching)

Matching multiple characters at the same time to achieve the matching of multiple characters requires multiple cycles, repeating the regular rule, and according to the number of cycles, it can be divided into 0 times, 1 time, multiple times and specific times.

  • 0|1:?
    Metacharacter? means to match 1 character or 0 characters.

If you want to match the words color and colour, you need to ensure that the character u can be matched at the same time, so the regular expression is:
/colou?r/

  • >=0: *
    Metacharacters* are used to indicate matching 0 characters or countless characters. Usually used to filter some dispensable strings.
  • >=1:+
    metacharacter+ is suitable for matching the same character one or more times.
  • Specific number of times To
    match a specific number of repetitions, use metacharacter curly brackets to give the exact interval range of the repeated matching settings.

a match 3 times is expressed as:
/a{3}/
grammar rules:

  • -{x}: x times
  • -{min, max}: between min times and max times
  • -{min,}: at least min times
  • -{0, max}: at most max times

Location boundary matching

In the process of searching for long text strings, you need to limit the location of the query.

Word boundary (\b)

Words are the basic unit of sentences and articles. A common usage scenario is to find out specific words in articles or sentences.

I want to find the word cat in the sentence "The cat scattered his food all over the room". If you use the regularity /cat/, it will match both the words cat and scattered. If you use the boundary regular expression \b, where b is the first letter of boundary. In the regular engine, it actually matches the position between the character (\w) that can form a word and the character (\W) that cannot form a word.
/\bcat\b/

String boundary

Borders and signs Regular expression Memory method
Word boundary \b boundary
Non-word boundary \B not boundary
Start of string ^ \hat{} ^
End of string $
Multi-line mode m multiple
Ignore case i ignore
Global mode g global

Sub-expression

Through nested recursion and self-referencing, regularization can play a more powerful role. The evolution of regular expressions from simple to complex usually uses the ideas of grouping, backtracking and logical processing. Using these three rules, infinitely complex regular expressions can be derived.

Grouping

The regular expressions contained in parenthesis metacharacters are grouped into one group, and each group is a sub-expression, which forms the basis of advanced regular expressions. If only simple (regex) matching syntax is used, it is essentially different Grouping is the same, if you want to play its powerful role, you need to combine back-reference methods.

Back reference

Backreference refers to the later part of the pattern that references the substring that has been matched before. The syntax of backtracking references \1 means to quote the first sub-expression, and \0 means the entire expression.

Match two consecutive identical words
Hello what what is the first thing, and I am am 007.
\b(\w+)\s\1

Look forward

Lookahead is used to limit the suffix. Any sub-expression contained in (?=regex) is used to restrict the matching of the previous expression during the matching process.

happy happily

  • Get the adverb at the beginning of happ: happ(?=ily)
  • Filter out the adverbs that begin with happ: happ (?! ily)

Search backward

Backward lookup is the reverse operation of forward lookup. Lookbehind is to specify a sub-expression, and then start from the position that matches this sub-expression to find a string that meets the rules.

apple people
just want to find apple's ple

  • Method 1: /(?<=app)ple/
  • Method 2: /(?<!peo)ple

Logical processing

Logic Regular metacharacters
versus The default is the relationship with
non- [ ^ \hat{} ^ regex]( ^ \hat{} ^It must be used with [] to indicate non) and!
or ∣ \mid

[Reference]
Do not recite regular expressions

Guess you like

Origin blog.csdn.net/studyeboy/article/details/107980721