Java Review (XXII, regular expressions)


Regular expressions are a powerful string-handling tools, you can look for strings, extraction, segmentation, replacement and other operations. String class also provides the following several special methods:

  • boolean matches (String regex): Yu determines the string matches the specified regular expression.
  • String replaceAll (String regex, String replacement): Yu string in the regex all matching substrings replaced replacement.
  • String replaceFirst (String regex, String replacement): the first string matching substring is replaced regex replacement.
  • String [] split (String regex): regex to as separator, dividing the character string into a plurality of sub-strings.

Above these special methods rely on Java regular expression support provided, in addition, Java also provides Pattem and Matcher two classes designed to provide regular expression support.


Creating regular expressions

Regular expression is a template for matching string can match a number of strings, so to create a regular expression is to create a special string.

In other languages, \\ he said: want to insert a general expression of (literally) in the positive backslash, do not give it any special significance. In Java, \\ said: backslash to insert a regular expression, so the following character has special meaning.
So, in other languages (such as Perl), a backslash \ is enough to have escaped role, and in Java regular expression in two backslash you need to have in order to be resolved to turn in other languages righteous action. Also it can be simply understood in regular expressions Java, the two \ on behalf of one of the other languages \, which is why a digital representation of the regular expression is \ d, which represents a common backslash is \ \.

Regular expressions are supported by legal characters as shown in Table I:


Table I: Regular expressions are supported by legal characters

Here Insert Picture Description
Regular expressions are some special characters, these special characters in the regular expression has its special purpose in being:


Table II: Regular expression special characters

Here Insert Picture Description

The above multiple characters put together, you can create a regular expression. E.g:

" \u0041\\\\ "   // 匹配 A \
" \u0061 \t "   // 匹配 a <制表符〉
" \\?\\ ["     // 匹配? [

The above regular expression still only matches a single character, because not yet use the "wildcard", "wildcard" in the regular expression can match the special characters more characters. Regular expression "wild card" far beyond the ordinary wildcard function, it is called a pre-defined characters:


Table 3: Predefined character

Here Insert Picture Description

The above seven predefined character is actually very easy to remember: d is a digit meaning, representing digital; s is the meaning of space, represents the blank; W is a word meaning, on behalf of the word. Capitalization d, s, w match exactly the opposite character.

With the above pre-defined character, then you can create a more powerful regular expressions. E.g:

c\\wt            //可 以匹配 cat 、 cbt 、 cct 、 cOt 、 c9t 等一批字符串
\\d\\d\\d-\\d\\d\\d-\d\\d\\d\\d       //匹配如 000-000-0000 形式的电话号码

Want to match a ~ f letters, or all lowercase letters match except ab or matching Chinese characters, then you need to use a bracket expression:

Table 4: bracket expression

Here Insert Picture Description
Regular expression parentheses indicates supports, more expressions for a sub-expression, or may be used parentheses operator (|). For example, the regular expression "((public) | (Protected) l (private))" for matching three symbols Java access control one.

Java also supports regular expressions Boundary matchers:

Table 5: Boundary matchers

Here Insert Picture Description
When the previous example the need to establish a form of matching 000-000-0000 phone number, use the \\ d \ d \ d- \ d \ d \ d- \ d \ d \ d \ d
regular expressions, it looks more cumbersome. In fact, regular expressions also provides a number of identifiers, the number of regular expression support identifier has the following modes:

  • Greedy (贪婪模式) : 数量表示符默认采用贪婪模式 , 除非另有表示。贪婪模式的表达式会一直匹配下去 ,直到无法匹配为止 。 如果你发现表达式匹配的结果与预期的不符 , 很有可能是因为一一你以为表达式只会匹配前面几个宇符,而实际上它是贪婪模式 , 所以会一直匹配下去 。
  • Reluctant (勉强模式) : 用问号后缀(?) 表示 , 它只会匹自己最少的字符 。 也称为最小匹配模式 。
  • Possessive (占有模式) : 用加号后缀(+)表示 ,目前只有 Java 支持占有模式,通常比较少用 。

三种模式的数量表示符如表六所示 。

表六:三种模式的数量表示符

Here Insert Picture Description

使用正则表达式

一旦在程序中定义了正则表达式,就可以使用 Pattem 和 Matcher 来使用正则表达式 。

Pattem 对象是正则表达式编译后在内存中的表示形式,因此,正则表达式宇符串必须先被编译为Pattem 对象,然后再利用该 Pattem 对象创建对应的 Matcher 对象 。 执行匹配所涉及的状态保留在 Matcher对象中,多个 Matcher 对象可共享同一个 Pattem 对象 。

因此,典型的调用顺序如下:

/ /将一个字符串编译成 Pattern 对象
Pattern p = Pattern.compil e( "a*b");
// 使用 Pattern 对象创建 Matcher 对象
Matcher m = p .matcher( "aaaaab" ) ;
boolean b = m.matches(); / /返回 true

上面定义的 Pattem 对象可以多次重复使用 。 如果某个正则表达式仅需一次使用,则可直接使用Pattem 类的静态 matches()方法,此方法自动把指定字符串编译成匿名的 Pattem 对象,并执行匹配,如下所示 :

boolean b = Pattern.matches("a*b" , "aaaaab"); // 返回 true

Pattem 是不可变类,可供多个并发线程安全使用 。

Matcher 类提供了如下几个常用方法 :

  • find(): 返回目标字符串中是否包含与 Pattem 匹配的子 串 。
  • group(): 返回上一次与 Pattem 匹配的子串 。
  • start(): 返回上一 次与 Pattem 匹配的子串在目标字符串中的开始位置 。
  • end(): 返回上一次与 Pattem 匹配的子串在目标字符串中的结束位置加 1 。
  • lookingAt() : 返回目标字符串前面部分与 Pattem 是否匹配 。
  • matches() : 返回整个目标字符串与 Pattem 是否匹配 。
  • reset(): 将现有的 Matcher 对象应用于一个新的字符序列 。

通过 Matcher 类的 findO和 groupO方法可以从目标字符串中依次取出特定子串(匹配正则表达式的子串),例如互联网的网络爬虫,它们可以自动从网页中识别出所有的电话号码 。 下面程序示范了如何从大段的宇符串中找出电话号码 :

FindGroup.java

public class FindGroup
{
	public static void main(String[] args)
	{
		// 使用字符串模拟从网络上得到的网页源码
		String str = "我想求购一本《***》,尽快联系我13500006666"
			+ "交朋友,电话号码是13611125565"
			+ "出售二手电脑,联系方式15899903312";
		// 创建一个Pattern对象,并用它建立一个Matcher对象
		// 该正则表达式只抓取13X和15X段的手机号,
		// 实际要抓取哪些电话号码,只要修改正则表达式即可。
		Matcher m = Pattern.compile("((13\\d)|(15\\d))\\d{8}")
			.matcher(str);
		// 将所有符合正则表达式的子串(电话号码)全部输出
		while(m.find())
		{
			System.out.println(m.group());
		}
	}
}

运行结果:
Here Insert Picture Description
find()方法依次查找字符串中与 Pattem 匹配的子串, 一旦找到对应的子
串,下次调用 find()方法时将接着向下查找。

find()方法还可以传入一个 int 类型的参数,带 int 参数的 find()方法将从该 int 索引处向下搜索 。start()和 end()方法主要用于确定子串在目标字符串中的位置,如下程序所示:

StartEnd.java

public class StartEnd
{
	public static void main(String[] args)
	{
		// 创建一个Pattern对象,并用它建立一个Matcher对象
		String regStr = "Java is very easy!";
		System.out.println("目标字符串是:" + regStr);
		Matcher m = Pattern.compile("\\w+")
			.matcher(regStr);
		while(m.find())
		{
			System.out.println(m.group() + "子串的起始位置:"
				+ m.start() + ",其结束位置:" + m.end());
		}
	}
}

运行结果:
Here Insert Picture DescriptionmatchesO和 lookingAt()方法有点相 似,只 是 matches()方法要求整个字符串和 Pattem 完全匹配时才返回 true ,而 lookingAtO只要字符串以 Pattem 开头就会返回 true 。 reset()方法可将现有的 Matcher 对象应用于新的字符序列 。看如下程序:

MatchesTest.java

public class MatchesTest
{
	public static void main(String[] args)
	{
		String[] mails =
		{
			"[email protected]" ,
			"[email protected]",
			"[email protected]",
			"[email protected]"
		};
		String mailRegEx = "\\w{3,20}@\\w+\\.(com|org|cn|net|gov)";
		Pattern mailPattern = Pattern.compile(mailRegEx);
		Matcher matcher = null;
		for (String mail : mails)
		{
			if (matcher == null)
			{
				matcher = mailPattern.matcher(mail);
			}
			else
			{
				matcher.reset(mail);
			}
			String result = mail + (matcher.matches() ? "是" : "不是")
				+ "一个有效的邮件地址!";
			System.out.println(result);
		}
	}
}

运行结果:
Here Insert Picture Description
除此之外 ,还可以利用正则表达式对目标字符串进行分割、查找、替换等操作,看如下程序:

ReplaceTest.java

public class ReplaceTest
{
	public static void main(String[] args)
	{
		String[] msgs =
		{
			"Java has regular expressions in 1.4",
			"regular expressions now expressing in Java",
			"Java represses oracular expressions"
		};
		Pattern p = Pattern.compile("re\\w*");
		Matcher matcher = null;
		for (int i = 0 ; i < msgs.length ; i++)
		{
			if (matcher == null)
			{
				matcher = p.matcher(msgs[i]);
			}
			else
			{
				matcher.reset(msgs[i]);
			}
			System.out.println(matcher.replaceAll("哈哈:)"));
		}
	}
}

运行结果:
Here Insert Picture Description


API:java.util.regex.Matcher

API:java.util.regex.Pattern



参考:

[1]: "crazy Java handouts"
[2]: Java regular expression
[3]: regular expression matching rule
[3]: regular expression matching rules complex
[4]: Regular Expressions Tutorial

Published 136 original articles · won praise 36 · views 30000 +

Guess you like

Origin blog.csdn.net/sinat_40770656/article/details/102924047