

1.正则表达式(regular expression)

正则表达式(regular expression)是根据字符串集合内每个字符串共享的共同特性来描述字符串集合的一种途径。正则表达式可以用于搜索、编辑或者处理文本和数据。必须了解创建正则表达式的特定语法——这超出了Java编程语言的一般语法。正则表达式的复杂性各不相同。但是了解了如何构造正则表达式的基础之后,就能够解释(或者创建)任何正则表达式。
本章讲解java.util.regex API支持的正则表达式语法,并且提供若干实例以便演示各种对象如何交互。在正则表达式的领域中,有很多形式可供选择,比如grep、Perl、Tcl、Python、PHP和awk。java.util.regex API中的正则表达式语法和Perl最为类似。


java.util.regex包主要由三部分构成:Pattern、Matcher和PatternSyntax- Exception。
l Pattern对象是正则表达式编译后的表达形式。Pattern类没有提供公共构造器。为了创建模式,首先必须调用它的一个public static compile方法,这样会返回一个Pattern对象。这些方法接受正则表达式作为第一个实参;本章下面几页将讲解所需的语法。
l Matcher对象是解释模式和对输入字符串执行匹配操作的引擎。和Pattern类一样,Matcher没有定义公共构造器。通过调用Pattern对象的matcher方法获得Matcher对象。
l PatternSyntaxException对象是不可控异常,它指出正则表达式模式中的语法错误。


本节定义一个可重用的测试示例RegexTestHarness.java,用于讲解这个API支持的正则表达式结构。运行这段代码的命令是java RegexTestHarness;不接受命令行参数。这个应用程序重复地循环,提示用户输入正则表达式和输入字符串。使用这个测试示例是可选的,但是你会发现使用它分析后面章节讨论的测试案例是很方便的。
import java.io.Console;

import java.util.regex.Pattern;

import java.util.regex.Matcher;

public class RegexTestHarness {

  public static void main(String[] args){

    Console console = System.console();

    if (console == null) {

      System.err.println("No console.");



    while (true) {

      Pattern pattern =

      Pattern.compile(console.readLine("%nEnter your " +

                                                   regex: "));

      Matcher matcher =

      pattern.matcher(console.readLine("Enter input string " +

                                                to search: "));

      boolean found = false;

      while (matcher.find()) {

        console.format("I found the text /"%s/" starting " +

                               "at index %d and ending at " +

                               "index %d.%n", matcher.group(),

                               matcher.start(), matcher.end());

        found = true;



        console.format("No match found.%n");







Enter your regex: foo

Enter input string to search: foo

I found the text "foo" starting at index 0 and ending at index 3.





索引 2

索引 1

单元 2

单元 1

索引 0

单元 0

图13-1  String字面量“foo”,标出了编号的单元和索引值



Enter your regex: foo

Enter input string to search: foofoofoo

I found the text "foo" starting at index 0 and ending at index 3.

I found the text "foo" starting at index 3 and ending at index 6.

I found the text "foo" starting at index 6 and ending at index 9.



Enter your regex: cat.

Enter input string to search: cats

I found the text "cats" starting at index 0 and ending at index 4.


这个API支持的元字符有:( [ { / ^ - $ | } ] ) ? * +.。

注意 在某些情况下,前面列出的特殊字符不被当作元字符对待。随着你更多地学习如何构造正则表达式,就会遇到这种情况。但是,你可以使用这个清单检查一个特殊字符是否被当作元字符。例如,字符!、@和#永远都不会具有特殊含义。


l 在元字符前面加上反斜线,或者

l 用/Q(开始引用)和/E(结束引用)把元字符括起来。





[abc] a、b 或 c(简单类)
[^abc] 任何字符,除了 a、b 或 c(否定)
[a-zA-Z] a 到 z 或 A 到 Z,两头的字母包括在内(范围)
[a-d[m-p]] a 到 d 或 m 到 p:[a-dm-p](并集)
[a-z&&[def]] d、e 或 f(交集)
[a-z&&[^bc]] a 到 z,除了 b 和 c:[ad-z](减去)
[a-z&&[^m-p]] a 到 z,而非 m 到 p:[a-lq-z](减去)

注意 短语“字符类”中“类”这个词不表示.class文件。在正则表达式的上下文表述中,字符类是括在方括号内的字符集合。它表示这些字符将和给定输入字符串内的单一字符成功匹配。



Enter your regex: [bcr]at

Enter input string to search: bat

I found the text "bat" starting at index 0 and ending at index 3.

Enter your regex: [bcr]at

Enter input string to search: cat

I found the text "cat" starting at index 0 and ending at index 3.

Enter your regex: [bcr]at

Enter input string to search: rat

I found the text "rat" starting at index 0 and ending at index 3.

Enter your regex: [bcr]at

Enter input string to search: hat

No match found.


a. 非


Enter your regex: [^bcr]at

Enter input string to search: bat

No match found.

Enter your regex: [^bcr]at

Enter input string to search: cat

No match found.

Enter your regex: [^bcr]at

Enter input string to search: rat

No match found.

Enter your regex: [^bcr]at

Enter input string to search: hat

I found the text "hat" starting at index 0 and ending at index 3.


b. 范围



Enter your regex: [a-c]

Enter input string to search: a

I found the text "a" starting at index 0 and ending at index 1.

Enter your regex: [a-c]

Enter input string to search: b

I found the text "b" starting at index 0 and ending at index 1.

Enter your regex: [a-c]

Enter input string to search: c

I found the text "c" starting at index 0 and ending at index 1.

Enter your regex: [a-c]

Enter input string to search: d

No match found.

Enter your regex: foo[1-5]

Enter input string to search: foo1

I found the text "foo1" starting at index 0 and ending at index 4.

Enter your regex: foo[1-5]

Enter input string to search: foo5

I found the text "foo5" starting at index 0 and ending at index 4.

Enter your regex: foo[1-5]

Enter input string to search: foo6

No match found.

Enter your regex: foo[^1-5]

Enter input string to search: foo1

No match found.

Enter your regex: foo[^1-5]

Enter input string to search: foo6

I found the text "foo6" starting at index 0 and ending at index 4.

c. 并


Enter your regex: [0-4[6-8]]

Enter input string to search: 0

I found the text "0" starting at index 0 and ending at index 1.

Enter your regex: [0-4[6-8]]

Enter input string to search: 5

No match found.

Enter your regex: [0-4[6-8]]

Enter input string to search: 6

I found the text "6" starting at index 0 and ending at index 1.

Enter your regex: [0-4[6-8]]

Enter input string to search: 8

I found the text "8" starting at index 0 and ending at index 1.

Enter your regex: [0-4[6-8]]

Enter input string to search: 9

No match found.

d. 交


Enter your regex: [0-9&&[345]]

Enter input string to search: 3

I found the text "3" starting at index 0 and ending at index 1.

Enter your regex: [0-9&&[345]]

Enter input string to search: 4

I found the text "4" starting at index 0 and ending at index 1.

Enter your regex: [0-9&&[345]]

Enter input string to search: 5

I found the text "5" starting at index 0 and ending at index 1.

Enter your regex: [0-9&&[345]]

Enter input string to search: 2

No match found.

Enter your regex: [0-9&&[345]]

Enter input string to search: 6

No match found.


Enter your regex: [2-8&&[4-6]]

Enter input string to search: 3

No match found.

Enter your regex: [2-8&&[4-6]]

Enter input string to search: 4

I found the text "4" starting at index 0 and ending at index 1.

Enter your regex: [2-8&&[4-6]]

Enter input string to search: 5

I found the text "5" starting at index 0 and ending at index 1.

Enter your regex: [2-8&&[4-6]]

Enter input string to search: 6

I found the text "6" starting at index 0 and ending at index 1.

Enter your regex: [2-8&&[4-6]]

Enter input string to search: 7

No match found.

e. 减


Enter your regex: [0-9&&[^345]]

Enter input string to search: 2

I found the text "2" starting at index 0 and ending at index 1.

Enter your regex: [0-9&&[^345]]

Enter input string to search: 3

No match found.

Enter your regex: [0-9&&[^345]]

Enter input string to search: 4

No match found.

Enter your regex: [0-9&&[^345]]

Enter input string to search: 5

No match found.

Enter your regex: [0-9&&[^345]]

Enter input string to search: 6

I found the text "6" starting at index 0 and ending at index 1.

Enter your regex: [0-9&&[^345]]

Enter input string to search: 9

I found the text "9" starting at index 0 and ending at index 1.



Pattern API包含很多有用的预定义字符类(predefined character class),它们提供常用正则表达式便利的简写方式。


. 任何字符(与行结束符可能匹配也可能不匹配)
\d 数字:[0-9]
\D 非数字: [^0-9]
\s 空白字符:[ \t\n\x0B\f\r]
\S 非空白字符:[^\s]
\w 单词字符:[a-zA-Z_0-9]
\W 非单词字符:[^\w]

以反斜线开头的结构被称为转义结构(escaped construct)。我们在3.1.2节的第2小节简单介绍了转义结构,其中提到了用于引用的反斜线、/Q和/E。如果你在字符串字面量中使用转义结构,就必须在反斜线前面再加上一个反斜线,以便能够编译字符串。例如:

private final String REGEX = "//d"; // a single digit



Enter your regex: .

Enter input string to search: @

I found the text "@" starting at index 0 and ending at index 1.

Enter your regex: .

Enter input string to search: 1

I found the text "1" starting at index 0 and ending at index 1.

Enter your regex: .

Enter input string to search: a

I found the text "a" starting at index 0 and ending at index 1.

Enter your regex: /d

Enter input string to search: 1

I found the text "1" starting at index 0 and ending at index 1.

Enter your regex: /d

Enter input string to search: a

No match found.

Enter your regex: /D

Enter input string to search: 1

No match found.

Enter your regex: /D

Enter input string to search: a

I found the text "a" starting at index 0 and ending at index 1.

Enter your regex: /s

Enter input string to search:

I found the text " " starting at index 0 and ending at index 1.

Enter your regex: /s

Enter input string to search: a

No match found.

Enter your regex: /S

Enter input string to search:

No match found.

Enter your regex: /S

Enter input string to search: a

I found the text "a" starting at index 0 and ending at index 1.

Enter your regex: /w

Enter input string to search: a

I found the text "a" starting at index 0 and ending at index 1.

Enter your regex: /w

Enter input string to search: !

No match found.

Enter your regex: /W

Enter input string to search: a

No match found.

Enter your regex: /W

Enter input string to search: !

I found the text "!" starting at index 0 and ending at index 1.


l /d匹配所有数字。

l /s匹配空白。

l /w匹配单词字符。


l /D匹配非数字。

l /S匹配非空白。

l /W匹配非单词字符。


量词(quantifier)允许你指定要匹配的出现次数。为了方便起见,Pattern API规范的三个部分描述greedy、reluctant和possessive量词,如表13-3所示。乍看上去,你可能认为量词X?、X??和X?+的功能完全一样,因为它们都匹配“X,一次或者完全没有”。在本节快结束时将解释它们实现的微妙区别。

Greedy 数量词
X? X,一次或一次也没有
X* X,零次或多次
X+ X,一次或多次
X{n} X,恰好 n 次
X{n,} X,至少 n 次
X{n,m} X,至少 n 次,但是不超过 m 次
Reluctant 数量词
X?? X,一次或一次也没有
X*? X,零次或多次
X+? X,一次或多次
X{n}? X,恰好 n 次
X{n,}? X,至少 n 次
X{n,m}? X,至少 n 次,但是不超过 m 次
Possessive 数量词
X?+ X,一次或一次也没有
X*+ X,零次或多次
X++ X,一次或多次
X{n}+ X,恰好 n 次
X{n,}+ X,至少 n 次
X{n,m}+ X,至少 n 次,但是不超过 m 次


Enter your regex: a?

Enter input string to search:

I found the text "" starting at index 0 and ending at index 0.

Enter your regex: a*

Enter input string to search:

I found the text "" starting at index 0 and ending at index 0.

Enter your regex: a+

Enter input string to search:

No match found.

13.6.1  零长度匹配

在前面的例子中,前两个匹配成功,因为表达式a?和a*都允许字母a的出现次数为0。你还会注意到,开始和结束索引都为0,这和我们到目前为止见过的任何例子都不同。空白输入字符串“”没有长度,所以测试简单地和位于索引0的“无内容”匹配。这种类型的匹配被称为零长度匹配(zero-length match)。零长度匹配可能发生在这样几种情况下:在空白输入字符串中、在输入字符串的开头、在输入字符串的最后一个字符之后或者在输入字符串的任何两个字符之间。零长度匹配很容易识别,因为它们总在同一个索引位置开始和结束。


Enter your regex: a?

Enter input string to search: a

I found the text "a" starting at index 0 and ending at index 1.

I found the text "" starting at index 1 and ending at index 1.

Enter your regex: a*

Enter input string to search: a

I found the text "a" starting at index 0 and ending at index 1.

I found the text "" starting at index 1 and ending at index 1.

Enter your regex: a+

Enter input string to search: a

I found the text "a" starting at index 0 and ending at index 1.



Enter your regex: a?

Enter input string to search: aaaaa

I found the text "a" starting at index 0 and ending at index 1.

I found the text "a" starting at index 1 and ending at index 2.

I found the text "a" starting at index 2 and ending at index 3.

I found the text "a" starting at index 3 and ending at index 4.

I found the text "a" starting at index 4 and ending at index 5.

I found the text "" starting at index 5 and ending at index 5.

Enter your regex: a*

Enter input string to search: aaaaa

I found the text "aaaaa" starting at index 0 and ending at index 5.

I found the text "" starting at index 5 and ending at index 5.

Enter your regex: a+

Enter input string to search: aaaaa

I found the text "aaaaa" starting at index 0 and ending at index 5.




Enter your regex: a?

Enter input string to search: ababaaaab

I found the text "a" starting at index 0 and ending at index 1.

I found the text "" starting at index 1 and ending at index 1.

I found the text "a" starting at index 2 and ending at index 3.

I found the text "" starting at index 3 and ending at index 3.

I found the text "a" starting at index 4 and ending at index 5.

I found the text "a" starting at index 5 and ending at index 6.

I found the text "a" starting at index 6 and ending at index 7.

I found the text "a" starting at index 7 and ending at index 8.

I found the text "" starting at index 8 and ending at index 8.

I found the text "" starting at index 9 and ending at index 9.

Enter your regex: a*

Enter input string to search: ababaaaab

I found the text "a" starting at index 0 and ending at index 1.

I found the text "" starting at index 1 and ending at index 1.

I found the text "a" starting at index 2 and ending at index 3.

I found the text "" starting at index 3 and ending at index 3.

I found the text "aaaa" starting at index 4 and ending at index 8.

I found the text "" starting at index 8 and ending at index 8.

I found the text "" starting at index 9 and ending at index 9.

Enter your regex: a+

Enter input string to search: ababaaaab

I found the text "a" starting at index 0 and ending at index 1.

I found the text "a" starting at index 2 and ending at index 3.

I found the text "aaaa" starting at index 4 and ending at index 8.



Enter your regex: a{3}

Enter input string to search: aa

No match found.

Enter your regex: a{3}

Enter input string to search: aaa

I found the text "aaa" starting at index 0 and ending at index 3.

Enter your regex: a{3}

Enter input string to search: aaaa

I found the text "aaa" starting at index 0 and ending at index 3.


Enter your regex: a{3}

Enter input string to search: aaaaaaaaa

I found the text "aaa" starting at index 0 and ending at index 3.

I found the text "aaa" starting at index 3 and ending at index 6.

I found the text "aaa" starting at index 6 and ending at index 9.


Enter your regex: a{3,}

Enter input string to search: aaaaaaaaa

I found the text "aaaaaaaaa" starting at index 0 and ending at index 9.



Enter your regex: a{3,6}

           // find at least 3 (but no more than 6) a's in a row

Enter input string to search: aaaaaaaaa

I found the text "aaaaaa" starting at index 0 and ending at index 6.

I found the text "aaa" starting at index 6 and ending at index 9.


13.6.2  使用量词的捕获组和字符类



Enter your regex: (dog){3}

Enter input string to search: dogdogdogdogdogdog

I found the text "dogdogdog" starting at index 0 and ending at index 9.

I found the text "dogdogdog" starting at index 9 and ending at index 18.

Enter your regex: dog{3}

Enter input string to search: dogdogdogdogdogdog

No match found.



Enter your regex: [abc]{3}

Enter input string to search: abccabaaaccbbbc

I found the text "abc" starting at index 0 and ending at index 3.

I found the text "cab" starting at index 3 and ending at index 6.

I found the text "aaa" starting at index 6 and ending at index 9.

I found the text "ccb" starting at index 9 and ending at index 12.

I found the text "bbc" starting at index 12 and ending at index 15.

Enter your regex: abc{3}

Enter input string to search: abccabaaaccbbbc

No match found.


13.6.3  greedy、reluctant和possessive量词的区别






Enter your regex: .*foo  // greedy quantifier

Enter input string to search: xfooxxxxxxfoo

I found the text "xfooxxxxxxfoo" starting at index 0 and ending at index 13.

Enter your regex: .*?foo  // reluctant quantifier

Enter input string to search: xfooxxxxxxfoo

I found the text "xfoo" starting at index 0 and ending at index 4.

I found the text "xxxxxxfoo" starting at index 4 and ending at index 13.

Enter your regex: .*+foo // possessive quantifier

Enter input string to search: xfooxxxxxxfoo

No match found.






捕获组(capturing group)是把多个字符当作一个单元对待的一种方式。通过把字符括在括号内创建捕获组。例如,正则表达式(dog)创建包含字母“d”、“o”和“g”的一个组。输入字符串和捕获组匹配的那一部分将被保存在内存中,以便以后通过反向引用再次使用(见13.7.2节的讨论)。

13.7.1  编号

如Pattern API中所述,按照从左到右的顺序计算捕获组的前括号数目,给捕获组编号。例如,在表达式((A)(B(C)))中,有4个这样的组:

(1) ((A)(B(C)))

(2) (A)

(3) (B(C))

(4) (C)


还有一个特殊的组,组0,它总是代表整个表达式。这个组不包括在groupCount报告的总数内。以(?开头的组是纯粹的非捕获组(non-capturing group),它不捕获文本,也不计入组的总数。(后面的13.9节中将给出非捕获组的例子。)


l public int start(int group)——返回前一个匹配操作期间,给定组捕获的子序列的开始索引。

l public int end(int group)——返回前一个匹配操作期间,给定组捕获的子序列的最后一个字符的索引加1。

l public String group(int group)——返回前一个匹配操作期间,给定组捕获的输入子序列。

13.7.2  反向引用



Enter your regex: (/d/d)/1

Enter input string to search: 1212

I found the text "1212" starting at index 0 and ending at index 4.


Enter your regex: (/d/d)/1

Enter input string to search: 1234

No match found.








