POSIX.2正则表达式说明（确认一个正则表达式是否正确的唯一方法就是去测试它）

codingstandards

浏览: 4764028 次
性别:
来自: 上海

最近访客更多访客>>

ProgramFans

xchao

tntxia

wangyan419

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

Linux基础

Linux Bash 正则表达式

POSIX.2正则表达式说明

关于在Linux/Bash中正则表达式（POSIX.2 regular expressions）的语法形式，可以使用 man 7 regex 去查看。

Advanced Bash-Scripting Guide: 18.1. A Brief Introduction to Regular Expressions 写道

An expression is a string of characters. Those characters having an interpretation above and beyond their literal meaning are called metacharacters.

正则表达式是特殊的字符串。在正则表达式中，有些字符具有特殊含义，而不是它本身的字面含义，这种字符称之为元字符（metacharacter）。

一个系统中的命令对正则表达式的支持程度，往往取决于其具体实现，所以有这么一种说法：“确认一个正则表达式是否正确的唯一方法就是去测试它 ”。

Advanced Bash-Scripting Guide: 18.1. A Brief Introduction to Regular Expressions 写道

The only way to be certain that a particular RE works is to test it.

下面说明一下 man 7 regex 所描述的 POSIX.2 正则表达式。

POSIX.2 正则表达式（Regular expressions, 简称RE），有两种形式：

一种是 modern RE, 或者称之为 extended RE，比如 egrep命令所支持的；

一种是 obsolete RE, 或者称之为 basic RE，比如 ed命令所支持的。

man 7 regex 写道

Regular expressions (‘‘RE’’s), as defined in POSIX.2, come in two forms: modern REs (roughly those of egrep;
1003.2 calls these ‘‘extended’’ REs) and obsolete REs (roughly those of ed(1); 1003.2 ‘‘basic’’ REs). Obsolete
REs mostly exist for backward compatibility in some old programs; they will be discussed at the end. 1003.2
leaves some aspects of RE syntax and semantics open; ‘(!)’ marks decisions on these aspects that may not be
fully portable to other 1003.2 implementations.

下面讲的是 modern RE，一个 modern RE 由一个或多个 branch 用竖线(|) 分隔。一个字符串只需要匹配其中一个 branch 就认为是匹配该 RE。

比如：abc|def 既可匹配 abc 也可匹配 def。

man 7 regex 写道

A (modern) RE is one(!) or more non-empty(!) branches, separated by ‘|’. It matches anything that matches one
of the branches.

一个 branch 由一个或多个 piece 串接而成：一个 piece 是由 atom 或者 atom 加上 modifier 组成。在匹配的时候是依次匹配。

modifier的作用是指定 atom 的出现次数，比如：

* 前面的 atom 出现 0次或多次；

+ 前面的 atom 出现 1次或多次；

? 前面的 atom 出现 0次或1次；

man 7 regex 写道

A branch is one(!) or more pieces, concatenated. It matches a match for the first, followed by a match for the
second, etc.
A piece is an atom possibly followed by a single(!) ‘*’, ‘+’, ‘?’, or bound. An atom followed by ‘*’ matches a
sequence of 0 or more matches of the atom. An atom followed by ‘+’ matches a sequence of 1 or more matches of
the atom. An atom followed by ‘?’ matches a sequence of 0 or 1 matches of the atom.

modifier也可以是 bound ({})，即指定范围，前面的 atom 出现的次数在{}内指定，如下：

{n} 前面的 atom 刚好出现 n次，n必须在 0 到 RE_DUP_MAX 之间，其中 RE_DUP_MAX 最大为255；

{n,} 前面的 atom 出现 n次及以上；

{n,m} 前面的 atom 出现 n次到m次，必须 n <= m。

man 7 regex 写道

A bound is ‘{’ followed by an unsigned decimal integer, possibly followed by ‘,’ possibly followed by another
unsigned decimal integer, always followed by ‘}’. The integers must lie between 0 and RE_DUP_MAX (255(!))
inclusive, and if there are two of them, the first may not exceed the second. An atom followed by a bound con-
taining one integer i and no comma matches a sequence of exactly i matches of the atom. An atom followed by a
bound containing one integer i and a comma matches a sequence of i or more matches of the atom. An atom fol-
lowed by a bound containing two integers i and j matches a sequence of i through j (inclusive) matches of the
atom.

下面讲到 atom 是指哪些东西，一个 atom 可以如下之一：

(RE) 匹配一个正则表达式，子表达式

() 匹配一个空串

[CHAR-SET] 匹配指定字符集合中的任意字符

. 匹配任意单个字符

^ 匹配行首

$ 匹配行尾

\跟上^.[$()|*+?{\之一转义，使这些元字符的特殊含义丧失，匹配这些字符本身

\跟上其他字符匹配就是这些字符本身

其他单个字符匹配这些字符本身

{跟上非数字字符此时{是个普通字符

结尾为\ 非法

man 7 regex 写道

An atom is a regular expression enclosed in ‘()’ (matching a match for the regular expression), an empty set of
‘()’ (matching the null string)(!), a bracket expression (see below), ‘.’ (matching any single character), ‘^’
(matching the null string at the beginning of a line), ‘$’ (matching the null string at the end of a line), a
‘\’ followed by one of the characters ‘^.[$()|*+?{\’ (matching that character taken as an ordinary character),
a ‘\’ followed by any other character(!) (matching that character taken as an ordinary character, as if the
‘\’ had not been present(!)), or a single character with no other significance (matching that character). A
‘{’ followed by a character other than a digit is an ordinary character, not the beginning of a bound(!). It
is illegal to end an RE with ‘\’.

方括号 [ ] 中可以指定一个字符的集合，并且不能是空集合。

如果这个集合以 ^ 开头，那么表示不匹配该集合中的字符。

类似 a-z 的形式可以指定字符的范围，但是 a-c-e 这种形式是非法的。

比如 [0-9] 表示匹配数字字符，[^0-9] 表示不匹配数字字符。

man 7 regex 写道

A bracket expression is a list of characters enclosed in ‘[]’. It normally matches any single character from
the list (but see below). If the list begins with ‘^’, it matches any single character (but see below) not
from the rest of the list. If two characters in the list are separated by ‘-’, this is shorthand for the full
range of characters between those two (inclusive) in the collating sequence, e.g. ‘[0-9]’ in ASCII matches any
decimal digit. It is illegal(!) for two ranges to share an endpoint, e.g. ‘a-c-e’. Ranges are very collating-
sequence-dependent, and portable programs should avoid relying on them.

在 [ ] 中，

如果字符集合需要包含 ] 呢，可以写成 []]，即 ]为集合中的第一个字符，而 [^]] 表示不匹配 ]；

如果字符集合需要包含 - 呢，必须把 - 放在第一个字符的位置或者最后一个字符的位置，[-] 表示匹配 -，[^-] 表示不匹配 - ；

man 7 regex 写道

To include a literal ‘]’ in the list, make it the first character (following a possible ‘^’). To include a
literal ‘-’, make it the first or last character, or the second endpoint of a range. To use a literal ‘-’ as
the first endpoint of a range, enclose it in ‘[.’ and ‘.]’ to make it a collating element (see below). With
the exception of these and some combinations using ‘[’ (see next paragraphs), all other special characters,
including ‘\’, lose their special significance within a bracket expression.

关于多字符序列，形式为 [.chars.]，比如 [[.ch.,]] 可以匹配 ch。

man 7 regex 写道

Within a bracket expression, a collating element (a character, a multi-character sequence that collates as if
it were a single character, or a collating-sequence name for either) enclosed in ‘[.’ and ‘.]’ stands for the
sequence of characters of that collating element. The sequence is a single element of the bracket expression’s
list. A bracket expression containing a multi-character collating element can thus match more than one charac-
ter, e.g. if the collating sequence includes a ‘ch’ collating element, then the RE ‘[[.ch.]]*c’ matches the
first five characters of ‘chchcc’.

关于等价类，形式为 [=c=]，但这个等价类目前我还没有明白怎么用法。

man 7 regex 写道

Within a bracket expression, a collating element enclosed in ‘[=’ and ‘=]’ is an equivalence class, standing
for the sequences of characters of all collating elements equivalent to that one, including itself. (If there
are no other equivalent collating elements, the treatment is as if the enclosing delimiters were ‘[.’ and
‘.]’.) For example, if o and ^ are the members of an equivalence class, then ‘[[=o=]]’, ‘[[=^=]]’, and ‘[o^]’
are all synonymous. An equivalence class may not(!) be an endpoint of a range.

在 [ ] 中，可以指定字符类，形式为 [:class:]，比如 [[:digit:]] 匹配数字，[[:alpha:]] 匹配字母，常用的标准字符类如下：

alnum 字母和数字

alpha 字母

blank 空白，包括空格、制表符等

digit 数字

lower 小写字母

space 空白，包括空格、制表符、竖向制表符、换行、回车，注意与 blank 类的区别

upper 大写字母

xdigit 十六进制数字字符

这些字符类的判断方式与C语言中的字符类判断是一样的，比如在C语言中用 isalpha(c) 来判断是否字母，以此类推。

man 7 regex 写道

       Within a bracket expression, the name of a character class enclosed in ‘[:’ and ‘:]’ stands for the list of all
       characters belonging to that class. Standard character class names are:

              alnum       digit       punct
              alpha       graph       space
              blank       lower       upper
              cntrl       print       xdigit

       These stand for the character classes defined in wctype(3). A locale may provide others. A character class
       may not be used as an endpoint of a range.

C语言中关于字符类的判断函数说明。

man 3 isalpha 写道

       isalnum()
              checks for an alphanumeric character; it is equivalent to (isalpha(c) || isdigit(c)).
       isalpha()
              checks for an alphabetic character; in the standard "C" locale, it is equivalent to (isupper(c) ||
              islower(c)). In some locales, there may be additional characters for which isalpha() is true—letters
              which are neither upper case nor lower case.
       isascii()
              checks whether c is a 7-bit unsigned char value that fits into the ASCII character set.
       isblank()
              checks for a blank character; that is, a space or a tab.
       iscntrl()
              checks for a control character.
       isdigit()
              checks for a digit (0 through 9).
       isgraph()
              checks for any printable character except space.
       islower()
              checks for a lower-case character.
       isprint()
              checks for any printable character including space.
       ispunct()
              checks for any printable character which is not a space or an alphanumeric character.
       isspace()
              checks for white-space characters. In the "C" and "POSIX" locales, these are: space, form-feed (　?[1m\f　?,
              newline (　?[1m\n　?, carriage return (　?[1m\r　?, horizontal tab (　?[1m\t　?, and vertical tab (　?[1m\v　?.
       isupper()
              checks for an uppercase letter.
       isxdigit()
              checks for a hexadecimal digits, i.e. one of
              0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F.

要注意的是，POSIX.2正则表达式不支持类似Java中的字符类的写法，比如在Java中 \d表示匹配数字，\w表示匹配字母数字下划线。

正则表达式的匹配，从字符串中最早匹配的位置开始，到最长匹配结束，是匹配的长度越长越好，即贪婪匹配。

man 7 regex 写道

In the event that an RE could match more than one substring of a given string, the RE matches the one starting
earliest in the string. If the RE could match more than one substring starting at that point, it matches the
longest. Subexpressions also match the longest possible substrings, subject to the constraint that the whole
match be as long as possible, with subexpressions starting earlier in the RE taking priority over ones starting
later. Note that higher-level subexpressions thus take priority over their lower-level component subexpres-
sions.

匹配长度以字符数计算。即使只匹配空串，也被认为比完全不匹配要长。比如：

bb* 匹配 abbbc 的中间三个字符；

(wee|week)(khights|nights) 匹配 weeknights 整个串；

(.*).* 匹配 abc，其中(.*) 匹配 abc，剩下的 .* 匹配空串；

(a*)* 匹配 bc，其中 (a*)* 和 (a*) 都只匹配空串。

man 7 regex 写道

Match lengths are measured in characters, not collating elements. A null string is considered longer than no
match at all. For example, ‘bb*’ matches the three middle characters of ‘abbbc’, ‘(wee|week)(knights|nights)’
matches all ten characters of ‘weeknights’, when ‘(.*).*’ is matched against ‘abc’ the parenthesized subexpres-
sion matches all three characters, and when ‘(a*)*’ is matched against ‘bc’ both the whole RE and the parenthe-
sized subexpression match the null string.

关于不区分大小匹配的说明。x 匹配 x和X，相当于 [xX]，而 [^x] 相当于 [^xX]。

man 7 regex 写道

If case-independent matching is specified, the effect is much as if all case distinctions had vanished from the
alphabet. When an alphabetic that exists in multiple cases appears as an ordinary character outside a bracket
expression, it is effectively transformed into a bracket expression containing both cases, e.g. ‘x’ becomes
‘[xX]’. When it appears inside a bracket expression, all case counterparts of it are added to the bracket
expression, so that (e.g.) ‘[x]’ becomes ‘[xX]’ and ‘[^x]’ becomes ‘[^xX]’.

正则表达式的长度限制，一般不超过256字节，但具体实现也可以不限定长度。

man 7 regex 写道

No particular limit is imposed on the length of REs(!). Programs intended to be portable should not employ REs
longer than 256 bytes, as an implementation can refuse to accept such REs and remain POSIX-compliant.

最后来讲 Obsolete RE （或 basic RE）与前面的 modern RE （或 extended RE）的区别：

在 basic RE 中，竖线(|)、加号(+)、问号(?)是普通字符；

范围用 \{ \} 来表示，而 { } 只是普通字符；

子表达式用  来表示，而 ( ) 只是普通字符；

当^不是开头、$不是结尾、*在开头时，它们是普通字符；

\非0数字的形式表示对前面匹配的子串的引用，比如 $[bc]$\1 匹配 bb 或 cc，但不匹配 bc 。

man 7 regex 写道

Obsolete (‘‘basic’’) regular expressions differ in several respects. ‘|’, ‘+’, and ‘?’ are ordinary characters
and there is no equivalent for their functionality. The delimiters for bounds are ‘\{’ and ‘\}’, with ‘{’ and
‘}’ by themselves ordinary characters. The parentheses for nested subexpressions are ‘$’ and ‘$’, with ‘(’
and ‘)’ by themselves ordinary characters. ‘^’ is an ordinary character except at the beginning of the RE
or(!) the beginning of a parenthesized subexpression, ‘$’ is an ordinary character except at the end of the RE
or(!) the end of a parenthesized subexpression, and ‘*’ is an ordinary character if it appears at the beginning
of the RE or the beginning of a parenthesized subexpression (after a possible leading ‘^’). Finally, there is
one new type of atom, a back reference: ‘\’ followed by a non-zero decimal digit d matches the same sequence of
characters matched by the dth parenthesized subexpression (numbering subexpressions by the positions of their
opening parentheses, left to right), so that (e.g.) ‘$[bc]$\1’ matches ‘bb’ or ‘cc’ but not ‘bc’.

重复前面说过的：“确认一个正则表达式是否正确的唯一方法就是去测试它 ”。

Advanced Bash-Scripting Guide: 18.1. A Brief Introduction to Regular Expressions 写道

The only way to be certain that a particular RE works is to test it.

本文链接：http://codingstandards.iteye.com/blog/1195592

3
顶

3
踩

分享到：

Bash字符串处理（与Java对照） - 18.格式化 ... | 导入MySQL数据库模式及数据的Bash脚本（改 ...

2011-10-14 09:12
浏览 4758
评论(0)
分类:操作系统
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

POSIX.2正则表达式说明（确认一个正则表达式是否正确的唯一方法就是去测试它）

POSIX.2正则表达式说明

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

POSIX.2正则表达式说明（确认一个正则表达式是否正确的唯一方法就是去测试它）

POSIX.2正则表达式说明

评论

发表评论

相关推荐

快速修改Linux的某网卡IP地址

最近访客更多访客>>