`

POSIX.2正则表达式说明(确认一个正则表达式是否正确的唯一方法就是去测试它)

阅读更多

POSIX.2正则表达式说明

关于在Linux/Bash中正则表达式(POSIX.2 regular expressions)的语法形式,可以使用 man 7 regex 去查看。

Advanced Bash-Scripting Guide: 18.1. A Brief Introduction to Regular Expressions 写道
An expression is a string of characters. Those characters having an interpretation above and beyond their literal meaning are called metacharacters.

正则表达式是特殊的字符串。在正则表达式中,有些字符具有特殊含义,而不是它本身的字面含义,这种字符称之为元字符(metacharacter)。

 

一个系统中的命令对正则表达式的支持程度,往往取决于其具体实现,所以有这么一种说法:“确认一个正则表达式是否正确的唯一方法就是去测试它 ”。

Advanced Bash-Scripting Guide: 18.1. A Brief Introduction to Regular Expressions 写道
The only way to be certain that a particular RE works is to test it.

 

下面说明一下 man 7 regex 所描述的 POSIX.2 正则表达式。

 

POSIX.2 正则表达式(Regular expressions, 简称RE),有两种形式:

一种是 modern RE, 或者称之为 extended RE,比如 egrep命令所支持的;

一种是 obsolete RE, 或者称之为 basic RE,比如 ed命令所支持的。

man 7 regex 写道
Regular expressions (‘‘RE’’s), as defined in POSIX.2, come in two forms: modern REs (roughly those of egrep;
1003.2 calls these ‘‘extended’’ REs) and obsolete REs (roughly those of ed(1); 1003.2 ‘‘basic’’ REs). Obsolete
REs mostly exist for backward compatibility in some old programs; they will be discussed at the end. 1003.2
leaves some aspects of RE syntax and semantics open; ‘(!)’ marks decisions on these aspects that may not be
fully portable to other 1003.2 implementations.
  

下面讲的是 modern RE,一个 modern RE 由一个或多个 branch 用 竖线(|) 分隔。一个字符串只需要匹配其中一个 branch 就认为是匹配该 RE。

比如:abc|def 既可匹配 abc 也可匹配 def。

man 7 regex 写道
A (modern) RE is one(!) or more non-empty(!) branches, separated by ‘|’. It matches anything that matches one
of the branches.

 

一个 branch 由一个或多个 piece 串接而成:一个 piece 是由 atom 或者 atom 加上 modifier 组成。在匹配的时候是依次匹配。

modifier的作用是指定 atom 的出现次数,比如:

*  前面的 atom 出现 0次或多次;

+ 前面的 atom 出现 1次或多次;

? 前面的 atom 出现 0次或1次;

man 7 regex 写道
A branch is one(!) or more pieces, concatenated. It matches a match for the first, followed by a match for the
second, etc.
A piece is an atom possibly followed by a single(!) ‘*’, ‘+’, ‘?’, or bound. An atom followed by ‘*’ matches a
sequence of 0 or more matches of the atom. An atom followed by ‘+’ matches a sequence of 1 or more matches of
the atom. An atom followed by ‘?’ matches a sequence of 0 or 1 matches of the atom.
 

modifier也可以是 bound ({}),即指定范围,前面的 atom 出现的次数在{}内指定,如下:

{n} 前面的 atom 刚好出现 n次,n必须在 0 到 RE_DUP_MAX 之间,其中 RE_DUP_MAX 最大为255;

{n,} 前面的 atom 出现 n次及以上;

{n,m} 前面的 atom 出现 n次到m次,必须 n <= m。

man 7 regex 写道
A bound is ‘{’ followed by an unsigned decimal integer, possibly followed by ‘,’ possibly followed by another
unsigned decimal integer, always followed by ‘}’. The integers must lie between 0 and RE_DUP_MAX (255(!))
inclusive, and if there are two of them, the first may not exceed the second. An atom followed by a bound con-
taining one integer i and no comma matches a sequence of exactly i matches of the atom. An atom followed by a
bound containing one integer i and a comma matches a sequence of i or more matches of the atom. An atom fol-
lowed by a bound containing two integers i and j matches a sequence of i through j (inclusive) matches of the
atom.
 

下面讲到 atom 是指哪些东西,一个 atom 可以如下之一:

(RE)               匹配一个正则表达式,子表达式

()                   匹配一个空串

[CHAR-SET]    匹配指定字符集合中的任意字符

.                    匹配任意单个字符

^                   匹配行首

$                   匹配行尾

\跟上^.[$()|*+?{\之一    转义,使这些元字符的特殊含义丧失,匹配这些字符本身

\跟上其他字符      匹配就是这些字符本身

其他单个字符     匹配这些字符本身

{跟上非数字字符     此时{是个普通字符

结尾为\           非法

man 7 regex 写道
An atom is a regular expression enclosed in ‘()’ (matching a match for the regular expression), an empty set of
‘()’ (matching the null string)(!), a bracket expression (see below), ‘.’ (matching any single character), ‘^’
(matching the null string at the beginning of a line), ‘$’ (matching the null string at the end of a line), a
‘\’ followed by one of the characters ‘^.[$()|*+?{\’ (matching that character taken as an ordinary character),
a ‘\’ followed by any other character(!) (matching that character taken as an ordinary character, as if the
‘\’ had not been present(!)), or a single character with no other significance (matching that character). A
‘{’ followed by a character other than a digit is an ordinary character, not the beginning of a bound(!). It
is illegal to end an RE with ‘\’.
 

 

方括号 [ ] 中可以指定一个字符的集合,并且不能是空集合。

如果这个集合以 ^ 开头,那么表示不匹配该集合中的字符。

类似 a-z 的形式可以指定字符的范围,但是 a-c-e 这种形式是非法的。

比如 [0-9] 表示匹配数字字符,[^0-9] 表示不匹配数字字符。

man 7 regex 写道
A bracket expression is a list of characters enclosed in ‘[]’. It normally matches any single character from
the list (but see below). If the list begins with ‘^’, it matches any single character (but see below) not
from the rest of the list. If two characters in the list are separated by ‘-’, this is shorthand for the full
range of characters between those two (inclusive) in the collating sequence, e.g. ‘[0-9]’ in ASCII matches any
decimal digit. It is illegal(!) for two ranges to share an endpoint, e.g. ‘a-c-e’. Ranges are very collating-
sequence-dependent, and portable programs should avoid relying on them.
 

在 [ ] 中,

如果字符集合需要包含 ] 呢,可以写成 []],即 ]为集合中的第一个字符,而 [^]] 表示不匹配 ];

如果字符集合需要包含 - 呢,必须把 - 放在第一个字符的位置或者最后一个字符的位置,[-] 表示匹配 -,[^-] 表示不匹配 - ;

man 7 regex 写道
To include a literal ‘]’ in the list, make it the first character (following a possible ‘^’). To include a
literal ‘-’, make it the first or last character, or the second endpoint of a range. To use a literal ‘-’ as
the first endpoint of a range, enclose it in ‘[.’ and ‘.]’ to make it a collating element (see below). With
the exception of these and some combinations using ‘[’ (see next paragraphs), all other special characters,
including ‘\’, lose their special significance within a bracket expression.
 

关于多字符序列,形式为 [.chars.],比如 [[.ch.,]] 可以匹配 ch。

man 7 regex 写道
Within a bracket expression, a collating element (a character, a multi-character sequence that collates as if
it were a single character, or a collating-sequence name for either) enclosed in ‘[.’ and ‘.]’ stands for the
sequence of characters of that collating element. The sequence is a single element of the bracket expression’s
list. A bracket expression containing a multi-character collating element can thus match more than one charac-
ter, e.g. if the collating sequence includes a ‘ch’ collating element, then the RE ‘[[.ch.]]*c’ matches the
first five characters of ‘chchcc’.
 

关于等价类,形式为 [=c=],但这个等价类目前我还没有明白怎么用法。

man 7 regex 写道
Within a bracket expression, a collating element enclosed in ‘[=’ and ‘=]’ is an equivalence class, standing
for the sequences of characters of all collating elements equivalent to that one, including itself. (If there
are no other equivalent collating elements, the treatment is as if the enclosing delimiters were ‘[.’ and
‘.]’.) For example, if o and ^ are the members of an equivalence class, then ‘[[=o=]]’, ‘[[=^=]]’, and ‘[o^]’
are all synonymous. An equivalence class may not(!) be an endpoint of a range.
 

 

 

在 [ ] 中,可以指定字符类,形式为 [:class:],比如 [[:digit:]] 匹配数字,[[:alpha:]] 匹配字母,常用的标准字符类如下:

alnum    字母和数字

alpha     字母

blank     空白,包括空格、制表符等

digit      数字

lower     小写字母

space   空白,包括空格、制表符、竖向制表符、换行、回车,注意与 blank 类的区别

upper    大写字母

xdigit    十六进制数字字符

这些字符类的判断方式与C语言中的字符类判断是一样的,比如在C语言中用 isalpha(c) 来判断是否字母,以此类推。

man 7 regex 写道
       Within a bracket expression, the name of a character class enclosed in ‘[:’ and ‘:]’ stands for the list of all
       characters belonging to that class.  Standard character class names are:

              alnum       digit       punct
              alpha       graph       space
              blank       lower       upper
              cntrl       print       xdigit

       These  stand  for  the character classes defined in wctype(3).  A locale may provide others.  A character class
       may not be used as an endpoint of a range.
 

C语言中关于字符类的判断函数说明。

man 3 isalpha 写道
       isalnum()
              checks for an alphanumeric character; it is equivalent to (isalpha(c) || isdigit(c)).
       isalpha()
              checks  for  an  alphabetic  character;  in  the standard "C" locale, it is equivalent to (isupper(c) ||
              islower(c)).  In some locales, there may be additional characters for which  isalpha()  is  true—letters
              which are neither upper case nor lower case.
       isascii()
              checks whether c is a 7-bit unsigned char value that fits into the ASCII character set.
       isblank()
              checks for a blank character; that is, a space or a tab.
       iscntrl()
              checks for a control character.
       isdigit()
              checks for a digit (0 through 9).
       isgraph()
              checks for any printable character except space.
       islower()
              checks for a lower-case character.
       isprint()
              checks for any printable character including space.
       ispunct()
              checks for any printable character which is not a space or an alphanumeric character.
       isspace()
              checks  for white-space characters.  In the "C" and "POSIX" locales, these are: space, form-feed ( ?[1m\f ?,
              newline ( ?[1m\n ?, carriage return ( ?[1m\r ?, horizontal tab ( ?[1m\t ?, and vertical tab ( ?[1m\v ?.
       isupper()
              checks for an uppercase letter.
       isxdigit()
              checks for a hexadecimal digits, i.e. one of
              0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F.
 

要注意的是,POSIX.2正则表达式不支持类似Java中的字符类的写法,比如在Java中 \d表示匹配数字,\w表示匹配字母数字下划线。

 

正则表达式的匹配,从字符串中最早匹配的位置开始,到最长匹配结束,是匹配的长度越长越好,即贪婪匹配。

man 7 regex 写道
In the event that an RE could match more than one substring of a given string, the RE matches the one starting
earliest in the string. If the RE could match more than one substring starting at that point, it matches the
longest. Subexpressions also match the longest possible substrings, subject to the constraint that the whole
match be as long as possible, with subexpressions starting earlier in the RE taking priority over ones starting
later. Note that higher-level subexpressions thus take priority over their lower-level component subexpres-
sions.
 

匹配长度以字符数计算。即使只匹配空串,也被认为比完全不匹配要长。比如:

bb* 匹配 abbbc 的中间三个字符;

(wee|week)(khights|nights)  匹配 weeknights 整个串;

(.*).*  匹配 abc,其中(.*) 匹配 abc,剩下的 .* 匹配空串;

(a*)*  匹配 bc,其中 (a*)* 和 (a*) 都只匹配空串。

man 7 regex 写道
Match lengths are measured in characters, not collating elements. A null string is considered longer than no
match at all. For example, ‘bb*’ matches the three middle characters of ‘abbbc’, ‘(wee|week)(knights|nights)’
matches all ten characters of ‘weeknights’, when ‘(.*).*’ is matched against ‘abc’ the parenthesized subexpres-
sion matches all three characters, and when ‘(a*)*’ is matched against ‘bc’ both the whole RE and the parenthe-
sized subexpression match the null string.
 

关于不区分大小匹配的说明。x 匹配 x和X,相当于 [xX],而 [^x] 相当于 [^xX]。

man 7 regex 写道
If case-independent matching is specified, the effect is much as if all case distinctions had vanished from the
alphabet. When an alphabetic that exists in multiple cases appears as an ordinary character outside a bracket
expression, it is effectively transformed into a bracket expression containing both cases, e.g. ‘x’ becomes
‘[xX]’. When it appears inside a bracket expression, all case counterparts of it are added to the bracket
expression, so that (e.g.) ‘[x]’ becomes ‘[xX]’ and ‘[^x]’ becomes ‘[^xX]’.
 

正则表达式的长度限制,一般不超过256字节,但具体实现也可以不限定长度。

man 7 regex 写道
No particular limit is imposed on the length of REs(!). Programs intended to be portable should not employ REs
longer than 256 bytes, as an implementation can refuse to accept such REs and remain POSIX-compliant.
 

最后来讲 Obsolete RE (或 basic RE) 与 前面的 modern RE (或 extended RE)的区别:

在 basic RE 中, 竖线(|)、加号(+)、问号(?)是普通字符;

范围用 \{ \} 来表示,而 { } 只是普通字符;

子表达式用 \( \) 来表示,而 ( ) 只是普通字符;

当^不是开头、$不是结尾、*在开头时,它们是普通字符;

\非0数字的形式表示对前面匹配的子串的引用,比如 \([bc]\)\1 匹配 bb 或 cc,但不匹配 bc 。

man 7 regex 写道
Obsolete (‘‘basic’’) regular expressions differ in several respects. ‘|’, ‘+’, and ‘?’ are ordinary characters
and there is no equivalent for their functionality. The delimiters for bounds are ‘\{’ and ‘\}’, with ‘{’ and
‘}’ by themselves ordinary characters. The parentheses for nested subexpressions are ‘\(’ and ‘\)’, with ‘(’
and ‘)’ by themselves ordinary characters. ‘^’ is an ordinary character except at the beginning of the RE
or(!) the beginning of a parenthesized subexpression, ‘$’ is an ordinary character except at the end of the RE
or(!) the end of a parenthesized subexpression, and ‘*’ is an ordinary character if it appears at the beginning
of the RE or the beginning of a parenthesized subexpression (after a possible leading ‘^’). Finally, there is
one new type of atom, a back reference: ‘\’ followed by a non-zero decimal digit d matches the same sequence of
characters matched by the dth parenthesized subexpression (numbering subexpressions by the positions of their
opening parentheses, left to right), so that (e.g.) ‘\([bc]\)\1’ matches ‘bb’ or ‘cc’ but not ‘bc’.
 

重复前面说过的:“确认一个正则表达式是否正确的唯一方法就是去测试它 ”。

Advanced Bash-Scripting Guide: 18.1. A Brief Introduction to Regular Expressions 写道
The only way to be certain that a particular RE works is to test it.

 

本文链接:http://codingstandards.iteye.com/blog/1195592

 

 

3
3
分享到:
评论

相关推荐

    精通正则表达式~~~

    精通正则表达式第三版 搜集于网络 前言..........I 第1章:正则表达式入门.... 1 解决实际问题... 2 作为编程语言的正则表达式... 4 以文件名做类比... 4 以语言做类比... 5 正则表达式的知识框架... 6 对于...

    Indesign_GREP正则表达式

    ### Indesign_GREP正则表达式详解 #### 1. GREP正则表达式概述 在Adobe InDesign软件中,GREP(Global Regular Expression Print)正则表达式的使用能够极大地提高文档编辑效率,特别是在处理大量文本时。通过精确...

    linux下的C语言POSIX正则表达式头文件和源文件: regex.h regex.cpp

    POSIX正则表达式是符合IEEE Std 1003.1标准的一套规则,它为编程提供了强大的文本模式匹配功能。在本主题中,我们将探讨`regex.h`头文件和`regex.cpp`源文件,以及如何在Visual Studio 2010或2012环境下编译它们。 ...

    Oracle数据库正则表达式

    元字符提供算法来确定 Oracle 如何处理组成一个正则表达式的字符。常见的元字符有: * `.` 匹配一个正规表达式中的任意字符(除了换行符) * `^` 使表达式定位至一行的开头 * `$` 使表达式定位至一行的末尾 * `*` ...

    关于正则表达式的应用(正则表达式)

    正则表达式是一种强大的文本处理工具,用于在字符串中寻找匹配特定模式的子串。它在编程语言如PHP中有着广泛的应用,包括数据验证、文本提取、搜索与替换等任务。正则表达式通过一系列特殊字符(元字符)和结构来...

    php正则表达式.txt

    假设我们要验证一个电子邮件地址是否符合标准格式,可以使用以下正则表达式: ```php function validateEmail($email){ return preg_match("/^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$/", $email); } ```...

    Oracle正则表达式详解(用法+实例)

    #### 三、常用的POSIX正则表达式元字符 - `^`: 匹配字符串的开始位置。 - `$`: 匹配字符串的结尾位置。 - `.`, `?`, `+`, `*`, `|`, `()`, `[]`, `{}`, `\`: 这些是基本的元字符,用于定义匹配规则。 - 特殊字符簇 ...

    Go-POSIX基本正则表达式伪随机字符串生成器

    本篇文章将详细讲解如何在Go语言中实现一个基于POSIX基本正则表达式的伪随机字符串生成器。POSIX基本正则表达式是正则表达式的一种标准形式,它为文本匹配提供了基础规则。 POSIX基本正则表达式包括一系列的元字符...

    php正则表达式深入浅出.docx

    条件测试是正则表达式中的一个重要概念,用于描述字符串的条件测试。 16. 为正则表达式添加注释 注释是正则表达式中的一个重要概念,用于描述正则表达式的注释。 深入浅出之正则表达式前言: 本文的作者在半年前...

    unix下的正则表达式

    2. **扩展正则表达式**(Extended Regular Expression, ERE):这是更现代的形式,由POSIX标准定义,提供了更多的元字符和功能,如egrep和awk支持的正则表达式。 #### 二、元字符及其功能 元字符是正则表达式的...

    [小小明]Python正则表达式全套笔记v0.3(1.8万字干货)

    正则表达式的本质就是用一些特定字符的组合,组成一个“规则字符串”表达对字符串的一种过滤逻辑,可以很方便的从指定的字符串中提取出我们想要的内容。 2.2 正则匹配规则表 2.2.1 基本字符规则 基本字符规则包括...

    JAVA正则表达式语法.txt

    根据给定文件的信息,我们可以详细地探讨JAVA正则表达式的语法和使用方法,这将包括基本的字符类、预定义类、POSIX类、特殊字符、Unicode支持以及匹配模式的行为等。 ### 1. 字符转义 在JAVA中,正则表达式中的...

    正则表达式 电子书 教程 chm

    本教程是专为初学者设计的,旨在提供一个简单易懂的入口点,帮助读者快速掌握正则表达式的基本概念和用法。 在正则表达式(Regular Expression,简称regex)的世界里,字符集、元字符和量词是核心概念。字符集包含...

    Oracle正则表达式函数全面解析

    本文将详细介绍Oracle数据库中支持的四个主要正则表达式函数:`REGEXP_LIKE`、`REGEXP_INSTR`、`REGEXP_SUBSTR`和`REGEXP_REPLACE`,以及如何使用POSIX正则表达式。 #### 二、Oracle正则表达式基础 ##### 1. POSIX...

    精通正则表达式(第三版)中文

    例如,`[a-zA-Z]+` 这个正则表达式用来匹配一个或多个连续出现的字母。 ### 本书主要内容 #### 第一部分:基础篇 - **第1章:介绍** 本章介绍了正则表达式的概念、历史和发展,以及它们的应用场景。 - **第2章:...

    shell正则表达式.txt

    =digit)`是一个正向前瞻断言,它检查其后的字符是否为数字,但不会匹配任何字符。 #### 替代与替换 在Shell脚本中,我们可以使用`sed`或`awk`命令结合正则表达式来进行文本的查找与替换。例如,`sed 's/old/new/g'...

    正则表达式在PHP中的应用.pdf

    正则表达式不仅仅是一个单一的字符串,而是定义了一个匹配规则集。例如,".php"可以匹配所有以.php结尾的字符串。 2. 正则表达式在PHP中的实现 PHP中,正则表达式主要通过内置的函数来实现,如`preg_match`用于...

    正则表达式在C语言中的应用

    正则表达式是一种强大的文本处理工具,用于匹配、查找、替换和分析字符串模式。...同时,为了更深入学习,建议研究POSIX正则表达式的其他高级特性,如反向引用、预查、非贪婪量词等,以及在不同情况下的优化策略。

    java正则表达式详解(PDF)

    它是基于Perl和POSIX正则表达式的实现,提供了一种灵活且强大的方式来处理文本数据。本文件"java正则表达式详解(PDF)"深入探讨了这一主题,下面将对其中的主要知识点进行详细介绍。 1. **正则表达式基本概念** -...

    正则表达式入门30分钟

    虽然原生支持并不像其他某些语言(如Perl或JavaScript)那么直接,但通过库函数如POSIX标准的`regcomp`和`regexec`,或是PCRE(Perl Compatible Regular Expressions)库,我们可以有效地利用正则表达式功能。...

Global site tag (gtag.js) - Google Analytics