16 Python
Python provides a rich,
Perl-like regular expression syntax in the re
module. The re
module uses a Traditional NFA match engine. For an explanation of the rules
behind an NFA engine, see Section
1.2
.
This chapter covers the version of re
included with
Python 2.2, although the module has been available in similar form since Python
1.5.
1.6.1 Supported Metacharacters
The re
module supports the metacharacters and
metasequences listed in Table
1-21
through Table
1-25
. For expanded definitions of each metacharacter, see Section
1.2.1
.
Table 1-21. Character representations
Sequence
Meaning
\a
|
Alert (bell), x07
.
|
\b
|
Backspace, x08
, supported only in character
class.
|
\n
|
Newline, x0A
.
|
\r
|
Carriage return, x0D
.
|
\f
|
Form feed, x0C
.
|
\t
|
Horizontal tab, x09
.
|
\v
|
Vertical tab, x0B
.
|
\
octal
|
Character specified by up to three octal digits.
|
\x
hh
|
Character specified by a two-digit hexadecimal
code.
|
\u
hhhh
|
Character specified by a four-digit hexadecimal
code.
|
\U
hhhhhhhh
|
Character specified by an eight-digit hexadecimal
code.
|
Table 1-22. Character classes and class-like
constructs
Class
Meaning
[...]
|
Any character listed or contained within a listed
range.
|
[^...]
|
Any character that is not listed and is not contained within a
listed range.
|
.
|
Any character, except a newline (unless DOTALL
mode).
|
\w
|
Word character, [a-zA-z0-9_]
(unless LOCALE
or UNICODE
mode).
|
\W
|
Non-word character, [^a-zA-z0-9_]
(unless
LOCALE
or UNICODE
mode).
|
\d
|
Digit character, [0-9]
.
|
\D
|
Non-digit character, [^0-9]
.
|
\s
|
Whitespace character, [ \t\n\r\f\v]
.
|
\S
|
Nonwhitespace character, [
\t\n\r\f\v]
.
|
Table 1-23. Anchors and zero-width tests
Sequence
Meaning
^
|
Start of string, or after any newline if in MULTILINE
match mode.
|
\A
|
Start of search string, in all match modes.
|
$
|
End of search string or before a string-ending newline, or
before any newline in MULTILINE
match mode.
|
\Z
|
End of string or before a string-ending newline, in any match
mode.
|
\b
|
Word boundary.
|
\B
|
Not-word-boundary.
|
(?=...)
|
Positive lookahead.
|
(?!...)
|
Negative lookahead.
|
(?<=...)
|
Positive lookbehind.
|
(?<!...)
|
Negative
lookbehind.
|
Table 1-24. Comments and mode modifiers
Modifier/sequence
Mode character
Meaning
I
or IGNORECASE
|
i
|
Case-insensitive matching.
|
L
or LOCALE
|
L
|
Cause \w
, \W
, \b
, and \B
to
use current locale's definition of alphanumeric.
|
M
or MULTILINE
or (?m)
|
m
|
^
and $
match next to embedded
\n
.
|
S
or DOTALL
or (?s)
|
s
|
Dot (.) matches newline.
|
U
or UNICODE
or (?u)
|
u
|
Cause \w
, \W
, \b
, and \B
to
use Unicode definition of alphanumeric.
|
X
or VERBOSE
or (?x)
|
x
|
Ignore whitespace and allow comments (#
) in
pattern.
|
(?
mode
)
|
|
Turn listed modes (iLmsux
) on for the entire regular
expression.
|
(?#...)
|
|
Treat substring as a comment.
|
#..
.
|
|
Treat rest of line as a comment in VERBOSE
mode.
|
Table 1-25. Grouping, capturing, conditional, and
control
Sequence
Meaning
(...)
|
Group subpattern and capture submatch into
\1
,\2
,...
|
(?P<
name
>
...)
|
Group subpattern and capture submatch into named capture group,
name
.
|
(?P=
name
)
|
Match text matched by earlier named capture group,
name
.
|
\
n
|
Contains the results of the n
th
earlier submatch.
|
(?:...)
|
Groups subpattern, but does not capture submatch.
|
...|..
.
|
Try subpatterns in alternation.
|
*
|
Match 0 or more times.
|
+
|
Match 1 or more times.
|
?
|
Match 1 or 0 times.
|
{
n
}
|
Match exactly n
times.
|
{
x
,y
}
|
Match at least x
times but no more than
y
times.
|
*?
|
Match 0 or more times, but as few times as
possible.
|
+?
|
Match 1 or more times, but as few times as
possible.
|
??
|
Match 0 or 1 time, but as few times as possible.
|
{
x
,y
}?
|
Match at least x
times, no more than
y
times, and as few times as
possible.
|
1.6.2 re Module Objects and Functions
The re
module
defines all regular expression functionality. Pattern matching is done directly
through module functions, or patterns are compiled into regular expression
objects that can be used for repeated pattern matching. Information about the
match, including captured groups, is retrieved through match objects.
Python's raw string syntax, r'
' or r"
",
allows you to specify regular expression patterns without having to escape
embedded backslashes. The raw-string pattern, r'\n
', is equivalent to
the regular string pattern, '\\n
'. Python also provides triple-quoted
raw strings for multiline regular expressions: r'''text''
' and
r"""text""
".
The re
module
defines the following functions and one exception.
compile(
pattern
[,
flags
])
Return a regular expression object with the optional mode
modifiers, flags
.
match(
pattern
,
string
[, flags
])
Search for pattern
at starting position of
string
, and return a match object or None
if no
match.
search(
pattern
,
string
[, flags
])
Search for pattern
in string
,
and return a match object or None
if no match.
split(
pattern
,
string
[, maxsplit
=0])
Split string
on pattern
. Limit
the number of splits to maxsplit
. Submatches from capturing
parentheses are also returned.
sub(
pattern
, repl
,
string
[, count
=0])
Return a string with all or up to count
occurrences of pattern
in string
replaced with
repl
. repl
may be either a string or a function
that takes a match object argument.
subn(
pattern
, repl
,
string
[, count
=0])
Perform sub( )
but return a tuple of the new string
and the number of replacements.
findall(
pattern
,
string
)
Return matches of pattern
in
string
. If pattern
has capturing groups, returns
a list of submatches or a list of tuples of submatches.
finditer(
pattern
,
string
)
Return an iterator over matches of pattern
in
string
. For each match, the iterator returns a match object.
escape(
string
)
Return string with alphanumerics backslashed so that
string
can be matched literally.
exception error
Exception raised if an error occurs during compilation or
matching. This is common if a string passed to a function is not a valid regular
expression.
Regular expression objects are created with the
re.compile
function.
flags
Return the flags argument used when the object was compiled or
0.
groupindex
Return a dictionary that maps symbolic group names to group
numbers.
pattern
Return the pattern string used when the object was
compiled.
match(
string
[,
pos
[, endpos
]])
search(
string
[,
pos
[, endpos
]])
split(
string
[,
maxsplit
=0])
sub(
repl
,
string
[, count
=0])
subn(
repl
,
string
[, count
=0])
findall(
string
)
Same as the re
module functions, except
pattern
is implied. pos
and endpos
give
start and end string indexes for the match.
Match objects are created by the match
and find
functions.
pos
endpos
Value of pos
or endpos
passed to
search
or match
.
re
The regular expression object whose match
or
search
returned this object.
string
String passed to match
or search
.
group([
g1
, g2
,
...])
Return one or more submatches from capturing groups. Groups may
be either numbers corresponding to capturing groups or strings corresponding to
named capturing groups. Group zero corresponds to the entire match. If no
arguments are provided, this function returns the entire match. Capturing groups
that did not match have a result of None
.
groups([
default
])
Return a tuple of the results of all capturing groups. Groups
that did not match have the value None
or default
.
groupdict([
default
])
Return a dictionary of named capture groups, keyed by group
name. Groups that did not match have the value None
or
default
.
start([
group
])
Index of start of substring matched by group
(or start of entire matched string if no group
).
end([
group
])
Index of end of substring matched by group
(or
start of entire matched string if no group
).
span([
group
])
Return a tuple of starting and ending indexes of
group
(or matched string if no group
).
expand([
template
])
Return a string obtained by doing backslash substitution on
template
. Character escapes, numeric backreferences, and named
backreferences are expanded.
lastgroup
Name of the last matching capture group, or None
if no
match or if the group had no name.
lastindex
Index of the last matching capture group, or None
if
no match.
1.6.3 Unicode Support
re
provides limited
Unicode
support. Strings may contain Unicode characters, and individual Unicode
characters can be specified with \u
. Additionally, the UNICODE
flag causes \w
, \W
, \b
, and \B
to recognize
all Unicode alphanumerics. However, re
does not provide support for
matching Unicode properties, blocks, or categories.
1.6.4 Examples
Example 1-13. Simple
match
#Match Spider-Man, Spiderman, SPIDER-MAN, etc.
import re
dailybugle = 'Spider-Man Menaces City!'
pattern = r'spider[- ]?man.'
if re.match(pattern, dailybugle, re.IGNORECASE):
print dailybugle
Example 1-14. Match
and capture group
#Match dates formatted like MM/DD/YYYY, MM-DD-YY,...
import re
date = '12/30/1969'
regex = re.compile(r'(\d\d)[-/](\d\d)[-/](\d\d(?:\d\d)?)')
match = regex.match(date)
if match:
month = match.group(1) #12
day = match.group(2) #30
year = match.group(3) #1969
Example 1-15. Simple
substitution
#Convert <br> to <br /> for XHTML compliance
import re
text = 'Hello world. <br>'
regex = re.compile(r'<br>', re.IGNORECASE);
repl = r'<br />'
result = regex.sub(repl,text)
Example 1-16. Harder
substitution
#urlify - turn URL's into HTML links
import re
text = 'Check the website, http://www.oreilly.com/catalog/repr.'
pattern = r'''
\b # start at word boundary
( # capture to \1
(https?|telnet|gopher|file|wais|ftp) :
# resource and colon
[\w/#~:.?+=&%@!\-] +? # one or more valid chars
# take little as possible
)
(?= # lookahead
[.:?\-] * # for possible punc
(?: [^\w/#~:.?+=&%@!\-] # invalid character
| $ ) # or end of string
)'''
regex = re.compile(pattern, re.IGNORECASE
+ re.VERBOSE);
result = regex.sub(r'<a href="\1">\1</a>', text)
1.6.5 Other Resources
|
相关推荐
Python库regex-0.1.20110313.tar.gz是一个包含Python正则表达式处理扩展功能的资源包。在Python编程语言中,正则表达式(Regular Expressions,简称regex)是一种强大的文本处理工具,它允许我们通过模式匹配来执行...
资源分类:Python库 所属语言:Python 资源全名:regex_engine-1.0.0.tar.gz 资源来源:官方 安装方法:https://lanzao.blog.csdn.net/article/details/101784059
资源分类:Python库 所属语言:Python 资源全名:regex-2020.11.11-cp38-cp38-manylinux2014_aarch64.whl 资源来源:官方 安装方法:https://lanzao.blog.csdn.net/article/details/101784059
资源分类:Python库 所属语言:Python 资源全名:regex-2020.6.8-cp37-cp37m-manylinux2010_i686.whl 资源来源:官方 安装方法:https://lanzao.blog.csdn.net/article/details/101784059
《master_py_regex:精通Python Regex的代码书》是由Felix Lopez和Victor Romero共同撰写的,这是一本深入探讨Python正则表达式(Regex)的编程书籍。在Python编程中,正则表达式是一种强大的文本处理工具,它允许...
Gtkodos是python regex测试工具。 它代表使用Gtk3库编写的Kodos的克隆。 该代码已从头开始重写。
标题中的"PyPI 官网下载 | regex-2014.10.02.tar.gz"表明这是一个从Python Package Index (PyPI)官方下载的软件包,名为`regex-2014.10.02.tar.gz`。PyPI是Python开发者发布自己软件包的主要平台,用户可以通过它来...
regex-2020.1.8-cp27-cp27m-win32
Python库regex-2021.7.1是Python编程语言中的一款高级正则表达式模块,它提供了比内置的`re`模块更强大的功能和更灵活的匹配规则。正则表达式(Regular Expression)是一种模式匹配工具,常用于文本处理、数据提取和...
### 利用Python自动生成Verilog模块例化模板 #### 一、背景介绍 随着集成电路设计的复杂度不断增加,手动编写大量的Verilog代码变得既耗时又容易出错。因此,自动化工具的需求变得越来越迫切。本文将介绍如何利用...
在Python的世界里,`regex`是一个强大的正则表达式库,它扩展了Python标准库`re`模块的功能,提供了更多高级的正则表达式特性。这个资源是`regex`库的一个特定版本——`regex-2015.06.14`,针对Python 2.7版本编译,...
本文实例讲述了Python使用中文正则表达式匹配指定中文字符串的方法。分享给大家供大家参考,具体如下: 业务场景: 从中文字句中匹配出指定的中文子字符串 .这样的情况我在工作中遇到非常多, 特梳理总结如下. 难点: ...
资源分类:Python库 所属语言:Python 使用前提:需要解压 资源全名:regex-2021.10.21-cp38-cp38-win_amd64.whl 资源来源:官方 安装方法:https://lanzao.blog.csdn.net/article/details/101784059
Python库regex-2020.11.13-cp37-cp37m-manylinux1_i686.whl是针对Python 3.7版本的一个软件包,它包含了一个增强版的正则表达式处理模块。这个模块名为`regex`,是Python标准库`re`模块的扩展,提供了更多高级功能和...
`regex-2019.12.20-cp36-cp36m-win_amd64.whl`是一个特定版本的`regex`库的预编译二进制包,适用于Python 3.6(cp36表示Python 3.6,cp36m代表兼容性标记,通常与Python的ABI相关)。这个文件是为Windows操作系统上...
"regex-2015.06.24-cp26-none-win_amd64.whl"是一个特定版本的`regex`库,对应于Python 2.6版本,并且是为64位的Windows系统编译的。`.whl`文件是一种Python的二进制包格式,它是Python的安装包管理系统`pip`所支持...
- `()`在EREs、Python RegEx和Perl RegEx中用于定义一个分组。 - 在BREs中,`()`用于字面量匹配,而`\( \)`用于匹配括号本身。 4. **可选匹配 (`?` 和 `\?`)** - `?`表示匹配前面的子表达式0次或1次。 - 在EREs...
python基础_36_RegEx_正则表达式_(教学教程tutorial)