`
sillycat
  • 浏览: 2542524 次
  • 性别: Icon_minigender_1
  • 来自: 成都
社区版块
存档分类
最新评论

DiveIntoPython(六)

阅读更多
DiveIntoPython(六)

英文书地址:
http://diveintopython.org/toc/index.html

Chapter 7. Regular Expressions
7.1.Diving In
Strings have methods for searching (index, find, and count), replacing (replace), and parsing (split), but they are limited to the simplest of cases. The search methods look for a single, hard-coded substring, and they are always case-sensitive. To do case-insensitive searches of a string s, you must call s.lower() or s.upper() and make sure your search strings are the appropriate case to match. The replace and split methods have the same limitations.

7.2.Case Study:Street Addresses
example 7.1.Matching at the End of a String
>>> s = '100 NORTH MAIN ROAD'
>>> s.replace('ROAD','RD.')
'100 NORTH MAIN RD.'
>>> s = '100 NORTH BROAD ROAD'
>>> s.replace('ROAD','RD.')
'100 NORTH BRD. RD.'
>>> s[:-4] + s[-4:].replace('ROAD','RD.')
'100 NORTH MAIN RD.'
>>> import re
>>> re.sub('ROAD$','RD.',s)
'100 NORTH MAIN RD.'

My goal is to standardize a street address so that 'ROAD' is always abbreviated as 'RD.'. At first glance, I thought this was simple enough that I could just use the string method replace. After all, all the data was already uppercase, so case mismatches would not be a problem. And the search string, 'ROAD', was a constant. And in this deceptively simple example, s.replace does indeed work.

The problem here is that 'ROAD' appears twice in the address, once as part of the street name 'BROAD' and once as its own word. The replace method sees these two occurrences and blindly replaces both of them; meanwhile, I see my addresses getting destroyed.

Take a look at the first parameter: 'ROAD$'. This is a simple regular expression that matches 'ROAD' only when it occurs at the end of a string. The $ means “end of the string”. (There is a corresponding character, the caret ^, which means “beginning of the string”.)

Using the re.sub function, you search the string s for the regular expression 'ROAD$' and replace it with 'RD.'. This matches the ROAD at the end of the string s, but does not match the ROAD that's part of the word BROAD, because that's in the middle of s.

example 7.2.Matching Whole Words
>>> s = '100 BROAD'
>>> re.sub('ROAD$','RD.',s)
'100 BRD.'
>>> re.sub('\\bROAD$','RD.',s)
'100 BROAD'
>>> re.sub(r'\bROAD$','RD.',s)
'100 BROAD'
>>> s = '100 BROAD ROAD APT. 3'
>>> re.sub(r'\bROAD$','RD.',s)
'100 BROAD ROAD APT. 3'
>>> re.sub(r'\bROAD\b','RD.',s)
'100 BROAD RD. APT. 3'

What I really wanted was to match 'ROAD' when it was at the end of the string and it was its own whole word, not a part of some larger word. To express this in a regular expression, you use \b, which means “a word boundary must occur right here”.In Python, this is complicated by the fact that the '\' character in a string must itself be escaped. This is sometimes referred to as the backslash plague, and it is one reason why regular expressions are easier in Perl than in Python.

To work around the backslash plague, you can use what is called a raw string, by prefixing the string with the letter r. This tells Python that nothing in this string should be escaped; '\t' is a tab character, but r'\t' is really the backslash character \ followed by the letter t. I recommend always using raw strings when dealing with regular expressions; otherwise, things get too confusing too quickly

7.3.Case Study:Roman Numberals
In Roman numerals, there are seven characters that are repeated and combined in various ways to represent numbers.
I = 1
V = 5
X = 10
L = 50
C = 100
D = 500
M = 1000

Characters are additive. I is 1, II is 2, and III is 3. VI is 6 (literally, “5 and 1”), VII is 7, and VIII is 8.

The tens characters (I, X, C, and M) can be repeated up to three times. At 4, you need to subtract from the next highest fives character. You can't represent 4 as IIII; instead, it is represented as IV (“1 less than 5”). The number 40 is written as XL (10 less than 50), 41 as XLI, 42 as XLII, 43 as XLIII, and then 44 as XLIV (10 less than 50, then 1 less than 5).

The fives characters can not be repeated. The number 10 is always represented as X, never as VV. The number 100 is always C, never LL.

Roman numerals are always written highest to lowest, and read left to right, so the order the of characters matters very much. DC is 600; CD is a completely different number (400, 100 less than 500). CI is 101; IC is not even a valid Roman numeral (because you can't subtract 1 directly from 100; you would need to write it as XCIX, for 10 less than 100, then 1 less than 10).

7.3.1.Checking for Thousands
example7.3.Checking for Thousands
>>> import re
>>> pattern = '^M?M?M?$'
>>> re.search(pattern,'M')
<_sre.SRE_Match object at 0x01373E90>
>>> re.search(pattern,'MM')
<_sre.SRE_Match object at 0x01373E58>
>>> re.search(pattern,'MMM')
<_sre.SRE_Match object at 0x01373E20>
>>> re.search(pattern,'MMMM')
>>> re.search(pattern,'')
<_sre.SRE_Match object at 0x01373E58>

^ to match what follows only at the beginning of the string. If this were not specified, the pattern would match no matter where the M characters were, which is not what you want. You want to make sure that the M characters, if they're there, are at the beginning of the string.

M? to optionally match a single M character. Since this is repeated three times, you're matching anywhere from zero to three M characters in a row.

$ to match what precedes only at the end of the string. When combined with the ^ character at the beginning, this means that the pattern must match the entire string, with no other characters before or after the M characters.

The essence of the re module is the search function, that takes a regular expression (pattern) and a string ('M') to try to match against the regular expression. If a match is found, search returns an object which has various methods to describe the match; if no match is found, search returns None, the Python null value.

7.3.2.Checking for Hundreds
100 = C
200 = CC
300 = CCC
400 = CD
500 = D
600 = DC
700 = DCC
800 = DCCC
900 = CM

example 7.4.Checking for Hundreds
>>> import re
>>> pattern = '^M?M?M?(CM|CD|D?C?C?C?)$'
>>> re.search(pattern,'MCM')
<_sre.SRE_Match object at 0x0132F4A0>
>>> re.search(pattern,'MD')
<_sre.SRE_Match object at 0x013752E0>
>>> re.search(pattern,'MMMCCC')
<_sre.SRE_Match object at 0x0132F4A0>
>>> re.search(pattern,'MCMC')

Then it has the new part, in parentheses, which defines a set of three mutually exclusive patterns, separated by vertical bars: CM, CD, and D?C?C?C? (which is an optional D followed by zero to three optional C characters). The regular expression parser checks for each of these patterns in order (from left to right), takes the first one that matches, and ignores the rest.

7.4.Using the {n,m} Syntax
example 7.6.The New Way:From n O m
>>> pattern = '^M{0,3}$'
>>> re.search(pattern,'M')
<_sre.SRE_Match object at 0x01373F70>
>>> re.search(pattern,'MMM')
<_sre.SRE_Match object at 0x0137A058>

This pattern says: “Match the start of the string, then anywhere from zero to three M characters, then the end of the string.” The 0 and 3 can be any numbers; if you want to match at least one but no more than three M characters, you could say M{1,3}.

7.4.1.Checking for Tens and Ones
example 7.7.Checking for Tens
>>> pattern = '^M?M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)$'
>>> re.search(pattern, 'MCMXL')   

example 7.8.Validating Roman Numberals with {n,m}
>>> pattern = '^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$'
>>> re.search(pattern, 'MDLV')

7.5.Verbose Regular Expressions
So far you've just been dealing with what I'll call “compact” regular expressions. As you've seen, they are difficult to read, and even if you figure out what one does, that's no guarantee that you'll be able to understand it six months later. What you really need is inline documentation.

A verbose regular expression is different from a compact regular expression in two ways:
Whitespace is ignored. Spaces, tabs, and carriage returns are not matched as spaces, tabs, and carriage returns. They're not matched at all. (If you want to match a space in a verbose regular expression, you'll need to escape it by putting a backslash in front of it.)

Comments are ignored. A comment in a verbose regular expression is just like a comment in Python code: it starts with a # character and goes until the end of the line. In this case it's a comment within a multi-line string instead of within your source code, but it works the same way.

example 7.9.Regular Expressions with Inline Comments
>>> pattern = """
    ^                   # beginning of string
    M{0,4}              # thousands - 0 to 4 M's
    (CM|CD|D?C{0,3})    # hundreds - 900 (CM), 400 (CD), 0-300 (0 to 3 C's),
                        #            or 500-800 (D, followed by 0 to 3 C's)
    (XC|XL|L?X{0,3})    # tens - 90 (XC), 40 (XL), 0-30 (0 to 3 X's),
                        #        or 50-80 (L, followed by 0 to 3 X's)
    (IX|IV|V?I{0,3})    # ones - 9 (IX), 4 (IV), 0-3 (0 to 3 I's),
                        #        or 5-8 (V, followed by 0 to 3 I's)
    $                   # end of string
    """
>>> re.search(pattern, 'M', re.VERBOSE)
<_sre.SRE_Match object at 0x01376110>
>>> re.search(pattern, 'MCMLXXXIX', re.VERBOSE)
<_sre.SRE_Match object at 0x01321F20>
>>> re.search(pattern, 'MMMMDCCCLXXXVIII', re.VERBOSE)
<_sre.SRE_Match object at 0x01376070>
>>> re.search(pattern, 'M')  

The most important thing to remember when using verbose regular expressions is that you need to pass an extra argument when working with them: re.VERBOSE is a constant defined in the re module that signals that the pattern should be treated as a verbose regular expression.

7.6.Case study:Parsing Phone Numbers
So far you've concentrated on matching whole patterns. Either the pattern matches, or it doesn't. But regular expressions are much more powerful than that. When a regular expression does match, you can pick out specific pieces of it. You can find out what matched where.

Here are the phone numbers I needed to be able to accept:
800-555-1212
800 555 1212
800.555.1212
(800) 555-1212
1-800-555-1212
800-555-1212-1234
800-555-1212x1234
800-555-1212 ext. 1234
work 1-(800) 555.1212 #1234

Quite a variety! In each of these cases, I need to know that the area code was 800, the trunk was 555, and the rest of the phone number was 1212. For those with an extension, I need to know that the extension was 1234.

example 7.10.Finding Numbers
>>> phonePattern = re.compile(r'^(\d{3})-(\d{3})-(\d{4})$')
>>> phonePattern.search('800-555-1212').groups()
('800', '555', '1212')

What's \d{3}? Well, the {3} means “match exactly three numeric digits”; it's a variation on the {n,m} syntax you saw earlier. \d means “any numeric digit” (0 through 9). Putting it in parentheses means “match exactly three numeric digits, and then remember them as a group that I can ask for later”.

example 7.11.Finding the Extension
>>> phonePattern = re.compile(r'^(\d{3})-(\d{3})-(\d{4})-(\d+)$')
>>> phonePattern.search('800-555-1212-12345').groups()
('800', '555', '1212', '12345')
>>> phonePattern.search('800-555-1212')
>>>

example 7.12.Handling Different Separators
>>> phonePattern = re.compile(r'^(\d{3})\D+(\d{3})\D+(\d{4})\D+(\d+)$')
>>> phonePattern.search('800 500 1211 1234').groups()
('800', '500', '1211', '1234')
>>> phonePattern.search('800-500-1211-1234').groups()
('800', '500', '1211', '1234')

\D+. What the heck is that? Well, \D matches any character except a numeric digit, and + means “1 or more”. So \D+ matches one or more characters that are not digits.

Using \D+ instead of - means you can now match phone numbers where the parts are separated by spaces instead of hyphens.

example 7.13.Handling Numbers Without Separators
>>> phonePattern = re.compile(r'^(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d*)$')
>>> phonePattern.search('80055512121234').groups()
('800', '555', '1212', '1234')
>>> phonePattern.search('800.555.1212 x1234').groups()
('800', '555', '1212', '1234')
>>> phonePattern.search('800-555-1212').groups()
('800', '555', '1212', '')
>>> phonePattern.search('(800)5551212 x1234')
>>>

Remember that + means “1 or more”? Well, * means “zero or more”. So now you should be able to parse phone numbers even when there is no separator character at all.

example7.14.Handling Leading Characters
>>> phonePattern = re.compile(r'^\D*(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d*)$')
>>> phonePattern.search('(800)5551212 ext. 1234').groups()
('800', '555', '1212', '1234')
>>> phonePattern.search('800-555-1212').groups()
('800', '555', '1212', '')
>>> phonePattern.search('work 1-(800) 555.1212 #1234')
>>>

Why doesn't this phone number match? Because there's a 1 before the area code, but you assumed that all the leading characters before the area code were non-numeric characters (\D*). Aargh.

example 7.15.Phone Number,Wherever I May Find Ye
>>> phonePattern = re.compile(r'(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d*)$')
>>> phonePattern.search('work 1-(800) 555.1212 #1234').groups()
('800', '555', '1212', '1234')
>>> phonePattern.search('800-555-1212')
<_sre.SRE_Match object at 0x0145A338>
>>> phonePattern.search('800-555-1212').groups()
('800', '555', '1212', '')
>>> phonePattern.search('80055512121234')
<_sre.SRE_Match object at 0x0145A338>
>>> phonePattern.search('80055512121234').groups()
('800', '555', '1212', '1234')

Note the lack of ^ in this regular expression. You are not matching the beginning of the string anymore. There's nothing that says you need to match the entire input with your regular expression. The regular expression engine will do the hard work of figuring out where the input string starts to match, and go from there.

example 7.16.Parsing Phone Numbers(Final Version)
>>> phonePattern = re.compile(r'''
                # don't match beginning of string, number can start anywhere
    (\d{3})     # area code is 3 digits (e.g. '800')
    \D*         # optional separator is any number of non-digits
    (\d{3})     # trunk is 3 digits (e.g. '555')
    \D*         # optional separator
    (\d{4})     # rest of number is 4 digits (e.g. '1212')
    \D*         # optional separator
    (\d*)       # extension is optional and can be any number of digits
    $           # end of string
    ''', re.VERBOSE)
>>> phonePattern.search('work 1-(800) 555.1212 #1234').groups()       
('800', '555', '1212', '1234')
>>> phonePattern.search('800-555-1212')                               
('800', '555', '1212', '')

You should now be familiar with the following techniques:
^     matches the beginning of a string.
$      matches the end of a string.
\b    matches a word boundary.
\d    matches any numeric digit.
\D    matches any non-numeric character.
x?    matches an optional x character (in other words, it matches an x zero or one times).
x*    matches x zero or more times.
x+    matches x one or more times.
x{n,m} matches an x character at least n times, but not more than m times.
(a|b|c)  matches either a or b or c.
(x)        in general is a remembered group. You can get the value of what matched by using the groups() method of                 the object returned by re.search.

Regular expressions are extremely powerful, but they are not the correct solution for every problem. You should learn enough about them to know when they are appropriate, when they will solve your problems, and when they will cause more problems than they solve.

分享到:
评论

相关推荐

    dive into python3 (中文版)

    Python是一种广泛使用的高级编程语言,以其简洁明了的语法和强大的功能而闻名。《深入Python3(中文版)》是一本系统介绍Python 3的书籍,旨在帮助读者深入学习Python 3的基本知识与应用。本文将根据给定文件的信息...

    《Dive Into Python 3中文版》PDF

    《Dive Into Python 3中文版》是一本深入学习Python 3编程语言的教程,适合初学者和有一定编程基础的开发者。这本书详细介绍了Python 3的各种特性,包括语法、数据结构、函数、类、模块、异常处理、输入/输出、网络...

    Dive into Python3

    《Dive into Python3》的压缩包文件名为diveintopython3-r860-2010-01-13,这可能表示它是2010年1月13日发布的第860个修订版。这个版本可能包含了作者对初版的修正和更新,以适应Python 3的最新发展。 通过阅读这...

    Dive Into Python 中文译文版

    PDF版本的《Dive Into Python 中文译文版》(diveintopython-pdfzh-cn-5.4b.zip)提供了完整的书籍内容,涵盖了Python的基础知识到高级特性。书中通过实际案例引导读者深入学习,包括但不限于变量、数据类型、控制...

    DiveIntoPython

    《Dive Into Python》是一本深受编程初学者和有经验开发者喜爱的Python编程教程。这本书以其深入浅出的讲解方式,让学习者能够快速掌握Python编程语言的核心概念和实际应用,特别是对于想要涉足Web开发领域的读者,...

    深入Python (Dive Into Python)

    深入python,深入Python (Dive Into Python) 译者序 by limodou 主页(http://phprecord.126.com) Python论坛 本书英文名字为《Dive Into Python》,其发布遵守 GNU 的自由文档许可证(Free Document Lience)的...

    Dive into python

    dive into python英文原版,Dive Into Python 3 covers Python 3 and its differences from Python 2. Compared to Dive Into Python, it’s about 20% revised and 80% new material. The book is now complete, ...

    Dive Into Python 2 中文版

    《Dive Into Python 2 中文版》是一本深度探讨Python编程语言的教程,适合已经有一定编程基础,希望深入理解Python特性和应用的读者。这本书以其详尽的解释和丰富的实例,为Python初学者和进阶者提供了全面的学习...

    Dive Into Python 3

    《深入Python 3》是一本全面且深入介绍Python 3编程语言的电子书籍,旨在帮助读者从...压缩包中的文件“diveintomark-diveintopython3-793871b”很可能是该书的源代码或HTML文件,可以配合阅读,加深对书中示例的理解。

    Dive Into Python 3 无水印pdf

    Dive Into Python 3 英文无水印pdf pdf所有页面使用FoxitReader和PDF-XChangeViewer测试都可以打开 本资源转载自网络,如有侵权,请联系上传者或csdn删除 本资源转载自网络,如有侵权,请联系上传者或csdn删除

    Dive Into Python 3, r870 (2010).pdf

    Didyoureadtheoriginal“DiveIntoPython”?Didyoubuyit onpaper?(Ifso,thanks!)AreyoureadytotaketheplungeintoPython3?…Ifso,readon.(Ifnoneofthat istrue,you’dbebetteroffstartingatthebeginning.) Python3...

    Dive Into Python V5.4

    《Dive Into Python V5.4》是一本深入学习Python编程语言的经典教程,以其详尽的解释和丰富的实例深受程序员们的喜爱。这个版本是官方提供的最新版本,它不仅包含了PDF格式的完整书籍,还附带了书中所有示例代码,为...

    diveintopython3

    在“diveintopython3-master”这个压缩包中,包含了这本书的所有源代码示例。通过这些代码,我们可以学习到以下关键知识点: 1. **Python基础**:包括变量、数据类型(如整型、浮点型、字符串、列表、元组、字典)...

    diveintopython-examples-5.4.rar

    diveintopython-examples-5.4.rardiveintopython-examples-5.4.rardiveintopython-examples-5.4.rardiveintopython-examples-5.4.rar

    dive-into-python3 (英文版)+深入python3(中文版)

    《Dive Into Python3》和《深入Python3》是两本深受Python爱好者欢迎的书籍,分别提供了英文和中文的学习资源,旨在帮助读者全面理解和掌握Python3编程语言。这两本书覆盖了Python3的基础语法、高级特性以及实际应用...

    Dive Into Python中文版

    Dive Into Python中文版,精心整理,epub版本方便阅读,下载阅读.

    Dive Into Python 3 中文版

    ### Dive Into Python 3 中文版 - 安装Python 3 #### 标题解析 - **Dive Into Python 3 中文版**:这本书名表明了内容将深入讲解Python 3的各项特性和使用方法,适合希望深入了解Python 3编程语言的读者。 #### ...

Global site tag (gtag.js) - Google Analytics