Python中的正则表达式匹配中文问题 -

djangofan

浏览: 36567 次

最近访客更多访客>>

xiaomabobo

choyajoy

lionbule

jiedushi

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

Python中的正则表达式匹配中文问题

博客分类：

python

python中正则表达式匹配中文是没有问题的，但是其中有一个关键点，那就是pattern中的中文编码必须和要匹配字符串保持一致；下面使用一个例子来说明：

# -*- coding: utf-8 -*-

'''

test.html内容为：

<div id='author_' >作　　者：（美）埃克尔著，陈昊鹏译</div>
<div id='publisher_'>出版社：机械工业出版社</div>
<ul >
<li>出版时间： 2007-6-1</li>
<li>字　　数： </li>
<li>版　　次： 1</li>
<li>页　　数： 880</li>
<li>印刷时间： 2007-6-1</li>
<li>开　　本： </li>
<li>印　　次： </li>
<li>纸　　张：胶版纸</li>
<li>I S B N ： 9787111213826</li>
<li>包　　装：平装</li>
</ul>

'''

import re

import chardet #用于检测str的编码

#读文件

def readContent():

f = file(r'/home/fzhong/test.html','r')

content = f.read()

f.close()

return content

#检测str的编码

def checkEncoding(str):

return chardet.detect(str)['encoding']

def extractAttrValue(regx):
p = re.compile(regx)
attrValue = p.search(self.dataStr).group(1).strip()
return attrValue

if __name__ == '__main__':

content = readContent()

#因为这里的test.html为gb2312编码，所以这里encoding应该为gb2312

encoding = checkEncoding(content)

p_isbn = u'<li>I S B N ：(.*?)</li>'.encode(encoding )
isbn = extractAttrValue(p_isbn)

#pattern为unicode，转为和content一样的编码，然后执行匹配

p_pub_date = u'<li>出版时间：(.*)</li>'.encode(encoding )
pubDate = extractAttrValue(p_pub_date)

p_edition_num = u'<li>版　　次：(.*?)</li>'.encode(encoding )
editionNum = extractAttrValue(p_edition_num)

p_page_num = u'<li>页　　数：(.*?)</li>'.encode(encoding )
pageNum = extractAttrValue(p_page_num)

p_author = ur'作　　者：(.*?)</div>'.encode(encoding )
author = extractAttrValue(p_author)

p_publisher = ur'出版社：(.*?)</div>'.encode(encoding )

publisher = extractAttrValue(p_publisher)

这里有几个关键点：

p_pub_date = u'<li>出版时间：(.*)</li>'.encode(encoding )

执行一个unicode到encoding编码的转换；

当然在上面的脚本中也可以这样：

p_pub_date = '<li>出版时间：(.*)</li>'.decode('UTF-8').encode(encoding )

分享到：

MySQL 字符集与校对规则 | Python Unicode与中文处理

2010-03-23 15:53
浏览 990
评论(0)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

Python中的正则表达式匹配中文问题

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

Python中的正则表达式匹配中文问题

评论

发表评论

相关推荐

python 的函数Decorators

Python Unicode与中文处理

MySQLdb for Python使用指南/Python的数据库操作

python MySQLdb示例代码

Python打包、安装与发布工具--setuptools

有道难题 之 有道搜索框 java实现

CentOS5.4上将Python版本升级到2.6.5

MySQLdb for python 安装

python lib 之 operator

python mysql 分页程序

How-To Guide for Descriptors

python __getattribute__ 的优先级问题

Thundering Herd Mitigation (memcached redis)

python 算法

最近访客更多访客>>

有道难题之有道搜索框 java实现

python getattribute 的优先级问题