浏览 3658 次
精华帖 (0) :: 良好帖 (0) :: 新手帖 (0) :: 隐藏帖 (0)
|
|
---|---|
作者 | 正文 |
发表时间:2009-12-30
最后修改:2009-12-30
上一次发现了SGMLParser的bug,(见Python sgmlparser bug)于是就想到了利用HTMLParser,于是对其利用同样的HTML代码做了测试:
测试代码如下:
class postparser(HTMLParser): def __init__(self): HTMLParser.__init__(self) self.pieces=[] def handle_starttag(self, tag, attrs): print "start tag name: " + tag for k,v in attrs: print "\t"+k+" : "+v def handle_endtag(self,tag): print "end tag name:"+tag def handle_data(self,data): self.pieces.append(data) def gethtmltext(self): return "".join(self.pieces) def reset(self): HTMLParser.reset(self) def testmyparser(htmldata): parser=postparser() parser.feed(htmldata) print parser.gethtmltext() parser.reset() if __name__=="__main__": #htmldata=urllib.urlopen("http://www.sogou.com").read().decode("gbk") htmldata="""<html><head> <title>Google Page</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <link rel="stylesheet" href="#" type="text/css"> </head><body> <table id="tab"> <tr id="tr1"><td id="tr1td1">tr1 td1</td><td>tr1 td2</td><td>tr1 td3</td></tr> <tr id="tr2"><td id="tr2td1">tr2 td1</td><td>tr2 td2</td><td>tr2 td3</td></tr> </table> <br/> <p onmousemove="javascript:alert('>p<');"> this is a paragraph.</p> <img src="http://www.baidu.com/img/baidu_logo.gif" id="baidulogo" /><br/> <a href="http://baidu.com">baidu</a><br/> <b>bold font</b><br/> <script language="javascript">alert("hello, world ");</script> <style>#tab{background-color:#fcdad5;}</style> </body></html> """ testmyparser(htmldata)
结果输出如下:
start tag name: html start tag name: head start tag name: title end tag name:title start tag name: meta http-equiv : Content-Type content : text/html; charset=utf-8 start tag name: link rel : stylesheet href : # type : text/css end tag name:head start tag name: body start tag name: table id : tab start tag name: tr id : tr1 start tag name: td id : tr1td1 end tag name:td start tag name: td end tag name:td start tag name: td end tag name:td end tag name:tr start tag name: tr id : tr2 start tag name: td id : tr2td1 end tag name:td start tag name: td end tag name:td start tag name: td end tag name:td end tag name:tr end tag name:table start tag name: br end tag name:br start tag name: p onmousemove : javascript:alert('>p<'); end tag name:p start tag name: img src : http://www.baidu.com/img/baidu_logo.gif id : baidulogo end tag name:img start tag name: br end tag name:br start tag name: a href : http://baidu.com end tag name:a start tag name: br end tag name:br start tag name: b end tag name:b start tag name: br end tag name:br start tag name: script language : javascript end tag name:script start tag name: style end tag name:style end tag name:body end tag name:html Google Page tr1 td1tr1 td2tr1 td3 tr2 td1tr2 td2tr2 td3 this is a paragraph. baidu bold font alert("hello, world "); #tab{background-color:#fcdad5;}
从测试结果来看,比之于SGMLParser,解析的结果还是不错的,克服了不能解析单个标签的bug,而且onmousemove中的><也得到了正确的解析。
声明:ITeye文章版权属于作者,受法律保护。没有作者书面许可不得转载。
推荐链接
|
|
返回顶楼 | |