Python爬虫-爬取SAE论坛上的精华帖子 -

tsface

浏览: 9494 次
性别:
来自: 西安

最近访客更多访客>>

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

Python爬虫-爬取SAE论坛上的精华帖子

博客分类：

Python

python 爬虫 SAE

其实是写Java的，但是最近学习Python，于是写了一个Python的简单脚本练手

   如何找到SAE上面所有的精华帖子，周末一个人无聊于是研究了一下python的urllib2，下面说下自己收集精华帖子的思路:
      1，发送相关模块的请求，生产html信息返回给本地
      2，处理html信息，找到可以标示精华帖子的html
      3，提取取出href熟悉和<a></a>标签包含的名称

      思路很简单，代码也不复杂，下面帖上自己的代码

#! /usr/bin/env python
#coding=utf-8
#Python 2.7 
#author：tsface  
#date：2014-04-20
#since:1.0


import urllib2
import time
import sys
import re

reload(sys)
sys.setdefaultencoding('utf-8')

#定义常量
BASE_PATH='http://cloudbbs.org/'
_URL='http://cloudbbs.org/forum.php?mod=forumdisplay'
headers ={'User_Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.69 Safari/537.36'}
pageSize=30
FIDS=[41,37,39,40,51,52,57,58,60,63,64,65,54,46,62]


#定义一个用来函数用来读取当前页面的html文档
#since:1.0
def getHtml(url):
    realUrl = url
    consoleLog('request url: '+realUrl)
    req = urllib2.Request(realUrl,headers = headers)
    res = urllib2.urlopen(req)
    html = res.read().decode('utf-8').encode('gb2312','ignore');
    return html


#定义一个函数用来解析当前文档中的精华帖子的url，并且把帖子的类容存储到一个tuple中
def parseExcellentContent(html):
    if html:
        consoleLog('start parse DOM...')
        regDigest = '(<a[^>]*class="xst" >[\\d\\D]*?</a>)'
        #找到主题的正则表达是
        regContent ='(<th[^>]*class="[\\d\\D].*?">[\\d\\D]*?</th>)'
        contentList = re.findall(regContent,html,re.S)
        digestList=[]
        for item in contentList:
            #找到精华帖子
            _digest = re.findall('alt="digest"',item,re.S)
            if _digest:
                _digestContent=re.findall(regDigest,item,re.S)
                _href=re.findall('<a href="(.*?)"',_digestContent[0],re.S)
                _bbsName=re.findall('<a.*?>(.*?)</a>',_digestContent[0],re.S)
                digestList.append({"address":_href[0],"title":_bbsName[0]})
        consoleLog('parse successfully...')
        return digestList
        
    else:
        consoleLog('Nothing to parse')
    
#写信息到文件中
def writeInfo2File(fileName,info):
    _file=open(fileName,'w+')
    _file.write(info)
    _file.close()
    consoleLog(fileName+'has build...')
    
#定义日志函数    
def consoleLog(log):
    print '['+getCurrentFormatTime()+']'+log

#获取本地格式化的时间
def getCurrentFormatTime():
    return time.strftime("%Y-%m-%d %A %X %Z", time.localtime())


#定义替换特殊字符的函数
def replaceSpecial(_href):
    replacedUrl=_href
    replaceTab = [("&lt;","<"),("&gt",">"),("&amp;","&"),("&nbps;"," ")]
    for repl in replaceTab:
        replacedUrl=replacedUrl.replace(repl[0],repl[1])
    return replacedUrl

#获取SAE平台上的精华文章列表
def getSAEBBSExcellentInfo():
    info=""
    for fid in FIDS:
        reqUrl = _URL+'&fid='+str(fid)
        for i in range(1,pageSize+1):
            _sigleReqUrl=reqUrl+'&page='+str(i)
            print _sigleReqUrl
            objs = parseExcellentContent(getHtml(_sigleReqUrl))
            for obj in objs:
                info+=(BASE_PATH+replaceSpecial(obj['address']))+(" "*10)+obj['title']+'\n'
        info+=('='*50+str(fid)+'\n')
        writeInfo2File('test.txt',info)


#测试
startTime = time.time()
getSAEBBSExcellentInfo()
endTime = time.time()
consoleLog('the py spend '+str(endTime-startTime))

分享到：