`
chenjinlai
  • 浏览: 70307 次
  • 性别: Icon_minigender_1
  • 来自: 杭州
文章分类
社区版块
存档分类
最新评论

programming collective intelligence读书笔记三

阅读更多
searchengine:  programming collective intelligence
一个简化的searchengine骨架...深度遍历link...
试了一下,公司网速太慢,爬不动
...searchengine的瓶颈在带宽和并发计算 programming collective intelligence
python的BeautifulSoup库很不错,php,ruby就没这么好的库,提取link要自己写匹配函数

>>> import searchengine
>>> c=searchengine.crawler()
>>> c=searchengine.crawler('')
>>> c.crawl()
indexing http://kiwitobes.com/wiki/
indexing http://kiwitobes.com/wiki/Citeseer.html
indexing http://kiwitobes.com/wiki/Insert_%2528SQL%2529.html
indexing http://kiwitobes.com/wiki/Spacecraft_propulsion.html
indexing http://kiwitobes.com/wiki/Noctis.html
indexing http://kiwitobes.com/wiki/Methods.html
indexing http://kiwitobes.com/wiki/32_%2528number%2529.html
...programming collective intelligence


searchengine.py
------------------------
import urllib2
from BeautifulSoup import *
from urlparse import urljoin
ignorewords=set(['the','of','to','and','a','in','is','it'])
baseUrl = set(['http://kiwitobes.com/wiki/'])

class crawler:

def __init__(self,dbname):
pass
def __del__(self):
pass
def dbcommit(self):
pass
def getentryid(self,table):
return None
def gettextonly(self,soup):
return None
def separatewords(self,text):
return None
def addlinkref(self,urlFrom,urlTo,linkText):
pass
def createindextables(self):
pass
def addtoindex(self,url,soup):
print 'indexing %s' %url
def isindexed(self,url):
return False
def crawl(self,pages=baseUrl,depth=2):
for i in range(depth):
newpages = set()
for page in pages:
try:
c = urllib2.urlopen(page)
except:
print "could not open %s" %page
continue
soup=BeautifulSoup(c.read())
self.addtoindex(page,soup)

links=soup('a')
for link in links:
if ('href' in dict(link.attrs)):
url = urljoin(page,link['href'])
if not self.isindexed(url):
newpages.add(url)
text = self.gettextonly(link)
self.addlinkref(page,url,text)
self.dbcommit()
pages=newpages

--------------programming collective intelligence
chenjinlai
2008-05-07
programming collective intelligence


分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics