searchengine:
programming collective intelligence一个简化的searchengine骨架...深度遍历link...
试了一下,公司网速太慢,爬不动
...searchengine的瓶颈在带宽和并发计算
programming collective intelligencepython的BeautifulSoup库很不错,php,ruby就没这么好的库,提取link要自己写匹配函数
>>> import searchengine
>>> c=searchengine.crawler()
>>> c=searchengine.crawler('')
>>> c.crawl()
indexing http://kiwitobes.com/wiki/
indexing http://kiwitobes.com/wiki/Citeseer.html
indexing http://kiwitobes.com/wiki/Insert_%2528SQL%2529.html
indexing http://kiwitobes.com/wiki/Spacecraft_propulsion.html
indexing http://kiwitobes.com/wiki/Noctis.html
indexing http://kiwitobes.com/wiki/Methods.html
indexing http://kiwitobes.com/wiki/32_%2528number%2529.html
...
programming collective intelligencesearchengine.py
------------------------
import urllib2
from BeautifulSoup import *
from urlparse import urljoin
ignorewords=set(['the','of','to','and','a','in','is','it'])
baseUrl = set(['http://kiwitobes.com/wiki/'])
class crawler:
def __init__(self,dbname):
pass
def __del__(self):
pass
def dbcommit(self):
pass
def getentryid(self,table):
return None
def gettextonly(self,soup):
return None
def separatewords(self,text):
return None
def addlinkref(self,urlFrom,urlTo,linkText):
pass
def createindextables(self):
pass
def addtoindex(self,url,soup):
print 'indexing %s' %url
def isindexed(self,url):
return False
def crawl(self,pages=baseUrl,depth=2):
for i in range(depth):
newpages = set()
for page in pages:
try:
c = urllib2.urlopen(page)
except:
print "could not open %s" %page
continue
soup=BeautifulSoup(c.read())
self.addtoindex(page,soup)
links=soup('a')
for link in links:
if ('href' in dict(link.attrs)):
url = urljoin(page,link['href'])
if not self.isindexed(url):
newpages.add(url)
text = self.gettextonly(link)
self.addlinkref(page,url,text)
self.dbcommit()
pages=newpages
--------------
programming collective intelligencechenjinlai
2008-05-07
programming collective intelligence
分享到:
相关推荐
集体智慧编程/Programming Collective Intelligence 中英文版
集体智慧编程(Programming Collective Intelligence)中文版 pdf-part1机器学习相关书籍
Programming Collective Intelligence: Building Smart Web 2.0 Applications 英文epub版本
《集体智慧编程》(Programming Collective Intelligence)官方源代码,非手敲。
《programming collective intelligence》是一本使用Python语言学习机器学习的教材。全书以英文原版呈现,难度适中,非常适合自学的读者。这本书被一些业界专家广泛推荐,因为它不仅对于初学者来说容易理解,而且还...
集体智慧编程 English edition
《编程集体智能》(Programming Collective Intelligence)是一本旨在帮助读者掌握如何利用Python语言进行机器学习的技术书籍。本书深入浅出地介绍了各种复杂的机器学习算法,并通过实际案例将这些算法的应用变得简单...
集体智慧编程(Programming Collective Intelligence)中文版 pdf-part2机器学习 贝叶斯 决策树
Segaran -- Programming Collective Intelligence -- 2008 -- code.7z
《Programming Collective Intelligence》是一本深度探讨如何利用编程技术来挖掘和分析数据,从而实现集体智慧的书籍。这本书主要面向对Python编程有一定基础,并希望通过编程手段理解并应用大数据、机器学习和人工...
集体智慧编程的全部代码,本人自行全部学完并且基本上都实操运行过,可靠。现在分享给大家,象征性的收一个积分,希望大家学习愉快。
中文版电子书 + 英文版电子书 + 源代码 本书以机器学习与计算统计为主题背景,专门讲述如何挖掘和分析Web上的数据和资源,如何分析用户体验、市场营销、个人品味等诸多信息,并得出有用的结论,通过复杂的算法来从...
A new category of powerful programming techniques lets you discover the patterns, inter-relationships, and individual profiles-the collective intelligence–locked in the data people leave behind as ...
Collective Intelligence in Action (Manning 2008).pdf