`
saiyaren
  • 浏览: 228025 次
  • 性别: Icon_minigender_1
  • 来自: 北京
社区版块
存档分类
最新评论

nutch1.4 部署应用

阅读更多

 

nutch1.4在2011年的11月26日正式发布了,nutch1.4之后更新了一些内容和一些配置,但是和1.3差别还是不大,但是和1.2之前的差异就比较大了,在nutch1.3之后,索引就用solr来进行生成了,包括查询也是用solr,所以在nutch1.2之前的web搜索服务也就不需要了。

首先我们去nutch的官网下载最新版的nutch1.4

地址为:

http://www.apache.org/dyn/closer.cgi/nutch/

 

下载apache-nutch-1.4-bin.zip或者apache-nutch-1.4-bin.tar.gz都可以

下载下来后,我们解压,现在先进行linux下的应用,下一节我会写eclipse中进行nutch开发

解压之后,我们会看到如下目录:


然后我们进入nutch/runtime/local的目录下,下目录下会有个conf文件夹,我们进入文件夹会看到如下文件:



 在这里我们只需要知道2个文件即可:

nutch-default.xml和regex-urlfilter.txt

 

nutch-default.xml 是nutch 的配置文件

regex-urlfilter.txt文件内是编辑NUTCH爬取的策略规则的

 

我们这是进行初次爬取,那么我们测试的话不需要对其他设置进行优化,只需要做到如下即可:

在nutch-default.xml文件中找到http.agent.name属性,将其中的value内容加上;

 

<!-- HTTP properties -->

<property>
  <name>http.agent.name</name>
  <value>jdodrc</value>
  <description>HTTP 'User-Agent' request header. MUST NOT be empty - 
  please set this to a single word uniquely related to your organization.

  NOTE: You should also check other related properties:

	http.robots.agents
	http.agent.description
	http.agent.url
	http.agent.email
	http.agent.version

  and set their values appropriately.

  </description>
</property>

 

 

如果不加上该属性的话,在执行nutch的时候会报如下错误:

 

Exception in thread "main" java.lang.IllegalArgumentException: Fetcher: No agents listed in 'http.agent.name' property.
 

增加上属性后,我们还需要进行规则的设置,比如我们要爬取www.163.com ,但是我们不是要把里面的所有链接都爬取下来,如sohu的广告,我们就不需要爬,我们只需要爬取163的内容,那么我们就需要设置爬取规则,爬取规则采用正则表达式进行编写(正则表达式在这里不做具体阐述)

 

那么我们在哪里编写规则呢?

 

regex-urlfilter.txt文件中编写规则:

 

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

这里是过滤的扩展名

 

抓取动态网页

 

# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]如果需要抓取动态网页就把这里注释掉
-[~]

 

页面链接过滤规则,如下为过滤163站的

 

 

# accept anything else
#+^http://([a-z0-9]*\.)*(.*\.)*.*/
+^http://([a-z0-9]*\.)*163\.com
 

如果做测试用只需要修改过滤规则即可。

 

nutch-default.xml的http.agent.name配置好后

regex-urlfilter.txt正则规则配置好后

那么我们在linux 在把runtime/local/bin下的.sh全部改为可执行文件

 

打开bin目录后,执行:

chmod +x *.sh

将所有的sh变为可执行

 

然后我们做下测试:

 

在runtime/local目录下,创建一个urls目录,然后里面创建一个文件test,在test文件里面输入我们要进行爬取的网站入口:

 

http://www.163.com/
 

然后保存,现在在我们的local目录下有一个urls目录,里面有一个入口文件

那么我们现在就进行一下测试:

测试之前我们需要对nutch的参数进行一下了解:

Crawl <urlDir> -solr <solrURL> [-dir d] [-threads n] [-depth i] [-topN N]

[]中间的是可选的

urlDir就是入口文件地址

-solr <solrUrl>为solr的地址(如果没有则为空)

-dir 是保存爬取文件的位置

-threads 是爬取开的线程(线程不是越多越好,实现要求即可,默认为10)

-depth 是访问的深度 (默认为5)

-topN 是访问的广度 (默认是Long.max)

 

然后在bin目录下有一个 nutch的shell文件,在nutch的shell文件中有一个crawl参数就是启动我们抓取类的:

我们现在测试爬行一下,现在我们的 目录位置是在nutch/runtime/local下

 

 

bin/nutch crawl urls -solr http://localhost:8080/solr/ -dir crawl -depth 2 -threads 5 -topN 100
 

如果要以后查看日志的话,那么就在最后加上一个 >& (输出位置)

 

solr需要单独配置,我会在solr一篇文章中讲怎么部署,这里的-solr的位置,只需要输入solr的url地址即可


如想了解solr部署请看solr 部署的文章

 

如果要想在windows下测试或者开发,那么需要首先安装cygwin,安装cygwin我会在eclipse中部署nutch1.4中介绍

 

测试结果:

 

crawl started in: crawl
rootUrlDir = urls/test.txt
threads = 10
depth = 2
solrUrl=http://localhost:8080/solr/
topN = 100
Injector: starting at 2012-02-07 14:21:20
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls/test.txt
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2012-02-07 14:21:25, elapsed: 00:00:04
Generator: starting at 2012-02-07 14:21:25
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 100
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/segments/20120207142128
Generator: finished at 2012-02-07 14:21:30, elapsed: 00:00:05
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
Fetcher: starting at 2012-02-07 14:21:30
Fetcher: segment: crawl/segments/20120207142128
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 1 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
fetching http://www.163.com/
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
Fetcher: throughput threshold: -1
-finishing thread FetcherThread, activeThreads=1
Fetcher: throughput threshold retries: 5
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2012-02-07 14:21:36, elapsed: 00:00:05
ParseSegment: starting at 2012-02-07 14:21:36
ParseSegment: segment: crawl/segments/20120207142128
Parsing: http://www.163.com/
ParseSegment: finished at 2012-02-07 14:21:39, elapsed: 00:00:03
CrawlDb update: starting at 2012-02-07 14:21:39
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20120207142128]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2012-02-07 14:21:42, elapsed: 00:00:03
Generator: starting at 2012-02-07 14:21:42
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 100
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/segments/20120207142145
Generator: finished at 2012-02-07 14:21:48, elapsed: 00:00:05
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
Fetcher: starting at 2012-02-07 14:21:48
Fetcher: segment: crawl/segments/20120207142145
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: time-out divisor: 2
Using queue mode : byHost
QueueFeeder finished: total 97 records + hit by time limit :0
Using queue mode : byHost
fetching http://bbs.163.com/
Using queue mode : byHost
fetching http://bbs.163.com/rank/
Using queue mode : byHost
fetching http://tech.163.com/cnstock/
Using queue mode : byHost
fetching http://tech.163.com/
Using queue mode : byHost
fetching http://tech.163.com/digi/nb/
Using queue mode : byHost
Using queue mode : byHost
fetching http://g.163.com/a?CID=10625&Values=3331479594&Redirect=http:/www.edu-163.com/Item/list.asp?id=1164
fetching http://g.163.com/r?site=netease&affiliate=homepage&cat=homepage&type=textlinkhouse&location=1
Using queue mode : byHost
fetching http://g.163.com/a?CID=10627&Values=896009995&Redirect=http:/www.dv37.com/jiaoyu/xiaoxinxing/
Using queue mode : byHost
fetching http://g.163.com/a?CID=10635&Values=1012801948&Redirect=http:/www.worldwayhk.com/
Fetcher: throughput threshold: -1
Fetcher: throughput threshold retries: 5
fetching http://g.163.com/a?CID=12392&Values=441270714&Redirect=http:/www.qinzhe.com/chinese/index.htm
fetching http://g.163.com/a?CID=10634&Values=2943411042&Redirect=http:/www.kpeng.com.cn/
fetching http://g.163.com/a?CID=12337&Values=3289604641&Redirect=http:/www.offcn.com/zg/2011ms/index.html
fetching http://g.163.com/a?CID=10633&Values=1745739655&Redirect=http:/www.edu-163.com/aidi/aidinj1.htm
fetching http://g.163.com/a?CID=12307&Values=3388898846&Redirect=http:/www.offcn.com/zg/2011ms/index.html
fetching http://g.163.com/a?CID=10629&Values=740233954&Redirect=http:/www.embasjtu.com/
fetching http://g.163.com/a?CID=10632&Values=715626766&Redirect=http:/www.edu-163.com/aidi/aidimg.htm
fetching http://g.163.com/a?CID=12259&Values=3180311081&Redirect=http:/www.gpkdtx.com/
fetching http://g.163.com/a?CID=12271&Values=904657751&Redirect=http:/www.vipabc.com/count.asp?code=QnfF0agFbn
fetching http://g.163.com/a?CID=10628&Values=2735701856&Redirect=http:/www.wsi.com.cn
fetching http://g.163.com/a?CID=10623&Values=1704187161&Redirect=http:/www.wsi.com.cn
fetching http://g.163.com/a?CID=12267&Values=608079303&Redirect=http:/edu.163.com/special/official/
fetching http://g.163.com/a?CID=10631&Values=3773655455&Redirect=http:/www.xinhaowei.cn/zt/sasheng-new/
fetching http://g.163.com/a?CID=10630&Values=4025376053&Redirect=http:/www.bwpx.com/
fetching http://g.163.com/a?CID=12283&Values=1441209353&Redirect=http:/www.zyqm.org/
fetching http://mobile.163.com/
fetching http://mobile.163.com/app/
fetching http://reg.vip.163.com/enterMail.m?enterVip=true-----------
fetching http://product.tech.163.com/mobile/
fetching http://hea.163.com/
-activeThreads=10, spinWaiting=0, fetchQueues.totalSize=68
fetching http://reg.email.163.com/mailregAll/reg0.jsp?from=163&regPage=163
-activeThreads=10, spinWaiting=0, fetchQueues.totalSize=67
fetching http://yuehui.163.com/
fetching http://auto.163.com/
fetching http://auto.163.com/buy/
fetching http://gongyi.163.com/
fetching http://reg.163.com/Main.jsp?username=pInfo
fetching http://reg.163.com/Logout.jsp?username=accountName&url=http:/www.163.com/
-activeThreads=10, spinWaiting=0, fetchQueues.totalSize=61
fetching http://money.163.com/fund/
fetching http://money.163.com/stock/
fetching http://money.163.com/hkstock/
fetching http://money.163.com/
-activeThreads=10, spinWaiting=0, fetchQueues.totalSize=57
fetching http://blog.163.com/passportIn.do?entry=163
fetching http://blog.163.com/?fromNavigation
fetching http://pay.163.com/
fetching http://baby.163.com/
fetching http://discovery.163.com/
-activeThreads=10, spinWaiting=0, fetchQueues.totalSize=52
fetching http://p.mail.163.com/mailinfo/shownewmsg_www_0819.htm
fetching http://help.163.com?b01abh1
fetching http://www.163.com/rss/
fetching http://home.163.com/
fetching http://product.auto.163.com/
-activeThreads=10, spinWaiting=0, fetchQueues.totalSize=47
fetching http://ecard.163.com/
fetching http://photo.163.com/?username=pInfo
fetching http://photo.163.com/pp/square/
fetching http://email.163.com/
fetching http://m.163.com/
-activeThreads=10, spinWaiting=0, fetchQueues.totalSize=42
fetching http://edu.163.com/
fetching http://edu.163.com/liuxue/
fetching http://xf.house.163.com/gz/
fetching http://game.163.com/
fetching http://travel.163.com/
-activeThreads=10, spinWaiting=0, fetchQueues.totalSize=37
fetching http://baoxian.163.com/?from=index
-activeThreads=10, spinWaiting=0, fetchQueues.totalSize=36
fetching http://zx.caipiao.163.com?from=shouye
fetching http://entry.mail.163.com/coremail/fcg/ntesdoor2?verifycookie=1&lightweight=1
fetching http://biz.163.com/
fetching http://t.163.com/rank?f=163dh
fetching http://t.163.com/chat?f=163dh
-activeThreads=10, spinWaiting=0, fetchQueues.totalSize=31
fetching http://t.163.com/?f=wstopmicoblogmsg
fetch of http://zx.caipiao.163.com?from=shouye failed with: org.apache.nutch.protocol.http.api.HttpException: bad status line '<html>': For input string: "<html>"
fetching http://t.163.com/rank/daren?f=163dh
fetching http://t.163.com/?f=wstopmicoblogmsg.enter
fetching http://t.163.com/
fetching http://sports.163.com/
-activeThreads=10, spinWaiting=0, fetchQueues.totalSize=26
fetching http://sports.163.com/nba/
fetching http://sports.163.com/cba/
-activeThreads=10, spinWaiting=0, fetchQueues.totalSize=24
fetching http://sports.163.com/yc/
-activeThreads=10, spinWaiting=0, fetchQueues.totalSize=23
fetching http://vipmail.163.com/
fetching http://digi.163.com/
fetching http://lady.163.com/beauty/
fetching http://lady.163.com/
fetching http://lady.163.com/sense/
-activeThreads=10, spinWaiting=0, fetchQueues.totalSize=18
fetching http://house.163.com/
fetching http://news.163.com/review/
-activeThreads=10, spinWaiting=0, fetchQueues.totalSize=16
fetching http://news.163.com/photo/
fetching http://news.163.com/
fetching http://v.163.com/doc/
-activeThreads=10, spinWaiting=0, fetchQueues.totalSize=13
fetching http://v.163.com/zongyi/
fetching http://v.163.com/
fetching http://v.163.com/focus/
fetching http://fushi.163.com/
-activeThreads=10, spinWaiting=0, fetchQueues.totalSize=9
fetching http://yc.163.com/
-activeThreads=10, spinWaiting=0, fetchQueues.totalSize=8
fetching http://mall.163.com/
fetching http://ent.163.com/movie/
fetching http://ent.163.com/
fetching http://ent.163.com/music/
fetching http://ent.163.com/tv/
fetching http://war.163.com/
-activeThreads=10, spinWaiting=0, fetchQueues.totalSize=2
* queue: http://fashion.163.com
  maxThreads    = 10
  inProgress    = 0
  crawlDelay    = 5000
  minCrawlDelay = 0
  nextFetchTime = 1328595704430
  now           = 1328595728444
  0. http://fashion.163.com/
* queue: http://book.163.com
  maxThreads    = 10
  inProgress    = 0
  crawlDelay    = 5000
  minCrawlDelay = 0
  nextFetchTime = 1328595704430
  now           = 1328595728445
  0. http://book.163.com/
fetching http://fashion.163.com/
-activeThreads=10, spinWaiting=0, fetchQueues.totalSize=1
* queue: http://book.163.com
  maxThreads    = 10
  inProgress    = 0
  crawlDelay    = 5000
  minCrawlDelay = 0
  nextFetchTime = 1328595704430
  now           = 1328595729445
  0. http://book.163.com/
fetching http://book.163.com/
-activeThreads=10, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=9
-finishing thread FetcherThread, activeThreads=8
-activeThreads=8, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=7
-finishing thread FetcherThread, activeThreads=6
-finishing thread FetcherThread, activeThreads=5
-finishing thread FetcherThread, activeThreads=4
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=2
-activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=1
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2012-02-07 14:22:20, elapsed: 00:00:32
ParseSegment: starting at 2012-02-07 14:22:20
ParseSegment: segment: crawl/segments/20120207142145
Parsing: http://auto.163.com/
Parsing: http://auto.163.com/buy/
Parsing: http://baby.163.com/
Parsing: http://baoxian.163.com/?from=index
Parsing: http://bbs.163.com/
Parsing: http://bbs.163.com/rank/
Parsing: http://biz.163.com/
Parsing: http://blog.163.com/?fromNavigation
Parsing: http://book.163.com/
Parsing: http://digi.163.com/
Parsing: http://discovery.163.com/
Parsing: http://edu.163.com/
Parsing: http://edu.163.com/liuxue/
Parsing: http://email.163.com/
Parsing: http://ent.163.com/
Parsing: http://ent.163.com/movie/
Parsing: http://ent.163.com/music/
Parsing: http://ent.163.com/tv/
Parsing: http://fashion.163.com/
Parsing: http://fushi.163.com/
Parsing: http://g.163.com/a?CID=10623&Values=1704187161&Redirect=http:/www.wsi.com.cn
Error parsing: http://g.163.com/a?CID=10623&Values=1704187161&Redirect=http:/www.wsi.com.cn: failed(2,0): Can't retrieve Tika parser for mime-type application/octet-stream
Parsing: http://g.163.com/a?CID=10625&Values=3331479594&Redirect=http:/www.edu-163.com/Item/list.asp?id=1164
Error parsing: http://g.163.com/a?CID=10625&Values=3331479594&Redirect=http:/www.edu-163.com/Item/list.asp?id=1164: failed(2,0): Can't retrieve Tika parser for mime-type application/octet-stream
Parsing: http://g.163.com/a?CID=10627&Values=896009995&Redirect=http:/www.dv37.com/jiaoyu/xiaoxinxing/
Error parsing: http://g.163.com/a?CID=10627&Values=896009995&Redirect=http:/www.dv37.com/jiaoyu/xiaoxinxing/: failed(2,0): Can't retrieve Tika parser for mime-type application/octet-stream
Parsing: http://g.163.com/a?CID=10628&Values=2735701856&Redirect=http:/www.wsi.com.cn
Error parsing: http://g.163.com/a?CID=10628&Values=2735701856&Redirect=http:/www.wsi.com.cn: failed(2,0): Can't retrieve Tika parser for mime-type application/octet-stream
Parsing: http://g.163.com/a?CID=10629&Values=740233954&Redirect=http:/www.embasjtu.com/
Error parsing: http://g.163.com/a?CID=10629&Values=740233954&Redirect=http:/www.embasjtu.com/: failed(2,0): Can't retrieve Tika parser for mime-type application/octet-stream
Parsing: http://g.163.com/a?CID=10630&Values=4025376053&Redirect=http:/www.bwpx.com/
Error parsing: http://g.163.com/a?CID=10630&Values=4025376053&Redirect=http:/www.bwpx.com/: failed(2,0): Can't retrieve Tika parser for mime-type application/octet-stream
Parsing: http://g.163.com/a?CID=10631&Values=3773655455&Redirect=http:/www.xinhaowei.cn/zt/sasheng-new/
Error parsing: http://g.163.com/a?CID=10631&Values=3773655455&Redirect=http:/www.xinhaowei.cn/zt/sasheng-new/: failed(2,0): Can't retrieve Tika parser for mime-type application/octet-stream
Parsing: http://g.163.com/a?CID=10632&Values=715626766&Redirect=http:/www.edu-163.com/aidi/aidimg.htm
Parsing: http://g.163.com/a?CID=10633&Values=1745739655&Redirect=http:/www.edu-163.com/aidi/aidinj1.htm
Parsing: http://g.163.com/a?CID=10634&Values=2943411042&Redirect=http:/www.kpeng.com.cn/
Error parsing: http://g.163.com/a?CID=10634&Values=2943411042&Redirect=http:/www.kpeng.com.cn/: failed(2,0): Can't retrieve Tika parser for mime-type application/octet-stream
Parsing: http://g.163.com/a?CID=10635&Values=1012801948&Redirect=http:/www.worldwayhk.com/
Error parsing: http://g.163.com/a?CID=10635&Values=1012801948&Redirect=http:/www.worldwayhk.com/: failed(2,0): Can't retrieve Tika parser for mime-type application/octet-stream
Parsing: http://g.163.com/a?CID=12259&Values=3180311081&Redirect=http:/www.gpkdtx.com/
Error parsing: http://g.163.com/a?CID=12259&Values=3180311081&Redirect=http:/www.gpkdtx.com/: failed(2,0): Can't retrieve Tika parser for mime-type application/octet-stream
Parsing: http://g.163.com/a?CID=12267&Values=608079303&Redirect=http:/edu.163.com/special/official/
Error parsing: http://g.163.com/a?CID=12267&Values=608079303&Redirect=http:/edu.163.com/special/official/: failed(2,0): Can't retrieve Tika parser for mime-type application/octet-stream
Parsing: http://g.163.com/a?CID=12271&Values=904657751&Redirect=http:/www.vipabc.com/count.asp?code=QnfF0agFbn
Error parsing: http://g.163.com/a?CID=12271&Values=904657751&Redirect=http:/www.vipabc.com/count.asp?code=QnfF0agFbn: failed(2,0): Can't retrieve Tika parser for mime-type application/octet-stream
Parsing: http://g.163.com/a?CID=12283&Values=1441209353&Redirect=http:/www.zyqm.org/
Error parsing: http://g.163.com/a?CID=12283&Values=1441209353&Redirect=http:/www.zyqm.org/: failed(2,0): Can't retrieve Tika parser for mime-type application/octet-stream
Parsing: http://g.163.com/a?CID=12307&Values=3388898846&Redirect=http:/www.offcn.com/zg/2011ms/index.html
Parsing: http://g.163.com/a?CID=12337&Values=3289604641&Redirect=http:/www.offcn.com/zg/2011ms/index.html
Parsing: http://g.163.com/a?CID=12392&Values=441270714&Redirect=http:/www.qinzhe.com/chinese/index.htm
Parsing: http://g.163.com/r?site=netease&affiliate=homepage&cat=homepage&type=textlinkhouse&location=1
Parsing: http://game.163.com/
Parsing: http://gongyi.163.com/
Parsing: http://hea.163.com/
Parsing: http://home.163.com/
Parsing: http://house.163.com/
Parsing: http://lady.163.com/
Parsing: http://lady.163.com/beauty/
Parsing: http://lady.163.com/sense/
Parsing: http://mall.163.com/
Parsing: http://mobile.163.com/
Parsing: http://mobile.163.com/app/
Parsing: http://money.163.com/
Parsing: http://money.163.com/fund/
Parsing: http://money.163.com/hkstock/
Parsing: http://money.163.com/stock/
Parsing: http://news.163.com/
Parsing: http://news.163.com/photo/
Parsing: http://news.163.com/review/
Parsing: http://p.mail.163.com/mailinfo/shownewmsg_www_0819.htm
Parsing: http://pay.163.com/
Parsing: http://photo.163.com/pp/square/
Parsing: http://product.auto.163.com/
Parsing: http://product.tech.163.com/mobile/
Parsing: http://reg.163.com/Logout.jsp?username=accountName&url=http:/www.163.com/
Parsing: http://reg.163.com/Main.jsp?username=pInfo
Parsing: http://reg.email.163.com/mailregAll/reg0.jsp?from=163&regPage=163
Parsing: http://reg.vip.163.com/enterMail.m?enterVip=true-----------
Parsing: http://sports.163.com/
Parsing: http://sports.163.com/cba/
Parsing: http://sports.163.com/nba/
Parsing: http://sports.163.com/yc/
Parsing: http://t.163.com/chat?f=163dh
Parsing: http://t.163.com/rank/daren?f=163dh
Parsing: http://t.163.com/rank?f=163dh
Parsing: http://tech.163.com/
Parsing: http://tech.163.com/cnstock/
Parsing: http://tech.163.com/digi/nb/
Parsing: http://travel.163.com/
Parsing: http://v.163.com/
Parsing: http://v.163.com/doc/
Parsing: http://v.163.com/focus/
Parsing: http://vipmail.163.com/
Parsing: http://war.163.com/
Parsing: http://www.163.com/rss/
Parsing: http://xf.house.163.com/gz/
Parsing: http://yc.163.com/
Parsing: http://yuehui.163.com/
ParseSegment: finished at 2012-02-07 14:22:26, elapsed: 00:00:06
CrawlDb update: starting at 2012-02-07 14:22:26
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20120207142145]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2012-02-07 14:22:30, elapsed: 00:00:04
crawl finished: crawl


 
4
0
分享到:
评论
25 楼 oMoChi_10 2013-04-22  
nutch1.6导入myeclipse是不是也一样呀。。。。我是做毕业设计的,老师还说要改一下源代码。。。这个哪里会呀。在这里看看有没有大神还是做这方面的。。。
24 楼 saiyaren 2013-03-11  
shantouyyt 写道
还有吗??  
那个eclipse中进行nutch1.4开发的  你在哪讲了 

我自己有文档,一直没有发上去,后来没搞了,所以博客也没有续写,有时间我发上去吧
23 楼 shantouyyt 2013-03-07  
还有吗??  
那个eclipse中进行nutch1.4开发的  你在哪讲了 
22 楼 青花瓷101 2012-08-02  
写的太棒了,,继续关注,,,
21 楼 saiyaren 2012-05-23  
youzhibing 写道
兄弟,我这按照你说的那样配置的,这么这样输出了
[yzb@www local]$ bin/nutch crawl urls -solr http://localhost:8080/solr/ -dir crawl -depth 2 -threads 5 -topN 100
crawl started in: crawl
rootUrlDir = urls
threads = 5
depth = 2
solrUrl=http://localhost:8080/solr/
topN = 100
Injector: starting at 2012-05-22 20:50:14
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:217)
        at org.apache.nutch.crawl.Crawl.run(Crawl.java:127)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
这个怎么解决了,看到了请速度给个反馈,谢谢!

请检查nutch-default.xml的plugin.folders是否修改为./src/plugin,默认为plugins,
修改后启动正常
一般是插件的地址问题!
20 楼 youzhibing 2012-05-22  
兄弟,我这按照你说的那样配置的,这么这样输出了
[yzb@www local]$ bin/nutch crawl urls -solr http://localhost:8080/solr/ -dir crawl -depth 2 -threads 5 -topN 100
crawl started in: crawl
rootUrlDir = urls
threads = 5
depth = 2
solrUrl=http://localhost:8080/solr/
topN = 100
Injector: starting at 2012-05-22 20:50:14
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:217)
        at org.apache.nutch.crawl.Crawl.run(Crawl.java:127)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
这个怎么解决了,看到了请速度给个反馈,谢谢!
19 楼 saiyaren 2012-04-12  
yaochanghong 写道
哥哇,你是太好了,我弄这个弄了好多天了,但是一直没有理想的结果。麻烦你的后续文档文档赶快上传啊。我们都等着啊。

,好的,这几天一直在忙乎工作,周一加了通宵,没时间更新,实在不好意思啊……
18 楼 yaochanghong 2012-04-11  
哥哇,你是太好了,我弄这个弄了好多天了,但是一直没有理想的结果。麻烦你的后续文档文档赶快上传啊。我们都等着啊。
17 楼 saiyaren 2012-04-11  
youzhibing 写道
saiyaren 写道
youzhibing 写道
环境貌似搭好了,solr也搭建了,nutch抓取之后如何对其进行搜索了,solr提供的那个界面执行的结果返回的是xml内容;怎么进入一般的查询界面( 只是一个查询框),返回的结果也是一般搜索引擎的查询结果格式了??

这些我都有写好的东西

在哪,能否给我看看,把你这方面相关的资料都给我借鉴借鉴吧!

在我本机的word上一直没有发布上来
16 楼 youzhibing 2012-04-10  
saiyaren 写道
youzhibing 写道
环境貌似搭好了,solr也搭建了,nutch抓取之后如何对其进行搜索了,solr提供的那个界面执行的结果返回的是xml内容;怎么进入一般的查询界面( 只是一个查询框),返回的结果也是一般搜索引擎的查询结果格式了??

这些我都有写好的东西

在哪,能否给我看看,把你这方面相关的资料都给我借鉴借鉴吧!
15 楼 saiyaren 2012-04-10  
youzhibing 写道
环境貌似搭好了,solr也搭建了,nutch抓取之后如何对其进行搜索了,solr提供的那个界面执行的结果返回的是xml内容;怎么进入一般的查询界面( 只是一个查询框),返回的结果也是一般搜索引擎的查询结果格式了??

这些我都有写好的东西
14 楼 youzhibing 2012-04-09  
环境貌似搭好了,solr也搭建了,nutch抓取之后如何对其进行搜索了,solr提供的那个界面执行的结果返回的是xml内容;怎么进入一般的查询界面( 只是一个查询框),返回的结果也是一般搜索引擎的查询结果格式了??
13 楼 saiyaren 2012-04-06  
youzhibing 写道
saiyaren 写道
youzhibing 写道
大哥,挤点时间写下在eclipse下配置nutch1.4,感激不尽!

好的,今天晚上回去写下吧,要是今天晚上没有其他事情的话,然后再把我之前遇到的一些问题也贴出来,呵呵

非常感谢

没事,群里面好多兄弟也等呢……最近不是太忙了,没时间了嘛,呵呵
12 楼 youzhibing 2012-04-06  
saiyaren 写道
youzhibing 写道
大哥,挤点时间写下在eclipse下配置nutch1.4,感激不尽!

好的,今天晚上回去写下吧,要是今天晚上没有其他事情的话,然后再把我之前遇到的一些问题也贴出来,呵呵

非常感谢
11 楼 saiyaren 2012-04-06  
youzhibing 写道
大哥,挤点时间写下在eclipse下配置nutch1.4,感激不尽!

好的,今天晚上回去写下吧,要是今天晚上没有其他事情的话,然后再把我之前遇到的一些问题也贴出来,呵呵
10 楼 youzhibing 2012-04-05  
大哥,挤点时间写下在eclipse下配置nutch1.4,感激不尽!
9 楼 youzhibing 2012-03-31  
youzhibing 写道
youzhibing 写道
saiyaren 写道
youzhibing 写道
那个eclipse中进行nutch1.4开发的  你在哪讲了

我今天晚上写吧……最近忙乎换工作的事情呢……

非常感谢!!

昨天晚上回来太晚了,就没有写,回去我看时间,然后尽快写出来吧,今天入职新工作了

你先忙自己的,我不是太着忙!
8 楼 saiyaren 2012-03-31  
youzhibing 写道
saiyaren 写道
youzhibing 写道
那个eclipse中进行nutch1.4开发的  你在哪讲了

我今天晚上写吧……最近忙乎换工作的事情呢……

非常感谢!!

昨天晚上回来太晚了,就没有写,回去我看时间,然后尽快写出来吧,今天入职新工作了
7 楼 youzhibing 2012-03-30  
saiyaren 写道
youzhibing 写道
那个eclipse中进行nutch1.4开发的  你在哪讲了

我今天晚上写吧……最近忙乎换工作的事情呢……

非常感谢!!
6 楼 youzhibing 2012-03-30  
saiyaren 写道
youzhibing 写道
那个eclipse中进行nutch1.4开发的  你在哪讲了

我今天晚上写吧……最近忙乎换工作的事情呢……

写好了没,另外,你那是什么系统下的eclipse配置

相关推荐

    nutch1.4帮助文档

    nutch1.4帮助文档,学习nutch1.4必备,最新nutch1.4核心类解读!

    nutch_1.4配置

    在Windows平台上部署Nutch 1.4,需预先安装以下工具和软件: 1. **Java JDK 1.7**:Nutch基于Java开发,因此需要安装JDK,并设置相应的环境变量。 2. **Cygwin**:由于Nutch的脚本采用Linux Shell编写,故在Windows...

    nutch_1.4在windows下安装配置.pdf

    通过这些步骤,用户可以成功部署 Nutch 并进行简单的网页爬取任务。同时,文中还提供了常见的错误排查方法,帮助用户顺利解决问题。此外,本文还简单介绍了如何验证 Solr 的安装情况,确保整个系统可以正常使用。

    apache-nutch-1.4

    Nutch 1.4是该项目的一个稳定版本,发布于2012年,尽管后续有更新的版本,但1.4版本因其稳定性及广泛的应用而备受青睐。在深入探讨Nutch 1.4的知识点之前,我们先来了解一下什么是Apache Nutch。 Apache Nutch是一...

    Nutch1.4_windows下eclipse配置图文详解.docx

    ### Nutch 1.4 在 Windows 下 Eclipse 配置图文详解 #### 一、环境准备与配置 **1.1 JDK 安装** - **版本选择**:文档中提到使用了 JDK1.6,官方下载地址为:[JDK6]...

    apache-nutch-1.4-bin.tar.gz.part2

    apache-nutch-1.4-bin.tar.gz.part2

    apache-nutch-1.4-bin.tar.gz

    在这个"apache-nutch-1.4-bin.tar.gz"压缩包中,包含了运行 Nutch 的所有必要组件和配置文件,适合初学者和开发者快速部署和实验。 **Nutch 的核心组成部分:** 1. **爬虫(Spider)**:Nutch 的爬虫负责在网络中...

    apache-nutch-1.4-src.tar.gz_nutch_搜索引擎

    在“apache-nutch-1.4-src.tar.gz”这个压缩包中,包含了Nutch 1.4版本的源代码,用户可以根据自己的需求对代码进行定制和扩展。 Nutch 的主要组件包括以下几个方面: 1. **网络爬虫(Crawler)**:Nutch 的网络...

    apache-nutch-1.4-bin.part2

    apache-nutch-1.4-bin.part2

    apache-nutch-1.4-bin.part1

    apache-nutch-1.4-bin.part1

    apache-nutch-1.4-bin.tar.gz.part1

    apache-nutch-1.4-bin.tar.gz.part1

    Nutch配置环境\Nutch1[1].4_windows下eclipse配置图文详解.docx

    本文将详细介绍如何在Windows环境下配置Nutch 1.4,并使用Eclipse进行开发。以下是你需要知道的关键步骤: 1. **安装JDK**: 在配置Nutch之前,首先确保已安装Java Development Kit (JDK)。这里推荐使用JDK 1.6。...

    Nutch在Tomcat下的部署.doc

    Nutch 是一个开源的网络爬虫项目,用于抓取互联网上的网页并建立索引,而Tomcat是一款流行的Java应用服务器,常用来部署Web应用程序。在本文中,我们将深入探讨如何在Tomcat环境下部署Nutch以及解决相关问题。 首先...

    Nutch搜索引擎·Nutch简单应用(第3期)

    而Nutch的简单应用则体现在用户如何利用Nutch提供的这些命令和配置选项,根据自己的需求来抓取和索引网络上的数据。例如,用户可以针对一个特定的URL列表,设置爬取深度和线程数来获取网站内容,并将这些内容索引后...

    Windows下配置nutch

    因此,Nutch 可以看作是 Lucene 的一个应用,提供了一个完整的搜索引擎解决方案。如果你已经有数据源,只需要搜索功能,可以直接使用 Lucene。而当你需要从网上抓取数据并进行搜索时,Nutch 就是更好的选择。 4. **...

    nutch1.3在myclipse部署工程源码

    MyEclipse(这里误写为myclipse)是基于Eclipse的一款集成开发环境(IDE),尤其适合Java开发,支持多种Java应用的创建、调试和部署。本教程将详细介绍如何在MyEclipse中部署Nutch1.3的工程源码。 首先,我们需要...

    nutch2.2.1安装步骤.docx

    3. apache-tomcat-8.5.39.tar.gz:Tomcat 是一个流行的 Java Servlet 和 JavaServer Pages(JSP)容器,用于运行 Nutch 的 Web 应用程序。 4. jdk-8u201-linux-x64.tar.gz:Java 开发工具包,Nutch 运行和构建的必需...

    nutch 初学文档教材

    1.4 nutch VS lucene.....2 2. nutch的安装与配置.....3 2.1 JDK的安装与配置.3 2.2 nutch的安装与配置........5 2.3 tomcat的安装与配置......5 3. nutch初体验7 3.1 爬行企业内部网....7 3.1.1 配置nutch....7 ...

    nutch 0.9 版(包含war,bin,src可直接部署使用)

    提供的WAR(Web Application Archive)文件是Java Web 应用的标准打包格式,可以直接部署在支持Servlet和JSP的Web服务器上,如Tomcat。在Nutch 0.9 中,这个WAR文件包含了运行Nutch Web界面所需的全部组件,包括Web...

Global site tag (gtag.js) - Google Analytics