- 浏览: 228005 次
- 性别:
- 来自: 北京
文章分类
最新评论
-
saiyaren:
husxwy 写道请教一个问题,是否碰见一个请求,nginx认 ...
nginx upstream 容错机制 原创-胡志广 -
husxwy:
请教一个问题,是否碰见一个请求,nginx认为tomcat1失 ...
nginx upstream 容错机制 原创-胡志广 -
ct518lovepwj:
楼主,请教一下,我的nutch集群只有一个节点运行,并且在抓取 ...
nutch集群,威力很大,哈哈!! -
saiyaren:
songbgi 写道saiyaren 写道saiyaren 写 ...
java web 开发问题总结 1 原创-胡志广 -
songbgi:
saiyaren 写道saiyaren 写道saiyaren ...
java web 开发问题总结 1 原创-胡志广
nutch1.4在2011年的11月26日正式发布了,nutch1.4之后更新了一些内容和一些配置,但是和1.3差别还是不大,但是和1.2之前的差异就比较大了,在nutch1.3之后,索引就用solr来进行生成了,包括查询也是用solr,所以在nutch1.2之前的web搜索服务也就不需要了。
首先我们去nutch的官网下载最新版的nutch1.4
地址为:
http://www.apache.org/dyn/closer.cgi/nutch/
下载apache-nutch-1.4-bin.zip或者apache-nutch-1.4-bin.tar.gz都可以
下载下来后,我们解压,现在先进行linux下的应用,下一节我会写eclipse中进行nutch开发
解压之后,我们会看到如下目录:
然后我们进入nutch/runtime/local的目录下,下目录下会有个conf文件夹,我们进入文件夹会看到如下文件:
在这里我们只需要知道2个文件即可:
nutch-default.xml和regex-urlfilter.txt
nutch-default.xml 是nutch 的配置文件
regex-urlfilter.txt文件内是编辑NUTCH爬取的策略规则的
我们这是进行初次爬取,那么我们测试的话不需要对其他设置进行优化,只需要做到如下即可:
在nutch-default.xml文件中找到http.agent.name属性,将其中的value内容加上;
<!-- HTTP properties --> <property> <name>http.agent.name</name> <value>jdodrc</value> <description>HTTP 'User-Agent' request header. MUST NOT be empty - please set this to a single word uniquely related to your organization. NOTE: You should also check other related properties: http.robots.agents http.agent.description http.agent.url http.agent.email http.agent.version and set their values appropriately. </description> </property>
如果不加上该属性的话,在执行nutch的时候会报如下错误:
Exception in thread "main" java.lang.IllegalArgumentException: Fetcher: No agents listed in 'http.agent.name' property.
增加上属性后,我们还需要进行规则的设置,比如我们要爬取www.163.com ,但是我们不是要把里面的所有链接都爬取下来,如sohu的广告,我们就不需要爬,我们只需要爬取163的内容,那么我们就需要设置爬取规则,爬取规则采用正则表达式进行编写(正则表达式在这里不做具体阐述)
那么我们在哪里编写规则呢?
regex-urlfilter.txt文件中编写规则:
# skip image and other suffixes we can't yet parse # for a more extensive coverage use the urlfilter-suffix plugin -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ 这里是过滤的扩展名
抓取动态网页
# skip URLs containing certain characters as probable queries, etc. #-[?*!@=]如果需要抓取动态网页就把这里注释掉 -[~]
页面链接过滤规则,如下为过滤163站的
# accept anything else #+^http://([a-z0-9]*\.)*(.*\.)*.*/ +^http://([a-z0-9]*\.)*163\.com
如果做测试用只需要修改过滤规则即可。
nutch-default.xml的http.agent.name配置好后
regex-urlfilter.txt正则规则配置好后
那么我们在linux 在把runtime/local/bin下的.sh全部改为可执行文件
打开bin目录后,执行:
chmod +x *.sh
将所有的sh变为可执行
然后我们做下测试:
在runtime/local目录下,创建一个urls目录,然后里面创建一个文件test,在test文件里面输入我们要进行爬取的网站入口:
http://www.163.com/
然后保存,现在在我们的local目录下有一个urls目录,里面有一个入口文件
那么我们现在就进行一下测试:
测试之前我们需要对nutch的参数进行一下了解:
Crawl <urlDir> -solr <solrURL> [-dir d] [-threads n] [-depth i] [-topN N]
[]中间的是可选的
urlDir就是入口文件地址
-solr <solrUrl>为solr的地址(如果没有则为空)
-dir 是保存爬取文件的位置
-threads 是爬取开的线程(线程不是越多越好,实现要求即可,默认为10)
-depth 是访问的深度 (默认为5)
-topN 是访问的广度 (默认是Long.max)
然后在bin目录下有一个 nutch的shell文件,在nutch的shell文件中有一个crawl参数就是启动我们抓取类的:
我们现在测试爬行一下,现在我们的 目录位置是在nutch/runtime/local下
bin/nutch crawl urls -solr http://localhost:8080/solr/ -dir crawl -depth 2 -threads 5 -topN 100
如果要以后查看日志的话,那么就在最后加上一个 >& (输出位置)
solr需要单独配置,我会在solr一篇文章中讲怎么部署,这里的-solr的位置,只需要输入solr的url地址即可
如想了解solr部署请看solr 部署的文章
如果要想在windows下测试或者开发,那么需要首先安装cygwin,安装cygwin我会在eclipse中部署nutch1.4中介绍
测试结果:
crawl started in: crawl rootUrlDir = urls/test.txt threads = 10 depth = 2 solrUrl=http://localhost:8080/solr/ topN = 100 Injector: starting at 2012-02-07 14:21:20 Injector: crawlDb: crawl/crawldb Injector: urlDir: urls/test.txt Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: finished at 2012-02-07 14:21:25, elapsed: 00:00:04 Generator: starting at 2012-02-07 14:21:25 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 100 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: crawl/segments/20120207142128 Generator: finished at 2012-02-07 14:21:30, elapsed: 00:00:05 Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting at 2012-02-07 14:21:30 Fetcher: segment: crawl/segments/20120207142128 Using queue mode : byHost Fetcher: threads: 10 Fetcher: time-out divisor: 2 QueueFeeder finished: total 1 records + hit by time limit :0 Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost fetching http://www.163.com/ -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost Fetcher: throughput threshold: -1 -finishing thread FetcherThread, activeThreads=1 Fetcher: throughput threshold retries: 5 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: finished at 2012-02-07 14:21:36, elapsed: 00:00:05 ParseSegment: starting at 2012-02-07 14:21:36 ParseSegment: segment: crawl/segments/20120207142128 Parsing: http://www.163.com/ ParseSegment: finished at 2012-02-07 14:21:39, elapsed: 00:00:03 CrawlDb update: starting at 2012-02-07 14:21:39 CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/segments/20120207142128] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: 404 purging: false CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2012-02-07 14:21:42, elapsed: 00:00:03 Generator: starting at 2012-02-07 14:21:42 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 100 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: crawl/segments/20120207142145 Generator: finished at 2012-02-07 14:21:48, elapsed: 00:00:05 Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting at 2012-02-07 14:21:48 Fetcher: segment: crawl/segments/20120207142145 Using queue mode : byHost Fetcher: threads: 10 Fetcher: time-out divisor: 2 Using queue mode : byHost QueueFeeder finished: total 97 records + hit by time limit :0 Using queue mode : byHost fetching http://bbs.163.com/ Using queue mode : byHost fetching http://bbs.163.com/rank/ Using queue mode : byHost fetching http://tech.163.com/cnstock/ Using queue mode : byHost fetching http://tech.163.com/ Using queue mode : byHost fetching http://tech.163.com/digi/nb/ Using queue mode : byHost Using queue mode : byHost fetching http://g.163.com/a?CID=10625&Values=3331479594&Redirect=http:/www.edu-163.com/Item/list.asp?id=1164 fetching http://g.163.com/r?site=netease&affiliate=homepage&cat=homepage&type=textlinkhouse&location=1 Using queue mode : byHost fetching http://g.163.com/a?CID=10627&Values=896009995&Redirect=http:/www.dv37.com/jiaoyu/xiaoxinxing/ Using queue mode : byHost fetching http://g.163.com/a?CID=10635&Values=1012801948&Redirect=http:/www.worldwayhk.com/ Fetcher: throughput threshold: -1 Fetcher: throughput threshold retries: 5 fetching http://g.163.com/a?CID=12392&Values=441270714&Redirect=http:/www.qinzhe.com/chinese/index.htm fetching http://g.163.com/a?CID=10634&Values=2943411042&Redirect=http:/www.kpeng.com.cn/ fetching http://g.163.com/a?CID=12337&Values=3289604641&Redirect=http:/www.offcn.com/zg/2011ms/index.html fetching http://g.163.com/a?CID=10633&Values=1745739655&Redirect=http:/www.edu-163.com/aidi/aidinj1.htm fetching http://g.163.com/a?CID=12307&Values=3388898846&Redirect=http:/www.offcn.com/zg/2011ms/index.html fetching http://g.163.com/a?CID=10629&Values=740233954&Redirect=http:/www.embasjtu.com/ fetching http://g.163.com/a?CID=10632&Values=715626766&Redirect=http:/www.edu-163.com/aidi/aidimg.htm fetching http://g.163.com/a?CID=12259&Values=3180311081&Redirect=http:/www.gpkdtx.com/ fetching http://g.163.com/a?CID=12271&Values=904657751&Redirect=http:/www.vipabc.com/count.asp?code=QnfF0agFbn fetching http://g.163.com/a?CID=10628&Values=2735701856&Redirect=http:/www.wsi.com.cn fetching http://g.163.com/a?CID=10623&Values=1704187161&Redirect=http:/www.wsi.com.cn fetching http://g.163.com/a?CID=12267&Values=608079303&Redirect=http:/edu.163.com/special/official/ fetching http://g.163.com/a?CID=10631&Values=3773655455&Redirect=http:/www.xinhaowei.cn/zt/sasheng-new/ fetching http://g.163.com/a?CID=10630&Values=4025376053&Redirect=http:/www.bwpx.com/ fetching http://g.163.com/a?CID=12283&Values=1441209353&Redirect=http:/www.zyqm.org/ fetching http://mobile.163.com/ fetching http://mobile.163.com/app/ fetching http://reg.vip.163.com/enterMail.m?enterVip=true----------- fetching http://product.tech.163.com/mobile/ fetching http://hea.163.com/ -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=68 fetching http://reg.email.163.com/mailregAll/reg0.jsp?from=163®Page=163 -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=67 fetching http://yuehui.163.com/ fetching http://auto.163.com/ fetching http://auto.163.com/buy/ fetching http://gongyi.163.com/ fetching http://reg.163.com/Main.jsp?username=pInfo fetching http://reg.163.com/Logout.jsp?username=accountName&url=http:/www.163.com/ -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=61 fetching http://money.163.com/fund/ fetching http://money.163.com/stock/ fetching http://money.163.com/hkstock/ fetching http://money.163.com/ -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=57 fetching http://blog.163.com/passportIn.do?entry=163 fetching http://blog.163.com/?fromNavigation fetching http://pay.163.com/ fetching http://baby.163.com/ fetching http://discovery.163.com/ -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=52 fetching http://p.mail.163.com/mailinfo/shownewmsg_www_0819.htm fetching http://help.163.com?b01abh1 fetching http://www.163.com/rss/ fetching http://home.163.com/ fetching http://product.auto.163.com/ -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=47 fetching http://ecard.163.com/ fetching http://photo.163.com/?username=pInfo fetching http://photo.163.com/pp/square/ fetching http://email.163.com/ fetching http://m.163.com/ -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=42 fetching http://edu.163.com/ fetching http://edu.163.com/liuxue/ fetching http://xf.house.163.com/gz/ fetching http://game.163.com/ fetching http://travel.163.com/ -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=37 fetching http://baoxian.163.com/?from=index -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=36 fetching http://zx.caipiao.163.com?from=shouye fetching http://entry.mail.163.com/coremail/fcg/ntesdoor2?verifycookie=1&lightweight=1 fetching http://biz.163.com/ fetching http://t.163.com/rank?f=163dh fetching http://t.163.com/chat?f=163dh -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=31 fetching http://t.163.com/?f=wstopmicoblogmsg fetch of http://zx.caipiao.163.com?from=shouye failed with: org.apache.nutch.protocol.http.api.HttpException: bad status line '<html>': For input string: "<html>" fetching http://t.163.com/rank/daren?f=163dh fetching http://t.163.com/?f=wstopmicoblogmsg.enter fetching http://t.163.com/ fetching http://sports.163.com/ -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=26 fetching http://sports.163.com/nba/ fetching http://sports.163.com/cba/ -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=24 fetching http://sports.163.com/yc/ -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=23 fetching http://vipmail.163.com/ fetching http://digi.163.com/ fetching http://lady.163.com/beauty/ fetching http://lady.163.com/ fetching http://lady.163.com/sense/ -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=18 fetching http://house.163.com/ fetching http://news.163.com/review/ -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=16 fetching http://news.163.com/photo/ fetching http://news.163.com/ fetching http://v.163.com/doc/ -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=13 fetching http://v.163.com/zongyi/ fetching http://v.163.com/ fetching http://v.163.com/focus/ fetching http://fushi.163.com/ -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=9 fetching http://yc.163.com/ -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=8 fetching http://mall.163.com/ fetching http://ent.163.com/movie/ fetching http://ent.163.com/ fetching http://ent.163.com/music/ fetching http://ent.163.com/tv/ fetching http://war.163.com/ -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=2 * queue: http://fashion.163.com maxThreads = 10 inProgress = 0 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1328595704430 now = 1328595728444 0. http://fashion.163.com/ * queue: http://book.163.com maxThreads = 10 inProgress = 0 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1328595704430 now = 1328595728445 0. http://book.163.com/ fetching http://fashion.163.com/ -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=1 * queue: http://book.163.com maxThreads = 10 inProgress = 0 crawlDelay = 5000 minCrawlDelay = 0 nextFetchTime = 1328595704430 now = 1328595729445 0. http://book.163.com/ fetching http://book.163.com/ -activeThreads=10, spinWaiting=0, fetchQueues.totalSize=0 -finishing thread FetcherThread, activeThreads=9 -finishing thread FetcherThread, activeThreads=8 -activeThreads=8, spinWaiting=0, fetchQueues.totalSize=0 -finishing thread FetcherThread, activeThreads=7 -finishing thread FetcherThread, activeThreads=6 -finishing thread FetcherThread, activeThreads=5 -finishing thread FetcherThread, activeThreads=4 -finishing thread FetcherThread, activeThreads=3 -finishing thread FetcherThread, activeThreads=2 -activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0 -finishing thread FetcherThread, activeThreads=1 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: finished at 2012-02-07 14:22:20, elapsed: 00:00:32 ParseSegment: starting at 2012-02-07 14:22:20 ParseSegment: segment: crawl/segments/20120207142145 Parsing: http://auto.163.com/ Parsing: http://auto.163.com/buy/ Parsing: http://baby.163.com/ Parsing: http://baoxian.163.com/?from=index Parsing: http://bbs.163.com/ Parsing: http://bbs.163.com/rank/ Parsing: http://biz.163.com/ Parsing: http://blog.163.com/?fromNavigation Parsing: http://book.163.com/ Parsing: http://digi.163.com/ Parsing: http://discovery.163.com/ Parsing: http://edu.163.com/ Parsing: http://edu.163.com/liuxue/ Parsing: http://email.163.com/ Parsing: http://ent.163.com/ Parsing: http://ent.163.com/movie/ Parsing: http://ent.163.com/music/ Parsing: http://ent.163.com/tv/ Parsing: http://fashion.163.com/ Parsing: http://fushi.163.com/ Parsing: http://g.163.com/a?CID=10623&Values=1704187161&Redirect=http:/www.wsi.com.cn Error parsing: http://g.163.com/a?CID=10623&Values=1704187161&Redirect=http:/www.wsi.com.cn: failed(2,0): Can't retrieve Tika parser for mime-type application/octet-stream Parsing: http://g.163.com/a?CID=10625&Values=3331479594&Redirect=http:/www.edu-163.com/Item/list.asp?id=1164 Error parsing: http://g.163.com/a?CID=10625&Values=3331479594&Redirect=http:/www.edu-163.com/Item/list.asp?id=1164: failed(2,0): Can't retrieve Tika parser for mime-type application/octet-stream Parsing: http://g.163.com/a?CID=10627&Values=896009995&Redirect=http:/www.dv37.com/jiaoyu/xiaoxinxing/ Error parsing: http://g.163.com/a?CID=10627&Values=896009995&Redirect=http:/www.dv37.com/jiaoyu/xiaoxinxing/: failed(2,0): Can't retrieve Tika parser for mime-type application/octet-stream Parsing: http://g.163.com/a?CID=10628&Values=2735701856&Redirect=http:/www.wsi.com.cn Error parsing: http://g.163.com/a?CID=10628&Values=2735701856&Redirect=http:/www.wsi.com.cn: failed(2,0): Can't retrieve Tika parser for mime-type application/octet-stream Parsing: http://g.163.com/a?CID=10629&Values=740233954&Redirect=http:/www.embasjtu.com/ Error parsing: http://g.163.com/a?CID=10629&Values=740233954&Redirect=http:/www.embasjtu.com/: failed(2,0): Can't retrieve Tika parser for mime-type application/octet-stream Parsing: http://g.163.com/a?CID=10630&Values=4025376053&Redirect=http:/www.bwpx.com/ Error parsing: http://g.163.com/a?CID=10630&Values=4025376053&Redirect=http:/www.bwpx.com/: failed(2,0): Can't retrieve Tika parser for mime-type application/octet-stream Parsing: http://g.163.com/a?CID=10631&Values=3773655455&Redirect=http:/www.xinhaowei.cn/zt/sasheng-new/ Error parsing: http://g.163.com/a?CID=10631&Values=3773655455&Redirect=http:/www.xinhaowei.cn/zt/sasheng-new/: failed(2,0): Can't retrieve Tika parser for mime-type application/octet-stream Parsing: http://g.163.com/a?CID=10632&Values=715626766&Redirect=http:/www.edu-163.com/aidi/aidimg.htm Parsing: http://g.163.com/a?CID=10633&Values=1745739655&Redirect=http:/www.edu-163.com/aidi/aidinj1.htm Parsing: http://g.163.com/a?CID=10634&Values=2943411042&Redirect=http:/www.kpeng.com.cn/ Error parsing: http://g.163.com/a?CID=10634&Values=2943411042&Redirect=http:/www.kpeng.com.cn/: failed(2,0): Can't retrieve Tika parser for mime-type application/octet-stream Parsing: http://g.163.com/a?CID=10635&Values=1012801948&Redirect=http:/www.worldwayhk.com/ Error parsing: http://g.163.com/a?CID=10635&Values=1012801948&Redirect=http:/www.worldwayhk.com/: failed(2,0): Can't retrieve Tika parser for mime-type application/octet-stream Parsing: http://g.163.com/a?CID=12259&Values=3180311081&Redirect=http:/www.gpkdtx.com/ Error parsing: http://g.163.com/a?CID=12259&Values=3180311081&Redirect=http:/www.gpkdtx.com/: failed(2,0): Can't retrieve Tika parser for mime-type application/octet-stream Parsing: http://g.163.com/a?CID=12267&Values=608079303&Redirect=http:/edu.163.com/special/official/ Error parsing: http://g.163.com/a?CID=12267&Values=608079303&Redirect=http:/edu.163.com/special/official/: failed(2,0): Can't retrieve Tika parser for mime-type application/octet-stream Parsing: http://g.163.com/a?CID=12271&Values=904657751&Redirect=http:/www.vipabc.com/count.asp?code=QnfF0agFbn Error parsing: http://g.163.com/a?CID=12271&Values=904657751&Redirect=http:/www.vipabc.com/count.asp?code=QnfF0agFbn: failed(2,0): Can't retrieve Tika parser for mime-type application/octet-stream Parsing: http://g.163.com/a?CID=12283&Values=1441209353&Redirect=http:/www.zyqm.org/ Error parsing: http://g.163.com/a?CID=12283&Values=1441209353&Redirect=http:/www.zyqm.org/: failed(2,0): Can't retrieve Tika parser for mime-type application/octet-stream Parsing: http://g.163.com/a?CID=12307&Values=3388898846&Redirect=http:/www.offcn.com/zg/2011ms/index.html Parsing: http://g.163.com/a?CID=12337&Values=3289604641&Redirect=http:/www.offcn.com/zg/2011ms/index.html Parsing: http://g.163.com/a?CID=12392&Values=441270714&Redirect=http:/www.qinzhe.com/chinese/index.htm Parsing: http://g.163.com/r?site=netease&affiliate=homepage&cat=homepage&type=textlinkhouse&location=1 Parsing: http://game.163.com/ Parsing: http://gongyi.163.com/ Parsing: http://hea.163.com/ Parsing: http://home.163.com/ Parsing: http://house.163.com/ Parsing: http://lady.163.com/ Parsing: http://lady.163.com/beauty/ Parsing: http://lady.163.com/sense/ Parsing: http://mall.163.com/ Parsing: http://mobile.163.com/ Parsing: http://mobile.163.com/app/ Parsing: http://money.163.com/ Parsing: http://money.163.com/fund/ Parsing: http://money.163.com/hkstock/ Parsing: http://money.163.com/stock/ Parsing: http://news.163.com/ Parsing: http://news.163.com/photo/ Parsing: http://news.163.com/review/ Parsing: http://p.mail.163.com/mailinfo/shownewmsg_www_0819.htm Parsing: http://pay.163.com/ Parsing: http://photo.163.com/pp/square/ Parsing: http://product.auto.163.com/ Parsing: http://product.tech.163.com/mobile/ Parsing: http://reg.163.com/Logout.jsp?username=accountName&url=http:/www.163.com/ Parsing: http://reg.163.com/Main.jsp?username=pInfo Parsing: http://reg.email.163.com/mailregAll/reg0.jsp?from=163®Page=163 Parsing: http://reg.vip.163.com/enterMail.m?enterVip=true----------- Parsing: http://sports.163.com/ Parsing: http://sports.163.com/cba/ Parsing: http://sports.163.com/nba/ Parsing: http://sports.163.com/yc/ Parsing: http://t.163.com/chat?f=163dh Parsing: http://t.163.com/rank/daren?f=163dh Parsing: http://t.163.com/rank?f=163dh Parsing: http://tech.163.com/ Parsing: http://tech.163.com/cnstock/ Parsing: http://tech.163.com/digi/nb/ Parsing: http://travel.163.com/ Parsing: http://v.163.com/ Parsing: http://v.163.com/doc/ Parsing: http://v.163.com/focus/ Parsing: http://vipmail.163.com/ Parsing: http://war.163.com/ Parsing: http://www.163.com/rss/ Parsing: http://xf.house.163.com/gz/ Parsing: http://yc.163.com/ Parsing: http://yuehui.163.com/ ParseSegment: finished at 2012-02-07 14:22:26, elapsed: 00:00:06 CrawlDb update: starting at 2012-02-07 14:22:26 CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/segments/20120207142145] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: 404 purging: false CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2012-02-07 14:22:30, elapsed: 00:00:04 crawl finished: crawl
评论
那个eclipse中进行nutch1.4开发的 你在哪讲了
我自己有文档,一直没有发上去,后来没搞了,所以博客也没有续写,有时间我发上去吧
那个eclipse中进行nutch1.4开发的 你在哪讲了
[yzb@www local]$ bin/nutch crawl urls -solr http://localhost:8080/solr/ -dir crawl -depth 2 -threads 5 -topN 100
crawl started in: crawl
rootUrlDir = urls
threads = 5
depth = 2
solrUrl=http://localhost:8080/solr/
topN = 100
Injector: starting at 2012-05-22 20:50:14
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
at org.apache.nutch.crawl.Injector.inject(Injector.java:217)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:127)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
这个怎么解决了,看到了请速度给个反馈,谢谢!
请检查nutch-default.xml的plugin.folders是否修改为./src/plugin,默认为plugins,
修改后启动正常
一般是插件的地址问题!
[yzb@www local]$ bin/nutch crawl urls -solr http://localhost:8080/solr/ -dir crawl -depth 2 -threads 5 -topN 100
crawl started in: crawl
rootUrlDir = urls
threads = 5
depth = 2
solrUrl=http://localhost:8080/solr/
topN = 100
Injector: starting at 2012-05-22 20:50:14
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
at org.apache.nutch.crawl.Injector.inject(Injector.java:217)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:127)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
这个怎么解决了,看到了请速度给个反馈,谢谢!
,好的,这几天一直在忙乎工作,周一加了通宵,没时间更新,实在不好意思啊……
这些我都有写好的东西
在哪,能否给我看看,把你这方面相关的资料都给我借鉴借鉴吧!
在我本机的word上一直没有发布上来
这些我都有写好的东西
在哪,能否给我看看,把你这方面相关的资料都给我借鉴借鉴吧!
这些我都有写好的东西
好的,今天晚上回去写下吧,要是今天晚上没有其他事情的话,然后再把我之前遇到的一些问题也贴出来,呵呵
非常感谢
没事,群里面好多兄弟也等呢……最近不是太忙了,没时间了嘛,呵呵
好的,今天晚上回去写下吧,要是今天晚上没有其他事情的话,然后再把我之前遇到的一些问题也贴出来,呵呵
非常感谢
好的,今天晚上回去写下吧,要是今天晚上没有其他事情的话,然后再把我之前遇到的一些问题也贴出来,呵呵
我今天晚上写吧……最近忙乎换工作的事情呢……
非常感谢!!
昨天晚上回来太晚了,就没有写,回去我看时间,然后尽快写出来吧,今天入职新工作了
你先忙自己的,我不是太着忙!
我今天晚上写吧……最近忙乎换工作的事情呢……
非常感谢!!
昨天晚上回来太晚了,就没有写,回去我看时间,然后尽快写出来吧,今天入职新工作了
我今天晚上写吧……最近忙乎换工作的事情呢……
非常感谢!!
我今天晚上写吧……最近忙乎换工作的事情呢……
写好了没,另外,你那是什么系统下的eclipse配置
相关推荐
nutch1.4帮助文档,学习nutch1.4必备,最新nutch1.4核心类解读!
在Windows平台上部署Nutch 1.4,需预先安装以下工具和软件: 1. **Java JDK 1.7**:Nutch基于Java开发,因此需要安装JDK,并设置相应的环境变量。 2. **Cygwin**:由于Nutch的脚本采用Linux Shell编写,故在Windows...
通过这些步骤,用户可以成功部署 Nutch 并进行简单的网页爬取任务。同时,文中还提供了常见的错误排查方法,帮助用户顺利解决问题。此外,本文还简单介绍了如何验证 Solr 的安装情况,确保整个系统可以正常使用。
Nutch 1.4是该项目的一个稳定版本,发布于2012年,尽管后续有更新的版本,但1.4版本因其稳定性及广泛的应用而备受青睐。在深入探讨Nutch 1.4的知识点之前,我们先来了解一下什么是Apache Nutch。 Apache Nutch是一...
### Nutch 1.4 在 Windows 下 Eclipse 配置图文详解 #### 一、环境准备与配置 **1.1 JDK 安装** - **版本选择**:文档中提到使用了 JDK1.6,官方下载地址为:[JDK6]...
apache-nutch-1.4-bin.tar.gz.part2
在这个"apache-nutch-1.4-bin.tar.gz"压缩包中,包含了运行 Nutch 的所有必要组件和配置文件,适合初学者和开发者快速部署和实验。 **Nutch 的核心组成部分:** 1. **爬虫(Spider)**:Nutch 的爬虫负责在网络中...
在“apache-nutch-1.4-src.tar.gz”这个压缩包中,包含了Nutch 1.4版本的源代码,用户可以根据自己的需求对代码进行定制和扩展。 Nutch 的主要组件包括以下几个方面: 1. **网络爬虫(Crawler)**:Nutch 的网络...
apache-nutch-1.4-bin.part2
apache-nutch-1.4-bin.part1
apache-nutch-1.4-bin.tar.gz.part1
本文将详细介绍如何在Windows环境下配置Nutch 1.4,并使用Eclipse进行开发。以下是你需要知道的关键步骤: 1. **安装JDK**: 在配置Nutch之前,首先确保已安装Java Development Kit (JDK)。这里推荐使用JDK 1.6。...
Nutch 是一个开源的网络爬虫项目,用于抓取互联网上的网页并建立索引,而Tomcat是一款流行的Java应用服务器,常用来部署Web应用程序。在本文中,我们将深入探讨如何在Tomcat环境下部署Nutch以及解决相关问题。 首先...
而Nutch的简单应用则体现在用户如何利用Nutch提供的这些命令和配置选项,根据自己的需求来抓取和索引网络上的数据。例如,用户可以针对一个特定的URL列表,设置爬取深度和线程数来获取网站内容,并将这些内容索引后...
因此,Nutch 可以看作是 Lucene 的一个应用,提供了一个完整的搜索引擎解决方案。如果你已经有数据源,只需要搜索功能,可以直接使用 Lucene。而当你需要从网上抓取数据并进行搜索时,Nutch 就是更好的选择。 4. **...
MyEclipse(这里误写为myclipse)是基于Eclipse的一款集成开发环境(IDE),尤其适合Java开发,支持多种Java应用的创建、调试和部署。本教程将详细介绍如何在MyEclipse中部署Nutch1.3的工程源码。 首先,我们需要...
3. apache-tomcat-8.5.39.tar.gz:Tomcat 是一个流行的 Java Servlet 和 JavaServer Pages(JSP)容器,用于运行 Nutch 的 Web 应用程序。 4. jdk-8u201-linux-x64.tar.gz:Java 开发工具包,Nutch 运行和构建的必需...
1.4 nutch VS lucene.....2 2. nutch的安装与配置.....3 2.1 JDK的安装与配置.3 2.2 nutch的安装与配置........5 2.3 tomcat的安装与配置......5 3. nutch初体验7 3.1 爬行企业内部网....7 3.1.1 配置nutch....7 ...
提供的WAR(Web Application Archive)文件是Java Web 应用的标准打包格式,可以直接部署在支持Servlet和JSP的Web服务器上,如Tomcat。在Nutch 0.9 中,这个WAR文件包含了运行Nutch Web界面所需的全部组件,包括Web...