nutch碰到failed with: Http code=403问题

qjwujian

浏览: 16794 次
性别:
来自: 重庆

最近访客更多访客>>

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

nutch

thread XSL Ubuntu lucene XML

做毕业设计。打算做一个校园网的搜索引擎。

下载了nutch1.2,然后做了一些配置试用了一下。

第一步：在解压后的nutch1.2目录里面新建urls目录，然后在其目录下新建url.txt文件，然后在文件中写入我

要抓取网站的网址，http://www.ujs.edu.cn/

第二步：在nutch1.2目录下新建logs目录，拿来存放日志文件。然后在下面新建test.log空白文件。

第三步：进入conf目录，编辑nutch-site.xml文件，这个文件主要配置你的spider的一些信息。

我的nutch-site.xml内容如下

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
<property>
<name>http.agent.name</name>
<value>mynutch</value>
<description>test
</description>
</property>
<property>
<name>http.agent.description</name>
<value>spider</value>
<description> spider
</description>
</property>
<property>
<name>http.agent.url</name>
<value>http://www.xxx.com </value>
<description>http://www.xxx.com
</description>
</property>
<property>
<name>http.agent.email</name>
<value>MyEmail</value>
<description>test@gmail.com
</description>
</property>
</configuration>

第四步：编辑conf下crawl-urlfilter.txt文件，找到“# accept hosts in MY.DOMAIN.NAME”

这一行，然后把这一行下面紧接的一行改为"+http://www.ujs.edu.cn"

第五步：我用的是ubuntu,所以进入shell,cd入nutch1.2目录，然后执行抓取命令：

bin/nutch crawl urls/url.txt -dir crawled >logs/test.log

过了一分钟，就结束了抓取，但是却没有抓取到任何数据，日志如下：

test.log

crawl started in: crawled
rootUrlDir = urls/url.txt
threads = 10
depth = 5
indexer=lucene
Injector: starting at 2011-04-18 20:19:19
Injector: crawlDb: crawled/crawldb
Injector: urlDir: urls/url.txt
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2011-04-18 20:19:23, elapsed: 00:00:03
Generator: starting at 2011-04-18 20:19:23
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawled/segments/20110418201927
Generator: finished at 2011-04-18 20:19:28, elapsed: 00:00:05
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
Fetcher: starting at 2011-04-18 20:19:28
Fetcher: segment: crawled/segments/20110418201927
Fetcher: threads: 10
QueueFeeder finished: total 1 records + hit by time limit :0
fetching http://www.ujs.edu.cn/
-finishing thread FetcherThread, activeThreads=8
-finishing thread FetcherThread, activeThreads=8
-finishing thread FetcherThread, activeThreads=7
-finishing thread FetcherThread, activeThreads=4
-finishing thread FetcherThread, activeThreads=5
-finishing thread FetcherThread, activeThreads=6
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=1
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
fetch of http://www.ujs.edu.cn/ failed with: Http code=403, url=http://www.ujs.edu.cn/
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2011-04-18 20:19:33, elapsed: 00:00:04
CrawlDb update: starting at 2011-04-18 20:19:33
CrawlDb update: db: crawled/crawldb
CrawlDb update: segments: [crawled/segments/20110418201927]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2011-04-18 20:19:36, elapsed: 00:00:02
Generator: starting at 2011-04-18 20:19:36
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=1 - no more URLs to fetch.
LinkDb: starting at 2011-04-18 20:19:37
LinkDb: linkdb: crawled/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: file:/home/hello/nutch-1.2/crawled/segments/20110418201927
LinkDb: finished at 2011-04-18 20:19:39, elapsed: 00:00:01
Indexer: starting at 2011-04-18 20:19:39
Indexer: finished at 2011-04-18 20:19:43, elapsed: 00:00:03
Dedup: starting at 2011-04-18 20:19:43
Dedup: adding indexes in: crawled/indexes
Dedup: finished at 2011-04-18 20:19:48, elapsed: 00:00:05
IndexMerger: starting at 2011-04-18 20:19:48
IndexMerger: merging indexes to: crawled/index
Adding file:/home/hello/nutch-1.2/crawled/indexes/part-00000
IndexMerger: finished at 2011-04-18 20:19:48, elapsed: 00:00:00
crawl finished: crawled

发现里面出现了fetch of http://www.ujs.edu.cn/ failed with: Http code=403, url=http://www.ujs.edu.cn/错误

我尝试了好几次都是这样，但是在浏览器中，打开http://www.ujs.edu.cn是能正常打开的，403错误表示没有权限读取

内容，我不明白为什么会出现这样的原因。网上搜了一下，也没搜到什么。谁能告诉我，我哪里弄错了？

分享到：

nutch出现org.apache.nutch.plugin.Plugin ... | ubuntu下的ftp使用

2011-04-18 20:27
浏览 1424
评论(0)
分类:互联网
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论