nutch1.4：爬虫定时抓取设置

peigang

浏览: 172791 次
性别:
来自: 北京

最近访客更多访客>>

yxmzhg

yexiaoshunfeier

wd1282988143

the12thwolf

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

nutch

nutch1.4定时爬取数据配合linux定时任务可以实现nutch的自动定时爬取，linux定时任务请参考《 Linux定时执行任务命令：at和crontab》

步骤如下：

1、首先查看当前用户的 crontab服务执行命令：

crontab -l
执行结果：
no crontab for ***
表示没有定义 crontab 服务

2、编辑crontab服务：

crontab -e
*/10 * * * * /home/*/*.sh     //每10分钟执行一次 ，*.sh中包含nutch抓取脚本如crawl

注意设置服务执行账户，此处设置为root如果是其他账户则需要对应修改为其他账户名。为*.sh文件设置可执行权限。

*.sh脚本中如果调用了系统环境变量则会发现脚步无法正常执行，原因是cron无法获取环境变量导致（相关说明文章：http://peigang.iteye.com/blog/1567706），改用如下写法：

crontab -e
*/10 * * * * . /etc/profile;/bin/sh /home/*/*.sh

. /etc/profile;/bin/sh 用来声明环境变量。

3、执行sudo apt-get install libnotify-bin

4、重新启动cron进程：

~#sudo /etc/init.d/cron restart

观察运行结果。重启可能不成功，使用如下步骤重新启动：

15:40:34^O^bin$ sudo /etc/init.d/cron stop
 [sudo] password for sniffer: 
 Rather than invoking init scripts through /etc/init.d, use the service(8)
 utility, e.g. service cron stop

 Since the script you are attempting to invoke has been converted to an
 Upstart job, you may also use the stop(8) utility, e.g. stop cron
 cron stop/waiting
 15:40:49^O^bin$ ps -A | grep cron
 15:40:54^O^bin$ sudo /etc/int.d/cron start
 sudo: /etc/int.d/cron: command not found
 15:41:11^O^bin$ sudo /etc/init.d/cron start
 Rather than invoking init scripts through /etc/init.d, use the service(8)
 utility, e.g. service cron start

 Since the script you are attempting to invoke has been converted to an
 Upstart job, you may also use the start(8) utility, e.g. start cron
 cron start/running, process 14362
 15:41:19^O^bin$ ps -A | grep cron
 14362 ?        00:00:00 cron

注：nutch脚本存在无法找到JAVA_HOME的问题可以修改如下部分解决：

if [ "$JAVA_HOME" = "" ]; then
  #echo "Error: JAVA_HOME is not set."
  #exit 1
  JAVA_HOME="***"
fi

1
顶

0
踩

分享到：

nutch1.4 分布式爬取 | Linux定时执行任务命令：at和crontab(转 ...

2012-06-13 15:03
浏览 4669
评论(2)
分类:操作系统
查看更多

2 楼 peigang 2015-06-25

试试跟踪一下脚本，应该是环境变量的问题。

1 楼 zhangmj10 2015-05-26

你好，看这帖子是好久以前的，不知道你能不能看到。不知道能不能帮解决下。想问一下，我的cron执行nutch抓取命令，总是提前退出，没有错误提示也没有log，我跟了脚本命令，好像是在crawl脚本中提前退出了，抓取脚本能直接执行，可以的话能不能帮解决下呢？

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论