hadoop备忘

gushuizerotoone

浏览: 175620 次
性别:
来自: 杭州

最近访客更多访客>>

rbaggio10

KnightMCH

牛哄哄

wenming6688

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

hadoop

Hadoop Workflow Eclipse BBS HBase

hadoop 技术论坛。http://bbs.hadoopor.com/index.php
1.hadoop0.20.0 + eclipse环境搭建http://bbs.hadoopor.com/thread-43-1-1.html
台湾一个人写的，很好。hadoop0.20.0 + eclipse环境搭建http://trac.nchc.org.tw/cloud/wiki/waue/2009/0617教怎么打包成jar，制作jar包
注意里面的那个Makefile文件“jar -cvf ${JarFile} -C bin/ .”
”hadoop jar ${JarFile} ${MainFunc} input output“等要用tab开头，而不是空格，至于help下面的我都注释掉了在前面加“#”，因为还不知道怎么用

1.hadoop集群配置中client如何向hadoop传输数据http://bbs.hadoopor.com/thread-362-1-1.html：
使用DFSClient工具，客户端上传数据不需要部署hadoop，只需要安装有DFSClient工具就可以上传数据了。
bin/hadoop fs -put 就是以DFSClient的方式“远程”访问HDFS的（当然也是在本地）

2.Hadoop对mysql支持http://bbs.hadoopor.com/thread-132-1-2.html
lance(274105045) 09:48:43
好像0。20里面有提供对DB的输入，输出。
hadoopor(784027584) 09:48:50
但要使用Job并行化，就不得使用默认的调试器，Facebook提供的FaireScheduler支持对Job的并行调度。?????
Spork(47986766) 09:49:16
不是好像了，就是有，只是目前支持的较好的都是开源的，如mysql

3.SequenceFile介绍:http://bbs.hadoopor.com/thread-144-1-1.html

4.JobTracker.JobInProcesshttp://bbs.hadoopor.com/thread-212-1-1.html用于监控一个Job的调度情况。一个Job会被分解成N个Tasks，这些Tasks被分配到集群中的TaskTracer节点，由TaskTracer节点去执行这些Tasks。

==========搜索自Nabble Hadoop===============
1.Hadoop 0.17 schedules jobs fifo. If it isn't,
that is a bug. http://old.nabble.com/Hadoop-job-scheduling-issue-td19659938.html#a19659938

2.Can jobs be configured to be sequential. it means jobs in Group1 excute first, and jobs in Group2 excute later. and Group2 jobs depends on Group1 jobs. The jobs in Group1 or Group2 are independent.
http://old.nabble.com/Can-jobs-be-configured-to-be-sequential-td20043257.html#a20043257
I recommend that you look at http://cascading.org as
an abstraction layer for managing these kinds of workflows. We've
found it quite useful.

3.Sequence of Streaming Jobs： if you are using the sh or bash, the variable $? holds the exit status of the last command to execute.

hadoop jar streaming.jar ...
if [ $? -ne 0 ]; then
echo "My job failed" 2>&1
exit 1
fi

Caution $? is the very last command to execute's exit status. It is easy to run another command before testing and then test the wrong command's exit status
http://old.nabble.com/Sequence-of-Streaming-Jobs-td23336043.html#a23351848

4.mapred.tasktracker.map.tasks.maximum
mapred.tasktracker.reduce.tasks.maximum
mapred.map.multithreadedrunner.threads

5.http://old.nabble.com/Linking-2-MarReduce-jobs-together--td18756178.html#a18756178
Is it possible to put the output from the reduce phase of job 1
to be the input to job number 2？
Well your data has to be somewhere between the two jobs... So I'd say yes, put it in HBase or HDFS to reuse it

6.the <pro hadoop> chapter8 covers this topic.

7.http://old.nabble.com/Customizing-machines-to-use-for-different-jobs-td23864519.html#a23864519
Customizing machines to use for different jobs:
Unfortunately there is no built-in way of doing this. You'd have to
instantiate two entirely separate Hadoop clusters to accomplish what you're
trying to do, which isn't an uncommon thing to do.

I'm not sure why you're hoping to have this behavior, but the fair share
scheduler might be helpful to you. It let's you essentially divvy up your
cluster into queues, where each queue has its own "chunk" of the cluster.
When resources are available outside of the "chunk," then jobs can span into
other queues' space.

Cloudera's Distribution for Hadoop (<http://www.cloudera.com/hadoop>)
includes the fair share scheduler. I recommend using our distribution,
otherwise here is the fair share JIRA:

<http://issues.apache.org/jira/browse/HADOOP-3746>

8.http://old.nabble.com/How-to-run-many-jobs-at-the-same-time--td23151917.html#a23151917
How to run many jobs at the same time?:JobControl example

9.http://issues.apache.org/jira/browse/HADOOP-5170
Set max map/reduce tasks on a per-job basis, either per-node or cluster-wide

once the tasktracker starts, the maximum number of tasks
per node can not be changed. In my case, I've solved this challenge by stopping and starting mapred (stop-mapred.sh, start-mapred.sh) between jobs.
There is a jira so this may be changed in the future: jira HADOOP-5170 (
http://issues.apache.org/jira/browse/HADOOP-5170)
可能已经修正了

10.Oozie, Hadoop Workflow System
https://issues.apache.org/jira/browse/HADOOP-5303

11.http://kdpeterson.net/blog/2009/11/hadoop-workflow-tools-survey.html
Hadoop Workflow Tools Survey
very clear about jobs schedule
一个视频http://developer.yahoo.net/blogs/theater/archives/2009/08/hadoop_summit_workflow_oozie.html

12.http://wiki.dspace.org/index.php/Creating_and_Applying_Patches_in_Eclipse
Creating and Applying Patches in Eclipse
http://www.ibm.com/developerworks/cn/opensource/os-eclipse-galileopatch/

13.JobControl:http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/mapred/jobcontrol/JobControl.html

14. http://kdpeterson.net/blog/2009/11/hadoop-workflow-tools-survey.html
By default, Hadoop uses a FIFO scheduler, but there are two more advanced schedulers which are widely used. The Capacity Scheduler is focused on guaranteing that various users of a cluster will have access to their guaranteed number of slots while making it and the Fair Scheduler is focused on providing good latency for small jobs while long running large jobs share the same cluster. These schedulers closely parallel processor scheduling, with hadoop jobs corresponding to processes and the map and reduce tasks corresponding to time slices.

分享到：

我写的那个聚类备忘 | hbase新版本已经取消掉了hql

2010-01-13 14:31
浏览 1498
评论(0)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

hadoop备忘

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

hadoop备忘

评论

发表评论

相关推荐

安装thrift

关于map task和reduce task的个数

impove hadoop mapreduce performance

备忘：生成自己的build.xml，jar包

生成自己的hadoop eclipse plugin

can we specify a job running on a set of certain nodes.

最近访客更多访客>>