- 浏览: 1184185 次
- 性别:
- 来自: 北京
文章分类
- 全部博客 (350)
- Ajax研究 (2)
- javascript (22)
- struts (15)
- hibernate (12)
- spring (8)
- 我的生活 (28)
- jsp (2)
- 我的随笔 (84)
- 脑筋急转弯 (1)
- struts2 (2)
- ibatis (1)
- groovy (1)
- json (4)
- flex (20)
- Html Css (5)
- lucene (11)
- solr研究 (2)
- nutch (25)
- ExtJs (3)
- linux (6)
- 正则表达式 (2)
- xml (1)
- jetty (0)
- 多线程 (1)
- hadoop (40)
- mapreduce (5)
- webservice (2)
- 云计算 (8)
- 创业计划 (1)
- android (8)
- jvm内存研究 (1)
- 新闻 (2)
- JPA (1)
- 搜索技术研究 (2)
- perl (1)
- awk (1)
- hive (7)
- jvm (1)
最新评论
-
pandaball:
支持一下,心如大海
做有气质的男人 -
recall992:
山东分公司的风格[color=brown]岁的法国电视[/co ...
solr是如何存储索引的 -
zhangsasa:
-services "services-config ...
flex中endpoint的作用是什么? -
来利强:
非常感谢
java使用json所需要的几个包 -
zhanglian520:
有参考价值。
hadoop部署错误之一:java.lang.IllegalArgumentException: Wrong FS
hadoop 技术论坛。http://bbs.hadoopor.com/index.php
1.hadoop0.20.0 + eclipse环境搭建http://bbs.hadoopor.com/thread-43-1-1.html
台湾一个人写的,很好。hadoop0.20.0 + eclipse环境搭建http://trac.nchc.org.tw/cloud/wiki/waue/2009/0617教怎么打包成jar,制作jar包
注意里面的那个Makefile文件“jar -cvf ${JarFile} -C bin/ .”
”hadoop jar ${JarFile} ${MainFunc} input output“等要用tab开头,而不是空格,至于help下面的我都注释掉了在前面加“#”,因为还不知道怎么用
1.hadoop集群配置中client如何向hadoop传输数据http://bbs.hadoopor.com/thread-362-1-1.html:
使用DFSClient工具,客户端上传数据不需要部署hadoop,只需要安装有DFSClient工具就可以上传数据了。
bin/hadoop fs -put 就是以DFSClient的方式“远程”访问HDFS的(当然也是在本地)
2.Hadoop对mysql支持http://bbs.hadoopor.com/thread-132-1-2.html
lance(274105045) 09:48:43
好像0。20里面有提供对DB的输入,输出。
hadoopor(784027584) 09:48:50
但要使用Job并行化,就不得使用默认的调试器,Facebook提供的FaireScheduler支持对Job的并行调度。?????
Spork(47986766) 09:49:16
不是好像了,就是有,只是目前支持的较好的都是开源的,如mysql
3.SequenceFile介绍:http://bbs.hadoopor.com/thread-144-1-1.html
4.JobTracker.JobInProcesshttp://bbs.hadoopor.com/thread-212-1-1.html用于监控一个Job的调度情况。一个Job会被分解成N个Tasks,这些Tasks被分配到集群中的TaskTracer节点,由TaskTracer节点去执行这些Tasks。
==========搜索自Nabble Hadoop===============
1.Hadoop 0.17 schedules jobs fifo. If it isn't,
that is a bug. http://old.nabble.com/Hadoop-job-scheduling-issue-td19659938.html#a19659938
2.Can jobs be configured to be sequential. it means jobs in Group1 excute first, and jobs in Group2 excute later. and Group2 jobs depends on Group1 jobs. The jobs in Group1 or Group2 are independent.
http://old.nabble.com/Can-jobs-be-configured-to-be-sequential-td20043257.html#a20043257
I recommend that you look at http://cascading.org as
an abstraction layer for managing these kinds of workflows. We've
found it quite useful.
3.Sequence of Streaming Jobs: if you are using the sh or bash, the variable $? holds the exit status of the last command to execute.
hadoop jar streaming.jar ...
if [ $? -ne 0 ]; then
echo "My job failed" 2>&1
exit 1
fi
Caution $? is the very last command to execute's exit status. It is easy to run another command before testing and then test the wrong command's exit status
http://old.nabble.com/Sequence-of-Streaming-Jobs-td23336043.html#a23351848
4.mapred.tasktracker.map.tasks.maximum
mapred.tasktracker.reduce.tasks.maximum
mapred.map.multithreadedrunner.threads
5.http://old.nabble.com/Linking-2-MarReduce-jobs-together--td18756178.html#a18756178
Is it possible to put the output from the reduce phase of job 1
to be the input to job number 2?
Well your data has to be somewhere between the two jobs... So I'd say yes, put it in HBase or HDFS to reuse it
6.the <pro hadoop> chapter8 covers this topic.
7.http://old.nabble.com/Customizing-machines-to-use-for-different-jobs-td23864519.html#a23864519
Customizing machines to use for different jobs:
Unfortunately there is no built-in way of doing this. You'd have to
instantiate two entirely separate Hadoop clusters to accomplish what you're
trying to do, which isn't an uncommon thing to do.
I'm not sure why you're hoping to have this behavior, but the fair share
scheduler might be helpful to you. It let's you essentially divvy up your
cluster into queues, where each queue has its own "chunk" of the cluster.
When resources are available outside of the "chunk," then jobs can span into
other queues' space.
Cloudera's Distribution for Hadoop (<http://www.cloudera.com/hadoop>)
includes the fair share scheduler. I recommend using our distribution,
otherwise here is the fair share JIRA:
<http://issues.apache.org/jira/browse/HADOOP-3746>
8.http://old.nabble.com/How-to-run-many-jobs-at-the-same-time--td23151917.html#a23151917
How to run many jobs at the same time?:JobControl example
9.http://issues.apache.org/jira/browse/HADOOP-5170
Set max map/reduce tasks on a per-job basis, either per-node or cluster-wide
once the tasktracker starts, the maximum number of tasks
per node can not be changed. In my case, I've solved this challenge by stopping and starting mapred (stop-mapred.sh, start-mapred.sh) between jobs.
There is a jira so this may be changed in the future: jira HADOOP-5170 (
http://issues.apache.org/jira/browse/HADOOP-5170)
可能已经修正了
10.Oozie, Hadoop Workflow System
https://issues.apache.org/jira/browse/HADOOP-5303
11.http://kdpeterson.net/blog/2009/11/hadoop-workflow-tools-survey.html
Hadoop Workflow Tools Survey
very clear about jobs schedule
一个视频http://developer.yahoo.net/blogs/theater/archives/2009/08/hadoop_summit_workflow_oozie.html
12.http://wiki.dspace.org/index.php/Creating_and_Applying_Patches_in_Eclipse
Creating and Applying Patches in Eclipse
http://www.ibm.com/developerworks/cn/opensource/os-eclipse-galileopatch/
13.JobControl:http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/mapred/jobcontrol/JobControl.html
14. http://kdpeterson.net/blog/2009/11/hadoop-workflow-tools-survey.html
By default, Hadoop uses a FIFO scheduler, but there are two more advanced schedulers which are widely used. The Capacity Scheduler is focused on guaranteing that various users of a cluster will have access to their guaranteed number of slots while making it and the Fair Scheduler is
1.hadoop0.20.0 + eclipse环境搭建http://bbs.hadoopor.com/thread-43-1-1.html
台湾一个人写的,很好。hadoop0.20.0 + eclipse环境搭建http://trac.nchc.org.tw/cloud/wiki/waue/2009/0617教怎么打包成jar,制作jar包
注意里面的那个Makefile文件“jar -cvf ${JarFile} -C bin/ .”
”hadoop jar ${JarFile} ${MainFunc} input output“等要用tab开头,而不是空格,至于help下面的我都注释掉了在前面加“#”,因为还不知道怎么用
1.hadoop集群配置中client如何向hadoop传输数据http://bbs.hadoopor.com/thread-362-1-1.html:
使用DFSClient工具,客户端上传数据不需要部署hadoop,只需要安装有DFSClient工具就可以上传数据了。
bin/hadoop fs -put 就是以DFSClient的方式“远程”访问HDFS的(当然也是在本地)
2.Hadoop对mysql支持http://bbs.hadoopor.com/thread-132-1-2.html
lance(274105045) 09:48:43
好像0。20里面有提供对DB的输入,输出。
hadoopor(784027584) 09:48:50
但要使用Job并行化,就不得使用默认的调试器,Facebook提供的FaireScheduler支持对Job的并行调度。?????
Spork(47986766) 09:49:16
不是好像了,就是有,只是目前支持的较好的都是开源的,如mysql
3.SequenceFile介绍:http://bbs.hadoopor.com/thread-144-1-1.html
4.JobTracker.JobInProcesshttp://bbs.hadoopor.com/thread-212-1-1.html用于监控一个Job的调度情况。一个Job会被分解成N个Tasks,这些Tasks被分配到集群中的TaskTracer节点,由TaskTracer节点去执行这些Tasks。
==========搜索自Nabble Hadoop===============
1.Hadoop 0.17 schedules jobs fifo. If it isn't,
that is a bug. http://old.nabble.com/Hadoop-job-scheduling-issue-td19659938.html#a19659938
2.Can jobs be configured to be sequential. it means jobs in Group1 excute first, and jobs in Group2 excute later. and Group2 jobs depends on Group1 jobs. The jobs in Group1 or Group2 are independent.
http://old.nabble.com/Can-jobs-be-configured-to-be-sequential-td20043257.html#a20043257
I recommend that you look at http://cascading.org as
an abstraction layer for managing these kinds of workflows. We've
found it quite useful.
3.Sequence of Streaming Jobs: if you are using the sh or bash, the variable $? holds the exit status of the last command to execute.
hadoop jar streaming.jar ...
if [ $? -ne 0 ]; then
echo "My job failed" 2>&1
exit 1
fi
Caution $? is the very last command to execute's exit status. It is easy to run another command before testing and then test the wrong command's exit status
http://old.nabble.com/Sequence-of-Streaming-Jobs-td23336043.html#a23351848
4.mapred.tasktracker.map.tasks.maximum
mapred.tasktracker.reduce.tasks.maximum
mapred.map.multithreadedrunner.threads
5.http://old.nabble.com/Linking-2-MarReduce-jobs-together--td18756178.html#a18756178
Is it possible to put the output from the reduce phase of job 1
to be the input to job number 2?
Well your data has to be somewhere between the two jobs... So I'd say yes, put it in HBase or HDFS to reuse it
6.the <pro hadoop> chapter8 covers this topic.
7.http://old.nabble.com/Customizing-machines-to-use-for-different-jobs-td23864519.html#a23864519
Customizing machines to use for different jobs:
Unfortunately there is no built-in way of doing this. You'd have to
instantiate two entirely separate Hadoop clusters to accomplish what you're
trying to do, which isn't an uncommon thing to do.
I'm not sure why you're hoping to have this behavior, but the fair share
scheduler might be helpful to you. It let's you essentially divvy up your
cluster into queues, where each queue has its own "chunk" of the cluster.
When resources are available outside of the "chunk," then jobs can span into
other queues' space.
Cloudera's Distribution for Hadoop (<http://www.cloudera.com/hadoop>)
includes the fair share scheduler. I recommend using our distribution,
otherwise here is the fair share JIRA:
<http://issues.apache.org/jira/browse/HADOOP-3746>
8.http://old.nabble.com/How-to-run-many-jobs-at-the-same-time--td23151917.html#a23151917
How to run many jobs at the same time?:JobControl example
9.http://issues.apache.org/jira/browse/HADOOP-5170
Set max map/reduce tasks on a per-job basis, either per-node or cluster-wide
once the tasktracker starts, the maximum number of tasks
per node can not be changed. In my case, I've solved this challenge by stopping and starting mapred (stop-mapred.sh, start-mapred.sh) between jobs.
There is a jira so this may be changed in the future: jira HADOOP-5170 (
http://issues.apache.org/jira/browse/HADOOP-5170)
可能已经修正了
10.Oozie, Hadoop Workflow System
https://issues.apache.org/jira/browse/HADOOP-5303
11.http://kdpeterson.net/blog/2009/11/hadoop-workflow-tools-survey.html
Hadoop Workflow Tools Survey
very clear about jobs schedule
一个视频http://developer.yahoo.net/blogs/theater/archives/2009/08/hadoop_summit_workflow_oozie.html
12.http://wiki.dspace.org/index.php/Creating_and_Applying_Patches_in_Eclipse
Creating and Applying Patches in Eclipse
http://www.ibm.com/developerworks/cn/opensource/os-eclipse-galileopatch/
13.JobControl:http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/mapred/jobcontrol/JobControl.html
14. http://kdpeterson.net/blog/2009/11/hadoop-workflow-tools-survey.html
By default, Hadoop uses a FIFO scheduler, but there are two more advanced schedulers which are widely used. The Capacity Scheduler is focused on guaranteing that various users of a cluster will have access to their guaranteed number of slots while making it and the Fair Scheduler is
发表评论
-
Java并发编程总结---Hadoop核心源码实例解读
2012-04-01 15:46 2193程序设计需要同步(synchronization),原因:1) ... -
使用hadoop的lzo问题!
2011-08-24 17:12 2638使用lzo压缩替换hadoop原始的Gzip压缩。相比之下有如 ... -
secondarynamenode配置使用总结
2011-07-07 08:37 7550一、环境 Hadoop 0.20.2、JDK 1.6、Lin ... -
Map/Reduce中的Combiner的使用
2011-07-07 08:36 4764一、作用 1、combiner最基本是实现本地key的聚合, ... -
Map/Reduce中的Partiotioner使用
2011-07-07 08:35 1870一、环境 1、hadoop 0.20.2 2、操作系统Li ... -
hadoop如何添加节点
2011-07-06 12:43 14901.部署hadoop 和普通的datanode一样。安装 ... -
hadoop如何恢复namenode
2011-07-06 12:36 8535Namenode恢复 1.修改conf/core-site.x ... -
Hadoop删除节点(Decommissioning Nodes)
2011-07-06 11:52 25721.集群配置 修改conf/hdfs-site.xml ... -
hadoop知识点整理
2011-07-06 11:51 26771. Hadoop 是什么? Hadoop 是一种使用 Ja ... -
喜欢hadoop的同学们值得一看
2011-07-03 15:50 2016海量数据正在不断生成,对于急需改变自己传统IT架构的企业而 ... -
hadoop优化
2011-07-03 15:43 1337一. conf/hadoop-site.xml配置, 略过. ... -
hadoop分配任务的问题
2011-05-16 23:09 5请教大家一个关于hadoop分配任务的问题: 1、根据机器 ... -
hadoop-FAQ
2011-05-15 11:38 725hadoop基础,挺详细的。希望对大家有用! -
Apache Hadoop 0.21版本新功能ChangeNode
2011-04-21 22:04 2000Apache Hadoop 0.21.0 在2010年8月23 ... -
Hadoop关于处理大量小文件的问题和解决方法
2011-04-21 11:07 2517小文件指的是那些size比 ... -
hadoop常见错误及解决办法!
2011-04-07 12:18 96470转: 1:Shuffle Error: Exceede ... -
Hadoop节点热拔插
2011-04-07 12:16 1635转 : 一、 Hadoop节点热 ... -
hadoop动态添加节点
2011-04-07 12:14 2013转: 有的时候, datanode或者tasktrac ... -
欢迎大家讨论hadoop性能优化
2011-04-06 15:42 1301大家知道hadoop这家伙是非常吃内存的。除了加内存哦! 如 ... -
hadoop错误之二:could only be replicated to 0 nodes, instead of 1
2011-02-22 08:23 2360WARN hdfs.DFSClient: NotReplic ...
相关推荐
这个“hadoop知识结构图”应该是对整个Hadoop生态系统的一个全面概览,包括其主要组件、工作原理以及与其他技术的关联。 Hadoop由Apache软件基金会开发,它的核心组件主要包括HDFS(Hadoop Distributed File System...
Hadoop知识点笔记 Hadoop是一种基于分布式计算的数据处理框架,由 Doug Cutting 和 Mike Cafarella 于2005年创建。Hadoop的主要组件包括HDFS(Hadoop Distributed File System)、YARN(Yet Another Resource ...
Hadoop_Spark相关面试问题总结 - Hadoop知识库.pdf Hadoop_Spark相关面试问题总结 - Hadoop知识库.pdf Hadoop_Spark相关面试问题总结 - Hadoop知识库.pdf
### Hadoop知识点梳理 #### 一、Hadoop概述与启动停止命令 Hadoop是一个开源的分布式计算框架,专为大规模数据集的存储和处理而设计。它的核心组件包括HDFS(Hadoop Distributed File System)和MapReduce计算框架...
回顾复习了Hadoop知识,绘制思维导图帮助记忆和复习。知识点包括Hadoop环境构建 、hdfs、yarn、hive、hbase和mapreduce。
学习hadoop 的一些分享保护了 集群的搭架和使用,主要保护HDFS 和 MapReduce
### Hadoop知识总汇 #### 一、Hadoop概述 Hadoop是由Apache基金会维护的一个开源软件框架,旨在为大规模数据集提供可靠、可扩展、分布式计算能力。它通过使用简单的编程模型来支持跨计算机集群的大数据分布式处理...
什么是hadoop: (1)Hadoop是一个开源的框架,可编写和运行分布式应用处理大规模数据,是专为离线和大规模数据分析而设计的,并不适合那种对几个记录随机读写的在线事务处理模式。Hadoop=HDFS(文件系统,数据存储技术...
Hadoop
了解这些核心知识点对于理解Hadoop的工作原理和解决实际问题至关重要。在面试中,候选人需要展示对Hadoop架构、工作流程、故障恢复机制和性能优化的理解,以证明他们具备在大数据环境中高效处理任务的能力。
本文将深入探讨Hadoop知识库及其常规命令,帮助你更好地理解和应用这一技术。 Hadoop的核心由两个主要组件组成:Hadoop Distributed File System (HDFS) 和 MapReduce。HDFS是一个分布式文件系统,设计用于跨大量...
同时,该书也介绍了Hadoop生态系统中的YARN资源管理和HBase分布式数据库等重要组件,帮助读者构建完整的Hadoop知识体系。 学习这两本书籍,不仅可以掌握Hadoop的基本操作,还能深入理解分布式计算的核心概念,如...
【Hadoop大数据开发基础-PPT课件】是一个涵盖了Hadoop生态系统入门知识的教育资源,适合初学者和希望深入了解大数据处理技术的IT专业人士。...对于寻求Hadoop知识的人来说,这是一份不可多得的参考资料。
Hadoop平台搭建是一个复杂的过程,需要具备一定的Linux和Hadoop知识。只有通过详细的配置和测试,才能确保Hadoop平台的稳定运行。 知识点: * Hadoop平台搭建的过程 * 硬件环境的选择 * 软件环境的安装和配置 * ...
除了基础的Hadoop知识,书中还会涉及Hadoop生态系统中的其他重要组件,如HBase(一个分布式NoSQL数据库),Hive(基于Hadoop的数据仓库工具),Pig(用于大数据分析的高级脚本语言),以及Spark(一种快速、通用的大...
本总结将深入探讨Hadoop的主要组件、工作原理以及在面试中可能遇到的相关知识点。 一、Hadoop概述 Hadoop是由Apache基金会开发的一个开源项目,它提供了一个分布式文件系统(HDFS)和一个并行计算框架(MapReduce...
这套教程涵盖了从基础到高级的Hadoop知识,包括Hadoop的分布式文件系统(HDFS)和MapReduce计算框架。通过这100集的教学视频,学习者将能够理解大数据处理的基本概念,以及如何利用Hadoop进行大规模数据处理。 ...
Hadoop是大数据处理领域的重要框架,它以分布式计算模型为基础,为海量数据的存储和处理提供了高效、可靠的...这个压缩包对于想要进入大数据领域的初学者或是希望深化Hadoop知识的专业人士来说,无疑是一份宝贵的资源。