`

hadoop知识

阅读更多
hadoop 技术论坛。http://bbs.hadoopor.com/index.php
1.hadoop0.20.0 + eclipse环境搭建http://bbs.hadoopor.com/thread-43-1-1.html
台湾一个人写的,很好。hadoop0.20.0 + eclipse环境搭建http://trac.nchc.org.tw/cloud/wiki/waue/2009/0617教怎么打包成jar,制作jar包
注意里面的那个Makefile文件“jar -cvf ${JarFile} -C bin/ .”
”hadoop jar ${JarFile} ${MainFunc} input output“等要用tab开头,而不是空格,至于help下面的我都注释掉了在前面加“#”,因为还不知道怎么用


1.hadoop集群配置中client如何向hadoop传输数据http://bbs.hadoopor.com/thread-362-1-1.html:
使用DFSClient工具,客户端上传数据不需要部署hadoop,只需要安装有DFSClient工具就可以上传数据了。
bin/hadoop fs -put 就是以DFSClient的方式“远程”访问HDFS的(当然也是在本地)

2.Hadoop对mysql支持http://bbs.hadoopor.com/thread-132-1-2.html
lance(274105045) 09:48:43
好像0。20里面有提供对DB的输入,输出。
hadoopor(784027584) 09:48:50
但要使用Job并行化,就不得使用默认的调试器,Facebook提供的FaireScheduler支持对Job的并行调度。?????
Spork(47986766) 09:49:16
不是好像了,就是有,只是目前支持的较好的都是开源的,如mysql

3.SequenceFile介绍:http://bbs.hadoopor.com/thread-144-1-1.html

4.JobTracker.JobInProcesshttp://bbs.hadoopor.com/thread-212-1-1.html用于监控一个Job的调度情况。一个Job会被分解成N个Tasks,这些Tasks被分配到集群中的TaskTracer节点,由TaskTracer节点去执行这些Tasks。

==========搜索自Nabble Hadoop===============
1.Hadoop 0.17 schedules jobs fifo. If it isn't,
that is a bug. http://old.nabble.com/Hadoop-job-scheduling-issue-td19659938.html#a19659938

2.Can jobs be configured to be sequential. it means jobs in Group1 excute first, and jobs in Group2 excute later. and Group2 jobs depends on Group1 jobs. The jobs in Group1 or Group2 are independent.
http://old.nabble.com/Can-jobs-be-configured-to-be-sequential-td20043257.html#a20043257
I recommend that you look at http://cascading.org as
an abstraction layer for managing these kinds of workflows. We've
found it quite useful.

3.Sequence of Streaming Jobs: if you are using the sh or bash, the variable $? holds the exit status of the last command to execute.

hadoop jar streaming.jar ...
if [ $? -ne 0 ]; then
    echo "My job failed" 2>&1
    exit 1
fi

Caution $? is the very last command to execute's exit status. It is easy to run another command before testing and then test the wrong command's exit status
http://old.nabble.com/Sequence-of-Streaming-Jobs-td23336043.html#a23351848

4.mapred.tasktracker.map.tasks.maximum
mapred.tasktracker.reduce.tasks.maximum
mapred.map.multithreadedrunner.threads

5.http://old.nabble.com/Linking-2-MarReduce-jobs-together--td18756178.html#a18756178
Is it possible to put the output from the reduce phase of job 1
to be the input to job number 2?
Well your data has to be somewhere between the two jobs... So I'd say yes, put it in HBase or HDFS to reuse it

6.the <pro hadoop> chapter8 covers this topic.

7.http://old.nabble.com/Customizing-machines-to-use-for-different-jobs-td23864519.html#a23864519
Customizing machines to use for different jobs:
Unfortunately there is no built-in way of doing this.  You'd have to
instantiate two entirely separate Hadoop clusters to accomplish what you're
trying to do, which isn't an uncommon thing to do.

I'm not sure why you're hoping to have this behavior, but the fair share
scheduler might be helpful to you.  It let's you essentially divvy up your
cluster into queues, where each queue has its own "chunk" of the cluster.
When resources are available outside of the "chunk," then jobs can span into
other queues' space.

Cloudera's Distribution for Hadoop (<http://www.cloudera.com/hadoop>)
includes the fair share scheduler.  I recommend using our distribution,
otherwise here is the fair share JIRA:

<http://issues.apache.org/jira/browse/HADOOP-3746>

8.http://old.nabble.com/How-to-run-many-jobs-at-the-same-time--td23151917.html#a23151917
How to run many jobs at the same time?:JobControl example

9.http://issues.apache.org/jira/browse/HADOOP-5170
Set max map/reduce tasks on a per-job basis, either per-node or cluster-wide

once the tasktracker starts, the maximum number of tasks
per node can not be changed. In my case, I've solved this challenge by stopping and starting mapred (stop-mapred.sh, start-mapred.sh) between jobs.
There is a jira so this may be changed in the future:  jira HADOOP-5170 (
http://issues.apache.org/jira/browse/HADOOP-5170)
可能已经修正了

10.Oozie, Hadoop Workflow System
https://issues.apache.org/jira/browse/HADOOP-5303

11.http://kdpeterson.net/blog/2009/11/hadoop-workflow-tools-survey.html
Hadoop Workflow Tools Survey
very clear about jobs schedule
一个视频http://developer.yahoo.net/blogs/theater/archives/2009/08/hadoop_summit_workflow_oozie.html

12.http://wiki.dspace.org/index.php/Creating_and_Applying_Patches_in_Eclipse
Creating and Applying Patches in Eclipse
http://www.ibm.com/developerworks/cn/opensource/os-eclipse-galileopatch/

13.JobControl:http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/mapred/jobcontrol/JobControl.html

14. http://kdpeterson.net/blog/2009/11/hadoop-workflow-tools-survey.html
By default, Hadoop uses a FIFO scheduler, but there are two more advanced schedulers which are widely used. The Capacity Scheduler is focused on guaranteing that various users of a cluster will have access to their guaranteed number of slots while making it and the Fair Scheduler is
分享到:
评论

相关推荐

    hadoop知识结构图

    这个“hadoop知识结构图”应该是对整个Hadoop生态系统的一个全面概览,包括其主要组件、工作原理以及与其他技术的关联。 Hadoop由Apache软件基金会开发,它的核心组件主要包括HDFS(Hadoop Distributed File System...

    Hadoop知识点笔记

    Hadoop知识点笔记 Hadoop是一种基于分布式计算的数据处理框架,由 Doug Cutting 和 Mike Cafarella 于2005年创建。Hadoop的主要组件包括HDFS(Hadoop Distributed File System)、YARN(Yet Another Resource ...

    Hadoop_Spark相关面试问题总结 - Hadoop知识库.pdf

    Hadoop_Spark相关面试问题总结 - Hadoop知识库.pdf Hadoop_Spark相关面试问题总结 - Hadoop知识库.pdf Hadoop_Spark相关面试问题总结 - Hadoop知识库.pdf

    Hadoop知识点梳理

    ### Hadoop知识点梳理 #### 一、Hadoop概述与启动停止命令 Hadoop是一个开源的分布式计算框架,专为大规模数据集的存储和处理而设计。它的核心组件包括HDFS(Hadoop Distributed File System)和MapReduce计算框架...

    Hadoop知识总结.png

    回顾复习了Hadoop知识,绘制思维导图帮助记忆和复习。知识点包括Hadoop环境构建 、hdfs、yarn、hive、hbase和mapreduce。

    hadoop 知识分享

    学习hadoop 的一些分享保护了 集群的搭架和使用,主要保护HDFS 和 MapReduce

    hadoop知识学习总结

    什么是hadoop: (1)Hadoop是一个开源的框架,可编写和运行分布式应用处理大规模数据,是专为离线和大规模数据分析而设计的,并不适合那种对几个记录随机读写的在线事务处理模式。Hadoop=HDFS(文件系统,数据存储技术...

    Hadoop知识点复习.png

    Hadoop

    大数据技术之Hadoop知识分享 Hadoop面试题目及其答案 Hadoop面试题目整理 Hadoop使用经验 共9页.pdf

    了解这些核心知识点对于理解Hadoop的工作原理和解决实际问题至关重要。在面试中,候选人需要展示对Hadoop架构、工作流程、故障恢复机制和性能优化的理解,以证明他们具备在大数据环境中高效处理任务的能力。

    Hadoop知识库:Hadoop知识库和常规命令

    本文将深入探讨Hadoop知识库及其常规命令,帮助你更好地理解和应用这一技术。 Hadoop的核心由两个主要组件组成:Hadoop Distributed File System (HDFS) 和 MapReduce。HDFS是一个分布式文件系统,设计用于跨大量...

    Hadoop书籍

    同时,该书也介绍了Hadoop生态系统中的YARN资源管理和HBase分布式数据库等重要组件,帮助读者构建完整的Hadoop知识体系。 学习这两本书籍,不仅可以掌握Hadoop的基本操作,还能深入理解分布式计算的核心概念,如...

    Hadoop平台搭建.ppt

    Hadoop平台搭建是一个复杂的过程,需要具备一定的Linux和Hadoop知识。只有通过详细的配置和测试,才能确保Hadoop平台的稳定运行。 知识点: * Hadoop平台搭建的过程 * 硬件环境的选择 * 软件环境的安装和配置 * ...

    hadoop 实战 中文版

    除了基础的Hadoop知识,书中还会涉及Hadoop生态系统中的其他重要组件,如HBase(一个分布式NoSQL数据库),Hive(基于Hadoop的数据仓库工具),Pig(用于大数据分析的高级脚本语言),以及Spark(一种快速、通用的大...

    hadoop学习总结(面试必备)

    本总结将深入探讨Hadoop的主要组件、工作原理以及在面试中可能遇到的相关知识点。 一、Hadoop概述 Hadoop是由Apache基金会开发的一个开源项目,它提供了一个分布式文件系统(HDFS)和一个并行计算框架(MapReduce...

    Hadoop大数据教学视频100集

    这套教程涵盖了从基础到高级的Hadoop知识,包括Hadoop的分布式文件系统(HDFS)和MapReduce计算框架。通过这100集的教学视频,学习者将能够理解大数据处理的基本概念,以及如何利用Hadoop进行大规模数据处理。 ...

    Hadoop全部课件.zip

    Hadoop是大数据处理领域的重要框架,它以分布式计算模型为基础,为海量数据的存储和处理提供了高效、可靠的...这个压缩包对于想要进入大数据领域的初学者或是希望深化Hadoop知识的专业人士来说,无疑是一份宝贵的资源。

    hadoop权威指南 第4版

    《Hadoop权威指南》第四版是一本针对大数据处理和分析领域的经典著作,旨在为读者提供全面、深入的Hadoop知识。Hadoop是Apache软件基金会的一个开源项目,它为大规模数据处理提供了一个分布式计算框架,是大数据领域...

Global site tag (gtag.js) - Google Analytics