由官方文档看hadoop的单机与伪分布运行 -

xiaoluobo6666

浏览: 13894 次
性别:
来自: 湖南

最近访客更多访客>>

beeMonkey

jrc838982823

tianhandigeng

小明同学的同学

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

由官方文档看hadoop的单机与伪分布运行

hadoop众所周知，在单击上运行时，有两种运行模式：直接的单击运行模式，和在单击上使用多个守护线程模拟分布的伪分布运行模式。这两天配了一下hadoop，在此，就他们之间运行时的过程与区别表达一下我的看法。

从官方文档上的配置看起
1.单击模式
单机模式的操作方法，文档如下：
[
默认情况下，Hadoop被配置成以非分布式模式运行的一个独立Java进程。这对调试非常有帮助。

下面的实例将已解压的 conf 目录拷贝作为输入，查找并显示匹配给定正则表达式的条目。输出写入到指定的output目录。
$ mkdir input
$ cp conf/*.xml input
$ bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'
$ cat output/*

]
可见，在单击模式中，首先创建了一个input目录($ mkdir input),然后，将原始数据(为hadoop自带的一些xml文件，即conf目录下的文件)拷贝至input目录下($ cp conf/*.xml input)，
接着，启动hadoop的 hadoop-*-examples.jar类，当然，这是中文文档对应的0.19.2版的hadoop对应的自带jar包，而我配置时，对应的jar包则为：hadoop-examples-0.20.203.0.jar，因版本而异。在这一句中($ bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'),同样给出了四个参数：grep，input，ouput，'dfs[a-.]+'，分别依次代表：调用时，对应jar包下调用的类，输入文件目录(第一步已创建)，输出文件目录(没有则会自动创建)，要求条件的正则表达式，
最后，显示输出结果($ cat output/*),如果成功，这回在output目录下生成结果文件。

那么，这个单机版到底怎么运行的呢？看源码：hadoop-examples-0.20.203.0.jar
打开：src/examples/org/apache/hadoop/examples/grep.java(被调用类)，可见：
[
public static void main(String[] args) throws Exception {
    int res = ToolRunner.run(new Configuration(), new Grep(), args);
    System.exit(res);
}
]
运行时，直接调用了ToolRunner的方法，同时传入了Config类，与Grep类，与命令行输入参数(args)。这个方法不直观，让我们换个方法来看：同目录下的WordCount.java类

主函数如下：
[
public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    if (otherArgs.length != 2) {...}

Job job = new Job(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
   ...
    FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
    FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

    System.exit(job.waitForCompletion(true) ? 0 : 1);
}
]

部分内容已省略
可见：在WC类中，首先创建了一个conf类(配置文件目前还没有配置，我们在伪分布时在分析)，接着得到传入参数(otherArgs，对应的是原始数据和输出数据目录),然后，创建一个Job，Job中设置一个Mapper，一个Reducer(大名鼎鼎的MR模型的两部分)，在然后，执行Job。而在Job中设置的Mapper和Reducer都是WC类的内部静态类，执行时，相当与调用了Mapper与Reducer中的map与reduce方法，进行了数据处理。

2.伪分布

也是先看官方文档，在官方文档中，给出的伪分布定义如下：
【
Hadoop可以在单节点上以所谓的伪分布式模式运行，此时每一个Hadoop守护进程都作为一个独立的Java进程运行。
】
而进行伪分布时，首先要做的：配置XML
【

Configuration

Use the following:

conf/core-site.xml:

<configuration>
     <property>
         <name>fs.default.name</name>
         <value>hdfs://localhost:9000</value>
     </property>
</configuration>

conf/hdfs-site.xml:

<configuration>
     <property>
         <name>dfs.replication</name>
         <value>1</value>
     </property>
</configuration>

conf/mapred-site.xml:

<configuration>
     <property>
         <name>mapred.job.tracker</name>
         <value>localhost:9001</value>
     </property>
</configuration>

】

配置的这些有什么用呢？什么参数有什们用的(这么多localhost。。。)？其实，我目前也不时很懂，凭自己感觉分析一下吧。
然后设置ssh，能够登录本机。
执行，启动start-all.sh
来，那就看一下start-all.sh中的内容
有这么几句
【
. "$bin"/hadoop-config.sh

# start dfs daemons
"$bin"/start-dfs.sh --config $HADOOP_CONF_DIR

# start mapred daemons
"$bin"/start-mapred.sh --config $HADOOP_CONF_DIR
】
可见，又启动了hadoop-config.sh，start-dfs.sh ，start-mapred.sh
总之，在start-all的启动过程中，启动了namenode，datanode，jobtracker，tasktracker等守护线程。
那么这些守护线程又是如何与我们的WD类交互的呢？

再回去看一下WD。。。我们开始时，好像跳过了什么。。对了就是config类。
发现如下静态块

【
static{
    //print deprecation warning if hadoop-site.xml is found in classpath
    ClassLoader cL = Thread.currentThread().getContextClassLoader();
    if (cL == null) {
      cL = Configuration.class.getClassLoader();
    }
    if(cL.getResource("hadoop-site.xml")!=null) {
      LOG.warn("DEPRECATED: hadoop-site.xml found in the classpath. " +
          "Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, "
          + "mapred-site.xml and hdfs-site.xml to override properties of " +
          "core-default.xml, mapred-default.xml and hdfs-default.xml " +
          "respectively");
    }
    addDefaultResource("core-default.xml");
    addDefaultResource("core-site.xml");
}
】
可见，config类中，首先就加载了hadoop-site.xml，当其为空时，加载core-site.xml，mapred-site.xml，hdfs-site.xml这三个xml文件(虽然我愣是没找到他加载的时的源码在那。。。。。)，没错，就是我们在开始伪分布时，配置的那三个xml文件。然后，WD
类就能够根据config类，在实行job时，调用对应的守护进程，恩..在伪分布时，就是配置在localhost上的进程，当然，如果将其修改为对应服务器的IP，那不就时真正的分布式了么？

Over.

分享到：