`
younglibin
  • 浏览: 1214233 次
  • 性别: Icon_minigender_1
  • 来自: 北京
社区版块
存档分类
最新评论

搭建hadoop环境,执行wordcount

 
阅读更多

1.机器选择   ,没有资源只能选择自己手头的一个服务器,部署一个伪分布式吧,

2.hadoop版本选择:hadoop分为 1.XX  和 2.XX 两个版本这两个版本之间差别还是挺大的,安装配置都不一样,所以一定确定自己用哪个 具体用哪个 参考:http://younglibin.iteye.com/blog/1921385(这里使用的老版本 1.2.1  ),  

由于我选择的是一台服务器,所以选择了伪分布式部署(参考:http://hadoop.apache.org/docs/r1.2.1/single_node_setup.html)

3. 开始搭建:选择服务器:172.16.236.11 

         创建 libin 用户 密码 password

         将本地下载的hadoop包: scp hadoop-1.2.1.tar.gz libin@172.16.236.11:~/

4. ssh    libin@172.16.236.11 

tar -zxvf hadoop-1.2.1.tar.gz 

 

一下就是配置hadoop了

5.vi conf/core-site.xml ;定义HadoopMaster的URI和端口

写道
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

 

6.vi conf/hdfs-site.xml配置 : 配置数据存储的副本数量 

写道
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

 

7. vi conf.mapre-site.xml : 配置jobtracker执行的服务器和端口

写道
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>

 

8.创建 ssh免登陆 :

这个是必须的,因为 hadoop在执行的时候,需要在服务器之间执行一些文件拷贝,如果不配置,就会频繁的提示输入密码,所以这里是必须的

Now check that you can ssh to the localhost without a passphrase:

$ ssh localhost

 

If you cannot ssh to localhost without a passphrase, execute the following commands:

写道
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

 

9.初始化 namenode节点

写道
bin/hadoop namenode -format

 看到:

13/08/14 13:48:12 INFO common.Storage: Storage directory /tmp/hadoop-libin/dfs/name has been successfully formatted.

 

说明 namenode 初始化成功

 

10.启动hadoop集群

写道
libin@d03:~/hadoop-1.2.1$ bin/start-all.sh
starting namenode, logging to /home/libin/hadoop-1.2.1/libexec/../logs/hadoop-libin-namenode-d03.out
localhost: starting datanode, logging to /home/libin/hadoop-1.2.1/libexec/../logs/hadoop-libin-datanode-d03.out
localhost: Error: JAVA_HOME is not set.

 按照错误提示,应该是java_home没有配置,需要在  conf/hadoop-env.sh 配置

export JAVA_HOME=/home/libin/jdk1.6.0_31

配置完成后再次启动 hadoop  使用jps  看到 一下进程存在说明hadoop启动成功

写道
libin@d03:~/hadoop-1.2.1$ jps
22002 TaskTracker
22119 Jps
21706 DataNode
21841 SecondaryNameNode
20710 NameNode
20967 JobTracker

 

12.使用hadoop命令查看 hadoop 一些文件信息

  我们一般会创建 input  目录和 output目录 ,方便hadoop在执行的时候需要的一些输入参数在input中定义, 输出结果在output中

写道
libin@d03:~/hadoop-1.2.1$ hadoop fs -ls
Warning: $HADOOP_HOME is deprecated.

ls: Cannot access .: No such file or directory.


libin@d03:~/hadoop-1.2.1$ hadoop fs -mkdir input
Warning: $HADOOP_HOME is deprecated.

libin@d03:~/hadoop-1.2.1$ hadoop fs -mkdir output
Warning: $HADOOP_HOME is deprecated.

libin@d03:~/hadoop-1.2.1$ hadoop fs -ls
Warning: $HADOOP_HOME is deprecated.

Found 2 items
drwxr-xr-x - libin supergroup 0 2013-08-14 13:57 /user/libin/input
drwxr-xr-x - libin supergroup 0 2013-08-14 13:57 /user/libin/output

 

 

13 在本地创建一个 文件  ,将该文件上传到 hadoop文件目录下

写道
libin@d03:~/hadoop-1.2.1$ hadoop fs -put input-local/libin input/libin
Warning: $HADOOP_HOME is deprecated.

libin@d03:~/hadoop-1.2.1$ hadoop fs -ls
Warning: $HADOOP_HOME is deprecated.

Found 2 items
drwxr-xr-x - libin supergroup 0 2013-08-14 14:00 /user/libin/input
drwxr-xr-x - libin supergroup 0 2013-08-14 13:57 /user/libin/output
libin@d03:~/hadoop-1.2.1$ hadoop fs -ls /user/libin/input
Warning: $HADOOP_HOME is deprecated.

Found 1 items
-rw-r--r-- 1 libin supergroup 29 2013-08-14 14:00 /user/libin/input/libin
libin@d03:~/hadoop-1.2.1$ hadoop fs -ls input
Warning: $HADOOP_HOME is deprecated.

Found 1 items
-rw-r--r-- 1 libin supergroup 29 2013-08-14 14:00 /user/libin/input/libin
libin@d03:~/hadoop-1.2.1$

 

14、执行以下hadoop自带的例子吧:

 以下列出了 hadoop自带的一些例子

写道
libin@d03:~/hadoop-1.2.1$ hadoop jar hadoop-examples-1.2.1.jar
Warning: $HADOOP_HOME is deprecated.

An example program must be given as the first argument.
Valid program names are:
aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
dbcount: An example job that count the pageview counts from a database.
grep: A map/reduce program that counts the matches of a regex in the input.
join: A job that effects a join over sorted, equally partitioned datasets
multifilewc: A job that counts words from several files.
pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
pi: A map/reduce program that estimates Pi using monte-carlo method.
randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
randomwriter: A map/reduce program that writes 10GB of random data per node.
secondarysort: An example defining a secondary sort to the reduce.
sleep: A job that sleeps at each map and reduce task.
sort: A map/reduce program that sorts the data written by the random writer.
sudoku: A sudoku solver.
teragen: Generate data for the terasort
terasort: Run the terasort
teravalidate: Checking results of terasort
wordcount: A map/reduce program that counts the words in the input files.
libin@d03:~/hadoop-1.2.1$

 15.执行最经典的 wordcount  也算是hadoop中的hello word 了

写道
libin@d03:~/hadoop-1.2.1$ hadoop jar hadoop-examples-1.2.1.jar wordcount
Warning: $HADOOP_HOME is deprecated.

Usage: wordcount <in> <out>
libin@d03:~/hadoop-1.2.1$ hadoop jar hadoop-examples-1.2.1.jar wordcount input/libin output
Warning: $HADOOP_HOME is deprecated.

13/08/14 14:02:14 INFO mapred.JobClient: Cleaning up the staging area hdfs://localhost:9000/tmp/hadoop-libin/mapred/staging/libin/.staging/job_201308141349_0001
13/08/14 14:02:14 ERROR security.UserGroupInformation: PriviledgedActionException as:libin cause:org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory output already exists
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory output already exists
at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:137)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:973)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:550)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:580)
at org.apache.hadoop.examples.WordCount.main(WordCount.java:82)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:64)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:160)

  这里提示需要两个参数  in  和out

 

进而 提示 Output directory output already exists   ,hadoop在执行之前会将你定义的输出目录生成,如果存在就不执行了,是因为 ,hadoop 是分布式的,如果你重复执行一个用例的话,会导致 后边的结果覆盖前面的结果,所以这里只要发现out目录存在,就不会执行  修改 out目录

 

写道
libin@d03:~/hadoop-1.2.1$ hadoop jar hadoop-examples-1.2.1.jar wordcount input/libin output/wordcount
Warning: $HADOOP_HOME is deprecated.

13/08/14 14:02:27 INFO input.FileInputFormat: Total input paths to process : 1
13/08/14 14:02:27 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/08/14 14:02:27 WARN snappy.LoadSnappy: Snappy native library not loaded
13/08/14 14:02:27 INFO mapred.JobClient: Running job: job_201308141349_0002
13/08/14 14:02:28 INFO mapred.JobClient: map 0% reduce 0%
13/08/14 14:02:32 INFO mapred.JobClient: map 100% reduce 0%
13/08/14 14:02:40 INFO mapred.JobClient: map 100% reduce 100%
13/08/14 14:02:40 INFO mapred.JobClient: Job complete: job_201308141349_0002
13/08/14 14:02:40 INFO mapred.JobClient: Counters: 29
13/08/14 14:02:40 INFO mapred.JobClient: Job Counters
13/08/14 14:02:40 INFO mapred.JobClient: Launched reduce tasks=1
13/08/14 14:02:40 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=3336
13/08/14 14:02:40 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
13/08/14 14:02:40 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
13/08/14 14:02:40 INFO mapred.JobClient: Launched map tasks=1
13/08/14 14:02:40 INFO mapred.JobClient: Data-local map tasks=1
13/08/14 14:02:40 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=8179
13/08/14 14:02:40 INFO mapred.JobClient: File Output Format Counters
13/08/14 14:02:40 INFO mapred.JobClient: Bytes Written=37
13/08/14 14:02:40 INFO mapred.JobClient: FileSystemCounters
13/08/14 14:02:40 INFO mapred.JobClient: FILE_BYTES_READ=71
13/08/14 14:02:40 INFO mapred.JobClient: HDFS_BYTES_READ=138
13/08/14 14:02:40 INFO mapred.JobClient: FILE_BYTES_WRITTEN=110523
13/08/14 14:02:40 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=37
13/08/14 14:02:40 INFO mapred.JobClient: File Input Format Counters
13/08/14 14:02:40 INFO mapred.JobClient: Bytes Read=29
13/08/14 14:02:40 INFO mapred.JobClient: Map-Reduce Framework
13/08/14 14:02:40 INFO mapred.JobClient: Map output materialized bytes=71
13/08/14 14:02:40 INFO mapred.JobClient: Map input records=8
13/08/14 14:02:40 INFO mapred.JobClient: Reduce shuffle bytes=71
13/08/14 14:02:40 INFO mapred.JobClient: Spilled Records=14
13/08/14 14:02:40 INFO mapred.JobClient: Map output bytes=69
13/08/14 14:02:40 INFO mapred.JobClient: CPU time spent (ms)=1460
13/08/14 14:02:41 INFO mapred.JobClient: Total committed heap usage (bytes)=401997824
13/08/14 14:02:41 INFO mapred.JobClient: Combine input records=10
13/08/14 14:02:41 INFO mapred.JobClient: SPLIT_RAW_BYTES=109
13/08/14 14:02:41 INFO mapred.JobClient: Reduce input records=7
13/08/14 14:02:41 INFO mapred.JobClient: Reduce input groups=7
13/08/14 14:02:41 INFO mapred.JobClient: Combine output records=7
13/08/14 14:02:41 INFO mapred.JobClient: Physical memory (bytes) snapshot=311259136
13/08/14 14:02:41 INFO mapred.JobClient: Reduce output records=7
13/08/14 14:02:41 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1118924800
13/08/14 14:02:41 INFO mapred.JobClient: Map output records=10

 

16、查看执行结果  

写道
libin@d03:~/hadoop-1.2.1$ hadoop fs -cat output/wordcount/part-r-00000
Warning: $HADOOP_HOME is deprecated.

a 4
c 1
d 1
is 1
li 1
libin 1
tmp? 1
libin@d03:~/hadoop-1.2.1$

 

 

 

大功告成,下一步就可以在这个基础上开发新的 mapReduce程序了!

 

这里配置的ip最好使用域名来做解析,但是域名解析 又要 牵扯到 DNS反响解析,所以这里没有这样配置,如果是配置集群,请配置DNS反响解析

 

分享到:
评论

相关推荐

    使用hadoop实现WordCount实验报告.docx

    实验报告的目的是详细记录使用Hadoop在Windows环境下实现WordCount应用的过程,包括环境配置、WordCount程序的实现以及实验结果分析。本实验旨在理解Hadoop分布式计算的基本原理,并熟悉Hadoop集群的搭建与管理。 #...

    hadoop-1.2.1运行WordCount

    - 在Hadoop环境下执行WordCount任务,命令为`hadoop jar /usr/hadoop/hadoop-1.2.1/hadoop-examples-1.2.1.jar wordcount input output`。 #### 七、查看控制台输出及Web界面 1. **控制台输出**: - 查看...

    cygwin+eclipse搭建hadoop开发环境,运行wordcount

    接下来,"Cygwin+Eclipse搭建Hadoop开发环境"文档将指导你如何配置Eclipse IDE,使其能够与Cygwin集成,用于Hadoop项目开发。Eclipse是Java开发者常用的一款强大IDE,它提供了丰富的插件支持,包括Hadoop开发插件,...

    Hadoop环境搭建与WordCount实例浅析.pdf

    - 在Hadoop环境中提交Job,执行WordCount任务,输入为文本文件,输出为单词计数结果。 6. **优化配置**: - 根据实际需求和硬件资源,参考`hadoop-default.xml`进行进一步的优化配置,如内存分配、副本数量等。 ...

    hadoop 运行成功代码(wordcount)

    总之,通过这个Hadoop WordCount实验,你将深入理解分布式计算的基本概念,掌握Hadoop环境的搭建与管理,以及使用Maven进行项目构建。这是一个绝佳的起点,为进一步探索大数据世界打下坚实的基础。

    ubuntu运行hadoop的wordcount

    #### 一、环境搭建与配置 在Ubuntu系统上部署并运行Hadoop WordCount实例,首先需要确保已经安装了Hadoop环境,并且版本为hadoop-0.20.2。此版本较旧,主要用于教学或测试目的,在生产环境中建议使用更稳定的新版本...

    Hadoop环境搭建手册(包含所有基本信息,本人亲测)

    1. **目的**:通过搭建Hadoop环境,学习和理解其工作原理,为实际的大数据处理任务做好准备。 2. **先决条件**: - **支持平台**:Hadoop可以在多种操作系统上运行,如Linux、Windows等。 - **所需软件**:Java...

    手动搭建hadoop高可用

    ### 手动搭建Hadoop高可用集群教程 #### 一、前言 本文将详细介绍如何从零开始手动搭建Hadoop高可用(HA)集群。Hadoop是一个能够处理大量数据的分布式计算框架,它由HDFS(Hadoop Distributed File System)和...

    hadoop demo wordcount

    1. Hadoop环境搭建:包括安装Hadoop,配置Hadoop集群(单机或伪分布式),以及设置Hadoop环境变量。 2. MapReduce编程模型:理解Map和Reduce函数的工作原理,以及它们之间的shuffle和sort过程。 3. 输入输出格式:...

    esplise插件搭建Hadoop环境

    本篇将详细介绍如何利用Eclipse插件来搭建Hadoop开发环境。 首先,Eclipse插件的安装是关键步骤。你需要找到适用于Eclipse的Hadoop开发插件,如"Hadoop Toolkit"或"Big Data Tools"。这些插件通常可以从Eclipse ...

    CentOS上搭建Hadoop2.5.2_CentOS搭建hadoop_云计算_源码

    在搭建Hadoop 2.5.2环境的过程中,选择CentOS作为操作系统是一个常见的选择,因为其稳定性和与开源软件的良好兼容性。以下是基于CentOS 7.0搭建Hadoop 2.5.2的详细步骤,以及涉及的相关知识点: 1. **系统准备**: ...

    用虚拟机在ubuntu上搭建hadoop平台的单机模式

    本文旨在详细介绍如何在虚拟机环境下,利用Ubuntu系统搭建Hadoop平台的单机模式。通过本教程,您可以掌握从安装虚拟机到配置Hadoop环境的全过程。本文不仅涵盖了安装步骤,还提供了可能遇到的问题及其解决方案。 ##...

    ubuntu搭建hadoop

    首先,环境准备是搭建Hadoop的前提。Ubuntu操作系统因其稳定性与易用性,常被选作Hadoop开发环境。确保你的系统是最新的,可以运行`sudo apt-get update && sudo apt-get upgrade`来更新系统。 接着,安装Java开发...

    搭建hadoop-1.2.1环境

    搭建Hadoop-1.2.1环境是一项关键任务,它涉及到多步骤的配置和安装,以便在集群中运行分布式计算任务。在这个过程中,我们将使用VMware9和Debian7作为基础操作系统,创建三台虚拟机(vmDebA、vmDebB、vmDebC)来构建...

    hadoop-2.x的环境搭建

    本文将详细阐述如何搭建Hadoop 2.x的环境,这包括单节点模式、伪分布式模式以及完全分布式模式的安装与配置,同时也涵盖了Hadoop生态系统中其他重要组件的安装。 一、Hadoop简介 Hadoop是基于Java开发的,它由...

    hadoop文档, hdfs mapreduce,环境搭建,例子介绍等

    ### Hadoop概述与环境搭建详解 #### 一、Hadoop简介 Hadoop是一个开源软件框架,主要用于处理大规模数据集(GB到PB级别)的分布式计算。它最初由Apache基金会开发,现已成为大数据处理领域的基石之一。Hadoop的核心...

    Hadoop的单机伪分布式搭建和运行第一个WordCount程序

    Hadoop的单机伪分布式搭建是指在单个机器上模拟分布式环境,实现Hadoop的 cluster 环境。这需要下载和安装Hadoop的software包,并配置相关的配置文件。 1. 下载和安装Hadoop 下载Hadoop的software包,解压缩后,...

    hadoop环境搭建资源

    3. **环境变量配置**: 在搭建Hadoop环境时,需要设置JAVA_HOME环境变量指向JDK的安装路径,并将HADOOP_HOME环境变量指向Hadoop的安装目录。还需将这两个目录的bin子目录添加到PATH环境变量中,以便在任何目录下都能...

Global site tag (gtag.js) - Google Analytics