Hadoop安装
0.安装java环境
<!--[if !supportLists]-->1. <!--[endif]-->下载版本1.2.1版本Hadoop
<!--[if !supportLists]-->2. <!--[endif]-->tar zxvf 解压缩hadoop到相应目录
<!--[if !supportLists]-->3. <!--[endif]-->需要配置的4个文件:hadoop-env.sh、core-site.xml、hdfs-site.xml、mapred-site.xml、masters、slaves
4、文件说明:
hadoop-env.sh是hadoop启动环境设置
hdfs-site.xml 是分布式文件系统相关配置
mapred-site.xml是map-reduce任务相关配置
master 表示主从机器(主机器是nameNode,丛就是dataNode)
slaver 表示丛机器
5、hadoop-env.sh配置详细:
# Set Hadoop-specific environment variables here.
# The only required environment variable is JAVA_HOME. All others are
# optional. When running a distributed configuration it is best to
# set JAVA_HOME in this file, so that it is correctly defined on
# remote nodes.
# The java implementation to use. Required. 配置该项目,设置jDK所在目录
export JAVA_HOME=/usr/local/jdk/jdk1.6.0_45
# Extra Java CLASSPATH elements. Optional.可选的配置CLASSPATH
# export HADOOP_CLASSPATH=
# The maximum amount of heap to use, in MB. Default is 1000. 配置虚拟机内存
# export HADOOP_HEAPSIZE=2000
# Extra Java runtime options. Empty by default. 配置模式,肯定是server模式了,client说明内存太小了,是java虚拟机模式,设计的内存回收等相关算法
# export HADOOP_OPTS=-server
# Command specific options appended to HADOOP_OPTS when specified 配置JMX管理配置
export HADOOP_NAMENODE_OPTS="-Dcom.sun.management.jmxremote $HADOOP_NAMENODE_OPTS"
export HADOOP_SECONDARYNAMENODE_OPTS="-Dcom.sun.management.jmxremote $HADOOP_SECONDARYNAMENODE_OPTS"
export HADOOP_DATANODE_OPTS="-Dcom.sun.management.jmxremote $HADOOP_DATANODE_OPTS"
export HADOOP_BALANCER_OPTS="-Dcom.sun.management.jmxremote $HADOOP_BALANCER_OPTS"
export HADOOP_JOBTRACKER_OPTS="-Dcom.sun.management.jmxremote $HADOOP_JOBTRACKER_OPTS"
# export HADOOP_TASKTRACKER_OPTS=
# The following applies to multiple commands (fs, dfs, fsck, distcp etc)
# export HADOOP_CLIENT_OPTS
# Extra ssh options. Empty by default.
# export HADOOP_SSH_OPTS="-o ConnectTimeout=1 -o SendEnv=HADOOP_CONF_DIR"
# Where log files are stored. $HADOOP_HOME/logs by default.
# export HADOOP_LOG_DIR=${HADOOP_HOME}/logs
# File naming remote slave hosts. $HADOOP_HOME/conf/slaves by default.
# export HADOOP_SLAVES=${HADOOP_HOME}/conf/slaves
# host:path where hadoop code should be rsync'd from. Unset by default.
# export HADOOP_MASTER=master:/home/$USER/src/hadoop
# Seconds to sleep between slave commands. Unset by default. This
# can be useful in large clusters, where, e.g., slave rsyncs can
# otherwise arrive faster than the master can service them.
# export HADOOP_SLAVE_SLEEP=0.1
# The directory where pid files are stored. /tmp by default.
# NOTE: this should be set to a directory that can only be written to by
# the users that are going to run the hadoop daemons. Otherwise there is
# the potential for a symlink attack.
# export HADOOP_PID_DIR=/var/hadoop/pids
# A string representing this instance of hadoop. $USER by default.
# export HADOOP_IDENT_STRING=$USER
# The scheduling priority for daemon processes. See 'man nice'.
# export HADOOP_NICENESS=10
由于以上配置设计到 ${HADOOP_HOME} 所以需要在/etc/profile 配置
export HADOOP_HOME=/usr/local/hadoop/hadoop 就是Hadoop所在的目录
5、core-site.xml核心配置:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<!—配置端口号 hadoop1 是本机机器名称需要在hosts文件夹下映射配置,也可以配置IP地址 - ->
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://hadoop1:9000</value>
</property>
<!-- 配置文件系统所在目录 à
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop/hadoop/tmp</value>
</property>
</configuration>
6、配置hdfs-site.xml:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<!-- 是否打开权限检查 à
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<!-- 副本复制的数目 à
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
7、配置mapred-site.xml:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<!-- 配置jobtracker端口号,dataNode会使用 à
<property>
<name>mapred.job.tracker</name>
<value>hdfs://hadoop1:9001</value>
</property>
</configuration>
8、master配置:
配置那些是master可以配置机器名称也可以配置IP地址
例如:
192.168.100.101
192.168.100.102
9、salver配置:
配置那些事slaver
192.168.100.101
192.168.100.102
10、配置SSH无密登录
免密码ssh设置(合适的权限很重要)
登入hadoop账户,建立ssh文件夹 mkdir .ssh
现在确认能否不输入口令就用ssh登录本机:
$ ssh namenode
如果不输入口令就无法用ssh登陆namenode,执行下面的命令:
$ ssh-keygen -t rsa –f ~/.ssh/id_rsa
回车设置密钥,可以设置一个密钥,也可以设置为空密钥。但是安全起见,设置密钥为hadoop,下面利用ssh-agent来设置免密钥登陆集群中其他机器。
私钥放在由-f选项指定的文件之中,例如~/.ssh/id_rsa。存放公钥的文件名称与私钥类似,但是以’.pub’为后缀,本例为~/.ssh/id_rsa.pub
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
将ssh密钥存放在namenode机器的~/.ssh/authorized_keys文件中。
再用scp命令将authorized_keys和id_rsa.pub分发到其它机器的相同文件夹下,例如,将authorized_keys文件分发给datanode1机器的.ssh文件夹的命令为:
$scp ~/.ssh/id_dsa.pub hadoop@datanode1:/home/hadoop/.ssh
到datanode1机器上 cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
理论上要将公钥发放给各机器然后再各个机器上生成authorized_keys文件,实际上直接分发authorized_keys文件也可以
然后赋予文件权限,赋予各机器的.ssh文件夹权限为711,authorized_keys的权限为644。过高的权限不能通过ssh-add命令,过低的权限无法实现免密码登录。
11、启动start-all.sh 就可以启动hadoop 集群了。其他丛机器会自动被ssh无密码登录启动起来。nameNode会启动其他子节点的DataNode。
12、使用jps查看
主机器:
[root@hadoop1 conf]# jps
32365 JobTracker
32090 NameNode
25900 Jps
3017 Bootstrap
32269 SecondaryNameNode
丛机器:
[root@hadoop2 ~]# jps
10009 TaskTracker
9901 DataNode
27852 Jps
恭喜你启动成功了
Hbase安装
由于hbase 和hadoop需要搭配版本的,我用的是1.2.1那么hbase用的是hbase-0.94.23
<!--[if !supportLists]-->1. <!--[endif]-->tar zxvf 解压缩 hbase-0.94.23.tar.gz
<!--[if !supportLists]-->2. <!--[endif]-->修改配置文件hbase-env.sh、hbase-site.xml
<!--[if !supportLists]-->3. <!--[endif]-->Hbase-env.sh
#
#/**
# * Copyright 2007 The Apache Software Foundation
# *
# * Licensed to the Apache Software Foundation (ASF) under one
# * or more contributor license agreements. See the NOTICE file
# * distributed with this work for additional information
# * regarding copyright ownership. The ASF licenses this file
# * to you under the Apache License, Version 2.0 (the
# * "License"); you may not use this file except in compliance
# * with the License. You may obtain a copy of the License at
# *
# * http://www.apache.org/licenses/LICENSE-2.0
# *
# * Unless required by applicable law or agreed to in writing, software
# * distributed under the License is distributed on an "AS IS" BASIS,
# * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# * See the License for the specific language governing permissions and
# * limitations under the License.
# */
# Set environment variables here.
# This script sets variables multiple times over the course of starting an hbase process,
# so try to keep things idempotent unless you want to take an even deeper look
# into the startup scripts (bin/hbase, etc.)
# The java implementation to use. Java 1.6 required.指定JDK所在的目录
# export JAVA_HOME=/usr/local/jdk/jdk1.6.0_45/
# Extra Java CLASSPATH elements. Optional.
# export HBASE_CLASSPATH=
# The maximum amount of heap to use, in MB. Default is 1000.
# export HBASE_HEAPSIZE=1000
# Extra Java runtime options.
# Below are what we set by default. May only work with SUN JVM.
# For more on why as well as other possible settings,
# see http://wiki.apache.org/hadoop/PerformanceTuning
export HBASE_OPTS="-XX:+UseConcMarkSweepGC"
# Uncomment one of the below three options to enable java garbage collection logging for the server-side processes.
# This enables basic gc logging to the .out file.
# export SERVER_GC_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps"
# This enables basic gc logging to its own file.
# If FILE-PATH is not replaced, the log file(.gc) would still be generated in the HBASE_LOG_DIR .
# export SERVER_GC_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:<FILE-PATH>"
# This enables basic GC logging to its own file with automatic log rolling. Only applies to jdk 1.6.0_34+ and 1.7.0_2+.
# If FILE-PATH is not replaced, the log file(.gc) would still be generated in the HBASE_LOG_DIR .
# export SERVER_GC_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:<FILE-PATH> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=1 -XX:GCLogFileSize=512M"
# Uncomment one of the below three options to enable java garbage collection logging for the client processes.
# This enables basic gc logging to the .out file.
# export CLIENT_GC_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps"
# This enables basic gc logging to its own file.
# If FILE-PATH is not replaced, the log file(.gc) would still be generated in the HBASE_LOG_DIR .
# export CLIENT_GC_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:<FILE-PATH>"
# This enables basic GC logging to its own file with automatic log rolling. Only applies to jdk 1.6.0_34+ and 1.7.0_2+.
# If FILE-PATH is not replaced, the log file(.gc) would still be generated in the HBASE_LOG_DIR .
# export CLIENT_GC_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:<FILE-PATH> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=1 -XX:GCLogFileSize=512M"
# Uncomment below if you intend to use the EXPERIMENTAL off heap cache.
# export HBASE_OPTS="$HBASE_OPTS -XX:MaxDirectMemorySize="
# Set hbase.offheapcache.percentage in hbase-site.xml to a nonzero value.
# Uncomment and adjust to enable JMX exporting
# See jmxremote.password and jmxremote.access in $JRE_HOME/lib/management to configure remote password access.
# More details at: http://java.sun.com/javase/6/docs/technotes/guides/management/agent.html
#
# export HBASE_JMX_BASE="-Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false"
# export HBASE_MASTER_OPTS="$HBASE_MASTER_OPTS $HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10101"
# export HBASE_REGIONSERVER_OPTS="$HBASE_REGIONSERVER_OPTS $HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10102"
# export HBASE_THRIFT_OPTS="$HBASE_THRIFT_OPTS $HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10103"
# export HBASE_ZOOKEEPER_OPTS="$HBASE_ZOOKEEPER_OPTS $HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10104"
# File naming hosts on which HRegionServers will run. $HBASE_HOME/conf/regionservers by default.
# export HBASE_REGIONSERVERS=${HBASE_HOME}/conf/regionservers
# File naming hosts on which backup HMaster will run. $HBASE_HOME/conf/backup-masters by default.
# export HBASE_BACKUP_MASTERS=${HBASE_HOME}/conf/backup-masters
# Extra ssh options. Empty by default.
# export HBASE_SSH_OPTS="-o ConnectTimeout=1 -o SendEnv=HBASE_CONF_DIR"
# Where log files are stored. $HBASE_HOME/logs by default.
# export HBASE_LOG_DIR=${HBASE_HOME}/logs
# Enable remote JDWP debugging of major HBase processes. Meant for Core Developers
# export HBASE_MASTER_OPTS="$HBASE_MASTER_OPTS -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=8070"
# export HBASE_REGIONSERVER_OPTS="$HBASE_REGIONSERVER_OPTS -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=8071"
# export HBASE_THRIFT_OPTS="$HBASE_THRIFT_OPTS -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=8072"
# export HBASE_ZOOKEEPER_OPTS="$HBASE_ZOOKEEPER_OPTS -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=8073"
# A string representing this instance of hbase. $USER by default.
# export HBASE_IDENT_STRING=$USER
# The scheduling priority for daemon processes. See 'man nice'.
# export HBASE_NICENESS=10
# The directory where pid files are stored. /tmp by default.
# export HBASE_PID_DIR=/var/hadoop/pids
# Seconds to sleep between slave commands. Unset by default. This
# can be useful in large clusters, where, e.g., slave rsyncs can
# otherwise arrive faster than the master can service them.
# export HBASE_SLAVE_SLEEP=0.1
# Tell HBase whether it should manage it's own instance of Zookeeper or not.
# export HBASE_MANAGES_ZK=true
4、配置hbase-site.xml
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://hadoop1:9000/hbase</value>
</property>
<!-- 设置hbase数据存在位置 -->
<property>
<name>hbase.rootdir</name>
<value>file:///usr/local/hadoop/hbase/hbase-0.94.23/data</value>
</property>
<!-- 配置Hbase是分布式模式 -->
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.master</name>
<value>hdfs://hadoop1:60000</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>hadoop1,hadoop2</value>
</property>
</configuration>
~
相关推荐
### Hadoop 学习总结 #### 一、HDFS简介 **1.1 数据块(Block)** HDFS(Hadoop Distributed File System)是Hadoop的核心组件之一,它主要用于存储大规模的数据集。HDFS默认的基本存储单位是64MB的数据块。与...
自己整理的hadoop学习笔记,很详尽 很真实。linux操作终端下遇到的各种Hadoop常见问题 解决方案
本文将基于“Hadoop学习总结和源码分析”这一主题,结合提供的文档资源,深入探讨Hadoop的核心组件HDFS(Hadoop Distributed File System)和MapReduce。 首先,我们从“Hadoop学习总结之一:HDFS简介.doc”开始,...
该项目主要针对一个由技术培训机构主办的技术学习论坛的日志数据进行分析。论坛日志数据由Python网络爬虫抓取,包含2013年5月30日和5月31日两天的数据,由于数据量较大,超过了传统分析工具的处理能力,因此选择使用...
### Hadoop入门学习文档知识点梳理 #### 一、大数据概论 ##### 1.1 大数据概念 - **定义**:大数据是指无法在可承受的时间范围内用常规软件工具进行捕捉、管理和处理的数据集合。 - **特点**: - **Volume(大量)...
### Hadoop 入门学习知识点概览 #### Hadoop 是什么? Hadoop 是 Apache 基金会下的一款开源软件框架,旨在通过集群的方式高效地处理大规模数据集。Hadoop 提供了分布式文件系统(HDFS)、运算资源调度系统(YARN...
总的来说,"hadoop学习总结1-5"这份资料可能涵盖了从Hadoop的基本概念到实际操作的各个方面,对于想要深入了解和掌握Hadoop的初学者来说,是一份宝贵的参考资料。通过系统学习,可以逐步提升在大数据处理领域的专业...
Hadoop是一个开源框架,用于存储和处理大型数据集。由Apache软件基金会开发,Hadoop已经成为大数据处理事实上的标准。它特别适合于存储非结构化和半结构化数据,并且能够存储和运行在廉价硬件之上。Hadoop具有高可靠...
以下是对Hadoop学习的详细总结: **HDFS(Hadoop Distributed File System)简介** HDFS是Hadoop的核心组件之一,是一个高度容错性的分布式文件系统。它被设计成能在普通的硬件上运行,并能够处理大规模的数据集。...
本日志主要记录了一位开发者在Ubuntu系统上配置Hadoop开发环境的过程,包括系统环境准备、SSH、JDK、JRE、Tomcat和NetBeans的配置,以及Hadoop环境的搭建和测试。 1. **系统环境配置** - 硬件平台:Pentium(R) ...
hadoop学习时用到的 测试数据:手机上网日志
【HADOOP学习笔记】 Hadoop是Apache基金会开发的一个开源分布式计算框架,是云计算领域的重要组成部分,尤其在大数据处理方面有着广泛的应用。本学习笔记将深入探讨Hadoop的核心组件、架构以及如何搭建云计算平台。...
例如,可以配置hadoop.log.dir来指定日志的存储位置,这有助于开发者在遇到问题时迅速定位和排查问题。 最后,Hadoop是一个不断演进的技术栈,随着版本的更新,新的特性和组件也会被添加进来。学习Hadoop不仅要关注...
在IT行业中,大数据处理是一项至关重要的任务,而Hadoop作为开源的大数据处理框架,是实现这一目标的关键工具。本文将深入探讨"基于Hadoop的日志行为分析系统",结合人工智能技术,来理解如何利用Hadoop进行大规模...
### Hadoop学习步骤详解 #### 一、选择合适的Hadoop版本并熟悉Hadoop原理 在开始学习Hadoop之前,首先需要选择一个合适的Hadoop版本。Hadoop作为一个分布式计算框架,经历了多个版本的发展,包括Hadoop 1.x、...
教程大纲.docx可能会列出以上各个主题的详细章节结构,而hadoop高级应用一.exe可能是配套的学习软件或模拟环境,让学习者可以亲手实践Hadoop的高级操作。 通过学习这个高级教程,你将能够更好地理解和应用Hadoop,...
本次要实践的数据日志来源于国内某技术学习论坛,该论坛由某培训机构主办,汇聚了众多技术学习者,每天都有人发帖、回帖。至此,我们通过Python网络爬虫手段进行数据抓取,将我们网站数据(2013-05-30,2013-05-31)...