`

搭建Hadoop伪分布式环境

 
阅读更多

<div class="iteye-blog-content-contain" style="font-size: 14px">

Hadoop developers usually test their scripts and code on a pseudo-distributed environment(also known as a single node setup), which is a virtual machine that runs all of the Hadoop daemons simultaneously on a single machine. This allows you to quickly write scripts and test them on limited data sets without having to connect to a remote cluster or pay the expense of EC2. If you're learning Hadoop, you'll probably also want to set up a pseudo-distributed environment to facilitate your understanding of the various Hadoop daemons.

These instructions will help you install a pseudo-distributed environment with Hadoop 2.5.2 on Ubuntu 14.04.

 

Quick Start

There are a couple of options that will allow you to quickly get up and running if you are not familiar with systems administration on Linux or do not wish to work through the process of installing Hadoop yourself. District Data Labs has provided a Virtual Machine Disk (VMDK) configured exactly as the instructions below, available for you to download directly. You can then use this VMDK in the virtualization software of your choice (e.g. VirtualBox or VMWare Fusion). Alternatively both Hortonworks and Cloudera supply virtual machines for quick download. Be aware that if you do use Cloudera or Hortonworks distributions, then the environment may be subtly different than the one described below.

Click here to download the VMDK we have put together.

If you are using the VMDK supplied by District Data Labs, log in to the machine using the username and password as follows:

username: student
password: password

If you're brave enough to set up the environment yourself, go ahead and move to the next section!

 

Setting up Linux

Before you can get started installing Hadoop, you'll need to have a Linux environment configured and ready to use. These instructions assume that you can get an Ubuntu 14.04 distribution installed on the machine of your choice, either in a dual booted configuration or using a virtual machine. Using Ubuntu Server or Ubuntu Desktop is left to your preference, since you'll also need to be familiar working with the command line. Personally, I prefer to use Ubuntu Server since it's more lightweight, and SSH into it from my host operating system.

Base Environment: Ubuntu x64 Desktop 14.04 LTS

Make sure your system is fully up-to-date with the required by running the following commands:

~$ sudo apt-get update && sudo apt-get upgrade
~$ sudo apt-get install build-essential ssh lzop git rsync curl
~$ sudo apt-get install python-dev python-setuptools
~$ sudo apt-get install libcurl4-openssl-dev
~$ sudo easy_install pip
~$ sudo pip install virtualenv virtualenvwrapper python-dateutil

 

Creating a Hadoop User

In order to secure our Hadoop services, we will make sure that Hadoop is run as a Hadoop-specific user and group. This user would be able to initiate SSH connections to other nodes in a cluster, but not have administrative access to do damage to the operating system upon which the service was running. Implementing Linux permissions also helps secure HDFS and is the start of preparing a secure computing cluster.

This tutorial is not meant for operational implementation. However, as a data scientist, these permissions may save you some headache in the long run, so it is helpful to have the permissions in place on your development environment. This will also ensure that the Hadoop installation is separate from other software applications and will help organize the maintenance of the machine.

Create the hadoop user and group, then add the student user to the Hadoop group:

~$ sudo addgroup hadoop
~$ sudo useradd -m -g hadoop hadoop
~$ sudo usermod -a -G hadoop student

Once you have logged out and logged back in (or restarted the machine) you should be able to see that you've been added to the Hadoop group by issuing the groupscommand. Note that the -r flag creates a system user without a home directory.

 

Configuring SSH

SSH is required and must be installed on your system to use Hadoop (and to better manage the virtual environment, especially if you're using a headless Ubuntu). Generate some ssh keys for the Hadoop user by issuing the following commands:

~$ sudo su hadoop
~$ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa):
Created directory '/home/hadoop/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/hadoop/.ssh/id_rsa.
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub. [... snip ...]

Simply hit enter at all the prompts to accept the defaults and to create a key that does not require a password to authenticate (this is required for Hadoop). In order to allow the key to be used to SSH into the box, copy the public key to the authorized_keys file with the following command:

~$ cat /home/hadoop/.ssh/id_rsa.pub >> /home/hadoop/.ssh/authorized_keys
~$ chmod 600 /home/hadoop/.ssh/authorized_keys

You should be able to download this key and use it to SSH into the Ubuntu environment. To test the SSH key issue the following command:

~$ ssh -l hadoop localhost

If this completes successfully without asking you for a password, then you have successfully configured SSH for Hadoop. Exit the SSH window by typing exit. You should be returned back to the hadoop user. Exit the Hadoop user by typing exitagain, you should now be in a terminal window that says student@ubuntu.

 

Installing Java

Hadoop and most of the Hadoop ecosystem require Java to run. Hadoop requires a minimum of Oracle Java™ 1.6.x or greater and used to recommend particular versions of Java™ to use with Hadoop. Now, Hadoop maintains a reporting of the various JDKs that work well with Hadoop. Ubuntu does not maintain an Oracle JDK in Ubuntu repositories because it is proprietary code, so instead we will install OpenJDK. For more information on supported Java™ versions, see Hadoop Java Versions and for information about installing different versions on Ubuntu, please see Installing Java on Ubuntu.

~$ sudo apt-get install openjdk-7-*

Do a quick check to ensure the right version of Java™ is installed:

~$ java -version
java version "1.7.0_65" OpenJDK Runtime Environment (IcedTea 2.5.3) (7u71-2.5.3-0ubuntu0.14.04.1) OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)

Hadoop is currently built and tested on both OpenJDK and Oracle's JDK/JRE.

 

Disabling IPv6

It has been reported for a while now that Hadoop running on Ubuntu has a conflict with IPv6, and ever since Hadoop 0.20, Ubuntu users have been disabling IPv6 on their clustered boxes. It is unclear whether or not this is still a bug in the latest versions of Hadoop, however in a single-node or pseudo-distributed environment we will have no need for IPv6, so it is best to simply disable it and not worry about any potential problems.

Edit the /etc/sysctl.conf file by executing the following lines of code:

~$ gksu gedit /etc/sysctl.conf

Then add the following lines to the end of the file:

# disable ipv6 net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

For this change to take effect, reboot your computer. Once it has rebooted check the status with the following command:

~$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6

If the output is 0, then IPv6 is enabled. If it is 1, then we have successfully disabled IPv6.

 

Installing Hadoop

To get Hadoop, you'll need to download the release of your choice from one of theApache Download Mirrors. These instructions will download the current stable version of Hadoop with YARN at the time of this writing, Hadoop 2.5.2.

After you've selected a mirror, type the following commands into a Terminal window, replacing http://apache.mirror.com/hadoop-2.5.0/ with the mirror URL that you selected and that is best for your region:

~/Downloads$ curl -O http://apache.mirror.com/hadoop-2.5.2/hadoop-2.5.2.tar.gz

You can verify the download by ensuring that the md5sum matches the md5sum which should also be available at the mirror:

~/Downloads$ md5sum hadoop-2.5.2.tar.gz
74a7581893a8224540a9417a4c2630da  hadoop-2.5.2.tar.gz

Of course, you can use any mechanism you wish to download Hadoop - wget or a browser will work just fine.

 

Unpacking

After obtaining the compressed tarball, the next step is to unpack it. You can use an Archive Manager or simply follow the instructions that follow next. The most significant decision that you have to make is where to unpack Hadoop to.

The Linux operating system depends upon a hierarchical directory structure to function. At the root, many directories that you've heard of have specific purposes:

  • /etc is used to store configuration files
  • /home is used to store user specific files
  • /bin and /sbin include programs that are vital for the OS
  • /usr/sbin are for programs that are not vital but are system wide
  • /usr/local is for locally installed programs
  • /var is used for program data including caches and logs

You can read more about these directories in this Stack Exchange post.

A good choice to move Hadoop to is the /opt and /srv directories.

  • /opt contains non-packaged programs, usually source. A lot of developers stick their code there for deployments.
  • The /srv directory stands for services. Hadoop, HBase, Hive and others run as services on your machine, so this seems like a great place to put things, and it's a standard location that's easy to get to. So let's stick everything there!

Enter the following commands:

~/Downloads$ tar -xzf hadoop-2.5.2.tar.gz
~/Downloads$ sudo mv hadoop-2.5.2 /srv/
~/Downloads$ sudo chown -R hadoop:hadoop /srv/hadoop-2.5.2
~/Downloads$ sudo chmod g+w -R /srv/hadoop-2.5.2
~/Downloads$ sudo ln -s /srv/hadoop-2.5.2 /srv/hadoop

These commands unpack Hadoop, move it to the service directory where we will keep all of our Hadoop and cluster services, and then set permissions. Finally, we create a symlink to the version of Hadoop that we would like to use, this will make it easy to upgrade our Hadoop distribution in the future.

 

Environment

In order to ensure everything executes correctly, we are going to set some environment variables so that Hadoop executes in its correct context. Enter the following command on the command line to open up a text editor with the profile of the hadoop user to change the environment variables.

/srv$ gksu gedit /home/hadoop/.bashrc

Add the following lines to this file:

# Set the Hadoop Related Environment variables export HADOOP_HOME=/srv/hadoop export PATH=$PATH:$HADOOP_HOME/bin # Set the JAVA_HOME export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

We'll also add some convenience functionality to the student user environment. Open the student user bash profile file with the following command:

~$ gedit ~/.profile

Add the following contents to that file:

# Set the Hadoop Related Environment variables export HADOOP_HOME=/srv/hadoop export HADOOP_STREAMING=$HADOOP_HOME/share/hadoop/tools/lib/hadoop streaming-2.5.2.jar export PATH=$PATH:$HADOOP_HOME/bin # Set the JAVA_HOME export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64 # Helpful Aliases alias ..="cd .." alias ...="cd ../.." alias hfs="hadoop fs" alias hls="hfs -ls"

These simple aliases may save you a lot of typing in the long run! Feel free to add any other helpers that you think might be useful in your development work.

Check that your environment configuration has worked by running a Hadoop command:

~$ hadoop version
Hadoop 2.5.2
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r cc72e9b000545b86b75a61f4835eb86d57bfafc0
Compiled by jenkins on 2014-11-14T23:45Z
Compiled with protoc 2.5.0
From source with checksum df7537a4faa4658983d397abf4514320
This command was run using /srv/hadoop-2.5.2/share/hadoop/common/hadoop-common-2.5.2.jar

If that ran with no errors and displayed an output similar to the one above, then everything has been configured correctly up to this point.

 

Hadoop Configuration

The penultimate step to setting up Hadoop as a pseudo-distributed node is to edit configuration files for the Hadoop environment, the MapReduce site, the HDFS site, and the YARN site. This will mostly entail configuration file editing.

Edit the hadoop-env.sh file by entering the following on the command line.

~$ gedit $HADOOP_HOME/etc/hadoop/hadoop-env.sh

The most important part of this configuration is to change the following line:

# The java implementation to use. export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

Next, edit the core site configuration file:

~$ gedit $HADOOP_HOME/etc/hadoop/core-site.xml

Replace the <configuration></configuration> with the following:

<configuration>
    <property>
        <name>fs.default.name</name>
        <value>hdfs://localhost:9000</value>
    </property>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/var/app/hadoop/data</value>
    </property>
</configuration>

Edit the MapReduce site configuration following by copying the template then opening the file for editing:

~$ cp $HADOOP_HOME/etc/hadoop/mapred-site.xml.template \ $HADOOP_HOME/etc/hadoop/mapred-site.xml
~$ gedit $HADOOP_HOME/etc/hadoop/mapred-site.xml

Replace the <configuration></configuration> with the following:

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

Now edit the HDFS site configuration by editing the following file:

~$ gedit $HADOOP_HOME/etc/hadoop/hdfs-site.xml

Replace the <configuration></configuration> with the following:

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

Finally, edit the YARN site configuration file:

~$ gedit $HADOOP_HOME/etc/hadoop/yarn-site.xml

And update the configuration as follows:

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
    <property>
        <name>yarn.resourcemanager.resource-tracker.address</name>
        <value>localhost:8025</value>
    </property>
    <property>
        <name>yarn.resourcemanager.scheduler.address</name>
        <value>localhost:8030</value>
    </property>
    <property>
        <name>yarn.resourcemanager.address</name>
        <value>localhost:8050</value>
    </property>
</configuration>

With these files edited, Hadoop should be fully configured as a pseudo-distributed environment.

 

Formatting the Namenode

The final step before we can turn Hadoop on is to format the namenode. The namenode is in charge of HDFS, the distributed file system. The namenode on this machine is going to keep its files in the /var/app/hadoop/data directory. We need to initialize this directory and then format the namenode to properly use it.

~$ sudo mkdir -p /var/app/hadoop/data
~$ sudo chown hadoop:hadoop -R /var/app/hadoop
~$ sudo su hadoop
~$ hadoop namenode -format

You should see a bunch of Java messages scrolling down the page if the namenode has executed successfully. There should be directories inside of the /var/app/hadoop/datadirectory, including a dfs directory. If that is what you see, then Hadoop should be all set up and ready to use!

 

Starting Hadoop

At this point we can start and run our Hadoop daemons. When you formatted the namenode, you switched to being the hadoop user with the sudo su hadoopcommand. If you're still that user, go ahead and execute the following commands:

~$ $HADOOP_HOME/sbin/start-dfs.sh
~$ $HADOOP_HOME/sbin/start-yarn.sh

The daemons should start up and issue messages about where they are logging to and other important information. If you get asked about your SSH key, just type y at the prompt. You can see the processes that are running via the jps command:

~$ jps
4801 Jps
4468 ResourceManager
4583 NodeManager
4012 NameNode
4318 SecondaryNameNode
4150 DataNode

If the processes are not running, then something has gone wrong. You can also access the Hadoop cluster administration site by opening a browser and point it tohttp://localhost:8088. This should bring up a page with the Hadoop logo and a table of applications.

To wrap up the configuration, prepare a space on HDFS for our student account to store data and to run analytical jobs on:

~$ hadoop fs -mkdir -p /user/student
~$ hadoop fs -chown student:student /user/student

You can now exit from the hadoop user's shell with the exit command.

 

Restarting Hadoop

If you reboot your machine, the Hadoop daemons will stop running and will not automatically be restarted. If you are attempting to run a Hadoop command and you get a "connection refused" message, it is likely because the daemons are not running. You can check this by issuing the jps command as sudo:

~$ sudo jps

To restart Hadoop in the case that it shuts down, issue the following commands:

~$ sudo -H -u hadoop $HADOOP_HOME/sbin/start-dfs.sh
~$ sudo -H -u hadoop $HADOOP_HOME/sbin/start-yarn.sh

The processes should start up again as the dedicated hadoop user and you'll be back on your way!

 

Installing Hive

For the most part, installing services on Hadoop (e.g. Hive, HBase, or others) will consist of the following in the environment we have set up:

  1. Download the release tarball of the service
  2. Unpack the release to /srv/ and creating a symlink from the release to a simple name
  3. Configure environment variables with the new paths
  4. Configure the service to run in pseudo-distributed mode

Hive also follows this pattern. Find the Hive release you wish to download from theApache Hive downloads page. At the time of this writing, Hive release 0.14.0 is current. Once you have selected a mirror, download the apache-hive-0.14.0-bin.tar.gz file to your downloads directory. Then issue the following commands in the terminal to unpack it:

~$ tar -xzf apache-hive-0.14.0-bin.tgz
~$ sudo mv apache-hive-0.14.0-bin /srv
~$ sudo chown -R hadoop:hadoop /srv/apache-hive-0.14.0-bin
~$ sudo ln -s /srv/apache-hive-0.14.0-bin /srv/hive

Edit your ~/.profile with these environment variables by adding the following to the bottom of the .profile:

# Configure Hive environment export HIVE_HOME=/srv/hive export PATH=$PATH:$HIVE_HOME/bin

No other configuration for Hive is required, although you can find other configuration details in HIVE_HOME/conf including the Hive environment shell file and the Hive site configuration XML.

 

Installing Spark

Installing Spark is also pretty straight forward, and we'll install it similarly to how we installed Hive. Find the Spark release you wish to download from the Apache Spark downloads page. The Spark release at the time of this writing is 1.1.0. You should choose the package type "Pre-built for Hadoop 2.4" and the download type should be "Direct Download". Then unpack it as follows:

~$ tar -xzf spark-1.1.0-bin-hadoop2.4.tgz
~$ sudo mv spark-1.1.0-bin-hadoop2.4.tgz /srv
~$ sudo chown -R hadoop:hadoop /srv/spark-1.1.0-bin-hadoop2.4
~$ sudo ln -s /srv/spark-1.1.0-bin-hadoop2.4 /srv/spark

Edit your ~/.profile with the following environment variables at the bottom of the file:

# Configure Spark environment export SPARK_HOME=/srv/spark export PATH=$SPARK_HOME/bin:$PATH

After you source your .profile or restart your terminal, you should be able to run a pyspark interpreter locally. You can now use pyspark and spark-submit commands to run Spark jobs.

 

Conclusion

 

At this point you should now have a fully configured Hadoop setup ready for development in pseudo-distributed mode on Ubuntu with HDFS, MapReduce on YARN, Hive, and Spark all ready to go as well as a simple methodology for installing other services. 

</div>

 

分享到:
评论

相关推荐

    云服务器上搭建大数据伪分布式环境

    #### 一、云服务器上搭建Hadoop伪分布式环境 ##### 1、软硬件环境 - **操作系统**: CentOS 7.2 64位 - **Java版本**: OpenJDK-1.8.0 - **Hadoop版本**: Hadoop-2.7 ##### 2、安装SSH客户端 为了能够在本地或其他...

    hadoop 伪分布式环境配置

    "Hadoop 伪分布式环境配置" Hadoop 伪分布式环境配置是指在单机上模拟 Hadoop 分布式环境的配置过程。这种配置方式可以模拟分布式环境,方便开发者测试和调试 Hadoop 程序。下面是 Hadoop 伪分布式环境配置的详细...

    hadoop伪分布式安装.pdf

    #### 二、Hadoop伪分布式环境准备 在开始配置Hadoop伪分布式之前,需要确保已经完成了基本的环境搭建工作。主要包括: 1. **系统环境准备**:一般情况下,推荐使用Linux操作系统,因为它提供了良好的稳定性和性能。...

    CentOS搭建hadoop伪分布式时遇到的错误.doc

    在搭建Hadoop伪分布式环境的过程中,常常会遇到与文件系统权限相关的错误,特别是在CentOS这样的Linux发行版上。本文将详细解析标题和描述中提到的两个常见问题,并提供解决方案。 问题1:启动集群时出错:`mkdir: ...

    Hadoop伪分布式集群环境搭建

    接下来,按照以下步骤搭建Hadoop伪分布式环境: 1. **下载Hadoop**:访问Apache官方网站,下载最新稳定版的Hadoop二进制包,将其解压到一个合适的目录,例如`/usr/local/hadoop`。 2. **配置环境变量**:编辑`~/....

    linux虚拟机搭建hadoop伪分布式集群

    在搭建Hadoop伪分布式集群的过程中,首先需要在Linux环境下配置VMware的NAT网络,确保虚拟机能够稳定地运行和通信。NAT网络允许虚拟机通过宿主机与外部网络进行通信,同时保持IP地址的固定性。具体操作包括取消DHCP...

    搭建hadoop伪分布式.docx

    伪分布式运行模式 这种模式也是在一台单机上运行,但用不同的Java进程模仿分布式运行中的各类结点(NameNode,DataNode,JobTracker,TaskTracker,SecondaryNameNode),请注意分布式运行中的这几个结点的区别:从分布式...

    虚拟机搭建Hadoop伪分布式及Hbase.docx

    安装完必需的软件后,可以开始搭建Hadoop伪分布式环境: 1. 解压缩下载的Hadoop和JDK到适当的位置。 2. 配置环境变量,如在`.bashrc`文件中添加HADOOP_HOME和JAVA_HOME路径,然后使用`source ~/.bashrc`使改动生效。...

    Hadoop伪分布式环境搭建

    1、搭建Hadoop伪分布式环境,通过HDFS 进行文件的上传和下载来测试环境是否搭建成功; 2、创建Java Maven项目,编写MapReduce代码实现对文本中字符(包含大小写字母、数字、各种符号)的统计,将项目打成jar包放入...

    hadoop伪分布式搭建.doc

    《Hadoop伪分布式环境搭建详解》 Hadoop作为大数据处理的核心框架,其分布式环境的搭建是初学者必须掌握的关键技能之一。本篇文章将详细解析Hadoop伪分布式集群环境的搭建步骤,以及在搭建过程中可能遇到的问题及其...

    shell脚本配置Hadoop伪分布式.zip

    7. 最后,通过`help文档.txt`中的指示进行验证,确保Hadoop伪分布式环境已成功搭建并运行。 理解并掌握这些步骤和配置文件的用途,是成为Hadoop管理员或数据工程师的关键技能之一。在实际应用中,你可能需要根据...

    hadoop伪分布式搭建.docx

    大数据hadoop平台伪分布式搭建详细步骤,基于ubtuntu系统,供初学者学习使用。... 大数据hadoop平台伪分布式搭建详细步骤,基于ubtuntu系统,供初学者学习使用。...

    hadoop伪分布式配置教程.doc

    本教程详细指导如何在Ubuntu 14.04 64位系统上配置Hadoop的伪分布式环境。虽然教程是基于Ubuntu 14.04,但同样适用于Ubuntu 12.04、16.04以及32位系统,同时也适用于CentOS/RedHat系统的类似配置。教程经过验证,...

    Hadoop完全分布式环境搭建步骤

    Hadoop完全分布式环境搭建文档,绝对原创,并且本人亲自验证并使用,图文并茂详细介绍了hadoop完全分布式环境搭建所有步骤,条例格式清楚,不能成功的,请给我留言!将给与在线支持!

    第3章hadoop伪分布式环境的搭建.docx

    【Hadoop伪分布式环境搭建详解】 Hadoop作为大数据处理的核心框架,对于初学者和专业开发人员来说,理解并掌握其安装配置至关重要。Hadoop提供了多种运行模式,包括本地模式、伪分布式模式以及集群模式,每种模式都...

    Hadoop完全分布式环境搭建

    1. **停掉伪分布式DFS文件系统**:确保之前的伪分布式环境已关闭,避免与新的完全分布式环境冲突。 2. **设置SSH免密码登录**:为集群中的所有节点设置SSH免密码登录,方便后续操作。 3. **备份伪分布式文件系统配置...

Global site tag (gtag.js) - Google Analytics