Tips and Guidelines
Selecting Appropriate JAR files for your MRv1 and YARN Jobs
Each implementation of the CDH4 MapReduce framework (MRv1 and YARN) consists of the artifacts (JAR files) that provide MapReduce functionality as well as auxiliary utility artifacts that are used during the course of the MapReduce job. When you submit a job either explicitly (using the Hadoop launcher script) or implicitly (via Java implementations) it is extremely important that you make sure that you reference utility artifacts that come with the same version of MapReduce implementation that is running on your cluster. The following table summarizes the names and location of these artifacts:
streaming |
/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/ hadoop-streaming-2.0.0-mr1-cdh<version>.jar |
/usr/lib/hadoop-mapreduce/ hadoop-streaming.jar |
rumen |
N/A |
/usr/lib/hadoop-mapreduce/ hadoop-rumen.jar |
hadoop examples |
/usr/lib/hadoop-0.20-mapreduce/ hadoop-examples.jar |
/usr/lib/hadoop-mapreduce/ hadoop-mapreduce-examples.jar |
distcp v1 |
/usr/lib/hadoop-0.20-mapreduce/ hadoop-tools.jar |
/usr/lib/hadoop-mapreduce/ hadoop-extras.jar |
distcp v2 |
N/A |
/usr/lib/hadoop-mapreduce/ hadoop-distcp.jar |
hadoop archives |
/usr/lib/hadoop-0.20-mapreduce/ hadoop-tools.jar |
/usr/lib/hadoop-mapreduce/ hadoop-archives.jar |
Improving Performance
This section provides solutions to some performance problems, and describes configuration best practices.
If you are running CDH over 10Gbps Ethernet, improperly set network configuration or improperly applied NIC firmware or drivers can noticeably degrade performance. Work with your network engineers and hardware vendors to make sure that you have the proper NIC firmware, drivers, and configurations in place and that your network performs properly. Cloudera recognizes that network setup and upgrade are challenging problems, and will make best efforts to share any helpful experiences.
Disabling Transparent Hugepage Compaction
Most Linux platforms supported by CDH4 include a feature called transparent hugepage compaction which interacts poorly with Hadoop workloads and can seriously degrade performance.
Symptom: top and other system monitoring tools show a large percentage of the CPU usage classified as "system CPU". If system CPU usage is 30% or more of the total CPU usage, your system may be experiencing this issue.
- Red Hat/CentOS: /sys/kernel/mm/redhat_transparent_hugepage/defrag
- Ubuntu/Debian, OEL, SLES: /sys/kernel/mm/transparent_hugepage/defrag
- To see whether transparent hugepage compaction is enabled, run the following command and check the output:
$ cat defrag_file_pathname
- [always] never means that transparent hugepage compaction is enabled.
- always [never] means that transparent hugepage compaction is disabled.
- To disable transparent hugepage compaction, add the following command to /etc/rc.local :
echo never > defrag_file_pathname
You can also disable transparent hugepage compaction interactively (but remember this will not survive a reboot).
# echo 'never' > defrag_file_pathnameTo disable transparent hugepage compaction temporarily using sudo:
$ sudo sh -c "echo 'never' > defrag_file_pathname"
Setting the vm.swappiness Linux Kernel Parameter
vm.swappiness is a Linux Kernel Parameter that controls how aggressively memory pages are swapped to disk. It can be set to a value between 0-100; the higher the value, the more aggressive the kernel is in seeking out inactive memory pages and swapping them to disk.
You can see what value vm.swappiness is currently set to by looking at /proc/sys/vm; for example:
cat /proc/sys/vm/swappiness
On most systems, it is set to 60 by default. This is not suitable for Hadoop clusters nodes, because it can cause processes to get swapped out even when there is free memory available. This can affect stability and performance, and may cause problems such as lengthy garbage collection pauses for important system daemons. Cloudera recommends that you set this parameter to 0; for example:
# sysctl -w vm.swappiness=0
Performance Enhancements in Shuffle Handler and IFile Reader
As of CDH4.1, the MapReduce shuffle handler and IFile reader use native Linux calls (posix_fadvise(2) and sync_data_range) on Linux systems with Hadoop native libraries installed. The subsections that follow provide details.
Shuffle Handler
You can improve MapReduce Shuffle Handler Performance by enabling shuffle readahead. This causes the TaskTracker or Node Manager to pre-fetch map output before sending it over the socket to the reducer.
- To enable this feature for YARN, set the mapreduce.shuffle.manage.os.cache property to true (default). To further tune performance, adjust the value of the mapreduce.shuffle.readahead.bytes property. The default value is 4MB.
- To enable this feature for MRv1, set the mapred.tasktracker.shuffle.fadvise property to true (default). To further tune performance, adjust the value of the mapred.tasktracker.shuffle.readahead.bytes property. The default value is 4MB.
IFile Reader
Enabling IFile readahead increases the performance of merge operations. To enable this feature for either MRv1 or YARN, set the mapreduce.ifile.readaheadproperty to true (default). To further tune the performance, adjust the value of the mapreduce.ifile.readahead.bytes property. The default value is 4MB.
Best Practices for MapReduce Configuration
The configuration settings described below can reduce inherent latencies in MapReduce execution. You set these values in mapred-site.xml.
Send a heartbeat as soon as a task finishes
Set the mapreduce.tasktracker.outofband.heartbeat property to true to let the TaskTracker send an out-of-band heartbeat on task completion to reduce latency; the default value is false:
<property> <name>mapreduce.tasktracker.outofband.heartbeat</name> <value>true</value> </property>
Reduce the interval for JobClient status reports on single node systems
The jobclient.progress.monitor.poll.interval property defines the interval (in milliseconds) at which JobClient reports status to the console and checks for job completion. The default value is 1000 milliseconds; you may want to set this to a lower value to make tests run faster on a single-node cluster. Adjusting this value on a large production cluster may lead to unwanted client-server traffic.
<property> <name>jobclient.progress.monitor.poll.interval</name> <value>10</value> </property>
Tune the JobTracker heartbeat interval
Tuning the minimum interval for the TaskTracker-to-JobTracker heartbeat to a smaller value may improve MapReduce performance on small clusters.
<property> <name>mapreduce.jobtracker.heartbeat.interval.min</name> <value>10</value> </property>
Start MapReduce JVMs immediately
The mapred.reduce.slowstart.completed.maps property specifies the proportion of Map tasks in a job that must be completed before any Reduce tasks are scheduled. For small jobs that require fast turnaround, setting this value to 0 can improve performance; larger values (as high as 50%) may be appropriate for larger jobs.
<property> <name>mapred.reduce.slowstart.completed.maps</name> <value>0</value> </property>
Best practices for HDFS Configuration
This section indicates changes you may want to make in hdfs-site.xml.
Improve Performance for Local Reads
Also known as short-circuit local reads, this capability is particularly useful for HBase and Cloudera Impala™. It improves the performance of node-local reads by providing a fast path that is enabled in this case. It requires libhadoop.so (the Hadoop Native Library) to be accessible to both the server and the client.
libhadoop.so is not available if you have installed from a tarball. You must install from an .rpm, .deb, or parcel in order to use short-circuit local reads.
Configure the following properties in hdfs-site.xml as shown:
<property> <name>dfs.client.read.shortcircuit</name> <value>true</value> </property> <property> <name> dfs.client.read.shortcircuit.streams.cache.size</name> <value>1000</value> </property> <property> <name> dfs.client.read.shortcircuit.streams.cache.size.expiry.ms</name> <value>1000</value> </property> <property> <name>dfs.domain.socket.path</name> <value>/var/run/hadoop-hdfs/dn._PORT</value> </property>
The text _PORT appears just as shown; you do not need to substitute a number.
If /var/run/hadoop-hdfs/ is group-writable, make sure its group is root.
Tips and Best Practices for Jobs
This section describes changes you can make at the job level.
Use the Distributed Cache to Transfer the Job JAR
Use the distributed cache to transfer the job JAR rather than using the JobConf(Class) constructor and the JobConf.setJar() andJobConf.setJarByClass() method.
To add JARs to the classpath, use -libjars <jar1>,<jar2>, which will copy the local JAR files to HDFS and then use the distributed cache mechanism to make sure they are available on the task nodes and are added to the task classpath.
The advantage of this over JobConf.setJar is that if the JAR is on a task node it won't need to be copied again if a second task from the same job runs on that node, though it will still need to be copied from the launch machine to HDFS.
-libjars works only if your MapReduce driver uses ToolRunner. If it doesn't, you would need to use the DistributedCache APIs (Cloudera does not recommend this).
For more information, see item 1 in the blog post How to Include Third-Party Libraries in Your MapReduce Job.
Changing the Logging Level on a Job (MRv1)
You can change the logging level for an individual job. You do this by setting the following properties in the job configuration (JobConf):
- mapreduce.map.log.level
- mapreduce.reduce.log.level
Valid values are NONE, INFO, WARN, DEBUG, TRACE, and ALL.
Example:
JobConf conf = new JobConf(); ... conf.set("mapreduce.map.log.level", "DEBUG"); conf.set("mapreduce.reduce.log.level", "TRACE"); ...
相关推荐
CDH(Cloudera's Distribution Including Apache Hadoop)是Cloudera公司提供的一个发行版,包含了Apache Hadoop的核心组件以及许多额外的工具和增强功能。CDH旨在简化Hadoop环境的搭建和管理过程。 #### Hadoop...
CDH,全称Cloudera Distribution Including Apache Hadoop,是由Cloudera公司提供的一个开源大数据平台,它包含了多个Apache项目,如Hadoop、HBase、Spark等,为企业级用户提供了统一的数据管理与分析解决方案。...
综上所述,CDH5 Hadoop发行版离线安装的过程涉及了从基础概念的理解到具体硬件配置的选择,再到操作系统和软件的安装等多个环节。每一步都需要精心规划和实施,以确保最终构建的Hadoop集群既稳定又高效。通过离线...
CDH_Hadoop_单机安装_集群安装_CDH-Hadoop-Install
在搭建Hadoop集群的过程中,使用CDH(Cloudera Distribution Including Apache Hadoop)是一个常见的选择,因为CDH提供了预编译的开源大数据组件,包括Hadoop、YARN等,简化了集群部署和管理。以下是对CDH搭建Hadoop...
### CDH Hadoop官方安装指南知识点详解 #### 关于CDH Hadoop官方安装文档 **标题:“cdh hadoop官方安装文档”** 该文档由Cloudera公司发布,旨在为用户提供一套详尽、清晰的Hadoop集群安装指南。文档适用于初学...
所以自己就用cdh的hadoop源码编译了一遍,踩了很多坑。最终还是解决了,能在windows中调试mapreduce了。 ps:csdn上看到同样的资源 竟然要10分,下不起 下不起。只能自己做了,5分服务大众。。。
CDH_hadoop安装汇总整理,基于CDH平台使用hadoop,本文亲自整理,没有坑。
Centos6.7 + CDH5.4.5 HADOOP 集群环境离线安装
本文档详细介绍了如何在CentOS 6.5 64位操作系统上进行CDH5 Hadoop集群的完全离线安装前的JDK配置工作。通过两种不同的安装方法——使用 `.rpm` 和 `.tar.gz` 文件,详细说明了每一步骤的操作过程。此外,还提供了...
本文档旨在提供关于如何在Cloudera Distribution Including Hadoop (CDH)上配置高可用性的详细指南。CDH是由Cloudera公司提供的一个企业级Hadoop发行版,它集成了Hadoop生态系统的多个组件,并提供了强大的管理和...
而搭建CDH_Hadoop不是一件容易的事情,步骤很多,配置繁琐,这也是大数据门槛高的一个体现,很多新手都卡在这一关。经过实践和参考相关资料,个人总结整理成文档,为自己为他人,希望都有帮助。
CDH5.5.0是CDH的一个版本,包含了Hadoop的多个组件,如HDFS、YARN等,并且提供了方便的图形化安装和配置工具,使得部署和管理Hadoop集群变得更加便捷。 在CDH5.5.0中,HDFS(Hadoop Distributed File System)和...
4. **CDH**: CDH是Cloudera对Hadoop生态的商业发行版,它包含了经过测试和优化的Hadoop组件,包括HDFS、MapReduce、YARN等,同时也包含了其他如Hive(SQL查询工具)、Pig(数据流编程工具)、Oozie(工作流调度器)...
CDH4.6.0是CDH系列的一个版本,它包含了对Hadoop、HBase、Hive和ZooKeeper等组件的集成和优化。Spark则是一个快速、通用且可扩展的数据处理引擎,0.9.0是Spark早期的一个版本,其与CDH4的结合提供了强大的计算能力。...
CDH(Cloudera Distribution Including Apache Hadoop)是Cloudera公司提供的一种预打包的Hadoop发行版,包含了多个相关的开源项目,旨在为企业提供一个完整的数据平台。本文将深入探讨在编译Hadoop CDH源码时所需的...
此外,CDH还提供了额外的管理和监控工具,如Cloudera Manager,方便用户对整个集群进行一站式管理,包括安装、配置、监控和优化。这些工具简化了Hadoop在企业环境中的部署和运维。 总之,Hadoop-2.6.0-cdh5.14.2是...
spark-assembly-1.6.0-cdh5.9.2-hadoop2.6.0-cdh5.9.2.jar
3. 准备相关软件包,如hadoop-0.20.2-cdh3u5.tar.gz、hbase-0.90.6-cdh3u5.tar.gz、hive-0.7.1-cdh3u5.tar.gz、zookeeper-3.3.5-cdh3u5.tar.gz和sqoop-1.3.0-cdh3u5.tar.gz等。 4. 安装JDK,需要将jdk-6u35-linux-x...
本文档主要以使用Cloudera Manager5.9.3 自动搭建Hbase集群为例,介绍如何使用Cloudera Manager自动搭建Hadoop相关集群,简化为多台服务器搭建Hadoop集群工作,提高工作效率。至于如何使用Cloudera Manager来管理...