`
085567
  • 浏览: 219351 次
  • 性别: Icon_minigender_1
  • 来自: 北京
社区版块
存档分类
最新评论

hadoop 10 tip(转载)

阅读更多

10 MapReduce Tips
This piece is based on the talk “Practical MapReduce” that I gave at Hadoop User Group UK on April 14.

1. Use an appropriate MapReduce language
There are many languages and frameworks that sit on top of MapReduce, so it’s worth thinking up-front which one to use for a particular problem. There is no one-size-fits-all language; each has different strengths and weaknesses.

Java: Good for: speed; control; binary data; working with existing Java or MapReduce libraries.
Pipes: Good for: working with existing C++ libraries.
Streaming: Good for: writing MapReduce programs in scripting languages.
Dumbo (Python), Happy (Jython), Wukong (Ruby), mrtoolkit (Ruby): Good for: Python/Ruby programmers who want quick results, and are comfortable with the MapReduce abstraction.
Pig, Hive, Cascading: Good for: higher-level abstractions; joins; nested data.
While there are no hard and fast rules, in general, we recommend using pure Java for large, recurring jobs, Hive for SQL style analysis and data warehousing, and Pig orStreaming for ad-hoc analysis.



2. Consider your input data “chunk” size
Are you generating large, unbounded files, like log files? Or lots of small files, like image files? How frequently do you need to run jobs?

Answers to these questions determine how your store and process data using HDFS. For large unbounded files, one approach (until HDFS appends are working) is to write files in batches and merge them periodically. For lots of small files, see The Small Files Problem.HBase is a good abstraction for some of these problems too, so may be worth considering.

3. Use SequenceFile and MapFile containers
SequenceFiles are a very useful tool. They are:

Splittable. So they work well with MapReduce: each map gets an independent split to work on.
Compressible. By using block compression you get the benefits of compression (use less disk space, faster to read and write), while keeping the file splittable still.
Compact. SequenceFiles are usually used with Hadoop Writable objects, which have a pretty compact format.
A MapFile is an indexed SequenceFile, useful for if you want to do look-ups by key.

However, both are Java-centric, so you can’t read them with non-Java tools. The Thriftand Avro projects are the places to look for language-neutral container file formats. (For example, see Avro’s DataFileWriter although there is no MapReduce integration yet.)

4. Implement the Tool interface
If you are writing a Java driver, then consider implementing the Tool interface to get the following options for free:

-D to pass in arbitrary properties (e.g. -D mapred.reduce.tasks=7 sets the number of reducers to 7)
-files to put files into the distributed cache
-archives to put archives (tar, tar.gz, zip, jar) into the distributed cache
-libjars to put JAR files on the task classpath
public class MyJob extends Configured implements Tool {

  public int run(String[] args) throws Exception {
    JobConf job = new JobConf(getConf(), MyJob.class);
    // run job ...
  }

  public static void main(String[] args) throws Exception {
    int res = ToolRunner.run(new Configuration(),
      new MyJob(), args);
    System.exit(res);
  }
}By taking this step you also make your driver more testable, since you can inject arbitrary configurations using Configured’s setConf() method.

5. Chain your jobs
It’s often natural to split a problem into multiple MapReduce jobs. The benefits are a better decomposition of the problem into smaller, more-easily understood (and more easily tested) steps. It can also boost re-usability. Also, by using the Fair Scheduler, you can run a small job promptly, and not worry that it will be stuck in a long queue of (other people’s) jobs.

ChainMapper and ChainReducer (in 0.20.0) are worth checking out too, as they allow you to use smaller units within one job, effectively allowing multiple mappers before and afterthe (single) reducer: M+RM*.

Pig and Hive do this kind of thing all the time, and it can be instructive to understand what they are doing behind the scenes by using EXPLAIN, or even by reading their source code, to make you a better MapReduce programmer. Of course, you could always use Pig or Hive in the first place…

6. Favor multiple partitions
We’re used to thinking that the output data is contained in one file. This is OK for small datasets, but if the output is large (more than a few tens of gigabytes, say) then it’s normally better to have a partitioned file, so you take advantage of the cluster parallelism for the reducer tasks. Conceptually, you should think of your output/part-*files as a single “file”: the fact it is broken up is an implementation detail. Often, the output forms the input to another MapReduce job, so it is naturally processed as a partitioned output by specifying output as the input path to the second job.

In some cases the partitioning can be exploited. CompositeInputFormat, for example, uses the partitioning to do joins efficiently on the map-side. Another example: if your output is a MapFile, you can use MapFileOutputFormat’s getReaders() method to do lookups on the partitioned output.

For small outputs you can merge the partitions into a single file, either by setting the number of reducers to 1 (the default), or by using the handy -getmerge option on the filesystem shell:

% hadoop fs -getmerge hdfs-output-dir local-fileThis concatenates the HDFS files hdfs-output-dir/part-* into a single local file.

7. Report progress
If your task reports no progress for 10 minutes (see the mapred.task.timeout property) then it will be killed by Hadoop. Most tasks don’t encounter this situation since they report progress implicitly by reading input and writing output. However, some jobs which don’t process records in this way may fall foul of this behavior and have their tasks killed. Simulations are a good example, since they do a lot of CPU-intensive processing in each map and typically only write the result at the end of the computation. They should be written in such a way as to report progress on a regular basis (more frequently than every 10 minutes). This may be achieved in a number of ways:

Call setStatus() on Reporter to set a human-readable description of
the task’s progress
Call incrCounter() on Reporter to increment a user counter
Call progress() on Reporter to tell Hadoop that your task is still there (and making progress)
8. Debug with status and counters
Using the Reporter’s setStatus() and incrCounter() methods is a simple but effective way to debug your jobs. Counters are often better than printing to standard error since they are aggregated centrally, and allow you to see how many times a condition has occurred.

Status descriptions are shown on the web UI so you can monitor a job and keep and eye on the statuses (as long as all the tasks fit on a single page). You can send extra debugging information to standard error which you can then retrieve through the web UI (click through to the task attempt, and find the stderr file).

You can do more advanced debugging with debug scripts.

9. Tune at the job level before the task level
Before you start profiling tasks there are a number of job-level checks to run through:

Have you set the optimal number of mappers and reducers?
The number of mappers is by default set to one per HDFS block. This is usually a good default, but see tip 2.
The number of reducers is best set to be the number of reduce slots in the cluster (minus a few to allow for failures). This allows the reducers to complete in a single wave.
Have you set a combiner (if your algorithm allows it)?
Have you enabled intermediate compression? (See JobConf.setCompressMapOutput(), or equivalently mapred.compress.map.output).
If using custom Writables, have you provided a RawComparator?
Finally, there are a number of low-level MapReduce shuffle parameters that you can tune to get improved performance.
10. Let someone else do the cluster administration
Getting a cluster up and running can be decidely non-trivial, so use some of the free tools to get started. For example, Cloudera provides an online configuration tool, RPMs, and Debian packages to set up Hadoop on your own hardware, as well as scripts to run on Amazon EC2.

Do you have a MapReduce tip to share? Please let us know in the comments.

Monday, May 18th, 2009 at 1:40 pm by Tom White, filed under general, hadoop, mapreduce
分享到:
评论

相关推荐

    hadoop-2.7.2-win10_x64.7z

    标题 "hadoop-2.7.2-win10_x64.7z" 提供了关键信息,这是一款针对64位Windows 10系统的Hadoop版本,版本号为2.7.2,并且已经过编译,可以预期在解压后直接运行。Hadoop是一个开源的分布式计算框架,由Apache软件基金...

    win10下编译过的hadoop jar包--hadoop-2.7.2.zip

    总的来说,Hadoop 2.7.2在Win10上的编译和运行,不仅提供了在本地环境中学习和测试Hadoop的机会,也让你有机会深入理解分布式系统的工作原理和大数据处理的基本流程。这是一项对任何希望在大数据领域发展的IT专业...

    win10编译过的hadoo2.6.4包括hadoop.dll和winutils.exe

    在Windows 10环境下,Hadoop 2.6.4是一个重要的大数据处理框架,它被广泛应用于分布式存储和计算任务。这个版本的Hadoop已经针对Windows操作系统进行了编译,使得在Win10系统上可以直接使用,无需进行额外的适配工作...

    hadoop window10 bin

    这个资源包“hadoop window10 bin”是专为Windows 10环境设计的,包含了在Windows平台上运行和开发Hadoop应用所需的基础二进制文件。 在Windows 10上安装和使用Hadoop通常需要解决一些平台兼容性问题,因为Hadoop...

    hadoop-2.7.6 for win10

    在本案例中,我们关注的是"Hadoop-2.7.6 for Win10"版本,这是一个专为Windows 10操作系统优化的Hadoop环境。这个压缩包包含了在Windows上运行Hadoop所需的关键组件。 1. **Winutils.exe**: 这是Hadoop在Windows...

    云计算Hadoop:快速部署Hadoop集群

    资源名称:云计算Hadoop:快速部署Hadoop集群内容简介: 近来云计算越来越热门了,云计算已经被看作IT业的新趋势。云计算可以粗略地定义为使用自己环境之外的某一服务提供的可伸缩计算资源,并按使用量付费。可以...

    hadoop2.7.6 x64 win10的本地文件

    在这个场景中,我们关注的是Hadoop 2.7.6版本在Windows 10操作系统上的64位本地环境搭建,特别是利用Visual Studio 2017进行编译的相关知识。 1. **Hadoop 2.7.6**: 这是Hadoop的一个稳定版本,包含了多项改进和...

    Hadoop2.7.3 Window10 hadoop.dll winutils.exe

    在本文中,我们将深入探讨如何在Windows 10操作系统中使用Hadoop 2.7.3版本进行开发,特别关注“hadoop.dll”和“winutils.exe”这两个关键组件。Hadoop是Apache软件基金会开发的一个开源分布式计算框架,它允许在...

    hadoop2.6.4 win10编译后bin和lib含winutils.exe和hadoop.dll

    总结起来,"hadoop2.6.4 win10编译后bin和lib含winutils.exe和hadoop.dll"是一个专为Windows 10优化的Hadoop版本,它包含了运行Hadoop所需的关键组件和库文件。通过替换原有Hadoop安装的`bin`和`lib`目录,用户可以...

    win10_hadoop-2.7.2.zip

    标题中的"win10_hadoop-2.7.2.zip"指示了这是一个针对Windows 10操作系统的Hadoop 2.7.2版本的压缩文件。Hadoop是Apache软件基金会开发的一个开源分布式计算框架,它使得大数据处理变得更加高效和可靠。在Windows 10...

    win10下hadoop的使用

    在Windows 10环境下使用Hadoop,特别是Hadoop 2.7.2版本,需要解决一系列与操作系统兼容性相关的问题。Hadoop最初是为Linux设计的,但在Windows上运行需要额外的配置和组件。本教程将详细介绍如何在Windows 10上安装...

    hadoop winutils hadoop.dll

    Hadoop是Apache软件基金会开发的一个开源分布式计算框架,它允许在普通硬件上高效处理大量数据。在Windows环境下,Hadoop的使用与Linux有所不同,因为它的设计最初是针对Linux操作系统的。"winutils"和"hadoop.dll...

    hadoop2.6.4 win10编译

    本文将深入探讨在Windows 10环境下编译Hadoop 2.6.4的过程及其相关知识点。 首先,我们需要理解Hadoop的编译环境需求。Windows 10作为操作系统,可能并非Hadoop开发的传统选择,因为Hadoop通常与Linux系统关联。...

    win7(x64)和win10(x64)编译的hadoop的bin

    【标题】"win7(x64)和win10(x64)编译的hadoop的bin" 涉及的知识点主要集中在Hadoop的编译与平台兼容性上,尤其是对于Windows 7和10 64位操作系统的支持。Hadoop是一个开源的分布式计算框架,它允许在大量计算机...

    Hadoop2.7.2Windows10文件

    Windows10 环境下编译的Hadoop2.7.2 Windows10 环境下编译的Hadoop2.7.2 Windows10 环境下编译的Hadoop2.7.2

    hadoop-2.7.7单机win7或win10搭建完整包

    3.使用编辑器打开E:\apps\hadoop-2.7.7\etc\hadoop\hadoop-env.cmd,修改set JAVA_HOME=E:\apps\你的jdk目录名 4.把E:\apps\hadoop-2.7.7\bin\hadoop.dll拷贝到 C:\Windows\System32 5.设置环境变量,新建系统变量,...

    hadoop2.7.3 Winutils.exe hadoop.dll

    在IT行业中,Hadoop是一个广泛使用的开源框架,主要用于大数据处理和分布式存储。Hadoop 2.7.3是这个框架的一个稳定版本,它包含了多个改进和优化,以提高性能和稳定性。在这个版本中,Winutils.exe和hadoop.dll是两...

    win10下hadoo2.7.2的hadoop.dll和winutils.exe

    在Windows 10环境下搭建Apache Hadoop 2.7.2的开发和调试环境,Hadoop.dll和Winutils.exe是两个关键组件。本文将详细解释这两个组件的作用,以及如何在Windows系统中正确使用它们。 Hadoop.dll是Apache Hadoop在...

    hadoop的dll文件 hadoop.zip

    Hadoop是一个开源的分布式计算框架,由Apache基金会开发,它主要设计用于处理和存储大量数据。在提供的信息中,我们关注的是"Hadoop的dll文件",这是一个动态链接库(DLL)文件,通常在Windows操作系统中使用,用于...

    hadoop2.7.3 for win10 64位 bin目录

    在Windows下面使用Hadoop Java API进行开发的时候需要编译Windows版本的Hadoop,然后把下载的Hadoop/bin目录替换掉。这个资源是Windows10 64位系统下编译的hadoop 2.7.3的bin目录。下载后即可替换使用。

Global site tag (gtag.js) - Google Analytics