This piece is based on the talk “Practical MapReduce” that I gave at Hadoop User Group UK on April 14.
1. Use an appropriate MapReduce language
There are many languages and frameworks that sit on top of MapReduce, so it’s worth thinking up-front which one to use for a particular problem. There is no one-size-fits-all language; each has different strengths and weaknesses.
- Java: Good for: speed; control; binary data; working with existing Java or MapReduce libraries.
- Pipes: Good for: working with existing C++ libraries.
- Streaming: Good for: writing MapReduce programs in scripting languages.
- Dumbo (Python), Happy (Jython), Wukong (Ruby), mrtoolkit (Ruby): Good for: Python/Ruby programmers who want quick results, and are comfortable with the MapReduce abstraction.
- Pig, Hive, Cascading: Good for: higher-level abstractions; joins; nested data.
While there are no hard and fast rules, in general, we recommend using pure Java for large, recurring jobs, Hive for SQL style analysis and data warehousing, and Pig orStreaming for ad-hoc analysis.
2. Consider your input data “chunk” size
Are you generating large, unbounded files, like log files? Or lots of small files, like image files? How frequently do you need to run jobs?
Answers to these questions determine how your store and process data using HDFS. For large unbounded files, one approach (until HDFS appends are working) is to write files in batches and merge them periodically. For lots of small files, see The Small Files Problem.HBase is a good abstraction for some of these problems too, so may be worth considering.
3. Use SequenceFile and MapFile containers
SequenceFiles are a very useful tool. They are:
- Splittable. So they work well with MapReduce: each map gets an independent split to work on.
- Compressible. By using block compression you get the benefits of compression (use less disk space, faster to read and write), while keeping the file splittable still.
- Compact. SequenceFiles are usually used with Hadoop Writable objects, which have a pretty compact format.
A MapFile is an indexed SequenceFile, useful for if you want to do look-ups by key.
However, both are Java-centric, so you can’t read them with non-Java tools. The Thriftand Avro projects are the places to look for language-neutral container file formats. (For example, see Avro’s DataFileWriter although there is no MapReduce integration yet.)
4. Implement the Tool interface
If you are writing a Java driver, then consider implementing the Tool
interface to get the following options for free:
-
-D
to pass in arbitrary properties (e.g.-D mapred.reduce.tasks=7
sets the number of reducers to 7) -
-files
to put files into the distributed cache -
-archives
to put archives (tar, tar.gz, zip, jar) into the distributed cache -
-libjars
to put JAR files on the task classpath
public class MyJob extends Configured implements Tool {
public int run(String[] args) throws Exception {
JobConf job = new JobConf(getConf(), MyJob.class);
// run job ...
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(),
new MyJob(), args);
System.exit(res);
}
}
By taking this step you also make your driver more testable, since you can inject arbitrary configurations using Configured
’s setConf()
method.
5. Chain your jobs
It’s often natural to split a problem into multiple MapReduce jobs. The benefits are a better decomposition of the problem into smaller, more-easily understood (and more easily tested) steps. It can also boost re-usability. Also, by using the Fair Scheduler, you can run a small job promptly, and not worry that it will be stuck in a long queue of (other people’s) jobs.
ChainMapper
and ChainReducer
(in 0.20.0) are worth checking out too, as they allow you to use smaller units within one job, effectively allowing multiple mappers before and afterthe (single) reducer: M+RM*
.
Pig and Hive do this kind of thing all the time, and it can be instructive to understand what they are doing behind the scenes by using EXPLAIN, or even by reading their source code, to make you a better MapReduce programmer. Of course, you could always use Pig or Hive in the first place…
6. Favor multiple partitions
We’re used to thinking that the output data is contained in one file. This is OK for small datasets, but if the output is large (more than a few tens of gigabytes, say) then it’s normally better to have a partitioned file, so you take advantage of the cluster parallelism for the reducer tasks. Conceptually, you should think of your output/part-*files as a single “file”: the fact it is broken up is an implementation detail. Often, the output forms the input to another MapReduce job, so it is naturally processed as a partitioned output by specifying output as the input path to the second job.
In some cases the partitioning can be exploited. CompositeInputFormat
, for example, uses the partitioning to do joins efficiently on the map-side. Another example: if your output is a MapFile, you can use MapFileOutputFormat
’s getReaders()
method to do lookups on the partitioned output.
For small outputs you can merge the partitions into a single file, either by setting the number of reducers to 1 (the default), or by using the handy -getmerge
option on the filesystem shell:
% hadoop fs -getmerge hdfs-output-dir local-file
This concatenates the HDFS files hdfs-output-dir/part-* into a single local file.
7. Report progress
If your task reports no progress for 10 minutes (see the mapred.task.timeout
property) then it will be killed by Hadoop. Most tasks don’t encounter this situation since they report progress implicitly by reading input and writing output. However, some jobs which don’t process records in this way may fall foul of this behavior and have their tasks killed. Simulations are a good example, since they do a lot of CPU-intensive processing in each map and typically only write the result at the end of the computation. They should be written in such a way as to report progress on a regular basis (more frequently than every 10 minutes). This may be achieved in a number of ways:
- Call
setStatus()
onReporter
to set a human-readable description of
the task’s progress - Call
incrCounter()
onReporter
to increment a user counter - Call
progress()
onReporter
to tell Hadoop that your task is still there (and making progress)
8. Debug with status and counters
Using the Reporter
’s setStatus()
and incrCounter()
methods is a simple but effective way to debug your jobs. Counters are often better than printing to standard error since they are aggregated centrally, and allow you to see how many times a condition has occurred.
Status descriptions are shown on the web UI so you can monitor a job and keep and eye on the statuses (as long as all the tasks fit on a single page). You can send extra debugging information to standard error which you can then retrieve through the web UI (click through to the task attempt, and find the stderr file).
You can do more advanced debugging with debug scripts.
9. Tune at the job level before the task level
Before you start profiling tasks there are a number of job-level checks to run through:
- Have you set the optimal number of mappers and reducers?
- The number of mappers is by default set to one per HDFS block. This is usually a good default, but see tip 2.
- The number of reducers is best set to be the number of reduce slots in the cluster (minus a few to allow for failures). This allows the reducers to complete in a single wave.
- Have you set a combiner (if your algorithm allows it)?
- Have you enabled intermediate compression? (See
JobConf.setCompressMapOutput()
, or equivalentlymapred.compress.map.output
). - If using custom Writables, have you provided a
RawComparator
? - Finally, there are a number of low-level MapReduce shuffle parameters that you can tune to get improved performance.
10. Let someone else do the cluster administration
Getting a cluster up and running can be decidely non-trivial, so use some of the free tools to get started. For example, Cloudera provides an online configuration tool, RPMs, and Debian packages to set up Hadoop on your own hardware, as well as scripts to run on Amazon EC2.
Do you have a MapReduce tip to share? Please let us know in the comments.
相关推荐
在这个"MapReduce项目 数据清洗"中,我们将探讨如何使用MapReduce对遗传关系族谱数据进行处理和清洗,以便进行后续分析。 1. **Map阶段**: 在Map阶段,原始数据被分割成多个小块(split),然后分配到不同的工作...
解决这些问题需要熟悉Hadoop的配置文件,理解MapReduce的工作原理,以及具备良好的Java编程和调试技巧。此外,利用MyEclipse的集成开发环境特性,如代码提示、错误检查和调试功能,可以帮助快速定位和修复问题。 ...
- 映射阶段是数据处理的起点,它接收输入数据,将大块数据分割成小片(键值对),然后对每个片进行本地化处理。这个阶段通常涉及数据预处理和转换,如清洗、过滤和计算。 - Map函数:自定义的Map函数负责将输入...
- 输入Split:首先,原始数据被划分为多个小的分片(Split),每个分片会在一个单独的Map任务中处理。 - Map:Map函数接收一个输入分片,将其内容解析为键值对,并针对每一对执行用户定义的逻辑,生成新的键值对。...
Map阶段是MapReduce工作流程的第一步,它接收输入数据集,并将其分割成多个小的数据块,每个数据块由一个Map任务处理。Map函数通常用于对原始数据进行预处理,如解析、过滤和转换。在这个阶段,数据本地化策略确保...
通过阅读《深入理解MapReduce架构设计与实现原理》,读者不仅可以掌握MapReduce的基本原理,还能了解到实际应用中的优化技巧和问题解决策略,对于从事大数据处理和云计算领域的专业人士来说,是一本不可多得的参考...
为了提升性能,开发者可以对MapReduce作业进行优化,如减少数据的网络传输、选择合适的分区策略、合并小文件等。 在学习MapReduce多语言编程时,理解上述基本原理和机制至关重要。通过课程大纲(未提供具体内容,...
在IT行业中,Hadoop MapReduce是一种分布式计算框架,主要用于处理和存储海量数据。这个开发案例是针对初学...对于新手来说,这是一个很好的起点,而对于有经验的开发者,这可以作为一个检查和学习他人优化技巧的机会。
4. MapReduce优化技巧:在MapReduce的实践中,优化技巧包括但不限于合理设置map和reduce任务的数量、优化数据分区、调整内存使用等。例如,适当增加map任务的数量可以提高数据处理的并行度,而合理的数据分区则可以...
#### 五、MapReduce优化技巧 - **数据倾斜处理**:通过预分区或自定义分区器等方式解决某些Reducer处理数据量过大的问题。 - **合并小文件**:将多个小文件合并成较大文件,以减少处理时间。 - **使用Combiner**:...
MapReduce的核心概念是将复杂的计算任务分解成较小的子任务,并在分布式环境中并行执行。 **1.1 编程模式** MapReduce的基本编程模式包括两个主要阶段:`Map` 和 `Reduce`。这两个阶段的工作原理如下: - **Map ...
MapReduce的核心思想是将大规模数据集分解为小数据块,通过map(映射)函数进行处理,然后使用reduce(归约)函数进行汇总。这种模式特别适用于解决一些特定类型的问题,但并非所有大数据问题都能用MapReduce解决。 ...
### MapReduce基础实战详解 #### 一、MapReduce概述 MapReduce是一种分布式计算模型,用于处理和生成大数据集。...通过不断实践和优化,开发者可以更好地掌握MapReduce的使用技巧,并应对各种复杂的数据处理需求。
Hadoop是一个分布式系统基础架构,由Apache基金会开发,它实现了MapReduce...通过这些知识,读者可以了解MapReduce在大数据处理中的应用,掌握MapReduce编程技巧,并能够搭建和使用Hadoop环境进行实际的数据处理工作。
本篇将详细讲解 `mapreduce-examples` 项目,这是一个基于 Java 的 MapReduce 示例,旨在帮助开发者更好地理解和应用 MapReduce 技术。 1. **Map 阶段** Map 阶段是 MapReduce 计算的第一个环节。在这个阶段,原始...
词频统计是MapReduce的典型应用之一,其目标是计算一个文本集合中每个单词出现的频率。首先,Map阶段将输入的文本文件分割成多个小块,并对每个块进行并行处理,输出键值对(单词,1)。接着,Reduce阶段负责合并Map...
在map阶段,系统将输入数据划分成小的数据块(分片),然后对每个分片执行map函数,该函数将数据解析为键值对的形式。之后,Shuffle阶段负责将map阶段的输出结果进行排序和分组,并把相同键的数据分配给同一个reduce...
MapReduce的核心思想是将大任务分解为许多小的Map任务,这些任务在分布式集群上并行执行,然后通过Reduce任务将结果汇总。在Map阶段,原始数据被分割并映射到不同的节点上进行处理;在Reduce阶段,处理后的结果被...