http://hadoop.apache.org/common/docs/r1.0.3/mapred_tutorial.html
mapper
mapper的任务就是把key-value pair转换成key-value pair,作为reducer的输入。
MapReduce framework会为每个InputSplit都分配一个mapper线程,完全并行计算。
mapper的结果在发送到reducer之前有的时候需要进行一次combine, 这中操作都在mapper所在的结点上进行,目的是减少发送到reducer结点的数据。
可以自己指定combiner : JobConf.setCombinerClass(Class)
mapper产生的key-value结果集合会根据key进行分组,然后不同的分组分发到不同的reducer上,可以使用JobConf.setOutputKeyComparatorClass(Class)指定自己的comparator来进行分组。
每个reducer都相当于一个partition, MapReduce Framework默认应该有一个很简单的partition规则,把mapper的结果集合根据key进行partition,然后分发到响应的reducer.也可以指定自己的partitioner, 实现Partitioner接口。
HashPartitioner is the default Partitioner.
需要多少个mapper?
取决于原始文件的大小和blocksize,通常来说每个block都有一个mapper.
num of mapper = filesize / blocksize。
一般来说,每个几点上可以同时有10 - 100个mapper task.
Reducer
reducer的任务是把key-list<value>转换成key-value pair,把结果输出到hdfs上。
reducer声明周期的三个阶段:
shuffle(洗牌) :通过http从所有的mapper上获取属于这个partition的数据。
sort(排序) : 虽然mapper的结果是排好序的,但是因为数据来自不同的mapper, 还是要把属于同一个key的所有数据整合在一起。
上两个阶段是同时进行的。
secondary sort(二次排序) : 如果自己指定数据如何根据key进行分组,则进行二次排序
JobConf.setOutputValueGroupingComparator(Class)
JobConf.setOutputKeyComparatorClass(Class)
reduce(处理) :结果是不排序的(记得貌似是部分排序)。
reducer的数量:
0.95 or 1.75 * (<no. of nodes> * mapred.tasktracker.reduce.tasks.maximum).
如果是0.95的话,所有的reducer都可以马上就开始工作。
(tag:我们的集群上应该有一个统筹管理所有job的controller)
如果是1.75, 运行最快的reducer结束它的工作之后,可以进行下一轮工作了。(prefer)
分享到:
相关推荐
Cloud computing The Hadoop framework
HIGHLIGHT Hadoop in Action is an example-rich tutorial that shows developers how to implement data-intensive distributed computing using Hadoop and the Map- Reduce framework. DESCRIPTION Hadoop is an ...
The professional's one-stop guide to this open-source, Java-based big data framework, Professional Hadoop is the complete reference and resource for experienced developers looking to employ Apache ...
If you are a system or application developer interested in learning how to solve practical problems using the Hadoop framework, then this book is ideal for you. This book is also meant for Hadoop ...
HIGHLIGHT Hadoop in Action is an example-rich tutorial that shows developers how to implement data-intensive distributed computing using Hadoop and the Map- Reduce framework. DESCRIPTION Hadoop is an ...
Hadoop框架入门(Getting Started with the Hadoop Framework)** 在第三章,作者提供了Hadoop框架的基本使用指南,涵盖了如何搭建环境以及如何运行Hadoop的基本命令。 **4. Hadoop管理(Hadoop Administration)*...
The core of the Hadoop framework is complex for development and optimization. The smart way to speed up and ease the process is to utilize different Hadoop ecosystem components that are very useful, ...
This comprehensive resource shows you how to build and maintain reliable, scalable, distributed systems with the Hadoop framework -- an open source implementation of MapReduce, the algorithm on which...
Hadoop framework 使得开发者可以快速构建高效、可靠的大数据处理应用程序。为了让初学者更好地理解和使用 Hadoop,下面将详细介绍 Hadoop 环境配置的步骤。 软件版本 在开始安装 Hadoop 之前,需要了解所需的软件...
The Hadoop framework plays a crucial role in this project, as it allows for distributed processing and storage of massive amounts of meteorological data. Hadoop is a popular open-source framework that...
Hadoop讲义,Hadoop Framework, HDFS, MapReduce, Spark
Ideal for processing large datasets, the Apache Hadoop framework is an open source implementation of the MapReduce algorithm on which Google built its empire. This comprehensive resource demonstrates...
这里是coursera课程Hadoop Platform and Application Framework的所有项目源代码,都通过测试考核,所以应该准确无误。原课程讲解Hadoop和Spark,有兴趣的小伙伴们可以参考,不过建议自己编程,有助于能力的提高。原...
If you are a system or application developer interested in learning how to solve practical problems using the Hadoop framework, then this book is ideal for you. You are expected to be familiar with ...
编辑`mapred-site.xml`,创建并编辑`mapred-site.xml.template`,设置MapReduce的运行模式为本地模式(`mapreduce.framework.name`设为`local`)。 8. **格式化NameNode**: 在命令行中执行`hadoop namenode -...