`
GQM
  • 浏览: 24872 次
  • 性别: Icon_minigender_1
  • 来自: 上海
社区版块
存档分类
最新评论

[笔记]hadoop tutorial - Reducer

 
阅读更多
引用
Reducer reduces a set of intermediate values which share a key to a smaller set of values.

Reducer的数量
可通过以下方法设置
JobConf.setNumReduceTasks(int);

可以修改mapred.reduce.tasks参数,默认值为1。
官网推荐计算方法
  • 0.95 * NUMBER_OF_NODES * mapred.tasktracker.reduce.tasks.maximum
  • 1.75 * NUMBER_OF_NODES * mapred.tasktracker.reduce.tasks.maximum

其中选择0.95时,所有的reducer task都会在map task结束时立即启动;选择1.75时有部分reducer task需要等到第二轮执行(从先结束的节点开始执行),这样可以更好的做到负载均衡。
引用
It is legal to set the number of reduce-tasks to zero if no reduction is desired.
In this case the outputs of the map-tasks go directly to the FileSystem, into the output path set by setOutputPath(Path). The framework does not sort the map-outputs before writing them out to the FileSystem.

Reducer阶段的工作
引用
Reducer has 3 primary phases: shuffle, sort and reduce.
  • Shuffle
  • Input to the Reducer is the sorted output of the mappers. In this phase the framework fetches the relevant partition of the output of all the mappers, via HTTP.
  • Sort
  • The framework groups Reducer inputs by keys (since different mappers may have output the same key) in this stage.
    The shuffle and sort phases occur simultaneously; while map-outputs are being fetched they are merged.
  • Secondary Sort
  • If equivalence rules for grouping the intermediate keys are required to be different from those for grouping keys before reduction, then one may specify a Comparator via JobConf.setOutputValueGroupingComparator(Class). Since JobConf.setOutputKeyComparatorClass(Class) can be used to control how intermediate keys are grouped, these can be used in conjunction to simulate secondary sort on values.
    辅助排序可用于values需要排序的场合,如果将key对应的所有values直接在内存中排序可能会造成OOM问题,通过辅助排序将key和value封装成复合key,JobConf.setOutputKeyComparatorClass(Class)控制排序;JobConf.setOutputValueGroupingComparator(Class)控制分组;同时需要自定义Partitioner,将每个group下的数据分到同一个reducer中,而每个group会形成一个part文件,在key很多的场景下可能会需要文件的拼接。(我的理解一般key对应的value比较少的场景直接在reducer中排序,如果key对应的value非常多,需要使用辅助排序。)
  • Reduce
  • In this phase the reduce(WritableComparable, Iterator, OutputCollector, Reporter) method is called for each <key, (list of values)> pair in the grouped inputs.
    分享到:
    评论

    相关推荐

      hadoop最新版本3.1.1全量jar包

      hadoop-annotations-3.1.1.jar hadoop-common-3.1.1.jar hadoop-mapreduce-client-core-3.1.1.jar hadoop-yarn-api-3.1.1.jar hadoop-auth-3.1.1.jar hadoop-hdfs-3.1.1.jar hadoop-mapreduce-client-hs-3.1.1.jar ...

      hadoop-eclipse-plugin1.2.1 and hadoop-eclipse-plugin2.8.0

      为了方便开发者在Eclipse或MyEclipse这样的集成开发环境中高效地进行Hadoop应用开发,Hadoop-Eclipse-Plugin应运而生。这个插件允许开发者直接在IDE中对Hadoop集群进行操作,如创建、编辑和运行MapReduce任务,极大...

      hbase-hadoop2-compat-1.2.12-API文档-中文版.zip

      赠送jar包:hbase-hadoop2-compat-1.2.12.jar; 赠送原API文档:hbase-hadoop2-compat-1.2.12-javadoc.jar; 赠送源代码:hbase-hadoop2-compat-1.2.12-sources.jar; 赠送Maven依赖信息文件:hbase-hadoop2-compat-...

      hadoop插件apache-hadoop-3.1.0-winutils-master.zip

      标题中的"apache-hadoop-3.1.0-winutils-master.zip"是一个针对Windows用户的Hadoop工具包,它包含了运行Hadoop所需的特定于Windows的工具和配置。`winutils.exe`是这个工具包的关键组件,它是Hadoop在Windows上的一...

      hadoop-eclipse-plugin-3.1.1.tar.gz

      使用Hadoop-Eclipse-Plugin时,建议遵循良好的编程习惯,如合理划分Mapper和Reducer的功能,优化数据处理流程,以及充分利用Hadoop的并行计算能力。同时,及时更新插件至最新版本,以获取最新的功能和修复。 通过...

      hadoop-yarn-client-2.6.5-API文档-中文版.zip

      赠送jar包:hadoop-yarn-client-2.6.5.jar; 赠送原API文档:hadoop-yarn-client-2.6.5-javadoc.jar; 赠送源代码:hadoop-yarn-client-2.6.5-sources.jar; 赠送Maven依赖信息文件:hadoop-yarn-client-2.6.5.pom;...

      hadoop2.6-common-bin.zip

      标题 "hadoop2.6-common-bin.zip" 指示这是一个包含Hadoop 2.6版本通用二进制文件的压缩包。这个压缩包主要针对Windows用户,旨在解决在该操作系统上运行Hadoop时可能遇到的"Could not locate executable"错误。这个...

      hadoop-mapreduce-client-jobclient-2.6.5-API文档-中文版.zip

      赠送jar包:hadoop-mapreduce-client-jobclient-2.6.5.jar; 赠送原API文档:hadoop-mapreduce-client-jobclient-2.6.5-javadoc.jar; 赠送源代码:hadoop-mapreduce-client-jobclient-2.6.5-sources.jar; 赠送...

      hadoop-auth-2.5.1-API文档-中文版.zip

      赠送jar包:hadoop-auth-2.5.1.jar; 赠送原API文档:hadoop-auth-2.5.1-javadoc.jar; 赠送源代码:hadoop-auth-2.5.1-sources.jar; 赠送Maven依赖信息文件:hadoop-auth-2.5.1.pom; 包含翻译后的API文档:hadoop...

      hadoop3.3.0-winutils所有bin文件

      Hadoop是Apache软件基金会开发的一个开源分布式计算框架,它允许在大规模集群上处理海量数据。Hadoop 3.3.0是该框架的一个版本,它带来了许多改进和新特性,旨在提升性能、稳定性和可扩展性。WinUtils是Hadoop在...

      hadoop-eclipse-plugin-2.7.3和2.7.7

      hadoop-eclipse-plugin-2.7.3和2.7.7的jar包 hadoop-eclipse-plugin-2.7.3和2.7.7的jar包 hadoop-eclipse-plugin-2.7.3和2.7.7的jar包 hadoop-eclipse-plugin-2.7.3和2.7.7的jar包

      hadoop-eclipse-plugin三个版本的插件都在这里了。

      hadoop-eclipse-plugin-2.7.4.jar和hadoop-eclipse-plugin-2.7.3.jar还有hadoop-eclipse-plugin-2.6.0.jar的插件都在这打包了,都可以用。

      hadoop-eclipse-plugin-2.10.0.jar

      Eclipse集成Hadoop2.10.0的插件,使用`ant`对hadoop的jar包进行打包并适应Eclipse加载,所以参数里有hadoop和eclipse的目录. 必须注意对于不同的hadoop版本,` HADDOP_INSTALL_PATH/share/hadoop/common/lib`下的jar包...

      apache-hadoop-3.1.3-winutils-master.zip

      在这个"apache-hadoop-3.1.3-winutils-master.zip"压缩包中,包含了在Windows环境下配置Hadoop HDFS客户端所需的组件,特别是`hadoop-winutils`和`hadoop.dll`这两个关键文件,它们对于在Windows系统上运行Hadoop...

      hadoop-common-2.6.0-bin-master.zip

      `hadoop-common-2.6.0-bin-master.zip` 是一个针对Hadoop 2.6.0版本的压缩包,特别适用于在Windows环境下进行本地开发和测试。这个版本的Hadoop包含了对Windows系统的优化,比如提供了`winutils.exe`,这是在Windows...

      flink-shaded-hadoop-3-uber-3.1.1.7.1.1.0-565-9.0.jar.tar.gz

      在这个特定的兼容包中,我们可以看到两个文件:flink-shaded-hadoop-3-uber-3.1.1.7.1.1.0-565-9.0.jar(实际的兼容库)和._flink-shaded-hadoop-3-uber-3.1.1.7.1.1.0-565-9.0.jar(可能是Mac OS的元数据文件,通常...

      hadoop-eclipse-plugin-3.3.1.jar

      Ubuntu虚拟机HADOOP集群搭建eclipse环境 hadoop-eclipse-plugin-3.3.1.jar

      hadoop-yarn-common-2.6.5-API文档-中文版.zip

      赠送jar包:hadoop-yarn-common-2.6.5.jar 赠送原API文档:hadoop-yarn-common-2.6.5-javadoc.jar 赠送源代码:hadoop-yarn-common-2.6.5-sources.jar 包含翻译后的API文档:hadoop-yarn-common-2.6.5-javadoc-...

      好用hadoop-eclipse-plugin-1.2.1

      hadoop-eclipse-plugin-1.2.1hadoop-eclipse-plugin-1.2.1hadoop-eclipse-plugin-1.2.1hadoop-eclipse-plugin-1.2.1

      flink-shaded-hadoop-2-uber-2.7.5-10.0.jar.zip

      Apache Flink 是一个流行的开源大数据处理框架,而 `flink-shaded-hadoop-2-uber-2.7.5-10.0.jar.zip` 文件是针对 Flink 优化的一个特殊版本的 Hadoop 库。这个压缩包中的 `flink-shaded-hadoop-2-uber-2.7.5-10.0....

    Global site tag (gtag.js) - Google Analytics