Apache Spark之 reduceByKey() 函数 -

Lixh1986

浏览: 1124335 次

最近访客更多访客>>

Z865785437029

akingde

ningzong

蕃薯耀

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

Apache Spark之 reduceByKey() 函数

博客分类：

Apache Spark

spark reduceByKey

一、背景知识

    RDD ： Resilient Distributed DataSet

           回弹性分布式数据集合

          什么是 Resilient /rɪˈzɪliənt/ ？英文解释是这样的：

          able to recoil or spring back into shape after bending, stretching,
          or being compressed.（
          其形状能够在弯曲、拉伸、被压缩后恢复回原来的模样）

     那么 Resilient
     应用到 RDD 中是什么意思？这需要引入 RDD 操作的另一个概念：Map。
     Map 是什么？ Java 中有 HashMap 集合类（使用Hash算法实现的Map）
     HashMap 的对象是 (Key, Value) 格式的。即：一个键映射一个值。

     这里的 Map 也是这个意思：映射。把一个 RDD 映射为一个新的 RDD，
     通过某种映射规则（函数）。 RDD 映射完成后，还可以恢复到映射前的样子。



二、reduceByKey() 函数

在 Apache Spark 的 Quick Start 页面有这么一行代码，是关于 MapReduce 的概念及用法的。

//
//该行代码用来统计 ReadMe.md 文件中，每个单词出现的次数。
//

scala> val textFile = sc.textFile("README.md")

scala> val wordCounts = textFile
                        .flatMap(line => line.split(" "))
                        .map(word => (word, 1))
                        .reduceByKey((a, b) => a + b)


scala> wordCounts.collect()
//res6: Array[(String, Int)] = Array((means,1), (under,2), (this,3), 
//                       (Because,1), (Python,2), (agree,1), (cluster.,1), ...)
//
//

其中有这么一个函数： reduceByKey((a, b) => a + b)

经过查看 API，现记录如下：

________________________________________________________________________________

- reduceByKey(func, [numTasks])
________________________________________________________________________________

    When called on a dataset of (K, V) pairs,
    returns a dataset of (K, V) pairs where the values for each key
    are aggregated using the given reduce function func, which must
    be of type (V,V) => V.

    Like in groupByKey, the number of reduce tasks is configurable
    through an optional second argument.

------------------------------------------------------------------------------
解释：

    reduceByKey 函数应用于（Key，Value）格式的数据集。
    reduceByKey 函数的作用是把 key 相同的合并。
    reduceByKey 函数同样返回一个（Key，Value）格式的数据集。

    reduceByKey 函数的参数 func 的格式：(V,V) => V.
    即：两个 value 作为参数；必须返回一个 Value 格式的数据。

________________________________________________________________________________

故：reduceByKey((a, b) => a + b) 的完整写法应该是这样的（大概意思）：

function reduceByKey( function(value_a, value_b){
    return value_a + value_b;
});

最后：

关于 Lambda 表达式，请参看我的这篇技术博文：

Java 8 之新特性介绍及 Lambda 表达式

引用：

Apache Spark - Quick Start
http://spark.apache.org/docs/latest/quick-start.html

Apache Spark - Programming Guid: Transformations
http://spark.apache.org/docs/latest/programming-guide.html#transformations

-
转载请注明，
原文出处：http://lixh1986.iteye.com/blog/2345420

-

分享到：