【Spark四十三】RDD算子逻辑执行图第三部分

bit1129

浏览: 1072924 次
性别:
来自: 北京

最近访客更多访客>>

xiaoyaohen24

yuxin8000

abc951654

zhongqi2513

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

Spark

1.interSection

2.join

1.interSection

1.示例代码

package spark.examples

import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.SparkContext._

object SparkRDDIntersection {

  def main(args : Array[String]) {
    val conf = new SparkConf().setAppName("SparkRDDDistinct").setMaster("local");
    val sc = new SparkContext(conf);
    val rdd1 = sc.parallelize(List(1,8,2,1,4,2,7,6,2,3,3,1), 3)
    val rdd2 = sc.parallelize(List(1,8,7,9,6,2,1), 2)
    val pairs = rdd1.intersection(rdd2);

    pairs.saveAsTextFile("file:///D:/intersection" + System.currentTimeMillis());

    println(pairs.toDebugString)
  }

}

1.1 RDD的依赖关系：

(3) MappedRDD[7] at intersection at SparkRDDIntersection.scala:13 []
 |  FilteredRDD[6] at intersection at SparkRDDIntersection.scala:13 []
 |  MappedValuesRDD[5] at intersection at SparkRDDIntersection.scala:13 []
 |  CoGroupedRDD[4] at intersection at SparkRDDIntersection.scala:13 []
 +-(3) MappedRDD[2] at intersection at SparkRDDIntersection.scala:13 []
 |  |  ParallelCollectionRDD[0] at parallelize at SparkRDDIntersection.scala:11 []
 +-(2) MappedRDD[3] at intersection at SparkRDDIntersection.scala:13 []
    |  ParallelCollectionRDD[1] at parallelize at SparkRDDIntersection.scala:12 []

1.2 运行结果：

part-000000: 6

part-000001: 1 7

part-000002: 8 2

2.RDD依赖图

3.intersection的源代码

  /**
   * Return the intersection of this RDD and another one. The output will not contain any duplicate
   * elements, even if the input RDDs did.
   *
   * Note that this method performs a shuffle internally.
   */
  def intersection(other: RDD[T]): RDD[T] = {
    this.map(v => (v, null)).cogroup(other.map(v => (v, null)))
        .filter { case (_, (leftGroup, rightGroup)) => leftGroup.nonEmpty && rightGroup.nonEmpty }
        .keys
  }

3.1 RDD的取交集算子是使用cogroup，首先将Key相同的Value聚合到一个数组中，然后进行过滤

3.2 即使RDD内部有重复的元素，也会过滤掉

2.join

1. 示例源代码：

package spark.examples

import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.SparkContext._

object SparkRDDJoin {

  def main(args : Array[String]) {
    val conf = new SparkConf().setAppName("SparkRDDJoin").setMaster("local");
    val sc = new SparkContext(conf);

    //第一个参数是集合，第二个参数是分区数
    val rdd1 = sc.parallelize(List((1,2),(2,3), (3,4),(4,5),(5,6)), 3)
    val rdd2 = sc.parallelize(List((3,6),(2,8)), 2);

     //join操作的RDD的元素类型必须是K/V类型
    val pairs = rdd1.join(rdd2);

    pairs.saveAsTextFile("file:///D:/join" + System.currentTimeMillis());

    println(pairs.toDebugString)
  }

}

1.1 RDD依赖图

(3) FlatMappedValuesRDD[4] at join at SparkRDDJoin.scala:17 []
 |  MappedValuesRDD[3] at join at SparkRDDJoin.scala:17 []
 |  CoGroupedRDD[2] at join at SparkRDDJoin.scala:17 []
 +-(3) ParallelCollectionRDD[0] at parallelize at SparkRDDJoin.scala:13 []
 +-(2) ParallelCollectionRDD[1] at parallelize at SparkRDDJoin.scala:14 []

1.2 计算结果

part-00000: (3,(4,6))

part-00001:空

part-00002:(2,(3,8))

2. RDD依赖图

3.join的源代码

  /**
   * Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each
   * pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and
   * (k, v2) is in `other`. Uses the given Partitioner to partition the output RDD.
   */
  def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = {
    this.cogroup(other, partitioner).flatMapValues( pair =>
      for (v <- pair._1; w <- pair._2) yield (v, w)
    )
  }

1. 从源代码中可以看到，图中所描绘的过程是正确的，对于一个给定的Key，假如RDD1中有m个（K，V)，RDD2中有n个(K,V‘)，那么结果中将由m*n个(K，(V,V'))

查看图片附件

分享到：

【Spark四十四】RDD算子逻辑执行图第四部 ... | 【Spark四十二】RDD算子逻辑执行图第二部 ...

2015-02-06 14:37
浏览 1488
评论(0)
分类:开源软件
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

【Spark四十三】RDD算子逻辑执行图第三部分

1.interSection

2.join

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

【Spark四十三】RDD算子逻辑执行图第三部分

1.interSection

2.join

评论

发表评论

相关推荐

【Spark109】Windows上运行spark-shell

【Spark108】Spark SQL动态代码生成四

【Spark107】Spark SQL动态代码生成三

【Spark106】Spark SQL动态代码生成二

【Spark105】Spark SQL动态代码生成一

【Spark105】Spark任务调度

【Spark104】Spark源代码构建打包

【Spark103】Task not serializable

【Spark102】Spark存储模块BlockManager剖析

【Spark101】Scala Promise/Future在Spark中的应用

【Spark100】Spark Streaming Checkpoint的一个坑

【Spark九十九】Spark Streaming的batch interval时间内的数据流转源码分析

【Spark九十八】Standalone Cluster Mode下的资源调度源代码分析

【Spark九十七】RDD API之aggregateByKey

【Spark九十六】RDD API之combineByKey

【Spark九十五】Spark Shell操作Spark SQL

【Spark九十四】spark-sql工具的使用

【Spark九十三】Spark读写Sequence File

【Spark九十二】Spark SQL操作Parquet格式的数据

【Spark九十一】Spark Streaming整合Kafka一些值得关注的问题

最近访客更多访客>>