1.overview in wordcount
-memory tips:
Job > Stage > Rdd > Dependency RDDs are linked by Dependencies.
2.terms
-RDD is associated by Dependency,ie Dependency is a warpper of RDD.
Stage contains corresponding rdd; dependency contains parent rdd also
-Stage is a wrapper of same-func tasks; second,used to schedule by DAGScheduler
-job.numTasks = resultStage.numPartitions.if ‘spark.default.parallelism’ not set,resultStage.numPartitions = ShuffleRDD.parts=hadoopRDD.splits
so resultstage.numPartitons is deternmined by ShuffleRDD#getDependencies()
TODO check in page
-ShuffleMapStage.partitions = rdd.partition (ie. MapPartitionsRDD[3] here,similar to resultStage except ‘spark.default.parallelism’ case)
-ShuffleRDD will be parted into the ResultStage instead of ShuffleMapStage
cirtical concepts
-spark.default.parallelism-determine how many result tasks to be run,see PariRDDFunctions#reduceByKey()
-each stage has one direct corresponding rdd.
-spark.cores.max in standalnoe mode,this property will limit the maximum of cores in cluster to be utilized .
-ShuffleDependecy.shuffleId == ShuffleMapStage.id?yes,ie ShuffleDependency:ShuffleMapStage = 1:1,
see shuffleToMapStage(shuffleDep.shuffleId) = stage