[spark-src-core] 2.5 core concepts in Spark

全部 Linux 数据库敏捷编程数据结构软件测试项目管理 Oracle 编程综合互联网 Erlang MySQL

浏览 819 次

锁定老帖子主题：[spark-src-core] 2.5 core concepts in Spark 精华帖 (0) :: 良好帖 (0) :: 新手帖 (0) :: 隐藏帖 (0)
作者	正文
leibnitz 等级: 初级会员性别: 文章: 10 积分: 70 来自: 广州	发表时间：2016-08-25 相关推荐: beginning-apache-spark-2.pdf VS2005广告控件的使用 learning-spark-streaming.pdf Spark on K8S（spark-on-kubernetes-operator）常见问题（一） [spark-src]-source reading 更多相关推荐 1.overview in wordcount -memory tips: Job > Stage > Rdd > Dependency RDDs are linked by Dependencies. 2.terms -RDD is associated by Dependency,ie Dependency is a warpper of RDD. Stage contains corresponding rdd; dependency contains parent rdd also -Stage is a wrapper of same-func tasks; second,used to schedule by DAGScheduler -job.numTasks = resultStage.numPartitions.if ‘spark.default.parallelism’ not set,resultStage.numPartitions = ShuffleRDD.parts=hadoopRDD.splits so resultstage.numPartitons is deternmined by ShuffleRDD#getDependencies() TODO check in page -ShuffleMapStage.partitions = rdd.partition (ie. MapPartitionsRDD[3] here,similar to resultStage except ‘spark.default.parallelism’ case) -ShuffleRDD will be parted into the ResultStage instead of ShuffleMapStage cirtical concepts -spark.default.parallelism-determine how many result tasks to be run,see PariRDDFunctions#reduceByKey() -each stage has one direct corresponding rdd. -spark.cores.max in standalnoe mode,this property will limit the maximum of cores in cluster to be utilized . -ShuffleDependecy.shuffleId == ShuffleMapStage.id?yes,ie ShuffleDependency:ShuffleMapStage = 1:1, see shuffleToMapStage(shuffleDep.shuffleId) = stage 大小: 361.7 KB 查看图片附件声明：ITeye文章版权属于作者，受法律保护。没有作者书面许可不得转载。推荐链接
返回顶楼

论坛首页 → 综合技术版

跳转论坛: