[spark-src-core] 6. checkpoint in spark

leibnitz

浏览: 289720 次
性别:
来自: 广州

最近访客更多访客>>

eternal1025

bneliao

adapterofcoms

caipeijun666

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

spark

same as others big data technology,CheckPoint is a well-knowed solution to keep data a snapshot for speeduping failovers,ie. restores to most recent checkpoint state of data ,so u will not need to recomputate the rdd against the job.

in fact,the checkpoint op will cut down the relationships of all parent rdds.so the current rdd will be the last rdd of data line,and it will be derived by CheckpointRDD to achieve this goal.moreover,CheckpointRDDData is a other wrapper of CheckpointRDD.

1.how to

in spark,the checkpoint version is done by below steps(spark 1.4.1):

a. setup checkpoint dir by SparkContext.setupCheckpointDir(xx)
b. snapshot a data state of timeline:rdd.checkpoint()
c. do real checkpoint op at the last of a job(by default)

now lets detail more the steps respectively.

in the step 'b',the src is implemented by below codepath:

 /**
   * Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint
   * directory set with SparkContext.setCheckpointDir() and all references to its parent
   * RDDs will be removed. This function must be called before any job has been
   * executed on this RDD. It is strongly recommended that this RDD is persisted in
   * memory, otherwise saving it on a file will require recomputation.-cmp RDDCheckpointData#doCheckpoint()
   */
  def checkpoint() {
    if (context.checkpointDir.isEmpty) {
      throw new SparkException("Checkpoint directory has not been set in the SparkContext")
    } else if (checkpointData.isEmpty) {
      checkpointData = Some(new RDDCheckpointData(this))
      checkpointData.get.markForCheckpoint()
    }
  }

in the comment,u will curious about :why its necessary to persist the rdd,and to memory?

by diving into the src we know that the checkpoint op is really a job to run one more time on this rdd to save the result to file,so u will do one more computation if this rdd is not persisted.

on the other hand,why this rdd is recommanded to save in memory but disk? in fact,it's a little bit of differencs between the data saved in memory and file(maybe data format is),therefor,i think the author does not emphasize where to persist but the op of 'persist'.

2.FAQ

a.how to use checkpoint to restore data

from the StreamContext,we know that a func named 'getOrCreate(...)' is there for using the specified checkpoint dir defined before .so the snapshot data will readin rdd if any.

b.why not to save computated results when the rdd is run in first time

hm...no doubt,the real meaning of checkpont op is a second same job run on thie rdd.so why not to save ths results to file simetaneously at the first time?

first,there is only one anomyous function only defined in any runJob(..),thereby no more param can be accpted besides the user function .

second,the user function divided by the checkpoint save-op is more clearly to debug ,mantain etc.

分享到：

[spark-src-core] 7.1 application in spar ... | [spark-src-core] 5.big data techniques i ...

2016-10-19 17:14
浏览 596
评论(0)
分类:开源软件
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论