[spark-src-core] 8. trivial bug in spark standalone executor assignment

leibnitz

浏览: 289705 次
性别:
来自: 广州

最近访客更多访客>>

eternal1025

bneliao

adapterofcoms

caipeijun666

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

spark

yep from [1] we know that spark will divide jobs into two steps to be executed:a.launches executors and b.assigns tasks to that executors by driver.so how do executors are assigned to workers by master is very important!

for standalone mode,when we dive into the src in Master#receiveWithLogging() for case RequestSubmitDriver you will figure out it.

1.what

while you step more ,u will see the details in the code path below:

/**相比hadoop中的mr slots,spark分配executors显得智能了:后者是按照cores,mem总体要求进行全集群分配,
    * 并且资源多的workers分配更多exers.很明显,这里不是类似hadoop那样按照splits数量进行; 另外也比hadoop slots更智能些.
   * Schedule executors to be launched on the workers.-note:here will not clear out the assigned app.
   * vip==> spread out purpose:
   * There are two modes of launching executors. The first attempts to spread out an application's
   * executors on as many workers as possible, while the second does the opposite (i.e. launch them
   * on as few workers as possible). The former is usually better for data locality purposes and is
   * the default.<==
   *
   * The number of cores assigned to each executor is configurable. When this is explicitly set,
   * multiple executors from the same application may be launched on the same worker if the worker
   * has enough cores and memory. Otherwise, each executor grabs all the cores available on the
   * worker by default, in which case only one executor may be launched on each worker.
   */
  private def startExecutorsOnWorkers(): Unit = {
    // Right now this is a very simple <<FIFO scheduler>>. We keep trying to fit in the first app
    // in the queue, then the second app, etc.
    if (spreadOutApps) { //- how to present the meaning of 'spread out'? see loop 'while()'
      // Try to spread out each app among all the workers, until it has all its cores
      for (app <- waitingApps if app.coresLeft > 0) { //深度置后
        //1 //-workers satisfied the need of a executor;reverse order to balance worker's load
        //-this filters limit thats a worker's mem and free cores must satisfy at least one executor's mem and cpus.
        val usableWorkers = workers.toArray.filter(_.state == WorkerState.ALIVE)
          .filter(worker => worker.memoryFree >= app.desc.memoryPerExecutorMB &&
            worker.coresFree >= app.desc.coresPerExecutor.getOrElse(1))
          .sortBy(_.coresFree).reverse //-the more free cores of worker the more priority it has
        //-2 balance the resources asked to spread out to cluster as far as possible
        val numUsable = usableWorkers.length
        val assigned = new Array[Int](numUsable) // Number of cores to give on each node
        //-here means if app.coresLeft > sum(workers'cores),more than one exectors will be reassigned in one worker
        // in next round.app.coresLeft can be thinked as the property spark.cores.max per app
        var toAssign = math.min(app.coresLeft, usableWorkers.map(_.coresFree).sum) //-can't determine which one is less
        var pos = 0
        ///-spread out the target cpus(spark.cores.max) across cluster for load balance.
        while (toAssign > 0) { //-coresFree is never changed,so one work's cpus will be all assigned as far as possible
          if (usableWorkers(pos).coresFree - assigned(pos) > 0) {//-app.coresLeft is a multiple of one exer core?needless
            toAssign -= 1
            assigned(pos) += 1
          }
          pos = (pos + 1) % numUsable
        }
        //3 Now that we've decided how many cores to give on each node, let's actually give them
        //严格按照worker本身的cpus and mem资源来分配exers,若果不够一个executor资源要求则不分配
        for (pos <- 0 until numUsable if assigned(pos) > 0) { //广度优先(横向)
          //-worker free mem(mainly) and coresFree and app.coresLeft both will be decreased below
          allocateWorkerResourceToExecutors(app, assigned(pos), usableWorkers(pos))
        }
      }
    } else {
      //-spark.deploy.spreadOut=false will launch 25 executores(ie.50 cores which same as specified)
      // Pack each app into as few workers as possible until we've assigned all its cores.先逐个worker分配资源.不够再下一个
      //-worker.coresFree will be decreased in allocateWorkerResourceToExecutors();
      // 实际上每次分配exers到worker时只考虑mem,而cpus是在下一轮分配时再考虑!这意味着stpreadOut=false时单个worker上
      // exers占用的cpus可能超过 单个worker配额.
      for (worker <- workers if worker.coresFree > 0 && worker.state == WorkerState.ALIVE) { //广度置后
        for (app <- waitingApps if app.coresLeft > 0) { //深度优先(垂直),若果一个worker资源不满足,进入一个worker继续分配
          allocateWorkerResourceToExecutors(app, app.coresLeft, worker)
        }
      }
    }
  }

  /**-以worker.memoryFree和参数corsToAllocate为原则生成executors,对于spreadOut=true,cores分配是严格按照worker实际数量进行的.
    * assign cores and mem to executor by it's reqeusts(core and mem unit).
    * *对于spreadOut=false,实际上这里分配executors时只考虑了mem而cpus并未考虑,
    * 只有在下一轮分配exers到worker时才考虑,see startExecutorsOnWorkers().但spreadOut=true时严格按照mem and cpus来分配
   * Allocate a worker's resources to one or more executors.-ie several exers may be run on same worker
   * @param app the info of the application which the executors belong to
   * @param coresToAllocate cores on this worker to be allocated to this application(-total cores to be assigned to this
    *                        worker)
   * @param worker the worker info
   */
  private def allocateWorkerResourceToExecutors(
      app: ApplicationInfo,
      coresToAllocate: Int,
      worker: WorkerInfo): Unit = {
    val memoryPerExecutor = app.desc.memoryPerExecutorMB
    val coresPerExecutor = app.desc.coresPerExecutor.getOrElse(coresToAllocate)
    var coresLeft = coresToAllocate
    ///-stop whichever meet the cpus or mem conditions
    while (coresLeft >= coresPerExecutor && worker.memoryFree >= memoryPerExecutor) {
      val exec = app.addExecutor(worker, coresPerExecutor) //-here will decrease the app.coresGranted,ie coresLeft
      coresLeft -= coresPerExecutor
      launchExecutor(worker, exec) //-here will decrease the number of free core and mem of worker
      app.state = ApplicationState.RUNNING
    }
  }

its meaning by below figure:

2.how about

annotation refered from spark src,it said thtat 'spark.cores.max' is the # cores to be allocated to one app as many as possible.that means there will be a computation bug in spark,ie.(spreadOut=true):

case	spark.cores.max	#workers	#worker cores	#worker mem	coresPerExecutor	memPerExecutor	result
1	10	10	16	16g	2	2g	failed:no executors be allocated,ie 10/10=1 < 2coresPerExecutor
2	20						10 executors allocated at one wave,ie a.20/10=2>=coresPerExecutor,2/2=1 b.10 cores>=2x1 c.16g>=2gx1
3	40						20 executors at one wave
4	40				2	16g	10 exers at one wave,10 exers at othe wave,total is 20
5	40				16	2g	similar as above
6	40				20		failed, #worker cpus < 20
7	40				2	20g	failed, #worker mem < 20g
8	15	10	16	16g	2	2g	only 5 executors allocated,ie 15/10=1wave,then15-10=5 that is only 10 cores to be assigned.

so from case 1 ,8 we know that the cluster has enough resources to allocate exers but in fact no any executors (or no reasonable # executors) to be launched.then you will see something weird occurs:

16/11/18 14:07:10 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 1 tasks

16/11/18 14:07:25 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/11/18 14:07:40 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

3.workarounds

a.use a reasonable # of cores ,ie

natural number= cores.max % (#workers x # coresPerExecutor)

b.appends a embedding code block to check cores.max

ie checks this # max whether can be collapsed with prevous wave computations,no matter more or less then coresPerExecutor after assigning cores to workers and before allocating executors.

4.conclusion

no doubt the property 'spark.cores.max' maybe arise certain misunderstands,but u can aovid this case if adopt the solutions above.

in general speaking this property will let spark more intelligent to allocate executors dynamically compared to other yarn computation framework etc.

ref:

[1] [spark-src-core] 4.2 communications b/t certain kernal components

[2] spark调度系列----1. spark stanalone模式下Master对worker上各个executor资源的分配

[3] Spark技术内幕：Executor分配详解