spark源码分析：catalyst 草稿 -

baishuo491

浏览: 79253 次
性别:
来自: 北京

最近访客更多访客>>

shymi1991

qq85609655

power315cn

冰魄永峰

博主相关

博客

微博

相册

留言

关于我

文章分类

全部博客 (28)

社区版块

存档分类

spark源码分析：catalyst 草稿

object Optimizer extends RuleExecutor[LogicalPlan] {
val batches =
    Batch("ConstantFolding", Once,
      ConstantFolding,
      BooleanSimplification,
      SimplifyFilters,
      SimplifyCasts) ::
    Batch("Filter Pushdown", Once,
      CombineFilters,
      PushPredicateThroughProject,
      PushPredicateThroughInnerJoin,
      ColumnPruning) :: Nil
}

SimplifyFilters

object SimplifyFilters extends Rule[LogicalPlan] {
def apply(plan: LogicalPlan): LogicalPlan = plan transform {
    case Filter(Literal(true, BooleanType), child) =>
      child
    case Filter(Literal(null, _), child) =>
      LocalRelation(child.output)
    case Filter(Literal(false, BooleanType), child) =>
      LocalRelation(child.output)
}
}
起到削减一些逻辑判断，直接返回child或者child.output的作用，那么这些Literal(true, BooleanType)之类的模式是从哪里来的呢？查看Optimizer 的batches 可以发现，是SimplifyFilters前面的batch：BooleanSimplification，在这里面形成的

SQLContext.createSchemaRDD(RDD<A>, TypeTag<A>) line: 90
BaiJoin$.main(String[]) line: 26
BaiJoin.main(String[]) line: not available

看这句：SQLContext.createSchemaRDD(RDD<A>, TypeTag<A>)
当时的断点停在new SchemaRDD这一句：
implicit def createSchemaRDD[A <: Product: TypeTag](rdd: RDD[A]) =
    new SchemaRDD(this, SparkLogicalPlan(ExistingRdd.fromProductRdd(rdd)))
当时的varible界面里有这样一个变量：evidence$1 TypeTags$TypeTagImpl<T> (id=107)
它的值是 TypeTag[com.ailk.test.sql.tb]，所以可以近似认为：A就是com.ailk.test.sql.tb（一个case class类型）
rdd则是：MappedRDD[2] at map at BaiJoin.scala:16
             MappedRDD[1] at textFile at BaiJoin.scala:16
                 HadoopRDD[0] at textFile at BaiJoin.scala:16

def fromProductRdd[A <: Product : TypeTag](productRdd: RDD[A]) = {
    ExistingRdd(ScalaReflection.attributesFor[A], productToRowRdd(productRdd))
}
把A里面，所有的item都取出来，成为一个列表，就是com.ailk.test.sql.tb定义的所有列
可见ScalaReflection.attributesFor[A]的结果是一个Seq[Attribute]，它的excute就是返回一个RDD[Row]
case class ExistingRdd(output: Seq[Attribute], rdd: RDD[Row]) extends LeafNode {
override def execute() = rdd
}
输入是RDD[A]，输出是RDD[Row]
def productToRowRdd[A <: Product](data: RDD[A]): RDD[Row] = {
    data.mapPartitions { iterator =>
      if (iterator.isEmpty) {
        Iterator.empty
      } else {
        val bufferedIterator = iterator.buffered
        val mutableRow = new GenericMutableRow(bufferedIterator.head.productArity)

        bufferedIterator.map { r =>
          var i = 0
          while (i < mutableRow.length) {
            mutableRow(i) = r.productElement(i)
            i += 1
          }

          mutableRow
        }
      }
    }
}

/////////////////////////////////////////////////////////////////////
heap jit-Compiler gc
dfs3
申请内存的操作必须是原子操作   线程的模式：tlab--为每个线程来 freeList Bumpthepointer
复制算法
s0和s1复制的是eden中存活的对象
标记清除算法---内存碎片
标记压缩算法----内存拷贝比较严重

root的选择：class thread stacklocal   jnilocal monitor “held by jvm”
dfs3 标记法

分享到：