hive ppd

bupt04406

浏览: 351920 次
性别:
来自: 杭州

最近访客更多访客>>

rotkNirvana

zhangyi0618

xuhai0605

pengcong90

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

Hive

hive

Implement predicate push down for hive queries
https://issues.apache.org/jira/browse/HIVE-279
FilterOperator is applied twice with ppd on.
https://issues.apache.org/jira/browse/HIVE-1538 .

ppd（谓词下推）在HIVE-279中实现，在HIVE-1538中去除了冗余，ppd把一些过滤条件直接下推到紧挨着TableScanOperator，先过滤掉无用的数据。
org.apache.hadoop.hive.conf.HiveConf:
HIVEOPTPPD("hive.optimize.ppd", true), // predicate pushdown 默认是打开的。

org.apache.hadoop.hive.ql.ppd.PredicatePushDown中有解释：
/**
* Implements predicate pushdown. Predicate pushdown is a term borrowed from relational
* databases even though for Hive it is predicate pushup.
* The basic idea is to process expressions as early in the plan as possible. The default plan
* generation adds filters where they are seen but in some instances some of the filter expressions
* can be pushed nearer to the operator that sees this particular data for the first time.
* e.g.
* select a.*, b.*
* from a join b on (a.col1 = b.col1)
* where a.col1 > 20 and b.col2 > 40
*
* For the above query, the predicates (a.col1 > 20) and (b.col2 > 40), without predicate pushdown,
* would be evaluated after the join processing has been done. Suppose the two predicates filter out
* most of the rows from a and b, the join is unnecessarily processing these rows.
* With predicate pushdown, these two predicates will be processed before the join.
*
* Predicate pushdown is enabled by setting hive.optimize.ppd to true. It is disable by default.
*
* The high-level algorithm is describe here
* - An operator is processed after all its children have been processed
* - An operator processes its own predicates and then merges (conjunction) with the processed
*     predicates of its children. In case of multiple children, there are combined using
*     disjunction (OR).
* - A predicate expression is processed for an operator using the following steps
*    - If the expr is a constant then it is a candidate for predicate pushdown
*    - If the expr is a col reference then it is a candidate and its alias is noted
*    - If the expr is an index and both the array and index expr are treated as children
*    - If the all child expr are candidates for pushdown and all of the expression reference
*        only one alias from the operator's RowResolver then the current expression is also a
*        candidate
*   One key thing to note is that some operators (Select, ReduceSink, GroupBy, Join etc) change
*   the columns as data flows through them. In such cases the column references are replaced by
*   the corresponding expression in the input data.
*/

下面可以看到打开关闭ppd，以及HIVE-1538打上的效果：
SQL：来自 hive-trunk/ql/src/test/queries/clientpositive/ppd_gby.q

EXPLAIN
SELECT src1.c1
FROM
(SELECT src.value as c1, count(src.key) as c2 from src where src.value > 'val_10' group by src.value) src1
WHERE src1.c1 > 'val_200' and (src1.c2 > 30 or src1.c1 < 'val_400')

set hive.optimize.ppd = false;

hive> set hive.optimize.ppd = false;
hive> EXPLAIN
    > SELECT src1.c1
    > FROM
    > (SELECT src.value as c1, count(src.key) as c2 from src where src.value > 'val_10' group by src.value) src1
    > WHERE src1.c1 > 'val_200' and (src1.c2 > 30 or src1.c1 < 'val_400');
OK
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_TABREF src)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL src) value) c1) (TOK_SELEXPR (TOK_FUNCTION count (. (TOK_TABLE_OR_COL src) key)) c2)) (TOK_WHERE (> (. (TOK_TABLE_OR_COL src) value) 'val_10')) (TOK_GROUPBY (. (TOK_TABLE_OR_COL src) value)))) src1)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL src1) c1))) (TOK_WHERE (and (> (. (TOK_TABLE_OR_COL src1) c1) 'val_200') (or (> (. (TOK_TABLE_OR_COL src1) c2) 30) (< (. (TOK_TABLE_OR_COL src1) c1) 'val_400'))))))

STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage

STAGE PLANS:
Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        src1:src
          TableScan
            alias: src
            Filter Operator
              predicate:
                  expr: (value > 'val_10')
                  type: boolean
              Select Operator
                expressions:
                      expr: key
                      type: string
                      expr: value
                      type: string
                outputColumnNames: key, value
                Group By Operator
                  aggregations:
                        expr: count(key)
                  bucketGroup: false
                  keys:
                        expr: value
                        type: string
                  mode: hash
                  outputColumnNames: _col0, _col1
                  Reduce Output Operator
                    key expressions:
                          expr: _col0
                          type: string
                    sort order: +
                    Map-reduce partition columns:
                          expr: _col0
                          type: string
                    tag: -1
                    value expressions:
                          expr: _col1
                          type: bigint
      Reduce Operator Tree:
        Group By Operator
          aggregations:
                expr: count(VALUE._col0)
          bucketGroup: false
          keys:
                expr: KEY._col0
                type: string
          mode: mergepartial
          outputColumnNames: _col0, _col1
          Select Operator
            expressions:
                  expr: _col0
                  type: string
                  expr: _col1
                  type: bigint
            outputColumnNames: _col0, _col1
            Filter Operator
              predicate:
                  expr: ((_col0 > 'val_200') and ((_col1 > 30) or (_col0 < 'val_400')))
                  type: boolean
              Select Operator
                expressions:
                      expr: _col0
                      type: string
                outputColumnNames: _col0
                File Output Operator
                  compressed: false
                  GlobalTableId: 0
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

Stage: Stage-0
    Fetch Operator
      limit: -1

    > set hive.ppd.remove.duplicatefilters=false;
hive> EXPLAIN
    > SELECT src1.c1
    > FROM
    > (SELECT src.value as c1, count(src.key) as c2 from src where src.value > 'val_10' group by src.value) src1
    > WHERE src1.c1 > 'val_200' and (src1.c2 > 30 or src1.c1 < 'val_400')
    > ;
OK
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_TABREF src)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL src) value) c1) (TOK_SELEXPR (TOK_FUNCTION count (. (TOK_TABLE_OR_COL src) key)) c2)) (TOK_WHERE (> (. (TOK_TABLE_OR_COL src) value) 'val_10')) (TOK_GROUPBY (. (TOK_TABLE_OR_COL src) value)))) src1)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL src1) c1))) (TOK_WHERE (and (> (. (TOK_TABLE_OR_COL src1) c1) 'val_200') (or (> (. (TOK_TABLE_OR_COL src1) c2) 30) (< (. (TOK_TABLE_OR_COL src1) c1) 'val_400'))))))

STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage

STAGE PLANS:
Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        src1:src
          TableScan
            alias: src
            Filter Operator
              predicate:
                  expr: ((value > 'val_10') and (value > 'val_200'))
                  type: boolean
              Filter Operator
                predicate:
                    expr: (value > 'val_10')
                    type: boolean
                Select Operator
                  expressions:
                        expr: key
                        type: string
                        expr: value
                        type: string
                  outputColumnNames: key, value
                  Group By Operator
                    aggregations:
                          expr: count(key)
                    bucketGroup: false
                    keys:
                          expr: value
                          type: string
                    mode: hash
                    outputColumnNames: _col0, _col1
                    Reduce Output Operator
                      key expressions:
                            expr: _col0
                            type: string
                      sort order: +
                      Map-reduce partition columns:
                            expr: _col0
                            type: string
                      tag: -1
                      value expressions:
                            expr: _col1
                            type: bigint
      Reduce Operator Tree:
        Group By Operator
          aggregations:
                expr: count(VALUE._col0)
          bucketGroup: false
          keys:
                expr: KEY._col0
                type: string
          mode: mergepartial
          outputColumnNames: _col0, _col1
          Select Operator
            expressions:
                  expr: _col0
                  type: string
                  expr: _col1
                  type: bigint
            outputColumnNames: _col0, _col1
            Filter Operator
              predicate:
                  expr: ((_col0 > 'val_200') and ((_col1 > 30) or (_col0 < 'val_400')))
                  type: boolean
              Select Operator
                expressions:
                      expr: _col0
                      type: string
                outputColumnNames: _col0
                File Output Operator
                  compressed: false
                  GlobalTableId: 0
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

Stage: Stage-0
    Fetch Operator
      limit: -1

Time taken: 0.208 seconds

hive> set hive.ppd.remove.duplicatefilters=true;
hive> EXPLAIN
    > SELECT src1.c1
    > FROM
    > (SELECT src.value as c1, count(src.key) as c2 from src where src.value > 'val_10' group by src.value) src1
    > WHERE src1.c1 > 'val_200' and (src1.c2 > 30 or src1.c1 < 'val_400');
OK
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_TABREF src)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL src) value) c1) (TOK_SELEXPR (TOK_FUNCTION count (. (TOK_TABLE_OR_COL src) key)) c2)) (TOK_WHERE (> (. (TOK_TABLE_OR_COL src) value) 'val_10')) (TOK_GROUPBY (. (TOK_TABLE_OR_COL src) value)))) src1)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL src1) c1))) (TOK_WHERE (and (> (. (TOK_TABLE_OR_COL src1) c1) 'val_200') (or (> (. (TOK_TABLE_OR_COL src1) c2) 30) (< (. (TOK_TABLE_OR_COL src1) c1) 'val_400'))))))

STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage

STAGE PLANS:
Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        src1:src
          TableScan
            alias: src
            Filter Operator
              predicate:
                  expr: ((value > 'val_10') and (value > 'val_200'))
                  type: boolean
              Select Operator
                expressions:
                      expr: key
                      type: string
                      expr: value
                      type: string
                outputColumnNames: key, value
                Group By Operator
                  aggregations:
                        expr: count(key)
                  bucketGroup: false
                  keys:
                        expr: value
                        type: string
                  mode: hash
                  outputColumnNames: _col0, _col1
                  Reduce Output Operator
                    key expressions:
                          expr: _col0
                          type: string
                    sort order: +
                    Map-reduce partition columns:
                          expr: _col0
                          type: string
                    tag: -1
                    value expressions:
                          expr: _col1
                          type: bigint
      Reduce Operator Tree:
        Group By Operator
          aggregations:
                expr: count(VALUE._col0)
          bucketGroup: false
          keys:
                expr: KEY._col0
                type: string
          mode: mergepartial
          outputColumnNames: _col0, _col1
          Filter Operator
            predicate:
                expr: ((_col0 > 'val_200') and ((_col1 > 30) or (_col0 < 'val_400')))
                type: boolean
            Select Operator
              expressions:
                    expr: _col0
                    type: string
                    expr: _col1
                    type: bigint
              outputColumnNames: _col0, _col1
              Select Operator
                expressions:
                      expr: _col0
                      type: string
                outputColumnNames: _col0
                File Output Operator
                  compressed: false
                  GlobalTableId: 0
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

Stage: Stage-0
    Fetch Operator
      limit: -1

分享到：

hive hiveconf 配置 | hive genPlan

2011-07-26 00:24
浏览 2239
评论(0)
分类:开源软件
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

hive ppd

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

hive ppd

评论

发表评论

相关推荐

hive rename table name

hive的distribute by如何partition long型的数据

hive like vs rlike vs regexp

hive sql where条件很简单，但是太多

insert into时(string->bigint)自动类型转换

通过复合结构来优化udf的调用

RegexSerDe

Hive 的 OutputCommitter

hive LATERAL VIEW 行转列

hive complex type

hive转义字符

hive 两个不同类型的columns进行比较

lateral view

udf 中获得 FileSystem

hive union mapjoin

hive eclipse

hive join filter

hive limit

hive convertMapJoin MapJoinProcessor

hive hive.merge.mapfiles hive.merge.mapredfiles

最近访客更多访客>>