`
tobyqiu
  • 浏览: 40459 次
  • 性别: Icon_minigender_1
  • 来自: 上海
社区版块
存档分类
最新评论

Hive Join 优化 翻译

 
阅读更多


翻译自 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization#LanguageManualJoinOptimization-AutoConversiontoSMBMapJoin

目录结构

Join Optimization ----Join 调优
	Improvements to the Hive Optimizer ----Hive的优化
	Star Join Optimization ----星型结构的优化
		Star Schema Example ----例子
		Prior Support for MAPJOIN  ----MapJoin的预备知识
			Limitations of Prior Implementation  ----使用限制
		Enhancements for Star Joins  ----优化星型join
			Optimize Chains of Map Joins  ----优化一连串的Map join
				Current and Future Optimizations ----当前以及未来的优化方向
			Optimize Auto Join Conversion ----自动join转换
				Current Optimization ----当前的优化方向
				Auto Conversion to SMB Map Join ----自动转换成SMS join
			Generate Hash Tables on the Task Side ----在task side生成hash表
				Pros and Cons of Client-Side Hash Tables ----优缺点
				Task-Side Generation of Hash Tables ----在task siade 生成hash表
					Further Options for Optimization----优化方向

 

Improvements to the Hive Optimizer

 

Hive automatically recognizes various use cases and optimizes for them. Hive 0.11 improves the optimizer for these cases:   hive可以自动优化,在0.11里面改进了一些优化用例

  • Joins where one side fits in memory. In the new optimization:  join的一边适合放进内存,有新的优化方案
    • that side is loaded into memory as a hash table 把表按照hash表的形式读进内存
    • only the larger table needs to be scanned 只扫描大表
    • fact tables have a smaller footprint in memory fact表只使用少量内存
  • Star-schema joins 星型join
  • Hints are no longer needed for many cases. 在很多情况下不再需要HINT
  • Map joins are automatically picked up by the optimizer. Map join 自动优化

Star Join Optimization

Star Schema Example

 

Select count(*) cnt
From store_sales ss
     join household_demographics hd on (ss.ss_hdemo_sk = hd.hd_demo_sk)
     join time_dim t on (ss.ss_sold_time_sk = t.t_time_sk)
     join store s on (s.s_store_sk = ss.ss_store_sk)
Where
     t.t_hour = 8
     t.t_minute >= 30
     hd.hd_dep_count = 2
order by cnt;
 

 

DW 常用的star schema,这个属于BI 基本概念了就不解释了。

Prior Support for MAPJOIN

 

The default value for hive.auto.convert.join was false in Hive 0.10.0.  Hive 0.11.0 changed the default to true (HIVE-3297).hive.auto.convert.join 在0.10默认是fales,到了0.11就是变成了true

MAPJOINs are processed by loading the smaller table into an in-memory hash map and matching keys with the larger table as they are streamed through. The prior implementation has this division of labor:

MAPJOINs  把小表以hash map的形式读进内存,然后和大表匹配key,以下是各个阶段的分工

  • Local work:本地
    • read records via standard table scan (including filters and projections) from source on local machine扫描表
    • build hashtable in memory建立hash表读进内存
    • write hashtable to local disk写进本地磁盘
    • upload hashtable to dfs上传到HDFS
    • add hashtable to distributed cache把hash表加进分布式缓存
  • Map task
    • read hashtable from local disk (distributed cache) into memory从本地磁盘(分布式缓存)把hashtable读进内存
    • match records' keys against hashtable key匹配
    • combine matches and write to output 合并匹配,写output
  • No reduce task Map join的特点,没有reduce

 

Limitations of Prior Implementation

  • The mapjoin operator can only handle one key at a time; that is, it can perform a multi-table join, but only if all the tables are joined on the same key. (Typical star schema joins do not fall into this category.)一个mapjoin只能处理一次一个key,它可以执行的多表连接,但只有当所有的表都加入了相同的key。(典型的星型连接不属于这一类)
  • Hints are cumbersome for users to apply correctly and auto conversion doesn't have enough logic to consistently predict if a MAPJOIN will fit into memory or not.就算加了hint也未必,真的是用mapjoin
  • A chain of MAPJOINs is not coalesced into a single map-only job, unless the query is written as a cascading sequence of mapjoin(table, subquery(mapjoin(table, subquery....). Auto conversion never produces a single map-only job.一连串的mapjoins不会合并成一个单一的map job,除非查询写成一个级联的mapjoin(mapjoin(table, subquery(mapjoin(table, subquery....).自动转换后的也不会变成一个单一的map job。
  • The hashtable for the mapjoin operator has to be generated for each run of the query, which involves downloading all the data to the Hive client machine as well as uploading the generated hashtable files.mapjoin 中用到的哈希表,每个子QUERY运行都会生成,先下载,再分发给map

Enhancements for Star Joins

the join optimizations can be grouped into three parts:join 调优从3部分入手

  • Execute chains of mapjoins in the operator tree in a single map-only job, when maphints are used.使用MapJoin Hint时,把一连串的Mapjoin操作变成一个map-only的job
  • Extend optimization to the auto-conversion case (generating an appropriate backup plan when optimizing).把优化方案尽可能的变成自动优化(顺便备份下执行计划)
  • Generate in-memory hashtable completely on the task side. (Future work.)使得hashtable在task side(map端)直接生成,现在的方案是先在本地生成,然后传到HDFS,再用分布式缓存去分给每个map, 未来的版本会实现。

Optimize Chains of Map Joins

The following query will produce two separate map-only jobs when executed: 下面的SQL会被分解成2个map-only的job执行

select /*+ MAPJOIN(time_dim, date_dim) */ count(*) from
store_sales 
join time_dim on (ss_sold_time_sk = t_time_sk) 
join date_dim on (ss_sold_date_sk = d_date_sk)
where t_hour = 8 and d_year = 2002

It is likely, though, that for small dimension tables the parts of both tables needed would fit into memory at the same time. This reduces the time needed to execute this query dramatically, as the fact table is only read once instead of reading it twice and writing it to HDFS to communicate between the jobs.把小表读进内存,如果fact只读了一次,而不是读2次,那么会极大的减少执行时间

Current and Future Optimizations 调优的方向
  1. Merge M*-MR patterns into a single MR.把多个Map-only的job +MR job 的模式变成单个MR 
  2. Merge MJ->MJ into a single MJ when possible.尽可能的把mapjoin 嵌套的模式 变成一个mapjoin
  3. Merge MJ* patterns into a single Map stage as a chain of MJ operators. (Not yet implemented.)把多个mapjoin串起来,变成一连串的mapjoin(上面的例子是分成2个map-only的job,而不是一连串的,这个功能现在还么有!!!)

If hive.auto.convert.join is set to true the optimizer not only converts joins to mapjoins but also merges MJ* patterns as much as possible. 如果 hive.auto.convert.join这个开关打开的话,不仅仅变成mapjoin,还会尽可能的转换成MJ*这种模式(还不支持的那种)

 

Optimize Auto Join Conversion

When auto join is enabled, there is no longer a need to provide the map-join hints in the query. The auto join option can be enabled with two configuration parameters:当auto join 开关打开,就不再需要使用hint了,开关的参数有2个

set hive.auto.convert.join.noconditionaltask = true;
set hive.auto.convert.join.noconditionaltask.size = 10000;

The default for hive.auto.convert.join.noconditionaltask is true which means auto conversion is enabled. 0.11里面这个默认值为true.

The size configuration enables the user to control what size table can fit in memory. 小于这个size的表可以被放进内存

This value represents the sum of the sizes of tables that can be converted to hashmaps that fit in memory. 这个size的大小指的是被放进内存的哈希表的大小总和

Currently, n-1 tables of the join have to fit in memory for the map-join optimization to take effect. 当前版本,n-1个表都可以被放进内存,最大的那个表放在磁盘上march

There is no check to see if the table is a compressed one or not and what the potential size of the table can be. 在这里,不会去检查表是否被压缩.意思应该直接从HDFS中得到的file的大小

The effect of this assumption on the results is discussed in the next section.这一行为的假设,会在下一节讨论

For example, the previous query just becomes:那么以前的例子就可以变成

select count(*) from
store_sales 
join time_dim on (ss_sold_time_sk = t_time_sk)
join date_dim on (ss_sold_date_sk = d_date_sk)
where t_hour = 8 and d_year = 2002

 

If time_dim and date_dim fit in the size configuration provided, the respective joins are converted to map-joins. 如果这2个维表的大小符合config的size,就会转换成map-join(2个). 这里的size 我认为应该是指 hive.smalltable.filesize 这个值 默认25m,奇怪的为什么这个值比noconditionaltask.size 大.

If the sum of the sizes of the tables can fit in the configured size, then the two map-joins are combined resulting in a single map-join. 如果维表的总和小于noconditionaltask.size 会把2个map-join 合并成一个

This reduces the number of MR-jobs required and significantly boosts the speed of execution of this query. 这样做减少了MR job的数量,并显著提高了query的速度

This example can be easily extended for multi-way joins as well and will work as expected.这个例子可以很容易地扩展为muti-way join 以及将按预期工作。

Outer joins offer more challenges. 外连接不能用map-join.

Since a map-join operator can only stream one table, the streamed table needs to be the one from which all of the rows are required. 因为map-join 只能有一个steam表,steam表的所有column都应该是全的, 外连接可能出现null,所以不能用吧

For the left outer join, this is the table on the left side of the join; for the right outer join, the table on the right side, etc. 

This means that even though an inner join can be converted to a map-join, an outer join cannot be converted. 这意味着外连接不能用,只有内连接才能map-join

An outer join can only be converted if the table(s) apart from the one that needs to be streamed can be fit in the size configuration.外连接只能用stream table的形式来调优了.

 A full outer join cannot be converted to a map-join at all since both tables need to be streamed.笛卡尔积就更别说了..

Auto join conversion also affects the sort-merge-bucket joins.自动开关也可以作用在sort-merge-bucket joins

Current Optimization 当前的优化方案

Group as many MJ operators as possible into one MJ.把多个MJ合并成一个

 

Auto Conversion to SMB Map Join 

 

Sort-Merge-Bucket (SMB) joins can be converted to SMB map joins as well.  基于桶的join,可以被转换成 基于桶的map join

SMB joins are used wherever the tables are sorted and bucketed.前提是表的按桶分的

The join boils down to just merging the already sorted tables, allowing this operation to be faster than an ordinary map-join. 排过序的表会比没排序的表做map join更快

However, if the tables are partitioned, there could be a slow down as each mapper would need to get a very small chunk of a partition which has a single key.如果表又是分区表,又是bucket表,会分成很多个小块

The following configuration settings enable the conversion of an SMB to a map-join SMB: 开关如下

 

set hive.auto.convert.sortmerge.join=true;
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
set hive.auto.convert.sortmerge.join.noconditionaltask=true;
 

 

There is an option to set the big table selection policy using the following configuration:大表配置策略

 

set hive.auto.convert.sortmerge.join.bigtable.selection.policy 
    = org.apache.hadoop.hive.ql.optimizer.TableSizeBasedBigTableSelectorForAutoSMJ;
 

 

By default, the selection policy is average partition size. The big table selection policy helps determine which table to choose for only streaming, as compared to hashing and streaming.默认情况下为平均分区,这个策略有助于确定选择Stream,相比是哈希还是流来说.

The available selection policies are:策略列表如下

org.apache.hadoop.hive.ql.optimizer.AvgPartitionSizeBasedBigTableSelectorForAutoSMJ (default)

org.apache.hadoop.hive.ql.optimizer.LeftmostBigTableSelectorForAutoSMJ

org.apache.hadoop.hive.ql.optimizer.TableSizeBasedBigTableSelectorForAutoSMJ

The names describe their uses. This is especially useful for the fact-fact join 根据名字就能判断用途.特别是用在 fact 和 fact的join中

 

Generate Hash Tables on the Task Side 

Future work will make it possible to generate in-memory hashtables completely on the task side.未来的版本,可能会把哈希放到task side(当前是放在客户端生成的)

 

Pros and Cons of Client-Side Hash Tables  在客户端生产哈希表的优缺点

 

Generating the hashtable (or multiple hashtables for multitable joins) on the client machine has drawbacks. 无论是生成哈希,还是多表的哈希join 都有问题

(The client machine is the host that is used to run the Hive client and submit jobs.)因为客户端的机器都是用来跑hive客户端 或者 是用来提交job的

Data locality: The client machine typically is not a data node. All the data accessed is remote and has to be read via the network.数据分布,客户端机器一般都不是数据节点,所有的数据访问是远程的,必须通过网络读取。

Specs: For the same reason, it is not clear what the specifications of the machine running this processing will be. It might have limitations in memory, hard drive, or CPU that the task nodes do not have.空间:出于同样的原因,这是不清楚的机器有点什么。任务节点上的内存,硬盘,cpu情况也不清楚。

HDFS upload: The data has to be brought back to the cluster and replicated via the distributed cache to be used by task nodes.数据拷贝都是问题

 

Pre-processing the hashtables on the client machine also has some benefits:预先在客户端生成哈希的好处有

1.What is stored in the distributed cache is likely to be smaller than the original table (filter and projection).因为做了filter或者投影,生成的哈希可能比原始的表要小

2.In contrast, loading hashtables directly on the task nodes using the distributed cache means larger objects in the cache, potentially reducing opportunities for using MAPJOIN.如果在task端使用分布缓存做哈希,意味着缓存会被占用,间接的减小了用mapjoin的可能性

 

Task-Side Generation of Hash Tables task端生成哈希

 

When the hashtables are generated completely on the task side, all task nodes have to access the original data source to generate the hashtable. 当在task端生成哈希时,所有任务节点必须访问原始数据源生成的哈希表(同时去访问同一资源)。

Since in the normal case this will happen in parallel it will not affect latency, but Hive has a concept of storage handlers and having many tasks access the same external data source (HBase, database, etc.) might overwhelm or slow down the source.在正常情况下,这一操作是并行的,不会导致延迟,但是hive有一个概念,就是多任务同时访问外部的数据源,如HBASE,DB等,这样就有可能导致延迟了.

Further Options for Optimization未来的优化方向

1.Increase the replication factor on dimension tables. 增加的维表的复制因子。

2.Use the distributed cache to hold dimension tables. 使用分布式缓存来存放维表.

 

 

 

 

 

 

分享到:
评论

相关推荐

    hive参数优化总结

    Hive 参数优化总结 Hive 是一个基于 Hadoop 的数据仓库工具,用于对大规模数据进行查询、分析和处理。为了提高 Hive 的性能和效率,参数优化是非常重要的一步。本文档将总结 Hive 参数优化的相关知识点,并对 Hive ...

    Hive查询优化整理与Hive简易版思维导图

    本文将深入探讨Hive查询优化的一些关键点,并结合个人实践经验和整理的Hive简易版思维导图,帮助你更好地理解和运用Hive。 一、Hive查询优化基础 1. **表分区**:分区是Hive提高查询效率的重要手段。通过将大表按...

    hive高级分析函数与优化.rar_hive_hive 函数优化_hive 分析 大数据_hive高级优化_oldwlc

    压缩可以减小存储空间,数据倾斜优化是解决数据分布不均问题,Join优化涉及MapJOIN、BroadcastJOIN等,SubQuery优化则涉及到子查询的重写和执行顺序调整。 5. **OldWLC( Warehouse Load Coordinator)**: OldWLC...

    大数据-hive性能优化

    本内容旨在详细讲解Hive性能优化的方案,包括模型设计、数据倾斜问题、减少作业数量、合理设置MapReduce任务数、业务逻辑与算法优化、count(distinct)优化、小文件合并以及整体优化策略等方面。 首先,理解Hadoop的...

    hive优化总结

    Hive优化总结是Hive性能优化的总结,涉及HIVE的参数设置、HQL语言的写法、JOIN操作的优化、MapReduce操作的优化、列裁剪、分区裁剪等多个方面。 1. 配置文件优化 Hive的配置文件hive-site.xml是Hive的核心配置文件...

    Hive优化.docx

    针对Hive的性能优化,可以从多个角度入手,特别是在处理复杂的Join操作时。以下是根据标题、描述以及部分内容提炼出的关键知识点: 1. **Join操作优化**: - **Join顺序调整**:在Join操作中,应将较小的表或子...

    Hive性能优化总结

    ### Hive性能优化总结 #### 一、Hadoop与Hive计算框架特性引发的问题 Hadoop作为大数据处理平台,其核心优势在于能够高效处理大规模数据集。然而,在具体的应用场景中,尤其是在Hive作为数据仓库使用时,仍存在...

    hive sql + left join 数据缺失

    ### Hive SQL Left Join 数据缺失问题解析 #### 一、问题背景 在大数据处理场景中,Hive 是一种广泛使用的工具,用于对存储在 Hadoop 文件系统中的数据进行数据汇总、查询和分析。随着Hive版本的不断更新,部分用户...

    Hive性能优化复习总结.doc.pdf

    Hive性能优化总结 Hive性能优化是一个复杂的问题,它涉及到Hadoop的计算框架特性、数据倾斜问题、MapReduce作业初始化时间长、SUM、COUNT、MAX、MIN等UDAF函数的使用、COUNT(DISTINCT)函数的低效、数据分布不均、...

    hive查询优化

    ### Hive查询优化详解 #### 一、Hive基础与架构 **Hive**作为Hadoop生态中的重要组成部分,被广泛应用于大数据分析领域。它通过提供类SQL语言(HiveQL)来简化对Hadoop分布式文件系统(HDFS)中存储的大规模数据集...

    HiveSQL优化手册

    3. **Map Join优化** - **使用Map Join**:对于大表JOIN大表的情况,若其中一张表数据量相对较小(例如,数据量小于1000条),可以考虑使用Map Join。这种方式可以在Map阶段完成JOIN操作,从而显著提高性能。 -...

    第6章:Hive性能优化及Hive3新特性1

    虽然Hive不支持传统数据库中的B树索引,但可以通过创建虚拟列(Bucketing)和Skewed Join优化来模拟索引效果。虚拟列可根据特定列值将数据分配到不同的桶中,而Skewed Join则处理数据倾斜问题,通过预处理将倾斜...

    Hive思维导图之Hive优化

    Hive思维导图之Hive优化

    hive常见的优化方案ppt

    以下是一些针对Hive性能优化的常见策略: 1. **开启Hive压缩**:压缩可以显著减少数据存储空间,加快读取速度。可以开启Hive的中间数据和最终数据压缩,使用`hive.exec.compress.output=true`和`hive.exec.compress...

    Hive_优化——深入浅出学Hive

    【Hive 优化——深入浅出学 Hive】 在大数据领域,Hive 是一个不可或缺的工具,主要用于数据仓库和商业智能(BI)分析。虽然它的主要操作是通过 SQL 进行,但理解其内部原理和优化策略对于提升集群执行效率至关重要...

    hive性能优化.pptx

    hive是基于Hadoop的一个数据仓库工具,用来进行数据提取、转化、加载。此文档是基于hive大数据库简介、使用场景、性能优化

    工作总结hive优化

    ### 工作总结:Hive优化 在大数据处理领域,Hive作为一种常用的数据仓库工具,其性能优化一直是数据工程师关注的重点。本文将基于提供的“hive优化”文档内容,深入探讨Hive优化的关键策略与实践技巧。 #### 核心...

    hive参数优化文档

    ### Hive参数优化详解 #### 一、整体架构优化 在Hive的整体架构中进行优化,主要是为了提升查询处理效率及资源利用率。以下是一些关键点: 1. **表设计**: - **分区表**:根据查询维度进行分区,如日期分区等。...

    Hive Summit 2011-join

    Hive Summit 2011-join介绍了Facebook中Hive实现的多种join策略。Hive作为一个数据仓库工具,主要用于处理大规模数据集的分析和查询,而join操作是数据仓库中常见且关键的操作之一。在大数据的背景下,如何高效地...

    深入浅出Hive企业级架构优化、Hive Sql优化、压缩和分布式缓存

    ### 深入浅出Hive企业级架构优化 #### Hive概述 Apache Hive 是一个基于 Hadoop 的数据仓库工具,可以将结构化的数据文件映射为一张数据库表,并提供简单的 SQL 查询功能,使不熟悉 MapReduce 的开发人员也能用 ...

Global site tag (gtag.js) - Google Analytics