`
zjeers
  • 浏览: 37712 次
  • 性别: Icon_minigender_1
  • 来自: 成都
社区版块
存档分类
最新评论

SQL Left Join,Hash Join,Merge Join对比

阅读更多

ref:http://support.microsoft.com/kb/197297/en-us

INF: Comparison of Join Techniques

Article ID : 197297
Last Review : April 21, 1999
Revision : 1.0
This article was previously published under Q197297

SUMMARY

Microsoft SQL Server version 7.0 includes some new join techniques, such as the hash join and sort-merge join, to supplement nested-loop joins. This article describes these new techniques and compares and contrasts them.

Back to the top

MORE INFORMATION

Microsoft SQL Server version 6.5 and earlier had only one technique for joining records: the nested loops join. SQL Server 7.0 implements hash and sort-based algorithms. The decision by the query engine on whether to use hash versus sort depends on relative sizes of the inputs, whether they are sorted or not, whether the hash table will fit entirely in memory, and so on. The optimizer will always choose the best join plan to execute your query, and that may be any one of the three options. The difference between hash joins and merge joins is often small, but there are certain situations where one is better than the other. There are considerable differences between these two join types and the nested loops join, as shown below. The example in this article is provided only as an illustration. The table at the end of this article is an illustration of some decisions that may affect the join type used. It is generally recommended that you not force join types, overriding the optimizer.

Back to the top

Nested Loop Joins

In a nested loops join, one of the tables is scanned; for every row in that table, a lookup is performed on the second table and the matching rows are retrieved. We will call these tables outer and inner input. The basic algorithm is to scan the inner input once for each value of outer input, looking for a match. Depending on the presence of indexes on the inner input the lookup can run quite efficiently. Indexed nested loops join is obviously more efficient than non-indexed nested loops. Also, if the tables have a 1-1 relationship, the scan of the inner table can stop as soon as the row is found that matches the row in the outer table.

Any query can be processed using this methodology including joins on inequality conditions.

Nested loops example:
SELECT * FROM reserves AS r1 JOIN sailors AS s1 ON r1.sid=s1.sid option
(loop join)
Reserves has 529 pages and 100000 rows
Sailors has 195 pages 40000 rows

Table 'sailors'. Scan count 100000, logical reads 207596, physical reads 3, read-ahead reads 0.
Table 'reserves'. Scan count 1, logical reads 529, physical reads 0, read- ahead reads 0.
|--Nested Loops(Inner Join)
       |--Sort(ORDER BY:([r1].[sid] ASC))
       |    |--Table Scan(OBJECT:([Northwind].[dbo].[reserves] AS [r1]))
       |--Clustered Index Seek(OBJECT:([Northwind].[dbo].[sailors].[I2] AS
[s1]), SEEK:([s1].[sid]=[r1].[sid]) ORDERED)

Back to the top

Hash Joins

In a hash match, a hash table is created in memory for one of the inputs (build input) using a repeatable randomizing function and this table is searched for matching values from the other input (probe input). The hash table performs like an index. The data from the build input is stored in memory. If the build input does not fit in memory it is partitioned recursively until it fits in memory. The probe input is then hashed and used to search for matches. Unlike with nested loops, the presence or absence of indexes is not particularly a concern in this case. Hash joins are CPU-intensive in comparison to nested loops and are affected by available memory. Hash joins are better when there is a significant difference in the sizes of the tables being joined.

In a hash join the build input determines the recursion depth because it can stop partitioning when the input fits in memory. A single hash join can perform both grouping and join at the same time when the grouping attribute is also the join attribute. The result of this join is not in any particular order. Inequality conditions cannot be satisfied by this type of join.
|--Hash Match(Inner Join, HASH:([s1].[sid])=([r1].[sid]),
RESIDUAL:([s1].[sid]=[r1].[sid]))
       |--Clustered Index Scan(OBJECT:([Northwind].[dbo].[sailors].[I2] AS [s1]))
       |--Table Scan(OBJECT:([Northwind].[dbo].[reserves] AS [r1]))
Table 'reserves'. Scan count 1, logical reads 529, physical reads 0, read- ahead reads 0.
Table 'sailors'. Scan count 1, logical reads 208, physical reads 0, read- ahead reads 0.

Back to the top

Sort-Merge Join

In a sort-merge, sorting the larger input dominates a large portion of the total cost. In a merge join, the inputs are sorted on the join column and the sorted pair is merged into one, eliminating the column values that do not match from the sorted result set.

The algorithm first examines the tuple of one input, scans the second sorted input until a row with equal or larger join value is found. Match up tuple in R1 with all tuples in R2 with matching join value. Repeat for all tuples in R1.

Merge join requires that both inputs be in memory (as opposed to a hash join, where only the smaller input is in memory). The merge join is obviously more efficient if the inputs are already sorted. Also the result of the join is sorted on the join attribute. Inequality conditions cannot be satisfied by this type of join.
|--Merge Join(Inner Join, MANY-TO-MANY MERGE:([r1].[sid])=([s1].[sid]),
RESIDUAL:([s1].[sid]=[r1].[sid]))
       |--Sort(ORDER BY:([r1].[sid] ASC))
       |    |--Table Scan(OBJECT:([Northwind].[dbo].[reserves] AS [r1]))
       |--Clustered Index Scan(OBJECT:([Northwind].[dbo].[sailors].[I2] AS s1]), ORDERED)

Table 'sailors'. Scan count 1, logical reads 208, physical reads 0, read- ahead reads 0.
Table 'reserves'. Scan count 1, logical reads 529, physical reads 0, read- ahead reads 0.

By changing the size of the input such that the relative sizes of the inputs vary, (in our example reserves had only 500 rows and sailors 50,000 rows) merge join for the same query as above took much longer than hash as can be seen below:

Merge join:
SQL Server Execution Times:
CPU time = 1081 ms, elapsed time = 2211 ms.

Hash join:
SQL Server Execution Times:
CPU time = 181 ms, elapsed time = 479 ms.

If an index exists on the join key, the data does not need to be sorted and therefore a merge join will be the best choice.
==========================================================================
Property                 |  Nested loop   | Hash or merge
=========================|=============================================
Only limited memory      | Better         |Worse,
                                          |memory is needed for sorting
                                          |or building hash table

Need first row fast      | Better         |Worse,
                                          |need to sort or hash
                                          |before a row is returned

Selecting only a few     | Better         |Worse
rows from inner table    |                |

Parallelism              |Works best      |Will work

Index available on inner |Works best      |Will work

Without one equality     |Works           |Needs at least 1 equality

2 Very large inputs      |Worse           |Not much better,
                                          |hash and sort will have
                                          |almost equal cost

One very small           |Worse           |Hash better than merge
and large input          |                |

Back to the top

分享到:
评论

相关推荐

    sql-jion 用法

    4. **Hash Join**和**Merge Join**: 这是两种优化性能的连接方法,由查询优化器根据数据量和索引情况选择使用。 - **Hash Join**:当一个表相对较小,可以完全放入内存时,查询优化器可能会选择哈希连接。它创建...

    sql面试题收集.pdf

    hash join、merge join、nest loop(cluster join)和index join是四种不同的连接算法,用于.optimizing查询性能。 hash join是一种基于哈希表的连接算法,用于快速连接两个表。 merge join是一种基于排序的连接...

    Apache Flink 维表关联实战.pdf

    Streaming SQL Join 面向无界数据集的 SQL,无法缓存历史所有数据,因此像 Sort-Merge Join 需要对数据进行排序是无法做到的,Nested-loop Join 和 Hash Join 经过一定的改良则可以满足实时 SQL 的要求。 Flink 中...

    Teradata-SQL.rar_teradata_teradata sql

    5. **JOIN操作**:Teradata支持各种JOIN类型,包括INNER JOIN、LEFT JOIN、RIGHT JOIN和FULL OUTER JOIN,以及优化的HASH JOIN和MERGE JOIN。 6. **SQL聚合函数**:如SUM、AVG、MAX、MIN和COUNT等,Teradata提供了...

    行业-92 深入探索多表关联的SQL语句到底是如何执行的?(2).rar

    2. 算子应用:使用相应的关联算法(如Nested Loop Join、Merge Join、Hash Join)找到匹配的行。 3. 结果合并:将匹配的行组合成新的记录集。 Nested Loop Join是最基础的关联方法,适合小规模的表或部分数据。...

    sql多表查询优化的研究

    通常有三种主要的连接类型:嵌套循环连接(Nested Loop Join)、排序合并连接(Sort-Merge Join)和哈希连接(Hash Join)。每种连接类型都有其适用场景和优缺点。嵌套循环连接适合于小表连接大表,排序合并连接适用...

    查询优化:sql2000中的连接两个表的查询语句的执行路径对性能的影响比较

    - **使用JOIN hint**:通过强制执行特定的连接算法(如NOCOUNT, LOOP, MERGE, HASH),有时可以优化性能,但这可能导致其他查询的性能下降,因此需谨慎使用。 - **查询重写**:有时通过子查询、临时表或者嵌套循环,...

    数据库面试题目

    在Oracle数据库中,常见的JOIN类型包括hash join、merge join、nest loop join(又称为cluster join)、index join等。举例来说,以下是一个使用hash join的SQL语句: ```sql SELECT * FROM a, b WHERE a.id = b.id...

    sql的update语句功能非常强大.docx

    `OPTION (query_hint [...n])`可以添加查询提示来进一步优化查询执行计划,如`HASH JOIN`、`MERGE JOIN`等。 10. **变量更新** 除了列,UPDATE还可以更新存储过程或函数中的变量,如`@variable = value`。 11. *...

    数据库面试题目研究

    - **Hash Join / Merge Join / Nest Loop (Cluster Join) / Index Join**:这些是不同的物理连接方法,Oracle会根据具体情况选择最合适的连接算法。 - **Hash Join**:适用于大型表之间的连接,通过创建哈希表来...

    ORACLE数据库SQL优化---表连接类型.docx

    - 排序合并连接(Sort Merge Join):两个已排序的表通过比较键值进行连接,适合大型表且有索引的情况。 - 嵌套循环连接(Nested Loops Join):驱动表的每一行与被驱动表的每一行进行比较,适合小表连接大表的...

    SAS ADV最强机经

    - 描述中的“leftjoin”表明了数据集连接的概念,这里特指了SAS中的左连接(LEFT JOIN),这是一种SQL连接类型,它允许将一个数据集与另一个数据集中的匹配项组合起来,如果没有匹配项,结果中将包含左侧数据集中的...

    SQL多表连接查询优化的相关研究ppt课件.ppt

    常见的连接类型有Nested Loop Join、Sort-Merge Join和Hash Join等。每种连接类型都有其优缺点,选择合适的连接类型可以显著地提高查询效率。 4. 解决方案空间 解决方案空间是多表连接查询优化的核心问题之一。...

    ORACLE面试锦集

    - **Hash Join**: 通过创建哈希表来进行连接操作,适用于大数据量的情况。 - **Merge Join**: 使用排序后的数据进行连接,适用于有序数据。 - **Nested Loop**: 最基本的连接方式,性能较低,适合小数据量。 #### ...

    filter优化

    - 考虑使用`MERGE JOIN`或`NESTED LOOP`等不同的连接策略,以减少不必要的`HASH JOIN`操作。 3. **建立有效的索引**: - 为涉及频繁过滤的列建立索引,尤其是`SO_STAFF_ID`、`PARTY_ID`、`BO_ID`和`PROD_ID`等列...

    数据库面试题目,经典面试

    5. **其他连接方式**:包括哈希连接(HASH JOIN)、合并连接(MERGE JOIN)和嵌套循环连接(NESTED LOOP JOIN)。这些连接方法在不同的数据分布和查询场景下有不同的性能表现。 接下来,我们讨论如何不借助第三方...

    Oracle 面试及答案-经典.doc

    本文档主要讨论 Oracle 面试中的基础概念和执行计划,包括表连接方式、等连接、非等连接、自连接、外连接、hash join、merge join、nest loop、index join 等,并对各种连接方式进行了详细的解释和示例。 一、基础...

    mysql课件00000000000000000000000000000

    1. JOIN操作:理解并掌握INNER JOIN、LEFT JOIN、RIGHT JOIN和FULL JOIN,用于合并多个表的数据。 2. 分组与聚合函数:GROUP BY用于数据分组,COUNT、SUM、AVG、MAX和MIN等聚合函数用于统计分析。 3. 子查询:学习...

Global site tag (gtag.js) - Google Analytics