Cardinality对表Join的影响 - 飞飞和贝贝爸的博客 - ITeye博客

`

oliver_peng

浏览: 44892 次
性别:
来自: 深圳

最近访客更多访客>>

limars

raines8231

Cannel_2020

xwins

博主相关

博客

微博

相册

收藏

留言

关于我

文章分类

社区版块

存档分类

最新评论

sevk： #让ruby1.8兼容1.9的写法 if RUBY_ ...
升级到Ruby 1.9 笔记二
lmxbitihero：最大亮点当然是性能提高了好几倍。
升级到Ruby 1.9 笔记一
skyblue1984：好像变复杂了.......
升级到Ruby 1.9 笔记二
doabit：想升1.9，只是RAILS在1.9下有编码问题，，又跑回1.8 ...
升级到Ruby 1.9 笔记一
gfdice：性能是最大亮点。反而这个unicode，我还是习惯程序员自己去 ...
升级到Ruby 1.9 笔记一

Cardinality对表Join的影响

博客分类：

MySQL

阅读更多

如果你执行SHOW INDEX FROM TABLE_NAME察看所引信息，中间会有一列Cardinality，MySQL在解析多重Join的时候会根据Cardinality的信息决定选择什么路径来执行Join，这种算法的前提条件是索引数据时很平均分配的，但如果索引中的数据非常不均衡，会导致MySQL做出错误的选择。下面是一个完整的例子：

首先创建三个测试表：

CREATE TABLE customer (
-- Unique ID
customer_id integer NOT NULL,

CONSTRAINT PK_customer PRIMARY KEY(customer_id)
);

CREATE TABLE my_order (
-- Unique ID
order_id integer NOT NULL,

product_id integer NOT NULL,

customer_id integer NOT NULL,

CONSTRAINT PK_order PRIMARY KEY(product_id, customer_id ),
CONSTRAINT FK_order2customer_id FOREIGN KEY (customer_id) REFERENCES customer(customer_id),
CONSTRAINT UQ_order_id UNIQUE(order_id)

);

CREATE TABLE delivery (

order_id int NOT NULL,

time datetime NOT NULL,

CONSTRAINT PK_order PRIMARY KEY(order_id, time),
CONSTRAINT FK_order FOREIGN KEY (order_id) REFERENCES my_order(order_id)
);

然后加入数据：

set @N=0;

insert into customer select @N:=@N+1 from mysql.help_topic LIMIT 1000;

set @N=0;

insert into my_order select @N:=@N+1, @N, 1 from mysql.help_topic a, mysql.help_topic b LIMIT 100000;

set @I=1;

insert into my_order select @N:=@N+1, @N, @I:=@I+1 from mysql.help_topic a, mysql.help_topic b LIMIT 600;

set @N=0;

insert into delivery select @N:=@N+1, '2010-05-10 15:22:02' from mysql.help_topic a, mysql.help_topic b LIMIT 600;

注意在my_order表中大多数记录的customer_id的值是1。

执行ANALYZE TABLE来更新Cardinality：

ANALYZE TABLE my_order;

ANALYZE TABLE delivery;

下面我们来执行一个Query：

mysql> SELECT count(*) FROM my_order a JOIN my_order b ON a.customer_id = b.customer_id and a.product_id = 123 JOIN delivery ON b.order_id = delivery.order_id AND delivery.time = '2010-05-10 15:22:02';
+----------+
| count(*) |
+----------+
|      600 |
+----------+
1 row in set (0.51 sec)

然后用Explain分析一下：

mysql> explain SELECT count(*) FROM my_order a JOIN my_order b ON a.customer_id = b.customer_id and a.product_id = 123 JOIN delivery ON b.order_id = delivery.order_id AND delivery.time = '2010-05-10 15:22:02';
+----+-------------+----------+--------+----------------------------------+----------------------+---------+------------------------------+------+-------------+
| id | select_type | table    | type   | possible_keys                    | key                  | key_len | ref                          | rows | Extra       |
+----+-------------+----------+--------+----------------------------------+----------------------+---------+------------------------------+------+-------------+
| 1 | SIMPLE      | a        | ref    | PRIMARY,FK_order2customer_id     | PRIMARY              | 4       | const                        |    1 | Using index |
| 1 | SIMPLE      | b        | ref    | UQ_order_id,FK_order2customer_id | FK_order2customer_id | 4       | oliver_test.a.customer_id    | 167 |             |
| 1 | SIMPLE      | delivery | eq_ref | PRIMARY                          | PRIMARY              | 12      | oliver_test.b.order_id,const |    1 | Using index |
+----+-------------+----------+--------+----------------------------------+----------------------+---------+------------------------------+------+-------------+
3 rows in set (0.00 sec)

从explain的结果来看，MySQL选择先把my_order自己做Join然后再去Join表delivery。第一个Join需要访问167行数据。但实际情况是如何呢？

mysql> SELECT count(*) FROM my_order a JOIN my_order b ON a.customer_id = b.customer_id and a.product_id = 123 ;
+----------+
| count(*) |
+----------+
|   100000 |
+----------+
1 row in set (0.06 sec)

如果我们只执行第一个Join，实际将访问100000条记录，远远高于167，相反如果我们先做第二个Join，我们只需要访问600行：

mysql> select count(*) from delivery JOIN my_order ON my_order.order_id = delivery.order_id AND delivery.time = '2010-05-10 15:22:02';
+----------+
| count(*) |
+----------+
|      600 |
+----------+
1 row in set (0.00 sec)

为什么MySQL会选择错误的顺序来执行呢？首先察看一下my_order表的索引：

mysql> show index from my_order;
+----------+------------+----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table    | Non_unique | Key_name             | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+----------+------------+----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| my_order |          0 | PRIMARY              |            1 | product_id | A         |      100600 |     NULL | NULL   |      | BTREE      |         |
| my_order |          0 | PRIMARY              |            2 | customer_id | A         |      100600 |     NULL | NULL   |      | BTREE      |         |
| my_order |          0 | UQ_order_id          |            1 | order_id    | A         |      100600 |     NULL | NULL   |      | BTREE      |         |
| my_order |          1 | FK_order2customer_id |            1 | customer_id | A         |         602 |     NULL | NULL   |      | BTREE      |         |
+----------+------------+----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
4 rows in set (0.00 sec)

我们可以看到对于索引FK_order2customer_id，Cardinality的值是602，MySQL在估算第一个Join的时候，假设索引是平均分布的，用总行数（100600）除以Cardinality，所以得到167，由于167小于第二个Join需要访问的行数600，所以选择先执行第一个Join。

如何解决这个问题？

修改Query中Join的顺序并用STRAIGHT_JOIN强制MySQL按照这个顺序执行，下面是新的Query:

mysql> SELECT count(*) FROM delivery STRAIGHT_JOIN my_order a ON a.order_id = delivery.order_id AND delivery.time = '2010-05-10 15:22:02' JOIN my_order b ON a.customer_id = b.customer_id and b.product_id = 123 ;
+----------+
| count(*) |
+----------+
|      600 |
+----------+
1 row in set (0.00 sec)

你可以看到查询时间从0.51秒变成0秒。

用Explain察看新的执行顺序：

mysql> explain SELECT count(*) FROM delivery STRAIGHT_JOIN my_order a ON a.order_id = delivery.order_id AND delivery.time = '2010-05-10 15:22:02' JOIN my_order b ON a.customer_id = b.customer_id and b.product_id = 123 ;
+----+-------------+----------+--------+----------------------------------+-------------+---------+-------------------------------+------+---------------------------------------------+
| id | select_type | table    | type   | possible_keys                    | key         | key_len | ref                           | rows | Extra                                       |
+----+-------------+----------+--------+----------------------------------+-------------+---------+-------------------------------+------+---------------------------------------------+
| 1 | SIMPLE      | b        | ref    | PRIMARY,FK_order2customer_id     | PRIMARY     | 4       | const                         |    1 | Using index                                 |
| 1 | SIMPLE      | delivery | index | PRIMARY                          | PRIMARY     | 12      | NULL                          | 600 | Using where; Using index; Using join buffer |
| 1 | SIMPLE      | a        | eq_ref | UQ_order_id,FK_order2customer_id | UQ_order_id | 4       | oliver_test.delivery.order_id |    1 | Using where                                 |
+----+-------------+----------+--------+----------------------------------+-------------+---------+-------------------------------+------+---------------------------------------------+
3 rows in set (0.00 sec)

另外很重要的一点是这个问题在开始阶段通常不会注意，因为数据量少不会影响性能，但随着数据不断增加，一旦达到一定数量，就会突然出现并影响系统的性能。

分享到：

如何释放已删除的InnoDB磁盘空间 | Apache mod_fcgid 配置实现fcgi process自 ...

2010-05-21 03:59
浏览 1041
评论(0)
论坛回复 / 浏览 (0 / 2914)
分类:数据库
查看更多

评论

发表评论

您还没有登录,请您登录后再发表评论

相关推荐

A Preprocessing Scheme for High-Cardinality Categorical Attributes: 这类问题通常被称为“高基数”（high cardinality）问题，因为“基数”指的是类别变量的不同取值数量。"A Preprocessing Scheme for High-Cardinality Categorical Attributes"这个主题探讨的就是如何有效地处理这类...

简单解析MySQL中的cardinality异常: 当Cardinality异常时，可能导致数据库查询效率降低，进而影响应用程序的性能。在上述问题中，报警显示php-fpm进程数量超过阈值，这通常是由于长时间运行的SQL查询阻塞了资源，导致进程堆积。在排查过程中，发现一...

Oracle优化器 Cardinality基数与Selectivity选择性: 例如，如果表T有1000行，列COL1上有500个非重复值，没有直方图和空值，那么查询"WHERE COL1=某值"时，优化器可能会假设数据均匀分布，预计返回2行，即Cardinality为2。准确的Cardinality有助于生成更高效的执行计划...

Stream summarizer and cardinality estimator..zip: "Stream summarizer" 和 "cardinality estimator" 是流处理中的两个重要概念，特别是在大数据分析和实时计算中。 **流概括器(Stream Summarizer)** 流概括器是用于在数据流上执行轻量级聚合操作的工具，它能够快速...

cardinality: 总的来说，无论是CSS中的选择器权重，数据库设计中的关联比例，还是集合论中的元素数量，"cardinality"都是一个关键概念，它影响着系统的设计、效率和正确性。理解并掌握这些概念对于任何IT专业人士来说都是非常重要...

论文研究-On Function Period and Coding-Alphabet Cardinality of Genetic Algorithms.pdf: 本文主要探讨了遗传算法的编码元数（coding alphabet cardinality）与函数周期（function period）之间的关系，并比较了不同编码元数的遗传算法在处理单周期函数和多周期函数时的优化效果。研究建立了一个基于一阶...

low-cardinality-index-effectiveness: 低基数列索引有效性PostgreSQL 和 DB2 的示例程序。DB2 备忘录将 DB2 本机库的路径配置为LD_LIBRARY_PATH 。例子： LD_LIBRARY_PATH=/opt/ibm/db2/V10.5/lib64在 CLP 中使用分号作为分隔符。 db2 -t

cardinality:用于确定和检查任何可迭代对象大小的 Python 库: 基数cardinality是一个 Python 库，用于确定和检查任何可迭代对象的大小。文档： : Python 包索引 (PyPI)： ://pypi.python.org/pypi/cardinality/ 源代码和问题跟踪器： :

New Cardinality Estimation Algorithms for HyperLogLog Sketches (1702.01284)-计算机科学: New cardinality estimation algorithms for HyperLogLog sketchesOtmar Ertl otmar.ertl@gmail.comFebruary 27, 2017This paper presents new methods to estimate the cardinalities of data sets recorded by ...

HyperLogLog - The Analysis of a Near-Optimal Cardinality Estimation Algorithm (914-3045-2-PB)-计算机科学: cardinality estimation algorithmPhilippe Flajolet1 and Éric Fusy1 and Olivier Gandouet2 and Frédéric Meunier11Algorithms Project, INRIA–Rocquencourt, F78153 Le Chesnay (France) 2LIRMM, 161 rue Ada...

PowerDesigner应用教程(深蓝居): Sybase PowerDesigner（简称PD）是目前市场占有率第一的数据库建模工具，其最新版本为15.1，不仅支持最新的SQL Server 2008数据库，而且在用户界面和功能上都有所增强。PowerDesigner主要被用于数据库建模，并且提供...

藏经阁-ENGINEERING FAST INDEXES (DEEP DIVE).pdf: 本资源摘要信息是关于快速索引的深入探讨，涵盖了Roaring模型、bitset和数组的使用、 cardinality TRACKING、位操作、 population count 等多方面的知识点。 Roaring模型 Roaring模型是一种混合模型，集合了数组、...

RFID Cardinality Estimation with Blocker Tags: ### RFID Cardinality Estimation with Blocker Tags #### 摘要与引言本文提出了一种新的射频识别（RFID）系统中的标签数量估计方案——REB（RFID Estimation with Blocker Tags），该方案在包含遮挡标签...

matlab开发-MaximumCardinalitymatching: 在MATLAB环境中进行软件开发时，经常会涉及到各种算法的实现，比如图论中的经典问题——最大基数匹配。本文将深入探讨“MaximumCardinalitymatching”这一主题，以及如何在MATLAB中构建非加权的最大基数匹配。...

Better with Fewer Bits - Improving the Performance of Cardinality Estimation of Large Data Streams (INFOCOM2017)-计算机科学: Better with Fewer Bits - Improving the Performance of Cardinality Estimation of Large Data Streams (INFOCOM2017)-计算机科学

Global site tag (gtag.js) - Google Analytics