如果你执行SHOW INDEX FROM TABLE_NAME察看所引信息,中间会有一列Cardinality,MySQL在解析多重Join的时候会根据Cardinality的信息决定选择什么路径来执行Join,这种算法的前提条件是索引数据时很平均分配的,但如果索引中的数据非常不均衡,会导致MySQL做出错误的选择。 下面是一个完整的例子:
首先创建三个测试表:
CREATE TABLE customer (
-- Unique ID
customer_id integer NOT NULL,
CONSTRAINT PK_customer PRIMARY KEY(customer_id)
);
CREATE TABLE my_order (
-- Unique ID
order_id integer NOT NULL,
product_id integer NOT NULL,
customer_id integer NOT NULL,
CONSTRAINT PK_order PRIMARY KEY(product_id, customer_id ),
CONSTRAINT FK_order2customer_id FOREIGN KEY (customer_id) REFERENCES customer(customer_id),
CONSTRAINT UQ_order_id UNIQUE(order_id)
);
CREATE TABLE delivery (
order_id int NOT NULL,
time datetime NOT NULL,
CONSTRAINT PK_order PRIMARY KEY(order_id, time),
CONSTRAINT FK_order FOREIGN KEY (order_id) REFERENCES my_order(order_id)
);
然后加入数据:
set @N=0;
insert into customer select @N:=@N+1 from mysql.help_topic LIMIT 1000;
set @N=0;
insert into my_order select @N:=@N+1, @N, 1 from mysql.help_topic a, mysql.help_topic b LIMIT 100000;
set @I=1;
insert into my_order select @N:=@N+1, @N, @I:=@I+1 from mysql.help_topic a, mysql.help_topic b LIMIT 600;
set @N=0;
insert into delivery select @N:=@N+1, '2010-05-10 15:22:02' from mysql.help_topic a, mysql.help_topic b LIMIT 600;
注意在my_order表中大多数记录的customer_id的值是1。
执行ANALYZE TABLE来更新Cardinality:
ANALYZE TABLE my_order;
ANALYZE TABLE delivery;
下面我们来执行一个Query:
mysql> SELECT count(*) FROM my_order a JOIN my_order b ON a.customer_id = b.customer_id and a.product_id = 123 JOIN delivery ON b.order_id = delivery.order_id AND delivery.time = '2010-05-10 15:22:02';
+----------+
| count(*) |
+----------+
| 600 |
+----------+
1 row in set (0.51 sec)
然后用Explain分析一下:
mysql> explain SELECT count(*) FROM my_order a JOIN my_order b ON a.customer_id = b.customer_id and a.product_id = 123 JOIN delivery ON b.order_id = delivery.order_id AND delivery.time = '2010-05-10 15:22:02';
+----+-------------+----------+--------+----------------------------------+----------------------+---------+------------------------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+--------+----------------------------------+----------------------+---------+------------------------------+------+-------------+
| 1 | SIMPLE | a | ref | PRIMARY,FK_order2customer_id | PRIMARY | 4 | const | 1 | Using index |
| 1 | SIMPLE | b | ref | UQ_order_id,FK_order2customer_id | FK_order2customer_id | 4 | oliver_test.a.customer_id | 167 | |
| 1 | SIMPLE | delivery | eq_ref | PRIMARY | PRIMARY | 12 | oliver_test.b.order_id,const | 1 | Using index |
+----+-------------+----------+--------+----------------------------------+----------------------+---------+------------------------------+------+-------------+
3 rows in set (0.00 sec)
从explain的结果来看,MySQL选择先把my_order自己做Join然后再去Join表delivery。第一个Join需要访问167行数据。 但实际情况是如何呢?
mysql> SELECT count(*) FROM my_order a JOIN my_order b ON a.customer_id = b.customer_id and a.product_id = 123 ;
+----------+
| count(*) |
+----------+
| 100000 |
+----------+
1 row in set (0.06 sec)
如果我们只执行第一个Join,实际将访问100000条记录,远远高于167,相反如果我们先做第二个Join, 我们只需要访问600行:
mysql> select count(*) from delivery JOIN my_order ON my_order.order_id = delivery.order_id AND delivery.time = '2010-05-10 15:22:02';
+----------+
| count(*) |
+----------+
| 600 |
+----------+
1 row in set (0.00 sec)
为什么MySQL会选择错误的顺序来执行呢?首先察看一下my_order表的索引:
mysql> show index from my_order;
+----------+------------+----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+----------+------------+----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| my_order | 0 | PRIMARY | 1 | product_id | A | 100600 | NULL | NULL | | BTREE | |
| my_order | 0 | PRIMARY | 2 | customer_id | A | 100600 | NULL | NULL | | BTREE | |
| my_order | 0 | UQ_order_id | 1 | order_id | A | 100600 | NULL | NULL | | BTREE | |
| my_order | 1 | FK_order2customer_id | 1 | customer_id | A | 602 | NULL | NULL | | BTREE | |
+----------+------------+----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
4 rows in set (0.00 sec)
我们可以看到对于索引FK_order2customer_id,Cardinality的值是602,MySQL在估算第一个Join的时候,假设索引是平均分布的,用总行数(100600)除以Cardinality,所以得到167,由于167小于第二个Join需要访问的行数600,所以选择先执行第一个Join。
如何解决这个问题?
修改Query中Join的顺序并用STRAIGHT_JOIN强制MySQL按照这个顺序执行,下面是新的Query:
mysql> SELECT count(*) FROM delivery STRAIGHT_JOIN my_order a ON a.order_id = delivery.order_id AND delivery.time = '2010-05-10 15:22:02' JOIN my_order b ON a.customer_id = b.customer_id and b.product_id = 123 ;
+----------+
| count(*) |
+----------+
| 600 |
+----------+
1 row in set (0.00 sec)
你可以看到查询时间从0.51秒变成0秒。
用Explain察看新的执行顺序:
mysql> explain SELECT count(*) FROM delivery STRAIGHT_JOIN my_order a ON a.order_id = delivery.order_id AND delivery.time = '2010-05-10 15:22:02' JOIN my_order b ON a.customer_id = b.customer_id and b.product_id = 123 ;
+----+-------------+----------+--------+----------------------------------+-------------+---------+-------------------------------+------+---------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+--------+----------------------------------+-------------+---------+-------------------------------+------+---------------------------------------------+
| 1 | SIMPLE | b | ref | PRIMARY,FK_order2customer_id | PRIMARY | 4 | const | 1 | Using index |
| 1 | SIMPLE | delivery | index | PRIMARY | PRIMARY | 12 | NULL | 600 | Using where; Using index; Using join buffer |
| 1 | SIMPLE | a | eq_ref | UQ_order_id,FK_order2customer_id | UQ_order_id | 4 | oliver_test.delivery.order_id | 1 | Using where |
+----+-------------+----------+--------+----------------------------------+-------------+---------+-------------------------------+------+---------------------------------------------+
3 rows in set (0.00 sec)
另外很重要的一点是这个问题在开始阶段通常不会注意,因为数据量少不会影响性能,但随着数据不断增加,一旦达到一定数量,就会突然出现并影响系统的性能。
分享到:
相关推荐
这类问题通常被称为“高基数”(high cardinality)问题,因为“基数”指的是类别变量的不同取值数量。"A Preprocessing Scheme for High-Cardinality Categorical Attributes"这个主题探讨的就是如何有效地处理这类...
当Cardinality异常时,可能导致数据库查询效率降低,进而影响应用程序的性能。 在上述问题中,报警显示php-fpm进程数量超过阈值,这通常是由于长时间运行的SQL查询阻塞了资源,导致进程堆积。在排查过程中,发现一...
例如,如果表T有1000行,列COL1上有500个非重复值,没有直方图和空值,那么查询"WHERE COL1=某值"时,优化器可能会假设数据均匀分布,预计返回2行,即Cardinality为2。准确的Cardinality有助于生成更高效的执行计划...
"Stream summarizer" 和 "cardinality estimator" 是流处理中的两个重要概念,特别是在大数据分析和实时计算中。 **流概括器(Stream Summarizer)** 流概括器是用于在数据流上执行轻量级聚合操作的工具,它能够快速...
总的来说,无论是CSS中的选择器权重,数据库设计中的关联比例,还是集合论中的元素数量,"cardinality"都是一个关键概念,它影响着系统的设计、效率和正确性。理解并掌握这些概念对于任何IT专业人士来说都是非常重要...
本文主要探讨了遗传算法的编码元数(coding alphabet cardinality)与函数周期(function period)之间的关系,并比较了不同编码元数的遗传算法在处理单周期函数和多周期函数时的优化效果。研究建立了一个基于一阶...
低基数列索引有效性PostgreSQL 和 DB2 的示例程序。DB2 备忘录将 DB2 本机库的路径配置为LD_LIBRARY_PATH 。 例子: LD_LIBRARY_PATH=/opt/ibm/db2/V10.5/lib64在 CLP 中使用分号作为分隔符。 db2 -t
基数cardinality是一个 Python 库,用于确定和检查任何可迭代对象的大小。 文档: : Python 包索引 (PyPI): ://pypi.python.org/pypi/cardinality/ 源代码和问题跟踪器: :
New cardinality estimation algorithms for HyperLogLog sketchesOtmar Ertl otmar.ertl@gmail.comFebruary 27, 2017This paper presents new methods to estimate the cardinalities of data sets recorded by ...
cardinality estimation algorithmPhilippe Flajolet1 and Éric Fusy1 and Olivier Gandouet2 and Frédéric Meunier11Algorithms Project, INRIA–Rocquencourt, F78153 Le Chesnay (France) 2LIRMM, 161 rue Ada...
Sybase PowerDesigner(简称PD)是目前市场占有率第一的数据库建模工具,其最新版本为15.1,不仅支持最新的SQL Server 2008数据库,而且在用户界面和功能上都有所增强。PowerDesigner主要被用于数据库建模,并且提供...
本资源摘要信息是关于快速索引的深入探讨,涵盖了Roaring模型、bitset和数组的使用、 cardinality TRACKING、位操作、 population count 等多方面的知识点。 Roaring模型 Roaring模型是一种混合模型,集合了数组、...
### RFID Cardinality Estimation with Blocker Tags #### 摘要与引言 本文提出了一种新的射频识别(RFID)系统中的标签数量估计方案——REB(RFID Estimation with Blocker Tags),该方案在包含遮挡标签...
在MATLAB环境中进行软件开发时,经常会涉及到各种算法的实现,比如图论中的经典问题——最大基数匹配。本文将深入探讨“MaximumCardinalitymatching”这一主题,以及如何在MATLAB中构建非加权的最大基数匹配。...
Better with Fewer Bits - Improving the Performance of Cardinality Estimation of Large Data Streams (INFOCOM2017)-计算机科学