论坛首页 综合技术论坛

Cardinality对表Join的影响

浏览 2908 次
精华帖 (0) :: 良好帖 (0) :: 新手帖 (0) :: 隐藏帖 (0)
作者 正文
   发表时间:2010-05-21  
如果你执行SHOW INDEX FROM TABLE_NAME察看所引信息,中间会有一列Cardinality,MySQL在解析多重Join的时候会根据Cardinality的信息决定选择什么路径来执行Join,这种算法的前提条件是索引数据时很平均分配的,但如果索引中的数据非常不均衡,会导致MySQL做出错误的选择。 下面是一个完整的例子:

首先创建三个测试表:

CREATE TABLE customer (
-- Unique ID
customer_id integer NOT NULL,

CONSTRAINT PK_customer PRIMARY KEY(customer_id)
);

CREATE TABLE my_order (
-- Unique ID
order_id integer NOT NULL,

product_id integer NOT NULL,

customer_id integer NOT NULL,

CONSTRAINT PK_order PRIMARY KEY(product_id, customer_id ),
CONSTRAINT FK_order2customer_id FOREIGN KEY (customer_id) REFERENCES customer(customer_id),
CONSTRAINT UQ_order_id UNIQUE(order_id)

);


CREATE TABLE delivery (
 
  order_id int NOT NULL,

  time datetime NOT NULL,

  CONSTRAINT PK_order PRIMARY KEY(order_id, time),
  CONSTRAINT FK_order FOREIGN KEY (order_id) REFERENCES my_order(order_id)
);

然后加入数据:

set @N=0;

insert into customer select @N:=@N+1 from mysql.help_topic  LIMIT 1000;

set @N=0;

insert into my_order select @N:=@N+1, @N, 1 from mysql.help_topic a, mysql.help_topic b LIMIT 100000;

set @I=1;

insert into my_order select @N:=@N+1, @N, @I:=@I+1  from mysql.help_topic a, mysql.help_topic b LIMIT 600;

set @N=0;

insert into delivery select @N:=@N+1, '2010-05-10 15:22:02'  from mysql.help_topic a, mysql.help_topic b LIMIT 600;

注意在my_order表中大多数记录的customer_id的值是1。

执行ANALYZE TABLE来更新Cardinality:

ANALYZE TABLE my_order;

ANALYZE TABLE delivery;

下面我们来执行一个Query:

mysql> SELECT count(*) FROM my_order a JOIN my_order b ON  a.customer_id = b.customer_id and a.product_id = 123 JOIN delivery ON b.order_id = delivery.order_id AND delivery.time = '2010-05-10 15:22:02';
+----------+
| count(*) |
+----------+
|      600 |
+----------+
1 row in set (0.51 sec)

然后用Explain分析一下:

mysql> explain SELECT count(*) FROM my_order a JOIN my_order b ON  a.customer_id = b.customer_id and a.product_id = 123 JOIN delivery ON b.order_id = delivery.order_id AND delivery.time = '2010-05-10 15:22:02';
+----+-------------+----------+--------+----------------------------------+----------------------+---------+------------------------------+------+-------------+
| id | select_type | table    | type   | possible_keys                    | key                  | key_len | ref                          | rows | Extra       |
+----+-------------+----------+--------+----------------------------------+----------------------+---------+------------------------------+------+-------------+
|  1 | SIMPLE      | a        | ref    | PRIMARY,FK_order2customer_id     | PRIMARY              | 4       | const                        |    1 | Using index |
|  1 | SIMPLE      | b        | ref    | UQ_order_id,FK_order2customer_id | FK_order2customer_id | 4       | oliver_test.a.customer_id    |  167 |             |
|  1 | SIMPLE      | delivery | eq_ref | PRIMARY                          | PRIMARY              | 12      | oliver_test.b.order_id,const |    1 | Using index |
+----+-------------+----------+--------+----------------------------------+----------------------+---------+------------------------------+------+-------------+
3 rows in set (0.00 sec)

从explain的结果来看,MySQL选择先把my_order自己做Join然后再去Join表delivery。第一个Join需要访问167行数据。 但实际情况是如何呢?

mysql> SELECT count(*) FROM my_order a JOIN my_order b ON  a.customer_id = b.customer_id and a.product_id = 123 ;
+----------+
| count(*) |
+----------+
|   100000 |
+----------+
1 row in set (0.06 sec)

如果我们只执行第一个Join,实际将访问100000条记录,远远高于167,相反如果我们先做第二个Join, 我们只需要访问600行:

mysql> select count(*) from delivery JOIN my_order ON my_order.order_id = delivery.order_id AND delivery.time = '2010-05-10 15:22:02';
+----------+
| count(*) |
+----------+
|      600 |
+----------+
1 row in set (0.00 sec)

为什么MySQL会选择错误的顺序来执行呢?首先察看一下my_order表的索引:

mysql> show index from my_order;
+----------+------------+----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table    | Non_unique | Key_name             | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+----------+------------+----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| my_order |          0 | PRIMARY              |            1 | product_id  | A         |      100600 |     NULL | NULL   |      | BTREE      |         |
| my_order |          0 | PRIMARY              |            2 | customer_id | A         |      100600 |     NULL | NULL   |      | BTREE      |         |
| my_order |          0 | UQ_order_id          |            1 | order_id    | A         |      100600 |     NULL | NULL   |      | BTREE      |         |
| my_order |          1 | FK_order2customer_id |            1 | customer_id | A         |         602 |     NULL | NULL   |      | BTREE      |         |
+----------+------------+----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
4 rows in set (0.00 sec)

我们可以看到对于索引FK_order2customer_id,Cardinality的值是602,MySQL在估算第一个Join的时候,假设索引是平均分布的,用总行数(100600)除以Cardinality,所以得到167,由于167小于第二个Join需要访问的行数600,所以选择先执行第一个Join。

如何解决这个问题?

修改Query中Join的顺序并用STRAIGHT_JOIN强制MySQL按照这个顺序执行,下面是新的Query:

mysql> SELECT count(*) FROM delivery STRAIGHT_JOIN my_order a ON a.order_id = delivery.order_id AND delivery.time = '2010-05-10 15:22:02' JOIN my_order b ON  a.customer_id = b.customer_id and b.product_id = 123 ;
+----------+
| count(*) |
+----------+
|      600 |
+----------+
1 row in set (0.00 sec)

你可以看到查询时间从0.51秒变成0秒。

用Explain察看新的执行顺序:

mysql> explain SELECT count(*) FROM delivery STRAIGHT_JOIN my_order a ON a.order_id = delivery.order_id AND delivery.time = '2010-05-10 15:22:02' JOIN my_order b ON  a.customer_id = b.customer_id and b.product_id = 123 ;
+----+-------------+----------+--------+----------------------------------+-------------+---------+-------------------------------+------+---------------------------------------------+
| id | select_type | table    | type   | possible_keys                    | key         | key_len | ref                           | rows | Extra                                       |
+----+-------------+----------+--------+----------------------------------+-------------+---------+-------------------------------+------+---------------------------------------------+
|  1 | SIMPLE      | b        | ref    | PRIMARY,FK_order2customer_id     | PRIMARY     | 4       | const                         |    1 | Using index                                 |
|  1 | SIMPLE      | delivery | index  | PRIMARY                          | PRIMARY     | 12      | NULL                          |  600 | Using where; Using index; Using join buffer |
|  1 | SIMPLE      | a        | eq_ref | UQ_order_id,FK_order2customer_id | UQ_order_id | 4       | oliver_test.delivery.order_id |    1 | Using where                                 |
+----+-------------+----------+--------+----------------------------------+-------------+---------+-------------------------------+------+---------------------------------------------+
3 rows in set (0.00 sec)

另外很重要的一点是这个问题在开始阶段通常不会注意,因为数据量少不会影响性能,但随着数据不断增加,一旦达到一定数量,就会突然出现并影响系统的性能。

论坛首页 综合技术版

跳转论坛:
Global site tag (gtag.js) - Google Analytics