Cardinality对表Join的影响

全部 Linux 数据库敏捷编程数据结构软件测试项目管理 Oracle 编程综合互联网 Erlang MySQL

浏览 2913 次

锁定老帖子主题：Cardinality对表Join的影响精华帖 (0) :: 良好帖 (0) :: 新手帖 (0) :: 隐藏帖 (0)
作者	正文
oliver_peng 等级: 初级会员性别: 文章: 12 积分: 90 来自: 深圳	发表时间：2010-05-21 相关推荐: what is Cardinality? mysql join 选择_0111mysql如何选择Join的顺序 join mysql 索引顺序_探究MySQL优化器对索引和JOIN顺序的选择 join mysql 索引顺序_0104探究MySQL优化器对索引和JOIN顺序的选择 Impala 如何确定 Join 策略更多相关推荐 MySQL 数据库如果你执行SHOW INDEX FROM TABLE_NAME察看所引信息，中间会有一列Cardinality，MySQL在解析多重Join的时候会根据Cardinality的信息决定选择什么路径来执行Join，这种算法的前提条件是索引数据时很平均分配的，但如果索引中的数据非常不均衡，会导致MySQL做出错误的选择。下面是一个完整的例子：首先创建三个测试表： CREATE TABLE customer ( -- Unique ID customer_id integer NOT NULL, CONSTRAINT PK_customer PRIMARY KEY(customer_id) ); CREATE TABLE my_order ( -- Unique ID order_id integer NOT NULL, product_id integer NOT NULL, customer_id integer NOT NULL, CONSTRAINT PK_order PRIMARY KEY(product_id, customer_id ), CONSTRAINT FK_order2customer_id FOREIGN KEY (customer_id) REFERENCES customer(customer_id), CONSTRAINT UQ_order_id UNIQUE(order_id) ); CREATE TABLE delivery ( order_id int NOT NULL, time datetime NOT NULL, CONSTRAINT PK_order PRIMARY KEY(order_id, time), CONSTRAINT FK_order FOREIGN KEY (order_id) REFERENCES my_order(order_id) ); 然后加入数据： set @N=0; insert into customer select @N:=@N+1 from mysql.help_topic LIMIT 1000; set @N=0; insert into my_order select @N:=@N+1, @N, 1 from mysql.help_topic a, mysql.help_topic b LIMIT 100000; set @I=1; insert into my_order select @N:=@N+1, @N, @I:=@I+1 from mysql.help_topic a, mysql.help_topic b LIMIT 600; set @N=0; insert into delivery select @N:=@N+1, '2010-05-10 15:22:02' from mysql.help_topic a, mysql.help_topic b LIMIT 600; 注意在my_order表中大多数记录的customer_id的值是1。执行ANALYZE TABLE来更新Cardinality： ANALYZE TABLE my_order; ANALYZE TABLE delivery; 下面我们来执行一个Query： mysql> SELECT count() FROM my_order a JOIN my_order b ON a.customer_id = b.customer_id and a.product_id = 123 JOIN delivery ON b.order_id = delivery.order_id AND delivery.time = '2010-05-10 15:22:02'; +----------+ \| count() \| +----------+ \| 600 \| +----------+ 1 row in set (0.51 sec) 然后用Explain分析一下： mysql> explain SELECT count() FROM my_order a JOIN my_order b ON a.customer_id = b.customer_id and a.product_id = 123 JOIN delivery ON b.order_id = delivery.order_id AND delivery.time = '2010-05-10 15:22:02'; +----+-------------+----------+--------+----------------------------------+----------------------+---------+------------------------------+------+-------------+ \| id \| select_type \| table \| type \| possible_keys \| key \| key_len \| ref \| rows \| Extra \| +----+-------------+----------+--------+----------------------------------+----------------------+---------+------------------------------+------+-------------+ \| 1 \| SIMPLE \| a \| ref \| PRIMARY,FK_order2customer_id \| PRIMARY \| 4 \| const \| 1 \| Using index \| \| 1 \| SIMPLE \| b \| ref \| UQ_order_id,FK_order2customer_id \| FK_order2customer_id \| 4 \| oliver_test.a.customer_id \| 167 \| \| \| 1 \| SIMPLE \| delivery \| eq_ref \| PRIMARY \| PRIMARY \| 12 \| oliver_test.b.order_id,const \| 1 \| Using index \| +----+-------------+----------+--------+----------------------------------+----------------------+---------+------------------------------+------+-------------+ 3 rows in set (0.00 sec) 从explain的结果来看，MySQL选择先把my_order自己做Join然后再去Join表delivery。第一个Join需要访问167行数据。但实际情况是如何呢？ mysql> SELECT count() FROM my_order a JOIN my_order b ON a.customer_id = b.customer_id and a.product_id = 123 ; +----------+ \| count() \| +----------+ \| 100000 \| +----------+ 1 row in set (0.06 sec) 如果我们只执行第一个Join，实际将访问100000条记录，远远高于167，相反如果我们先做第二个Join，我们只需要访问600行： mysql> select count() from delivery JOIN my_order ON my_order.order_id = delivery.order_id AND delivery.time = '2010-05-10 15:22:02'; +----------+ \| count() \| +----------+ \| 600 \| +----------+ 1 row in set (0.00 sec) 为什么MySQL会选择错误的顺序来执行呢？首先察看一下my_order表的索引： mysql> show index from my_order; +----------+------------+----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+ \| Table \| Non_unique \| Key_name \| Seq_in_index \| Column_name \| Collation \| Cardinality \| Sub_part \| Packed \| Null \| Index_type \| Comment \| +----------+------------+----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+ \| my_order \| 0 \| PRIMARY \| 1 \| product_id \| A \| 100600 \| NULL \| NULL \| \| BTREE \| \| \| my_order \| 0 \| PRIMARY \| 2 \| customer_id \| A \| 100600 \| NULL \| NULL \| \| BTREE \| \| \| my_order \| 0 \| UQ_order_id \| 1 \| order_id \| A \| 100600 \| NULL \| NULL \| \| BTREE \| \| \| my_order \| 1 \| FK_order2customer_id \| 1 \| customer_id \| A \| 602 \| NULL \| NULL \| \| BTREE \| \| +----------+------------+----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+ 4 rows in set (0.00 sec) 我们可以看到对于索引FK_order2customer_id，Cardinality的值是602，MySQL在估算第一个Join的时候，假设索引是平均分布的，用总行数（100600）除以Cardinality，所以得到167，由于167小于第二个Join需要访问的行数600，所以选择先执行第一个Join。如何解决这个问题？修改Query中Join的顺序并用STRAIGHT_JOIN强制MySQL按照这个顺序执行，下面是新的Query: mysql> SELECT count() FROM delivery STRAIGHT_JOIN my_order a ON a.order_id = delivery.order_id AND delivery.time = '2010-05-10 15:22:02' JOIN my_order b ON a.customer_id = b.customer_id and b.product_id = 123 ; +----------+ \| count() \| +----------+ \| 600 \| +----------+ 1 row in set (0.00 sec) 你可以看到查询时间从0.51秒变成0秒。用Explain察看新的执行顺序： mysql> explain SELECT count() FROM delivery STRAIGHT_JOIN my_order a ON a.order_id = delivery.order_id AND delivery.time = '2010-05-10 15:22:02' JOIN my_order b ON a.customer_id = b.customer_id and b.product_id = 123 ; +----+-------------+----------+--------+----------------------------------+-------------+---------+-------------------------------+------+---------------------------------------------+ \| id \| select_type \| table \| type \| possible_keys \| key \| key_len \| ref \| rows \| Extra \| +----+-------------+----------+--------+----------------------------------+-------------+---------+-------------------------------+------+---------------------------------------------+ \| 1 \| SIMPLE \| b \| ref \| PRIMARY,FK_order2customer_id \| PRIMARY \| 4 \| const \| 1 \| Using index \| \| 1 \| SIMPLE \| delivery \| index \| PRIMARY \| PRIMARY \| 12 \| NULL \| 600 \| Using where; Using index; Using join buffer \| \| 1 \| SIMPLE \| a \| eq_ref \| UQ_order_id,FK_order2customer_id \| UQ_order_id \| 4 \| oliver_test.delivery.order_id \| 1 \| Using where \| +----+-------------+----------+--------+----------------------------------+-------------+---------+-------------------------------+------+---------------------------------------------+ 3 rows in set (0.00 sec) 另外很重要的一点是这个问题在开始阶段通常不会注意，因为数据量少不会影响性能，但随着数据不断增加，一旦达到一定数量，就会突然出现并影响系统的性能。声明：ITeye文章版权属于作者，受法律保护。没有作者书面许可不得转载。推荐链接
返回顶楼

论坛首页 → 综合技术版

跳转论坛: