[翻译]——MySQL 8.0 Histograms

2022-10-13,,,

前言: 本文是对这篇博客mysql 8.0 histograms的翻译,翻译如有不当的地方,敬请谅解,请尊重原创和翻译劳动成果,转载的时候请注明出处。谢谢!

 

英文原文地址:https://lefred.be/content/mysql-8-0-histograms/

 

翻译原文地址:

 

 

在mysql 8.0之前,mysql缺失了其它关系数据库中一个众所周知的功能:优化器的直方图

 

优化器团队(optimizer team)在越来越多的mysql dba的呼声中实现了这个功能。

 

 

直方图定义

 

但什么是直方图呢?我们来看维基百科的定义吧,直方图是数值数据分布的准确表示。 对于rdbms来说,直方图是特定列内数据分布的近似值。因此在mysql中,直方图能够帮助优化器找到最有效的执行计划。

 

直方图例子

 

为了说明直方图是如何影响优化器工作的,我会用dbt3生成的数据来演示。

 

我们准备了一个简单查询:

 

select * from orders  

  join customer on o_custkey = c_custkey 

where o_orderdate < '1993-01-01' 

  and c_mktsegment = "automobile"\g

 

让我们看一下传统的执行计划的explain输出,以及可视化方式(visual one):

 

mysql> explain select * from orders  

       join customer on o_custkey = c_custkey 

       where o_orderdate < '1993-01-01' and c_mktsegment = "automobile"\g

*************************** 1. row ***************************

           id: 1

  select_type: simple

        table: customer

   partitions: null

         type: all

possible_keys: primary

          key: null

      key_len: null

          ref: null

         rows: 149050

     filtered: 10.00

        extra: using where

*************************** 2. row ***************************

           id: 1

  select_type: simple

        table: orders

   partitions: null

         type: ref

possible_keys: i_o_custkey,i_o_orderdate

          key: i_o_custkey

      key_len: 5

          ref: dbt3.customer.c_custkey

         rows: 14

     filtered: 30.62

        extra: using where

2 rows in set, 1 warning (0.28 sec)

 

我们看到mysql首先对customer表做了一个全表扫描,并且它的选择估计记录(过滤)是10%;

 

 

 

接下来让我们运行这个查询(我使用了count(*)),然后我们来看看有多少行记录

 

mysql> select count(*) from orders  

       join customer on o_custkey = c_custkey 

       where o_orderdate < '1993-01-01' and c_mktsegment = "automobile"\g

*************************** 1. row ***************************

count(*): 45127

1 row in set (49.98 sec)

 

 

创建直方图

 

现在,我将在表customer上的字段c_mktsegment上创建一个直方图

 

mysql> analyze table customer update histogram on c_mktsegment with 1024 buckets;

+---------------+-----------+----------+---------------------------------------------------------+

| table         | op        | msg_type | msg_text                                                |

+---------------+-----------+----------+---------------------------------------------------------+

| dbt3.customer | histogram | status   | histogram statistics created for column 'c_mktsegment'. |

+---------------+-----------+----------+---------------------------------------------------------+

 

接下来,我们来验证查询的执行计划:

 

mysql> explain select * from orders  

               join customer on o_custkey = c_custkey 

               where o_orderdate < '1993-01-01' and c_mktsegment = "automobile"\g

*************************** 1. row ***************************

           id: 1

  select_type: simple

        table: orders

   partitions: null

         type: all

possible_keys: i_o_custkey,i_o_orderdate

          key: null

      key_len: null

          ref: null

         rows: 1494230

     filtered: 30.62

        extra: using where

*************************** 2. row ***************************

           id: 1

  select_type: simple

        table: customer

   partitions: null

         type: eq_ref

possible_keys: primary

          key: primary

      key_len: 4

          ref: dbt3.orders.o_custkey

         rows: 1

     filtered: 19.84

        extra: using where

2 rows in set, 1 warning (1.06 sec)

 

 

现在,使用直方图后,我们可以看到customer表的吸引力降低了,因为order表按条件过滤的行的百分比(30.62)几乎是customer表按条件过滤行的百分比的两倍(19.84%),这将导致低order表进行查找。

 

注意:这段感觉没有翻译恰当,英文原文如下,如果感觉翻译比较生硬,参考原文

 

now with the histogram we can see that it becomes less attractive to start with customer table since almost twice as many rows (19.84%) will cause look-ups into the order table.

 

 

 

优化器选择对order表进行全表扫描(full sacn),此时执行计划的代价看起来似乎还高一些,,让我们看一下sql的执行时间:

 

 

mysql> select count(*) from orders  

       join customer on o_custkey = c_custkey 

       where o_orderdate < '1993-01-01' and c_mktsegment = "automobile"\g

*************************** 1. row ***************************

count(*): 45127

1 row in set (6.35 sec)

 

sql语句的执行时间更短,明显比之前要快了

 

 

 

查看数据的分布

 

 

直方图数据存贮在information_schema.column_statistics表中,这个表的定义如下

 

 

+-------------+-------------+------+-----+---------+-------+

| field       | type        | null | key | default | extra |

+-------------+-------------+------+-----+---------+-------+

| schema_name | varchar(64) | no   |     | null    |       |

| table_name  | varchar(64) | no   |     | null    |       |

| column_name | varchar(64) | no   |     | null    |       |

| histogram   | json        | no   |     | null    |       |

+-------------+-------------+------+-----+---------+-------+

 

 

它的一条记录类似下面这样:

 

select schema_name, table_name, column_name, json_pretty(histogram) 

from information_schema.column_statistics 

where column_name = 'c_mktsegment'\g

*************************** 1. row ***************************

           schema_name: dbt3

            table_name: customer

           column_name: c_mktsegment

json_pretty(histogram): {

  "buckets": [

    [

      "base64:type254:qvvut01pqklmrq==",

      0.19837010534684954

    ],

    [

      "base64:type254:qlvjterjtkc=",

      0.3983104750546611

    ],

    [

      "base64:type254:rlvstkluvvjf",

      0.5978433710991851

    ],

    [

      "base64:type254:se9vu0vit0xe",

      0.799801232359372

    ],

    [

      "base64:type254:tufdselorvjz",

      1.0

    ]

  ],

  "data-type": "string",

  "null-values": 0.0,

  "collation-id": 255,

  "last-updated": "2018-03-02 20:21:48.271523",

  "sampling-rate": 0.6709158000670916,

  "histogram-type": "singleton",

  "number-of-buckets-specified": 1024

}

 

而且可以查看分布

 

select from_base64(substring_index(v, ':', -1)) value, concat(round(c*100,1),'%') cumulfreq, 

       concat(round((c - lag(c, 1, 0) over()) * 100,1), '%') freq  

from information_schema.column_statistics, json_table(histogram->'$.buckets', 

     '$[*]' columns(v varchar(60) path '$[0]', c double path '$[1]')) hist  

where schema_name  = 'dbt3' and table_name = 'customer' and column_name = 'c_mktsegment';

+------------+-----------+-------+

| value      | cumulfreq | freq  |

+------------+-----------+-------+

| automobile | 19.8%     | 19.8% |

| building   | 39.9%     | 20.1% |

| furniture  | 59.9%     | 19.9% |

| household  | 79.9%     | 20.1% |

| machinery  | 100.0%    | 20.1% |

+------------+-----------+-------+

 

你也可以用下面语法删除直方图信息。

 

 

mysql> analyze table customer drop histogram on c_mktsegment;

+---------------+-----------+----------+---------------------------------------------------------+

| table         | op        | msg_type | msg_text                                                |

+---------------+-----------+----------+---------------------------------------------------------+

| dbt3.customer | histogram | status   | histogram statistics removed for column 'c_mktsegment'. |

+---------------+-----------+----------+---------------------------------------------------------+

1 row in set (0.00 sec)

 

 

buckets

 

你会注意到,当我们创建一个直方图时,我们需要指定buckets的数量,事实上,数据被分成包含特定值以及他们基数(cardinality)的一组buckets,如果在上一个例子中检查直方图的类型,你会发现它是等宽直方图(singleton)

 

 

"histogram-type": "singleton",

 

 

这种类型的直方图最好的,因为基数是针对单个特定值。 如果这次我仅使用2个存储桶(buckets)来重新创建直方图(请记住,在c_mktsegment列中有4个不同的值):

 

 

mysql> analyze table customer update histogram on c_mktsegment with 2 buckets;

+---------------+-----------+----------+---------------------------------------------------------+

| table         | op        | msg_type | msg_text                                                |

+---------------+-----------+----------+---------------------------------------------------------+

| dbt3.customer | histogram | status   | histogram statistics created for column 'c_mktsegment'. |

+---------------+-----------+----------+---------------------------------------------------------+

 

如果我检查直方图的类型:

 

 

mysql> select schema_name, table_name, column_name, 

              json_pretty(histogram) 

       from information_schema.column_statistics 

      where column_name = 'c_mktsegment'\g

*************************** 1. row ***************************

           schema_name: dbt3

            table_name: customer

           column_name: c_mktsegment

json_pretty(histogram): {

  "buckets": [

    [

      "base64:type254:qvvut01pqklmrq==",

      "base64:type254:rlvstkluvvjf",

      0.5996992690844636,

      3

    ],

    [

      "base64:type254:se9vu0vit0xe",

      "base64:type254:tufdselorvjz",

      1.0,

      2

    ]

  ],

  "data-type": "string",

  "null-values": 0.0,

  "collation-id": 255,

  "last-updated": "2018-03-02 20:42:26.165898",

  "sampling-rate": 0.6709158000670916,

  "histogram-type": "equi-height",

  "number-of-buckets-specified": 2

}

 

现在的直方图类型是等高直方图,这意味着将连续范围的值分组到存储桶中,以使落入每个存储桶的数据项的数量相同。

 

 

结论:

 

直方图对那些不是索引中第一列的列非常有用,这些列用于join、in子查询(in-subqueries)或order by…limit的查询的where条件下使用。  

 

另外, 可以考虑尝试使用足够的存储通来获取等宽直方图。

《[翻译]——MySQL 8.0 Histograms.doc》

下载本文的Word格式文档,以方便收藏与打印。