X

MySQL and HeatWave

How to Create an Index Based on an Expression, with Generated Columns

Mark Lewin
MySQL Curriculum Developer

One of the many great new features included in MySQL 5.7 is the generated column. A generated column is a column in a table where the data is calculated for you, based on an expression that you provide.

To understand the benefits of using generated columns, let's consider a very simple example. Imagine that I want to find out how many new hires my organization had in the year 2000. Here's my table:

mysql> DESC employees;
+------------+---------------+------+-----+---------+-------+
| Field      | Type          | Null | Key | Default | Extra |	
+------------+---------------+------+-----+---------+-------+
| emp_no     | int(11)       | NO   | PRI | NULL    |       |
| birth_date | date          | NO   |     | NULL    |       |
| first_name | varchar(14)   | NO   |     | NULL    |       |
| last_name  | varchar(16)   | NO   |     | NULL    |       |
| gender     | enum('M','F') | NO   |     | NULL    |       |
| hire_date  | date          | NO   |     | NULL    |       |
+------------+---------------+------+-----+---------+-------+
6 rows in set (#.## sec)

It's a simple enough query: I just use the YEAR() function to extract the year from the hire_date column. However, the use of the YEAR() function prevents the optimizer from using hire_date as an index and results in a full table scan:

mysql> EXPLAIN SELECT COUNT(emp_no)
    -> FROM employees WHERE YEAR(hire_date)=2000\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: employees
   partitions: NULL
        type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 299733
     filtered: 100.00
        Extra: Using where
1 row in set, 1 warning (#.## sec)

In a trivial query like this one, I might just choose to live with the overhead. But imagine a frequently-executed query that examines millions of records. I would want to optimize it as much as possible, and to do that I need to be able to use the year as an index. I could create a new column and populate it with the year portion of the date. But then I'd need to create a trigger on the table that keeps this new column up to date when the data changes. That sounds like a lot of work to me. A much better approach would be to use a generated column, which will create and maintain the data on my behalf:

mysql> ALTER TABLE employees 
    -> ADD COLUMN hire_year SMALLINT
    -> GENERATED ALWAYS AS (YEAR(hire_date)) STORED;
Query OK, 300024 rows affected (#.## sec)
Records: 300024  Duplicates: 0  Warnings: 0

In the above statement, the GENERATED ALWAYS clause tells MySQL that this is a generated column, and provides the expression used to create the data in the column. The STORED keyword specifies that this data should be stored with the table and can be indexed. The default option is VIRTUAL, which does not store the column values but evaluates them each time a row is read. Generated columns of the VIRTUAL type can be used only for secondary indexes.

I can now add an index based on this new column. The optimizer can use it to improve the performance of my query:

mysql> CREATE INDEX idx_hire_year 
    -> ON employees(hire_year);
Query OK, 0 rows affected (#.## sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> DESC employees;
+------------+---------------+------+-----+---------+------------------+
| Field      | Type          | Null | Key | Default | Extra            |
+------------+---------------+------+-----+---------+------------------+
| emp_no     | int(11)       | NO   | PRI | NULL    |                  |
| birth_date | date          | NO   |     | NULL    |                  |
| first_name | varchar(14)   | NO   |     | NULL    |                  |
| last_name  | varchar(16)   | NO   |     | NULL    |                  |
| gender     | enum('M','F') | NO   |     | NULL    |                  |
| hire_date  | date          | NO   |     | NULL    |                  |
| hire_year  | smallint(6)   | YES  | MUL | NULL    | STORED GENERATED |
+------------+---------------+------+-----+---------+------------------+
7 rows in set (#.## sec)

mysql> EXPLAIN SELECT COUNT(emp_no) 
    -> FROM employees WHERE hire_year=2000\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: employees
   partitions: NULL
         type: ref
possible_keys: idx_hire_year
          key: idx_hire_year
      key_len: 3
          ref: const
         rows: 13
     filtered: 100.00
        Extra: Using index
1 row in set, 1 warning (#.## sec)

As I hope this brief post demonstrates, generated columns are great for working with columns of data that cannot be indexed directly, including indexing fields in JSON documents. They are also useful for "centralizing" complex conditions that multiple queries rely on and ensuring that every query uses exactly the same condition.

Find out more about MySQL Generated Columns in the MySQL Reference Manual.

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.