MySQL HeatWave Best Practices Series: Data load to HeatWave with MySQL Autopilot

 

In the first blog of the best practices series, we discussed the best practices for schema design. In this blog, we go into details of best practices of loading data to HeatWave. This is the first criteria of running queries in HeatWave for query acceleration. Go to HeatWave documentation to see the pre-requisites of running query in HeatWave.

Before we dive into data load, let’s talk about the differences of how data is stored in MySQL and HeatWave.

 

Background

MySQL HeatWave is a fully managed service that enables developers and database administrators to run OLTP, OLAP and machine learning workloads directly from their MySQL database without the need of ETL. HeatWave accelerates MySQL query performance by order of magnitudes and allows users to analyze real-time data.

To provide a single service for 2 different workloads, MySQL HeatWave consists of two query processing engines with two completely different physical data layouts, query processing models, supported query processing algorithms and performance characteristics:

  1. MySQL InnoDB storage engine – a row major, single threaded, transactional processing database engine
  2. HeatWave – a columnar major, in-memory, vectorized, and massively parallel pushed based query processing engine.

 

Analytics or complex queries which require significant query acceleration are offloaded to HeatWave, while rest of the queries are executed on the MySQL InnoDB storage engine.

MySQL HeatWave

Figure 1 – InnoDB & HeatWave as primary & secondary storage engines

InnoDB storage engine is modelled as the primary storage engine. HeatWave is modelled as a secondary storage engine and an up-to-date shadow copy of data is maintained in HeatWave. Before HeatWave can be used for query acceleration, InnoDB tables must be transformed and loaded to HeatWave.

 

HeatWave Data Format

HeatWave stores tables in hybrid columnar format to facilitate efficient query processing, change propagation, and other data management operations. A table in InnoDB is transformed into HeatWave partitions, each partition is sliced into data chunks of fixed sizes and each such chunk contains several continuous memory buffers called vector buffers. Vector buffers contain data from columns of the table. There may be several such vector buffers per column, per chunk and each vector buffer is encoded or compressed using HeatWave specific algorithms.

MySQL HeatWave

Figure 2: A table with three columns, C1, C2, & C3, in InnoDB & HeatWave data formats

 

HeatWave Data Load Flow

There are 2 commands to load table to HeatWave:

  1. ALTER TABLE <table> SECEONDARY_ENGINE=RAPID This command specifies the table to use HeatWave as the secondary engine
  2. ALTER TABLE <table> SECONDARY_LOAD – This command triggers the actual load operation from InnoDB to HeatWave

MySQL HeatWave

Figure 3: Preparation phase for load – table is marked as secondary engine

 

ALTER TABLE <table> SECONDARY_ENGINE=RAPID is a DDL operation in MySQL. This command takes an exclusive lock on the table and updates the table related metadata in the data dictionary of MySQL server. In absence of any contention, the operation returns instantaneously.

MySQL HeatWave

Fig. 4: Actual load operation for the table which has been marked as secondary engine

 

ALTER TABLE <table> SECONDARY LOAD operation is also a DDL operation in MySQL, and thus it takes an exclusive lock on the table for small duration of time, eventually downgrading it to a read lock of the table. The load operation involves following steps:

  1. Table data is read from InnoDB in parallel, using parallel scan interface. The default degree of parallelism is determined based on MySQL instance type and shape
  2. InnoDB table data is converted into HeatWave format; it includes row major to column major transformation, encoding and compression
  3. After that, the table data in HeatWave format is sent over the network to be stored in HeatWave cluster; the data is encrypted before transferred over the network
  4. Finally, the table data is persisted to Scale-out Storage Layer for efficient data management

MySQL HeatWave

Figure 5: Steps in load operation to load data from InnoDB to HeatWave

 

Loading data to HeatWave

Before loading data to HeatWave, consider which tables or columns are needed for query to accelerate in HeatWave. You can minimize the memory/cluster size needed, hence lower your cost, if you don’t need to load all your MySQL data to HeatWave.

 

There are 2 ways to load data to HeatWave:

  1. Loading data manually – We have extended ALTER TABLE statement of MySQL to support load of tables from primary storage engine to secondary storage engine. Once the desired tables are loaded to HeatWave engine, they are kept up to date automatically.
  2. Loading data using MySQL Parallel Load – MySQL Parallel Load is one of the features of MySQL AutoPilot which automates and optimizes the data load process using advanced machine learning techniques. It minimizes or eliminates manual steps during the load process, helps in optimizing load performance and memory utilization. Coming up with values of critical load parameters, such as degree of parallelism, type of encoding etc., are taken care by this module.

We recommend using Auto Parallel Load to load data to HeatWave. Below sections describe the details and benefits.

 

Loading tables to Heatwave using Auto Parallel Load

Auto Parallel Load eliminates the laborious and manual tasks such as:

  1. Excluding schemas, tables, and columns that cannot be loaded
  2. Specifying HeatWave as secondary engine for tables
  3. Estimating memory needed for tables set in task 2 above
  4. Checking memory capacity in HeatWave for load to complete
  5. Optimizing load parallelism based on machine-learning models
  6. Loading the data with right set of parameters with values determined in above steps

Example: Load TPCH schema using Auto Parallel Load

To load all tables under TPCH schema, run

CALL sys.heatwave_load(JSON_ARRAY('tpch1'), NULL);

 

Panels with useful information show details regarding the load operation:

  1. Offload Analysis – This panel shows information about the schema, whether there are unsupported tables/columns, and the total number of tables/columns that are being loaded to HeatWave.

MySQL HeatWave

Fig. 6: Offload Analysis panel during Auto Parallel Load execution

 

  1. Capacity estimation – this panel shows memory estimation required to load the data to the MySQL node and to the HeatWave nodes. This allows you to see if there is enough memory available in both the MySQL node and HeatWave nodes. It also shows other relevant information such as number of string columns encoded as DICTIONARY/VARLEN and an estimated load time.

MySQL HeatWave

Figure 7: Capacity Estimation panel during Auto Parallel Load execution

 

Tracking Load Progress

Auto Parallel Load shows per-table load progress while it loads multiple tables. You can also track load progress using performance_schema.rpd_tables:

SELECT statement in a separate session: SELECT NAME, ID, LOAD_STATUS, LOAD_PROGRESS FROM performance_schema.rpd_tables NATURAL JOIN performance_schema.rpd_table_id;

 

MySQL HeatWave

 

Figure 8. Show load progress

 

Auto Parallel Load Dryrun Mode

Executing Auto Parallel Load in ‘dryrun’ mode allows you to review information regarding the load without actually loading the data to HeatWave.

CALL sys.heatwave_load(JSON_ARRAY('tpch1'),JSON_OBJECT('mode','dryrun'))

 

You can get individual load commands by running the following:

SELECT log->>”$.sql” AS “Load Script” FROM sys.heatwave_autopilot_report WHERE type = “sql” ORDER BY id;

 

MySQL HeatWave

Figure 9: Load script from Auto Parallel Load

 

You can also see the number of innodb_parallel_read_threads set for each table. This is calculated based on advanced machine learning model to provide maximum load speed through parallelism and minimize memory used for the table in HeatWave.

 

Load schema with unsupported tables and columns with unsupported data type

Auto parallel load can detect and exclude unsupported tables or columns with unsupported data type to load to HeatWave automatically. Following our example above, another schema “test” with unsupported tables and columns is being loaded to HeatWave.

 

NOTE: Auto Parallel Load can load multiple schemas in one command:

CALL sys.heatwave_load(JSON_ARRAY(‘tpch1′,’test’),NULL);

MySQL HeatWave

Figure 10: Auto Parallel Load automatically detects and excludes unsupported tables/columns

 

To identify the exact issue with the unsupported tables/columns, execute

SELECT * FROM sys.heatwave_autopilot_report WHERE type=’warn’;

which prints out all warnings generated by Auto Parallel Load.

MySQL HeatWave

Figure 11: Fetching all warnings generated by Auto Parallel Load after execution.

Table `t1` cannot be loaded because it does not have a PRIMARY KEY which is required for table to load to HeatWave. Table `t3` contains column with data type `blob`. This table can be partially loaded by disabling column unsupported_col_name. Auto Parallel Load automatically excludes `t1` and excludes unsupported_col_name from `t3` from loading to HeatWave.

To see the commands for excluding the column, run:

SELECT log->>”$.sql” AS “Load Script” FROM sys.heatwave_autopilot_report WHERE type = “sql” ORDER BY id;

MySQL HeatWave

Figure. 12: Auto Parallel Load generated commands to disable unsupported tables/columns.

 

Conclusion

MySQL HeatWave supports easy and intuitive way to load tables from InnoDB to HeatWave for query acceleration using a familiar set of `ALTER TABLE` commands. The entire load experience is further enhanced by MySQL AutoPilot (Auto Parallel Load). MySQL AutoPilot uses advanced machine learning techniques to take complex optimization decisions during load and helps in eliminating manual error prone steps.

 

Please stay turn for the next couple topics of the MySQL HeatWave best practices on query performance and monitoring.

Addition Resources