Tips for bulk loading
By Henrik Ingo on Jan 08, 2009
A collague at Sun asked me for tips on how to tune MySQL to get fast bulk loads of csv files. (The use case being, insert data into a data warehouse during a nightly load window.) Considering that I spend most of my time working with MySQL Cluster, I was amazed at how many tips I could already come up with both for MyISAM and InnoDB. So I thought it might be interesting to share, and also: Do you have any more tips to add?
[A Sun partner] have requested to do a POC to test mysql's bulk loading capabilities from CSV files. They have about 20GB of compressed CSV files, and they want to see how long it takes to load them.
They haven't specified which storage engine they intend to use yet.
Good start, well defined PoC. The storage engine question actually is significant here.
- MyISAM typically may be up to twice as fast for bulk loads compared to InnoDB. (But there may be some tuning that makes InnoDB closer to MyISAM, see below.) MyISAM should also be faster on reads.
- On the downside, MyISAM doesn't offer transactions which also implies no crash recovery!
I would recommend MyISAM if:
- Customer/Partner understands the below points, ie is not completely ignorant of DB/MySQL concepts. If he is/appears ignorant, then using InnoDB is simpler and safer.
- Apart from the bulk load, there is no transactional activity on the database. (not at all, or very little)
- It is feasible for the customer to load the data for instance at night time with something like LOCK TABLE...; LOAD DATA INFILE...; UNLOCK TABLE; In other words, nobody will read data or generate reports during the load period, eg it could be a nightly window or something. (http://dev.mysql.com/doc/refman/5.0/en/lock-tables.html)
- Crash recovery is not an issue. This basically means that in case MySQL (or the whole server) crashes and the DB cannot be repaired, there is always a backup, ZFS snapshot or other means available to get back to the state after the most recent successful bulk load. And there were no transactions after the bulk load or it is ok for them to be lost. (Note that if there is no write activity, then a crash is \*not\* a problem, problems only happen if there is a crash during LOAD/INSERT/UPDATE.)
Note that InnoDB is not a bad choice for this task either, and we do have big datawarehouses running on InnoDB.
I've googled, and have found the following parameters:
MyISAM: bulk_insert_buffer_size, myisam_sort_buffer_size, key_buffer_size
InnoDB: innodb_buffer_pool_size, innodb_log_file_size, innodb_flush_logs_at_trx_commit, innodb_flush_method, innodb_doublewrite
Besides the parameters above, are there any others that I should be looking at? Also, do you have any pointers to how I can derive the optimum values of the parameters above?
Actually, I'm not an expert on InnoDB and MyISAM parameters. What you've found seem to be what people generally tune, yes. But I might be missing something myself.
In addition, I would think about the following:
- For both, but especially for MyISAM, try to find ways how to make the inserts happen concurrently. Obviously you should have each table in a separate csv file, but also data going into the same table can be split into multiple parts. Then you can LOAD DATA INFILE... simultaneously for each csv file.
- For MyISAM, use the CONCURRENT keyword to enable inserts to happen concurrently, and if needed use "SET GLOBAL concurrent_insert=2":
- If you don't use replication (you probably don't), turn off the binary log by commenting out log-bin in the my.cnf file.
- It is a big advantage if the data in the CSV files is already ordered by primary key. (Because the InnoDB primary key is a clustered index, so it will organize the table physically to be in primary key order anyway.)
- For the bulk insert, you should consider turning off foreign key checking and unique index checking.
- Using InnoDB plugin, you can speed things up by inserting data into a table without indexes (only define primary key, of course), and then create the indexes separately with alter table. (on an existing table you can also consider dropping existing indexes, the benefit of this would depend case by case).
InnoDB plugin is not supplied by Sun but directly from Oracle. So we cannot sell it with the commercial (oem) license, but for customer's that use it for internal use we do support it on a best effort basis (as we do for other non-standard MySQL uses too).
Also came across this while writing this: