Tips for bulk loading

A collague at Sun asked me for tips on how to tune MySQL to get fast bulk loads of csv files. (The use case being, insert data into a data warehouse during a nightly load window.) Considering that I spend most of my time working with MySQL Cluster, I was amazed at how many tips I could already come up with both for MyISAM and InnoDB. So I thought it might be interesting to share, and also: Do you have any more tips to add?

[A Sun partner] have requested to do a POC to test mysql's bulk loading capabilities from CSV files. They have about 20GB of compressed CSV files, and they want to see how long it takes to load them.

They haven't specified which storage engine they intend to use yet.

Good start, well defined PoC. The storage engine question actually is significant here.

- MyISAM typically may be up to twice as fast for bulk loads compared to InnoDB. (But there may be some tuning that makes InnoDB closer to MyISAM, see below.) MyISAM should also be faster on reads.

- On the downside, MyISAM doesn't offer transactions which also implies no crash recovery!

I would recommend MyISAM if:

- Customer/Partner understands the below points, ie is not completely ignorant of DB/MySQL concepts. If he is/appears ignorant, then using InnoDB is simpler and safer.

- Apart from the bulk load, there is no transactional activity on the database. (not at all, or very little)

- It is feasible for the customer to load the data for instance at night time with something like LOCK TABLE...; LOAD DATA INFILE...; UNLOCK TABLE; In other words, nobody will read data or generate reports during the load period, eg it could be a nightly window or something. (

- Crash recovery is not an issue. This basically means that in case MySQL (or the whole server) crashes and the DB cannot be repaired, there is always a backup, ZFS snapshot or other means available to get back to the state after the most recent successful bulk load. And there were no transactions after the bulk load or it is ok for them to be lost. (Note that if there is no write activity, then a crash is \*not\* a problem, problems only happen if there is a crash during LOAD/INSERT/UPDATE.)

Note that InnoDB is not a bad choice for this task either, and we do have big datawarehouses running on InnoDB.

I've googled, and have found the following parameters:

MyISAM: bulk_insert_buffer_size, myisam_sort_buffer_size, key_buffer_size

InnoDB: innodb_buffer_pool_size, innodb_log_file_size, innodb_flush_logs_at_trx_commit, innodb_flush_method, innodb_doublewrite

Besides the parameters above, are there any others that I should be looking at? Also, do you have any pointers to how I can derive the optimum values of the parameters above?

Actually, I'm not an expert on InnoDB and MyISAM parameters. What you've found seem to be what people generally tune, yes. But I might be missing something myself.

In addition, I would think about the following:

- For both, but especially for MyISAM, try to find ways how to make the inserts happen concurrently. Obviously you should have each table in a separate csv file, but also data going into the same table can be split into multiple parts. Then you can LOAD DATA INFILE... simultaneously for each csv file.

- For MyISAM, use the CONCURRENT keyword to enable inserts to happen concurrently, and if needed use "SET GLOBAL concurrent_insert=2":

- If you don't use replication (you probably don't), turn off the binary log by commenting out log-bin in the my.cnf file.

For InnoDB:

- It is a big advantage if the data in the CSV files is already ordered by primary key. (Because the InnoDB primary key is a clustered index, so it will organize the table physically to be in primary key order anyway.)

- For the bulk insert, you should consider turning off foreign key checking and unique index checking.

- Using InnoDB plugin, you can speed things up by inserting data into a table without indexes (only define primary key, of course), and then create the indexes separately with alter table. (on an existing table you can also consider dropping existing indexes, the benefit of this would depend case by case).

InnoDB plugin is not supplied by Sun but directly from Oracle. So we cannot sell it with the commercial (oem) license, but for customer's that use it for internal use we do support it on a best effort basis (as we do for other non-standard MySQL uses too).

Also came across this while writing this:



beware LOAD DATA has evil twists...for example, if you have a NULL in the CSV, but the column is NOT NULL, LOAD DATA INFILE will insert a 0!!!


Other ways of doing it:
- if you are sure the CSV file does not contain nulls, you can try to create a CSV table and overwrite it's data file with the source CSV file. There is some black magic to that, that involves locking and flushing the table....I am sure the server devs can fill you in on the exact details. Then, you can read from the CSV table to insert into the target table.

- Method I like most is Matt Caster's Kettle (aka. Pentaho Data Integration). Look here for a parallel CSV load:

Nice thing about this approach is that you can keep scaling by throwing more nodes at the problem.

See also this related talk (on pulling this trick in a cloud) at the next UC:

Hope this helps

Posted by Roland Bouman on January 08, 2009 at 05:08 PM EET #

Hi Roland, really appreciate your input...

The CSV trick, I never tried it but I have a slide which instructs you to do:

mysql> create table … engine=csv;

[mysql@linux1 gim]$ ls -l
total 785848
-rw-rw---- 1 mysql mysql 35 May 1 client_detail_csv.CSM
-rw-rw---- 1 mysql mysql 0 May 1 client_detail_csv.CSV
-rw-rw---- 1 mysql mysql 291367350000 May 1 flatdata.dat

[mysql@linux1 gim]$ mv flatdata.dat client_detail_csv.CSV

mysql> flush tables;
Query OK, 0 rows affected (0.00 sec)‏
mysql> select count(\*) from client_detail_csv;
| count(\*) |
| 1000000000 |


It would actually be interesting to try what happens after this: Is it straightforward to INSERT INTO realinnodbtable SELECT \* FROM csvtable or will there also be surprises with NULLs or something else? I can parallelize LOAD DATA INFILE by breaking the source files into smaller chunks, does the CONCURRENT insert still work when I insert from a csv table? (Of course, I could still have broken the source files into parts and end up with many small csv tables.) But it would be worth trying, I'll pass the tip along.

Posted by Henrik Ingo on January 09, 2009 at 01:00 AM EET #

You might be disappointed by InnoDB performance for inserts on a big server -- and The workaround for this is to use 1 very fast (SSD) disk.

Has anyone published results for bulk loading into MyISAM? Multiple connections loading one table might not be as fast as expected because of the global MyISAM mutex

Posted by Mark Callaghan on January 09, 2009 at 09:37 AM EET #

Post a Comment:
Comments are closed for this entry.

The people of the MySQL Telecom team writes about developments around MySQL and MySQL Cluster and how these products are used by our Communitcations industry customers. (Image jasmic@Flickr)


« October 2016