Sunday Feb 15, 2009

Date handling in MySQL

I see my collague Anders wrote some thoughts on date handling in MySQL. I recently had to explain this really down to fine details for a customer, so Anders, everyone else, here's my conclusions...

All the fine details of allowed ways to enter dates are explained here:
http://dev.mysql.com/doc/refman/5.1/en/datetime.html

With dates, just as in many other cases, MySQL is very flexible in how you can input data. But one thing is that, between the multitudes of formats that are allowed, MySQL does \*not\* by default allow you to enter a non-existant date, e.g. "31st of February 2009". I know some other DB products allow that - they then store it as a correct date automatically (3rd of March 2009). This is mentioned at the end of the above manual page. It can be enabled with an SQL_MODE setting and I have attached a lengthy example about that separately below (Listing 1).


\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*
\* LISTING 1
\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*
Short explanation in English:

1. Create a table with DATE field, verify that a normal date insert works fine.

2. sql_mode is not set, using the default "liberal" mode

3. Insert with id=2 "succeeds" but with a warning. In fact, only a zero value was inserted. This is consistent with sql_mode.

4. When sql_mode is set to one of the "strict" modes, the same insert (id=3) fails with an error. No new row was inserted.

5. Adding the option ALLOW_INVALID_DATES to sql_mode, makes the same insert work (id=4).

6. Note that when selecting, the date is returned in the same invalid form it was inserted. However, if using it in an operation, it is converted to match a valid date forward or backward in time.

mysql> CREATE TABLE t (id INT PRIMARY KEY, d1 DATE);
Query OK, 0 rows affected (0.00 sec)

mysql> SHOW VARIABLES LIKE "%sql_mode%";
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| sql_mode | |
+---------------+-------+
1 row in set (0.01 sec)

mysql> INSERT INTO t VALUES (1, "2008-12-21");
Query OK, 1 row affected (0.00 sec)

mysql> INSERT INTO t VALUES (2, "2008-02-31");
Query OK, 1 row affected, 1 warning (0.00 sec)

mysql> SHOW WARNINGS;
+---------+------+-----------------------------------------+
| Level | Code | Message |
+---------+------+-----------------------------------------+
| Warning | 1265 | Data truncated for column 'd1' at row 1 |
+---------+------+-----------------------------------------+
1 row in set (0.00 sec)

mysql> SELECT \* FROM t;
+----+------------+
| id | d1 |
+----+------------+
| 1 | 2008-12-21 |
| 2 | 0000-00-00 |
+----+------------+
2 rows in set (0.00 sec)

mysql> SET SESSION sql_mode='STRICT_ALL_TABLES';
Query OK, 0 rows affected (0.00 sec)

mysql> INSERT INTO t VALUES (3, "2008-02-31");
ERROR 1292 (22007): Incorrect date value: '2008-02-31' for column 'd1' at row 1

mysql> SELECT \* FROM t;
+----+------------+
| id | d1 |
+----+------------+
| 1 | 2008-12-21 |
| 2 | 0000-00-00 |
+----+------------+
2 rows in set (0.00 sec)

mysql> SHOW VARIABLES LIKE "%sql_mode%";
+---------------+-------------------+
| Variable_name | Value |
+---------------+-------------------+
| sql_mode | STRICT_ALL_TABLES |
+---------------+-------------------+
1 row in set (0.00 sec)

mysql> SET SESSION sql_mode='STRICT_ALL_TABLES,ALLOW_INVALID_DATES';
Query OK, 0 rows affected (0.00 sec)

mysql> SHOW VARIABLES LIKE "%sql_mode%";
+---------------+---------------------------------------+
| Variable_name | Value |
+---------------+---------------------------------------+
| sql_mode | STRICT_ALL_TABLES,ALLOW_INVALID_DATES |
+---------------+---------------------------------------+
1 row in set (0.00 sec)

mysql> INSERT INTO t VALUES (4, "2008-02-31");
Query OK, 1 row affected (0.00 sec)

mysql> SELECT \* FROM t;
+----+------------+
| id | d1 |
+----+------------+
| 1 | 2008-12-21 |
| 2 | 0000-00-00 |
| 4 | 2008-02-31 |
+----+------------+
3 rows in set (0.01 sec)

mysql> SELECT d1 + INTERVAL 0 DAY FROM t;
+---------------------+
| d1 + INTERVAL 0 DAY |
+---------------------+
| 2008-12-21 |
| NULL |
| 2008-03-02 |
+---------------------+
3 rows in set (0.01 sec)

mysql> -- Note: 2008 was a leap year so above is correct!
mysql>

\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*
\* END OF LISTING 1
\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*

Lessons learned from the example:
1. The MySQL default mode often will continue operations without stopping to an error - this also happens in many other cases than this. In the old days, without the ETL tools we have today, this was probably a very convenient feature if you had to insert "dirty" data into your
Data Warehouse.

2. You may/should consider setting sql_mode='TRADITIONAL' (or another strict mode) to stop and get an error to catch such mistakes.

3. Set sql_mode="ALLOW_INVALID_DATES" to allow out of range dates like is usual on some other DB products.

Links about sql_mode:
http://dev.mysql.com/doc/refman/5.1/en/faqs-sql-modes.html
http://dev.mysql.com/doc/refman/5.1/en/server-sql-mode.html

Another thing that may also come up is that Oracle allows dates to be expressed in textual forms, such as "31-jan-2009".

This hasn't been supported in MySQL, and it seems the situation is still the same in 5.1. There actually is a variable "date_format" in place, but in practice you can only set it to values that are equivalent with the default, you cannot really change it:


\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*
mysql> show variables like "%date%";
+------------------------------+-------------------+
| Variable_name | Value |
+------------------------------+-------------------+
| date_format | %Y-%m-%d |
| datetime_format | %Y-%m-%d %H:%i:%s |

mysql> SET SESSION date_format="%Y-%M-%d";
ERROR 1231 (42000): Variable 'date_format' can't be set to the value of
'%Y-%M-%d'

mysql> SET SESSION date_format="%Y-%m-%d";
Query OK, 0 rows affected (0.00 sec)

mysql> SET SESSION date_format="%Y/%m/%d";
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT \* FROM t;
+----+------------+
| id | d1 |
+----+------------+
| 1 | 2008-12-21 |
| 2 | 0000-00-00 |
| 4 | 2008-02-31 |
+----+------------+
3 rows in set (0.00 sec)

mysql>
\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*

You can of course always use the DATE_FORMAT() and STR_TO_DATE() functions explicitly if you have a need to format textual date formats.

http://dev.mysql.com/doc/refman/5.1/en/date-and-time-functions.html#function_date-format
http://dev.mysql.com/doc/refman/5.1/en/date-and-time-functions.html#function_str-to-date
See also:
http://dev.mysql.com/doc/refman/5.1/en/date-and-time-functions.html#function_get-format

Thursday Jan 08, 2009

Tips for bulk loading

A collague at Sun asked me for tips on how to tune MySQL to get fast bulk loads of csv files. (The use case being, insert data into a data warehouse during a nightly load window.) Considering that I spend most of my time working with MySQL Cluster, I was amazed at how many tips I could already come up with both for MyISAM and InnoDB. So I thought it might be interesting to share, and also: Do you have any more tips to add?



[A Sun partner] have requested to do a POC to test mysql's bulk loading capabilities from CSV files. They have about 20GB of compressed CSV files, and they want to see how long it takes to load them.

They haven't specified which storage engine they intend to use yet.

Good start, well defined PoC. The storage engine question actually is significant here.

- MyISAM typically may be up to twice as fast for bulk loads compared to InnoDB. (But there may be some tuning that makes InnoDB closer to MyISAM, see below.) MyISAM should also be faster on reads.

- On the downside, MyISAM doesn't offer transactions which also implies no crash recovery!

I would recommend MyISAM if:

- Customer/Partner understands the below points, ie is not completely ignorant of DB/MySQL concepts. If he is/appears ignorant, then using InnoDB is simpler and safer.

- Apart from the bulk load, there is no transactional activity on the database. (not at all, or very little)

- It is feasible for the customer to load the data for instance at night time with something like LOCK TABLE...; LOAD DATA INFILE...; UNLOCK TABLE; In other words, nobody will read data or generate reports during the load period, eg it could be a nightly window or something. (http://dev.mysql.com/doc/refman/5.0/en/lock-tables.html)

- Crash recovery is not an issue. This basically means that in case MySQL (or the whole server) crashes and the DB cannot be repaired, there is always a backup, ZFS snapshot or other means available to get back to the state after the most recent successful bulk load. And there were no transactions after the bulk load or it is ok for them to be lost. (Note that if there is no write activity, then a crash is \*not\* a problem, problems only happen if there is a crash during LOAD/INSERT/UPDATE.)

Note that InnoDB is not a bad choice for this task either, and we do have big datawarehouses running on InnoDB.


I've googled, and have found the following parameters:

MyISAM: bulk_insert_buffer_size, myisam_sort_buffer_size, key_buffer_size

InnoDB: innodb_buffer_pool_size, innodb_log_file_size, innodb_flush_logs_at_trx_commit, innodb_flush_method, innodb_doublewrite

Besides the parameters above, are there any others that I should be looking at? Also, do you have any pointers to how I can derive the optimum values of the parameters above?

Actually, I'm not an expert on InnoDB and MyISAM parameters. What you've found seem to be what people generally tune, yes. But I might be missing something myself.

In addition, I would think about the following:

- For both, but especially for MyISAM, try to find ways how to make the inserts happen concurrently. Obviously you should have each table in a separate csv file, but also data going into the same table can be split into multiple parts. Then you can LOAD DATA INFILE... simultaneously for each csv file.

- For MyISAM, use the CONCURRENT keyword to enable inserts to happen concurrently, and if needed use "SET GLOBAL concurrent_insert=2":
http://dev.mysql.com/doc/refman/5.1/en/concurrent-inserts.html

- If you don't use replication (you probably don't), turn off the binary log by commenting out log-bin in the my.cnf file.
http://www.mysqlperformanceblog.com/2008/05/14/
http://dev.mysql.com/doc/refman/5.1/en/replication-options-binary-log.html

For InnoDB:

- It is a big advantage if the data in the CSV files is already ordered by primary key. (Because the InnoDB primary key is a clustered index, so it will organize the table physically to be in primary key order anyway.)

- For the bulk insert, you should consider turning off foreign key checking and unique index checking.
UNIQUE_CHECKS=0;
FOREIGN_KEY_CHECKS=0

- Using InnoDB plugin, you can speed things up by inserting data into a table without indexes (only define primary key, of course), and then create the indexes separately with alter table. (on an existing table you can also consider dropping existing indexes, the benefit of this would depend case by case).
http://www.innodb.com/doc/innodb_plugin-1.0/innodb-create-index.html
http://www.mysqlperformanceblog.com/2008/04/23/testing-innodb-barracuda-format-with-compression/

InnoDB plugin is not supplied by Sun but directly from Oracle. So we cannot sell it with the commercial (oem) license, but for customer's that use it for internal use we do support it on a best effort basis (as we do for other non-standard MySQL uses too).

Also came across this while writing this:
http://www.mysqlperformanceblog.com/2008/07/03/how-to-load-large-files-safely-into-innodb-with-load-data-infile/

Sunday Sep 28, 2008

Accessing your MySQL data whatever way you want it (Part 2, InnoDB)

In the previous post we had a look at the MySQL Cluster NDB API and how it enables direct access to the MySQL Cluster data nodes, and therefore also enables access through other protocols than SQL.

I've often asked myself: Since NDB is so great for MySQL Cluster, is there anything similar for MySQL Server (the not-cluster version...)? A couple of months ago Kazuho Oku did something like that and wrote in his blog about it.

The context for Kazuho's work is the social network use case: 1) You have users. 2) Some users are linked to each others as friends. 3) When a user logs in, he should see a timeline of events/messages from his friends. In a previous post he had already tested the difference between a "pull" and "push" approach. (Taking a small sidetrack here, imho both approaches are wrong: The messages/events should first be submitted only in the originating users table, then copied to each recipient by an asynchronous background process. This would give you the best of both worlds, submission speed of pull model and retrieval speed of push model. Anyway...)

For the test we are talking about now, Kazuho is exploring the pull model: When a user logs in, a query is executed to fetch messages/events from all of the user's friends. Kazuho then compared 3 ways to do this: By SQL from the app, by a stored procedure that does everything at once, and by a MySQL User Defined Function. (A UDF is something you write and compile in C and install as a plugin to the server. It can then be called simply as "SELECT functionname(...);".) The UDF is accessing the InnoDB data structures directly using MySQL and InnoDB internal functions, so it is reminiscient of using the NDB API to bypass SQL in MySQL Cluster.

Kazuho's results are clear:






Building Timelines on MySQL
timelines / sec.
SQL56.7
Stored Procedure136
UDF using Direct Access1,710

1) This is a good example of a use case where using a stored procedure gives you an advantage over raw SQL. Never mind if you think MySQL stored procedures are inefficient or not, these numbers are clear, the stored procedure approach is 2,5 times more efficient.

2) The UDF rocks! Accessing InnoDB structures directly, it is 10+ times faster than the stored procedure.

There is one drawback though. Accessing the storage engine directly inside MySQL is a bit dangerous. There is no well defined API so there is no guarantee that the next version will not break your UDF. Well, I guess it wouldn't but in theory at least. And a UDF is something you have to install in the server, it is not a client API in that sense. But getting 10 times better performance is something to think about, if you're ready to get your elbows dirty.

PS. I hear the social networking problem is an especially interesting one from this point of view, in that it doesn't map easily to plain old relational databases and SQL. Getting what you want is a bit inefficient with SQL. Kazuho's UDF's show that it can be done tenfold more efficient, by accessing the data in a more optimum way. This is of course exactly the point with native data access.

Monday Sep 01, 2008

Accessing your MySQL data whatever way you want it

One way to look at a database is that


  1. there is data, and

  2. there are ways to access data.

This dichotomy was actually coined (whether intentional or not) by Vinay in the MySQL telecom team when discussing the MySQL Cluster vision some months ago.

Even if you typically think of MySQL Cluster as just a clustered version of the plain old MySQL server, it is actually more like the opposite is true, if you consider the architecture and history of MySQL Cluster. The original cluster was just the data store called Network DataBase or NDB as we familiarly know it still. Then MySQL Server was integrated on top of that to provide an SQL interface. The original and "native" NDB interface is still there though, and many prefer to use direct C++ or Java access to their NDB-clustered data. It is faster just in general, but especially applications with a real-time requirement will benefit from omitting the MySQL Server. No additional network hop and no parsing of SQL, just direct access to your data. Sometimes also you might benefit from being able to do things with the NDB API that cannot be efficiently expressed in SQL at all.

But did you know that in addition to the SQL and NDB API there are actually multiple more ways to interface with MySQL Cluster available:


  • LDAP was presented at a MySQL User Conference BOF last spring and is actually available now as an Option in the Carrier Grade Edition. The LDAP interface is actually an OpenLDAP server node, using a native NDB backend in the slapd.

  • mod_ndb is a REST Web Services API for MySQL Cluster. This one is (obviously) implemented as an Apache module. Although produced by our own John 'JD' Duncan, it is not a Sun supported product.

  • We also know of a case where MySQL Cluster is accessed through the RADIUS protocol, although I don't think this implementation is publicly available.

  • And someone also does it with DIAMETER, a successor to RADIUS.

I don't know details on the 2 last ones, but at least the 2 first ones use NDB directly. That is much more efficient and convenient than for instance doing some LDAP->SQL conversions when SQL really isn't needed in the first place. Moreover, you did realize that all these interfaces are equal citizens with the mysqld - they are all just api nodes. Meaning, you could have one big cluster and access that same data with SQL, LDAP, HTTP, RADIUS and DIAMETER, and of course directly from your application code with the NDB C++ or Java API. Which brings us back to the title for this blog post: You have data and you have ways to access the data. Whatever ways suits you the best.

Then of course for the interesting question? Are there more protocols/API's out there for MySQL Cluster that we don't know about? (Or that I ommitted by mistake?) Are there some protocols there that would be interesting to implement? Let us know at hingo at mysql dot com (or comment below)!

Wednesday Aug 13, 2008

MySQL perspectives from a SQL Server guru

Ben Kaufman at SQLServerCentral introduces MySQL to the SQL Server DBA crowd. All in all his views seem to be fairly positive, in particular the MySQL Cluster experience:


NDB is the gem of MySQL, originally developed by Ericson to track cell phone calls this is a share nothing cluster engine stored in memory. This is a true cluster that supports both high availability and load balancing. [...]
This engine is similar to synchronous mirroring in SQL Server in that it is a 2-phase commit, the difference being the commit is done in memory at the data layer not the log. Logs are hardened at a later time, with the theory being that since data is committed on multiple nodes the data is safe and doesn't require a log flush as part of the transaction. [...]
For pure performance the cluster is comparable with a single instance of SQL Server. I've found on selects it beats SQL Server slightly as long as the data on SQL Server is in memory. This gap widens as more connections are made. Writes depend on whether SQL Server is using write back cache on its controller, in the case it is, it beats NDB, due to NDBs 2-phase commit. Without the controller cache NDB beat SQL. However this is not apples to apples. When compared to SQL Server synchronous mirroring NDB wins hands down. The cost associated with NDB is that it resides in memory (5.1 allows you to keep non indexed data on disk), and therefore your dataset is limited by the amount of memory you use. [...]
With all the negatives put aside if you have an application that requires redundancy, and fast inserts and selects on a single table, NDB is the best product out there. We've been running it for almost 18 months and it's been rock solid. Compared with other SQL Server and other MySQL implementations this has required the least amount of attention. One final note this will not run on Windows.

Also other storage engines get a good evaluation:


Myisam whether intentional or not is built and optimized for read-only datasets. Some of the features that make this the case is the fact that it doesn't support transactions, so taking user updates would be dangerous, but you don't incur the overhead of the transactional log. It performs locking at the table level not the row or page, which is not the best for active OLTP systems, unless the system is only doing inserts. On an application only performing inserts it performs well because it has the ability to perform concurrent inserts and selects. This is because data is appended to the end of the file, and a table lock is not issued, allowing for selects of the data not to be blocked.

Wow, even I had missed this fact about MyISAM. There's always so much talk about MyISAM performing badly on "concurrent writes", I didn't realize that pure INSERTs are not a problem only UPDATEs. I immediately tested this and on a 2 core system (my laptop) the performance of bulk insert doubled when using 10 threads instead of 1.

Thanks Ben, this is useful info. There are many networking use cases where you have to collect lots of data from network elements in some kind of aggregator or mediator, and there the data is then queried, forwarded and deleted later on, but the data is never UPDATEd. It seems MyISAM is a good fit for this kind of use case after all, not just pure datawarehousing. (Especially with PARTITIONing, where you can delete full tables/partitions and not just a bunch of rows.)

About

The people of the MySQL Telecom team writes about developments around MySQL and MySQL Cluster and how these products are used by our Communitcations industry customers. (Image jasmic@Flickr)

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today