Common problems and mistakes when working with UTF-8 data using MySQL Connector/ODBC
By BOGDAN DEGTYARIOV on Apr 30, 2014
Starting from the version 5.2 the MySQL Connector/ODBC comes in two flavors
ANSI and UNICODE. The main difference between these two versions is the way of
processing of the string and text data:
- ANSI or (A) version mainly works with the single-byte character sets such as
Latin-1 (a.k.a. ISO-8859-1), Windows-1251 (a.k.a. CP1251) and so on.
This puts a very important requirement of using 0x00 byte as a string
termination character. Any string returned to the client application is
a sequence of characters each of them is SQLCHAR type. SQLCHAR is mapped
to the single-byte char C type. ANSI driver cannot process most of
UNICODE strings because multi-byte UNICODE characters can contain 0x00
byte, which will be immediately treated as the end of the string. The
ANSI-style function can be explicitly called by adding A to the ODBC
function name. For instance:
SQLDriverConnectA() for SQLDriverConnect()
SQLColAttributeA() for SQLColAttribute()
- UNICODE or (W) version works with the multi-byte encodings. In this case
(W) stands for Wide characters that consist of several bytes: UTF-16, UCS-2
and others. Zero byte 0x00 is no longer a string terminator because it can
be a part of a multi-byte character. The string is terminated by a character
which has all its bytes set to zero (such as 0x0000 for UTF-16). The text
string is a sequence of such wide characters of SQLWCHAR type.
NOTE: The mapping of SQLWCHAR type is somewhat complicated and it can
create hidden pitfalls for programmers porting their code from
Windows to Linux. Usually a SQLWCHAR character is a 16-bit unit and
we will not consider the exotic cases when SQLWCHAR is different.
Windows uses UTF-16 and maps SQLWCHAR to 16-bit wchar_t type.
However, many Linux versions such as Ubuntu Linux use UTF-32 as an
internal character set and therefore their 32-bit wchar_t is not
compatible with SQLWCHAR, which is always 16-bit.
The UNICODE-style function calls are made through adding W to the function
SQLDriverConnectW() for SQLDriverConnect()
SQLColAttributeW() for SQLColAttribute()
So far it was all clear and straightforward because we reviewed only fixed-size
character sets that consisted of 1 byte (Latin-1), 2 bytes (UTF-16) or 4 bytes
(UTF-32). The situation is different with UTF-8 because it is a variable-length
character set with characters that take from 1 to 3 bytes per character. UTF-8
possesses all properties of UNICODE character set except that the UTF-8 string
termination character is a single-byte 0x00 (same as for single-byte character
NOTE: We are talking about MySQL implementation of UTF-8, which is using
1 to 3 bytes per character.
It is easy to see that UTF-8 data stream has more in common with single-byte
character sets than with multi-byte ones. In fact, the binary representation
of UTF-8 string written with only Latin characters ("Hello World") would be
exactly the same as the binary representation of the single-byte ISO-8859-01.
The zero byte 0x00 is the mark for the end of a UTF-8 string, which makes the
single-byte SQLCHAR ODBC type suitable for representing UTF-8 data.
With the wide-character type SQLWCHAR things get more complicated because
0x00 combined with other non-zero byte gives a valid wide character instead of
terminating the string, which can cause the data corruption.
Here is an example with data stream containing two strings "string 1" and
ANSI ODBC driver and SQLCHAR data (single-byte units)
[s][t][r][i][n][g][ ][\0][s][t][r][i][n][g][ ][\0]
|__(String #1) |___(String #2)
When ODBC driver is processing String #1 it sees the zero byte and stops
reading data after it.
UNICODE ODBC driver and SQLWCHAR data (two-byte units)
[st][ri][ng][ 1][\0s][tr][in][g ][2\0]
|__(String #1) |___(String #2)
In the above example two strings collided inside one SQLWCHAR.
The [\0s] is not a terminating character for SQLWCHAR string.
For SQLWCHAR it has to be [\0\0], but two consequent zero bytes in UTF-8 would
mean that the second string is empty.
If the client program requires UTF-8 data it is better to use the ANSI version
of MySQL Connector/ODBC.