Wednesday Apr 30, 2014

Common problems and mistakes when working with UTF-8 data using MySQL Connector/ODBC

Starting from the version 5.2 the MySQL Connector/ODBC comes in two flavors
ANSI and UNICODE. The main difference between these two versions is the way of
processing of the string and text data:

ANSI or (A) version mainly works with the single-byte character sets such as
   Latin-1 (a.k.a. ISO-8859-1), Windows-1251 (a.k.a. CP1251) and so on.
   This puts a very important requirement of using
0x00 byte as a string
   termination character. Any string returned to the client application is
   a sequence of characters each of them is
SQLCHAR type. SQLCHAR is mapped
   to the single-byte char C type.
ANSI driver cannot process most of
   UNICODE strings because multi-byte UNICODE characters can contain 0x00
   byte, which will be immediately treated as the end of the string. The
   ANSI-style function can be explicitly called by adding A to the ODBC
   function name. For instance:

       SQLDriverConnectA() for SQLDriverConnect()
       SQLColAttributeA() for SQLColAttribute()

UNICODE or (W) version works with the multi-byte encodings. In this case
   (W) stands for Wide characters that consist of several bytes:
UTF-16, UCS-2
   and others. Zero byte 0x00 is no longer a string terminator because it can
   be a part of a multi-byte character. The string is terminated by a character
   which has all its bytes set to zero (such as
0x0000 for UTF-16). The text
   string is a sequence of such wide characters of

   NOTE: The mapping of
SQLWCHAR type is somewhat complicated and it can
         create hidden pitfalls for programmers porting their code from
         Windows to Linux. Usually a
SQLWCHAR character is a 16-bit unit and
         we will not consider the exotic cases when
SQLWCHAR is different.
         Windows uses
UTF-16 and maps SQLWCHAR to 16-bit wchar_t type.
         However, many Linux versions such as Ubuntu Linux use
UTF-32 as an
         internal character set and therefore their 32-bit wchar_t is not
         compatible with
SQLWCHAR, which is always 16-bit.

UNICODE-style function calls are made through adding W to the function

       SQLDriverConnectW() for SQLDriverConnect()
       SQLColAttributeW() for SQLColAttribute()

So far it was all clear and straightforward because we reviewed only fixed-size
character sets that consisted of 1 byte (Latin-1), 2 bytes (
UTF-16) or 4 bytes
UTF-32). The situation is different with UTF-8 because it is a variable-length
character set with characters that take from 1 to 3 bytes per character.
possesses all properties of
UNICODE character set except that the UTF-8 string
termination character is a single-byte
0x00 (same as for single-byte character

NOTE: We are talking about MySQL implementation of
UTF-8, which is using
           1 to 3 bytes per character.

It is easy to see that
UTF-8 data stream has more in common with single-byte
character sets than with multi-byte ones. In fact, the binary representation
UTF-8 string written with only Latin characters ("Hello World") would be
exactly the same as the binary representation of the single-byte ISO-8859-01.
The zero byte
0x00 is the mark for the end of a UTF-8 string, which makes the
SQLCHAR ODBC type suitable for representing UTF-8 data.

With the wide-character type
SQLWCHAR things get more complicated because
0x00 combined with other non-zero byte gives a valid wide character instead of
terminating the string, which can cause the data corruption.

Here is an example with data stream containing two strings "string 1" and
"string 2":

ANSI ODBC driver and SQLCHAR data (single-byte units)

  [s][t][r][i][n][g][ ][1][\0][s][t][r][i][n][g][ ][2][\0]
   ^                           ^                            
   |__(String #1)              |___(String #2)              

When ODBC driver is processing String #1 it sees the zero byte and stops
reading data after it.

UNICODE ODBC driver and SQLWCHAR data (two-byte units)

  [st][ri][ng][ 1][\0s][tr][in][g ][2\0]  
   ^                 ^                    
   |__(String #1)    |___(String #2)      

In the above example two strings collided inside one
The [\0s] is not a terminating character for
SQLWCHAR string.
SQLWCHAR it has to be [\0\0], but two consequent zero bytes in UTF-8 would
mean that the second string is empty.

If the client program requires
UTF-8 data it is better to use the ANSI version
of MySQL Connector/ODBC.

Wednesday Oct 30, 2013

How to deal with MySQL Connector/ODBC error "Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock'"

I am sure many users run into a mysterious problem when perfectly working ODBC configurations started failing with errors like:
Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock'
The above error message might be preceded with something like [nxDc[yQ]. At the same time odbc.ini specifies in its DSN different SOCKET=/tmp/mysql.sock or a TCP connection SERVER=<remote_host_or_ip>. The question is, what had happened that the ODBC driver started to ignore the DSN options? The clue lies in the corrupted string [nxDc[yQ], which actually was [UnixODBC][MySQL] with each 2nd symbol removed. This is the case of bad conversion from SQLCHAR to SQLWCHAR. The UnixODBC driver manager took a single-byte character string from the client application and tried to convert it into the wide (multi-byte) characters for the Unicode version of MyODBC driver: Initially the piece of the connection string was represented by 1-byte chars like:
after the bad conversion to wide chars (commonly 2-byte UTF-16)
instead of
Naturally, the MyODBC driver could not parse the bad string and tried to use the default connection type (SOCKET) with the default value (/var/lib/mysql/mysql.sock) Now we know what happened, but why it happened? In most cases it happened because of using ODBCManageDataSourcesQ4 utility or its older analog ODBCConfig. When registering ODBC drivers they put lots of additional options and one of these options badly affects the UnixODBC driver manager itself. The solution is simple - remove or comment out the option in odbcinst.ini file (it is empty by default) set for the driver:
[MySQL ODBC 5.2.6 Driver]
Description    =
Driver         = /home/dbs/myodbc526/lib/
Driver64       = /home/dbs/myodbc526/lib/
Setup          = /home/dbs/myodbc526/lib/
Setup64        = /home/dbs/myodbc526/lib/
UsageCount     = 1
CPTimeout      = 0
CPTimeToLive   = 0
IconvEncoding  =  # <--------- remove this line
Trace          =
TraceFile      =
TraceLibrary   =
After applying this simple solution (remove the line with IconvEncoding = ) everything came to normal. Prior to removing that line I tried putting different encoding names there, but the result was not good, so I really don't know how to properly use it. Unfortunately, UnixODBC manuals say nothing about it. Therefore, removing this option was the only way to get things done.

Phil Wang-Oracle


« October 2015