Common problems and mistakes when working with UTF-8 data using MySQL Connector/ODBC

Starting from the version 5.2 the MySQL Connector/ODBC comes in two flavors
ANSI and UNICODE. The main difference between these two versions is the way of
processing of the string and text data:

 -
ANSI or (A) version mainly works with the single-byte character sets such as
   Latin-1 (a.k.a. ISO-8859-1), Windows-1251 (a.k.a. CP1251) and so on.
   This puts a very important requirement of using
0x00 byte as a string
   termination character. Any string returned to the client application is
   a sequence of characters each of them is
SQLCHAR type. SQLCHAR is mapped
   to the single-byte char C type.
ANSI driver cannot process most of
   UNICODE strings because multi-byte UNICODE characters can contain 0x00
   byte, which will be immediately treated as the end of the string. The
   ANSI-style function can be explicitly called by adding A to the ODBC
   function name. For instance:

       SQLDriverConnectA() for SQLDriverConnect()
       SQLColAttributeA() for SQLColAttribute()
       etc.

 -
UNICODE or (W) version works with the multi-byte encodings. In this case
   (W) stands for Wide characters that consist of several bytes:
UTF-16, UCS-2
   and others. Zero byte 0x00 is no longer a string terminator because it can
   be a part of a multi-byte character. The string is terminated by a character
   which has all its bytes set to zero (such as
0x0000 for UTF-16). The text
   string is a sequence of such wide characters of
SQLWCHAR type.

   NOTE: The mapping of
SQLWCHAR type is somewhat complicated and it can
         create hidden pitfalls for programmers porting their code from
         Windows to Linux. Usually a
SQLWCHAR character is a 16-bit unit and
         we will not consider the exotic cases when
SQLWCHAR is different.
         Windows uses
UTF-16 and maps SQLWCHAR to 16-bit wchar_t type.
         However, many Linux versions such as Ubuntu Linux use
UTF-32 as an
         internal character set and therefore their 32-bit wchar_t is not
         compatible with
SQLWCHAR, which is always 16-bit.

   The
UNICODE-style function calls are made through adding W to the function
   name:

       SQLDriverConnectW() for SQLDriverConnect()
       SQLColAttributeW() for SQLColAttribute()
       etc.

So far it was all clear and straightforward because we reviewed only fixed-size
character sets that consisted of 1 byte (Latin-1), 2 bytes (
UTF-16) or 4 bytes
(
UTF-32). The situation is different with UTF-8 because it is a variable-length
character set with characters that take from 1 to 3 bytes per character.
UTF-8
possesses all properties of
UNICODE character set except that the UTF-8 string
termination character is a single-byte
0x00 (same as for single-byte character
sets).

NOTE: We are talking about MySQL implementation of
UTF-8, which is using
           1 to 3 bytes per character.

It is easy to see that
UTF-8 data stream has more in common with single-byte
character sets than with multi-byte ones. In fact, the binary representation
of
UTF-8 string written with only Latin characters ("Hello World") would be
exactly the same as the binary representation of the single-byte ISO-8859-01.
The zero byte
0x00 is the mark for the end of a UTF-8 string, which makes the
single-byte
SQLCHAR ODBC type suitable for representing UTF-8 data.

With the wide-character type
SQLWCHAR things get more complicated because
0x00 combined with other non-zero byte gives a valid wide character instead of
terminating the string, which can cause the data corruption.

Here is an example with data stream containing two strings "string 1" and
"string 2":

ANSI ODBC driver and SQLCHAR data (single-byte units)

  [s][t][r][i][n][g][ ][1][\0][s][t][r][i][n][g][ ][2][\0]
   ^                           ^                            
   |__(String #1)              |___(String #2)              

When ODBC driver is processing String #1 it sees the zero byte and stops
reading data after it.

UNICODE ODBC driver and SQLWCHAR data (two-byte units)

  [st][ri][ng][ 1][\0s][tr][in][g ][2\0]  
   ^                 ^                    
   |__(String #1)    |___(String #2)      


In the above example two strings collided inside one
SQLWCHAR.
The [\0s] is not a terminating character for
SQLWCHAR string.
For
SQLWCHAR it has to be [\0\0], but two consequent zero bytes in UTF-8 would
mean that the second string is empty.

If the client program requires
UTF-8 data it is better to use the ANSI version
of MySQL Connector/ODBC.

Comments:

Post a Comment:
  • HTML Syntax: NOT allowed
About

Phil Wang-Oracle

Search

Categories
Archives
« May 2015
SunMonTueWedThuFriSat
     
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
      
Today