UTF-16 and UTF-32 support in GCC added
By Elena on Apr 21, 2008
Our team has made the first contribution to GCC last week. This is
quite exciting news for us, and hopefully it will be the first of many
contributions to come. Kris Van Hees has implemented UTF-16 and
UTF-32 char data types support for the C and C++ languages in gcc.
His work is based on the ISO/IEC draft technical report for C (ISO/IEC
JTC1 SC22 WG14 N1040) and the proposal for C++ (ISO/IEC JTC1 SC22 WG21N2249) here
Neither proposal defines a specific encoding for UTF-16. This
implementation uses the target endianness to determine whether
UTF-16BE or UTF-16LE will be used.
Support was added for the following wide character datatypes (internal
for C, Fundamental types for C++) with the given underlying data
* char16_t: short unsigned int
* char32_t: unsigned int
Support was added to the tokenizer to accept the following new
character and string literal notations:
* u'c-char-sequence' char16_t character literal (UTF-16)
* U'c-char-sequence' char32_t character literal (UTF-32)
* u"s-char-sequence" array of char16_t (UTF-16)
* U"s-char-sequence" array of char32_t (UTF-32)
Support was also added to the C parser and the C++ parser to handle the
following concatenations of string literals:
* "a" u"a" -> u"ab"
* u"a" "b" -> u"ab"
* u"a" u"b" -> u"ab"
* "a" U"b" -> U"ab"
* U"a" "b" -> U"ab"
* U"a" U"b" -> U"ab"
This behaviour is only available in the gnu99, c++0x, and gnu++0x
See the initial patch submitted upstream:
And the revisions:
The patch was commited to the GCC trunk on Apr 18th, 2008.
Thanks to Jason Merrill, Joseph Myers, Andrew Pinski and Tom Tromey for the reviews.