Introduction

Your choice of database character set is important for several reasons:

  • Changing the Oracle database character set usually requires creating a new database and the data conversion can be non trivial
  • Runtime translation of character set data can be expensive and data loss could occur given bad choices
  • Due to company mergers and the rise of popular web based systems, the need for international, multilingual databases has increased

good and bad database charactersets

 

If you have the choice, always use the Unicode character set AL32UTF8 and avoid UTF8:

  • Oracle UTF8 does not fully implement Unicode UTF-8 and is not recommended for new databases
  • Oracle AL32UTF8 fully implements Unicode UTF-8
  • Oracle AL32UTF8 is a superset of Oracle UTF8
  • AL32UTF8 has the best feature support for the Oracle database

The Oracle Database supports multilingual appliations with country specific languages, dates, times, money, number and calendar conventions.  This is achieved via Oracle Database Globalization support using the Oracle NLS [National Language Support] runtime library [NLSRTL].

The origins of the many different character sets that the Oracle database supports are complex.  Older computer hardware, operating systems and versions of the Oracle database used to constrain the choices for database charcacter set.  Since Oracle 7, your choices for the database charcter set have increased. Since Oracle 12.2, the default and recommened database character set has been AL32UTF8.

This blog looks at the past and present of character sets, Unicode, universal character sets and your choices for the Oracle database character set.

 

 

Some character encoding pre-history

Historically computers had very limited language support. EBCDIC [Extended Binary Coded Decimal Interhange Code] has been used on IBM 360 mainframes since 1963 and the ASCII [American Standard Code for Information Interchange] standard was first published in 1963.

EBCDIC and ASCII code tables

 

These character encodings were both created in the USA and were based on the English alphabet.  From the 1950s to mid 1970s, puch cards were mainly used as inputs to [mainframe] computers.  The ability to cram alpha-numeric encondings into 6 and eventually 8 bits was critical as those computers had such limited main memory.

punch card

 

While EBCDIC and ASCII are still supported today, they are not very useful for most of the world’s top 20 languages:

Language Speakers Characters
English 1.3 billion < 128
Mandarian Chinese 1.1 billion Thousands
Hindi 600 million < 128
Spanish 543 million < 128
Standard Arabic 274 million < 128
Bengali 268 million < 128
French 267 million < 128
Russian 258 million < 128
Portugese 258 million < 128
Urdu 230 million < 128
Indonesian 199 million < 128
Standard German 135 million < 128
Japanese 126 million Thousands [Kanji]
Marathi 99 million < 128
Telugu 96 million < 128
Turkish 88 million < 128
Tamil 85 million < 128
Yue Chinese 85 million Thousands
Wu Chinese 82 million Thousands
Korean 82 million < 128

 

 

 

NRCS and dumb terminals

In 1983, to avoid the cost of creating separate [dumb] terminals for each country, Digital Equipment Corportation invented NRCS [National Replacement Character Set] for the VT-200 [green screen] terminals. NRCS enabled different 8 bit character sets to be used on the same terminal hardware. The NRCS showed that many languages can be encoded in 7 bits [128 characters] or 8 bits [256 characters].  These NCRS encoded terminals for mini-computers were popular and by 1987, over one million of these ‘dumb’ terminals had been sold.

VT-200 green screen dumb terminal

 

 

ISO/IEC 8859

The success of NRCS on VT-200 style terminals lead to the ISO/IEC 8859 set of standards for 8 bit character encodings. These standards were desgined for character encoding, not to make the characters look pretty for typography.

ISO/IEC 8859 standards
Part Name Description
Part 1 Latin-1 Western European Includes Danish, Dutch, Emglish, Faerorse, Finnish, French, German, Icelandic, Irish, Italian, Norwegian, Portuguese, Rhaeto-Romanic, Scottish Gaelic, Spanish
Part 2 Latin-2 Central European Includes Bosnian, Polish, Croatian, Czech, Slovak, Slovene, Serbian and Hungarian
Part 3 Latin-3 South European Includes Turkish, Maltese and Esperanto
Part 4 Latin-4 North European Includes Estonian, Latvian, Lithuanian, Greenlandic and Sami
Part 5 Latin/Cryllic Includes Belarusian, Bulgarian, Macedonian, Russian, Serbian and Ukrainian
Part 6 Latin/Arabic Includes the most common Arabic language characters
Part 7 Latin/Greek Includes modern Greek and some ancient Greek languages
Part 8 Latin/Hebrew Includes modern Hebrew
Part 9 Latin Turkish Mostly ISO/IEC 8859-1 with Turkish letters replacing the infrequently used Icelandic letters
Part 10 Latin-6/Nordic A rearrangement of Latin-4
Part 11 Latin/Thai Includes the characters needed for the Thai languge
Part 12 Latin/Devanagari The encoding for Devanagari was abandoned in 1997
Part 13 Latin-7 Baltic-Rim Includes some missing charcaters from Latin-4 and Latin-6
Part 14 Latin-8 Includes the Gaelic and Breton langages
Part 15 Latin-9 A modifiction of ISO/IEC 8859 that replaces some symbols with €, Š, š, Ž, ž, Œ, œ, and Ÿ
Part 16 Latin-10 Includes Albanian, Croatian, Hungarian, Italian, Polish, Romanian, Slovene, Finnish, French, German, and Irish Gaelic

The various ISO-IE 8859 standards meant that computers could enter, store, retrieve and display characters in many different local languages.  The problem was that these standards were largely incompatable when it came to data exchange as it is hard to define the translation or mapping between these diffrent character encodings.

 

 

IBM PC with MS-DOS and code pages

IBM PC compatable computers running MS-DOS used ISO/IEC 8858 derived characters sets.  The original IBM PC used Code page 437 [DOS Latin US].

IBM_PC MS-DOS

These IBM PC compatable computers could also load different code pages to enable various character sets, but with limits in the color and fonts.

What was really needed was a ‘universal way to represent all character encodings’.  One of the challenges was that 8 bits was not enough to represent the huge number of Japanese, Korean, Manadarin, Yue or Wu chinese characters. This means that you may need more than one byte [8 bits] to encode a character and more sophisicated hardware than dumb terminals to render these [non ASCII] characters.

 

 

The birth of Unicode

Microcomputers with ‘high resolution’ bit-mapped displays from Xerox PARC, Apple and Microsoft enabled the display or rendering of complex characters.

Xerox PARC Altair, MS Windows and Apple Macintosh

Represenitives from Xerox PARC and Apple started investigating a universal character set in 1987. By 1988 a document called Unicode 88 described a model which used 16 bits to represent the character encoding for modern languages. With more input from RLG [Research Libraries Group], Sun Microsystems and NeXT, the newly formed Uniccode consortium published the first Unicode standard in 1991. By the end of 1991 there were about 61 million IBM compatable PCs, 7 million Apple Macintosh and about two thousand Xerox Alto computers.

 

 

The Unicode Standard

Unicode is focused on the [numberic] encoding of characters [graphemes] rather than the various renderings [glyphs] of those characters.

The Unicode standard defines Unicode Transformation Formats (UTF): UTF-8, UTF-16, and UTF-32. In China, GB18030 is widely used and fully implements Unicode although it is not an official Unicode standard.

Unicdoe logo

Even though Unicode started as a 16 bit encoding, the first 256 encodings used the same 8 bits codes as ISO/IEC 8859-1 [Latin-1] to make it simple to convert existing western text into Unicode.

 

 

Unicode UTF-8

UTF-8 has variable width encoding and can support 1,112,064 character points.

Variable width UTF8 encoding

To minimize space, UTF-8 can use 1-4 bytes to encode a character.  The first 128 characters in Unicode are a singe byte and correspond to the same ASCII byte encodings.

Technologies using UTF8

UTF-8 is widely used and is part of the specification for HTTP headers, IANA, HTML, CSS and XML.

 

 

Unicode UTF-16

UTF-16 is variable length, using one or two 16 bit code units to encode 1,112,064 characters.

Variable width UTF16

UTF-16 is used internally by Java, JavaScript and Microsoft Windows.

Java, JavaScript and Windwos use UTF16

 

 

 

Unicode UTF-32

UTF-32 is a fixed length character using four bytes (32 bits) for each Unicode code point.

Fixed width UTF32

Unicode Transformation Formats (UTF) are a space time trade off:

  • UTF-8 minimizes space, but it is more complex to find the offsets of character encodings as the width is variable
  • UTF-16 tends to use more space than UTF-8, but the offsets for character encodings are still tricky
  • UTF-32 uses the most space [four bytes] per character encoding, but finding the offsets of character encodings is trivial

 

 

Eventual dominance of Unicode

UTF-8 has been the most common HTML encoding since 2008 and now accounts for 97.5% of all web pages.

Character encoding populatiry

Although Java, JavaScript, Python and MS Windows use UTF-16, more recently there has been a move towards UTF-8 in MS Windows, Go, Rust, Swift, C++ and Julia.

 

 

Unicode Character sets in the Oracle Database

The Oracle Databse has supported five different Unicode database character sets.

Unicode character sets in Oracle Database

Things to note about Unicode support in the Oracle Database:

  • AL16UTF16 is NOT a supported database character set. It is only supported as a National Character Set for non Unicode database character sets
  • AL24UTFFS is no longer supported
  • UTF8 only supports Unicode 3.0 and does not support the supplimental characters for Japanese and Chinese characters sets
  • UTFE is only useful if the system is EBCDIC based
  • This leaves AL32UTF8 as the Unicode database of choice

 

 

How Unicode data is stored in an Oracle Database

The Oracle Database uses the following encoding for UTF-8:

Oracle Database UTF8 encoding

Comments on Oracle UTF-8 encoding:

  • ASCII characters use 1 byte

  • European (except ASCII), Arabic, and Hebrew characters require 2 bytes

  • Indic, Thai, Chinese, Japanese, and Korean characters as well as certain symbols such as the euro symbol require 3 bytes

  • Characters in the Private Use Area #1 require 3 bytes

  • Supplementary characters require 4 bytes

  • Characters in the Private Use Area #2 require 4 bytes

  • The AL32UTF8 character set supports 1-byte, 2-byte, 3-byte, and 4-byte values

  • The UTF8 character set supports 1-byte, 2-byte, and 3-byte values, but not 4-byte values

  • AL32UTF8 is a superset of UTF8 as it can support 4-byte values

 

 

 

Choosing an Oracle Database Character set

The character encoding scheme used by the database is defined as part of the CREATE DATABASE statement. All SQL CHAR data type columns (CHAR, CLOB, VARCHAR2, and LONG), including columns in the data dictionary, have their data stored in the database character set. In addition, the choice of database character set determines which characters can name objects in the database. SQL NCHAR data type columns (NCHAR, NCLOB, and NVARCHAR2) use the national character set.

Oracle Database uses the database character set for:

  • Data stored in SQL CHAR data types (CHAR, VARCHAR2, CLOB, and LONG)

  • Identifiers such as table names, column names, and PL/SQL variables

  • Entering and storing SQL and PL/SQL source code

 

Consider the following questions when you choose an Oracle Database character set for the database:

  • What languages does the database need to support now?

  • What languages will the database need to support in the future?

  • Is the character set available on the operating system?

  • What character sets are used on clients?

  • How well does the application handle the character set?

  • What are the performance implications of the character set?

  • What are the restrictions associated with the character set?

 

 

 

 

 

Recommended character set

Oracle recommends using Unicode for all new system deployments. Migrating legacy systems to Unicode is also recommended. Deploying your systems today in Unicode offers many advantages in usability, compatibility, and extensibility. Oracle Database enables you to deploy high-performing systems faster and more easily while utilizing the advantages of Unicode. Even if you do not need to support multilingual data today, nor have any requirement for Unicode, it is still likely to be the best choice for a new system in the long run and will ultimately save you time and money as well as give you competitive advantages in the long term.

If in doubt, use AL32UTF8:
  • AL32UTF8 is a superset of UTF8
  • AL32UTF8 is a universal characterset
    • It can represent all known language character encodings
  • AL32UTF8 is space efficient
    • ASCII uses one byte for storage
    • European, Arabic and Hebrew use two bytes for storage
    • Indic, Thai and common Chinese, Japanese and Korean use three bytes for storage
    • Additional Chinese, Japanese, Korean and historic languages use four bytes

 

OUI character set

When you create an Oracle database, you must choose both the National Character Set and Character Set.

  • Choose AL32UTF8 as the database character set
  • Choose AL16UTF16 as the national character set
  • Choose your default language as appropriate
  • Choose your default territory as appropriate

 

 

Summary

  • There are many choices for the Oracle Database character set
  • Your hardware, operating system and data will constrain your choices for the database character set
  • Unicode is now the dominant character set encoding
  • Unicode UTF-8 is the dominant encoding for HTML
  • AL32UTF8 is the best supported Unicode encoding for the Oracle Database character set
  • AL32UTF8 is a default database charcter set since 12.2
  • AL32UTF8 is the recommended Oracle database character set

 

Disclaimer: These are my personal thoughts and do not represent Oracle’s official viewpoint in any way, shape, or form.