Introduction

Your choice of database character set is important for several reasons:

Changing the Oracle database character set usually requires creating a new database and the data conversion can be non trivial
Runtime translation of character set data can be expensive and data loss could occur given bad choices
Due to company mergers and the rise of popular web based systems, the need for international, multilingual databases has increased

good and bad database charactersets

If you have the choice, always use the Unicode character set AL32UTF8 and avoid UTF8:

Oracle UTF8 does not fully implement Unicode UTF-8 and is not recommended for new databases
Oracle AL32UTF8 fully implements Unicode UTF-8
Oracle AL32UTF8 is a superset of Oracle UTF8
AL32UTF8 has the best feature support for the Oracle database

The Oracle Database supports multilingual appliations with country specific languages, dates, times, money, number and calendar conventions. This is achieved via Oracle Database Globalization support using the Oracle NLS [National Language Support] runtime library [NLSRTL].

The origins of the many different character sets that the Oracle database supports are complex. Older computer hardware, operating systems and versions of the Oracle database used to constrain the choices for database charcacter set. Since Oracle 7, your choices for the database charcter set have increased. Since Oracle 12.2, the default and recommened database character set has been AL32UTF8.

This blog looks at the past and present of character sets, Unicode, universal character sets and your choices for the Oracle database character set.

Some character encoding pre-history

Historically computers had very limited language support. EBCDIC [Extended Binary Coded Decimal Interhange Code] has been used on IBM 360 mainframes since 1963 and the ASCII [American Standard Code for Information Interchange] standard was first published in 1963.

EBCDIC and ASCII code tables

These character encodings were both created in the USA and were based on the English alphabet. From the 1950s to mid 1970s, puch cards were mainly used as inputs to [mainframe] computers. The ability to cram alpha-numeric encondings into 6 and eventually 8 bits was critical as those computers had such limited main memory.

punch card

While EBCDIC and ASCII are still supported today, they are not very useful for most of the world’s top 20 languages:

Language	Speakers	Characters
English	1.3 billion	< 128
Mandarian Chinese	1.1 billion	Thousands
Hindi	600 million	< 128
Spanish	543 million	< 128
Standard Arabic	274 million	< 128
Bengali	268 million	< 128
French	267 million	< 128
Russian	258 million	< 128
Portugese	258 million	< 128
Urdu	230 million	< 128
Indonesian	199 million	< 128
Standard German	135 million	< 128
Japanese	126 million	Thousands [Kanji]
Marathi	99 million	< 128
Telugu	96 million	< 128
Turkish	88 million	< 128
Tamil	85 million	< 128
Yue Chinese	85 million	Thousands
Wu Chinese	82 million	Thousands
Korean	82 million	< 128

NRCS and dumb terminals

In 1983, to avoid the cost of creating separate [dumb] terminals for each country, Digital Equipment Corportation invented NRCS [National Replacement Character Set] for the VT-200 [green screen] terminals. NRCS enabled different 8 bit character sets to be used on the same terminal hardware. The NRCS showed that many languages can be encoded in 7 bits [128 characters] or 8 bits [256 characters]. These NCRS encoded terminals for mini-computers were popular and by 1987, over one million of these ‘dumb’ terminals had been sold.

VT-200 green screen dumb terminal

ISO/IEC 8859

The success of NRCS on VT-200 style terminals lead to the ISO/IEC 8859 set of standards for 8 bit character encodings. These standards were desgined for character encoding, not to make the characters look pretty for typography.

**ISO/IEC 8859 standards**
Part	Name	Description
Part 1	Latin-1 Western European	Includes Danish, Dutch, Emglish, Faerorse, Finnish, French, German, Icelandic, Irish, Italian, Norwegian, Portuguese, Rhaeto-Romanic, Scottish Gaelic, Spanish
Part 2	Latin-2 Central European	Includes Bosnian, Polish, Croatian, Czech, Slovak, Slovene, Serbian and Hungarian
Part 3	Latin-3 South European	Includes Turkish, Maltese and Esperanto
Part 4	Latin-4 North European	Includes Estonian, Latvian, Lithuanian, Greenlandic and Sami
Part 5	Latin/Cryllic	Includes Belarusian, Bulgarian, Macedonian, Russian, Serbian and Ukrainian
Part 6	Latin/Arabic	Includes the most common Arabic language characters
Part 7	Latin/Greek	Includes modern Greek and some ancient Greek languages
Part 8	Latin/Hebrew	Includes modern Hebrew
Part 9	Latin Turkish	Mostly ISO/IEC 8859-1 with Turkish letters replacing the infrequently used Icelandic letters
Part 10	Latin-6/Nordic	A rearrangement of Latin-4
Part 11	Latin/Thai	Includes the characters needed for the Thai languge
Part 12	Latin/Devanagari	The encoding for Devanagari was abandoned in 1997
Part 13	Latin-7 Baltic-Rim	Includes some missing charcaters from Latin-4 and Latin-6
Part 14	Latin-8	Includes the Gaelic and Breton langages
Part 15	Latin-9	A modifiction of ISO/IEC 8859 that replaces some symbols with €, Š, š, Ž, ž, Œ, œ, and Ÿ
Part 16	Latin-10	Includes Albanian, Croatian, Hungarian, Italian, Polish, Romanian, Slovene, Finnish, French, German, and Irish Gaelic

The various ISO-IE 8859 standards meant that computers could enter, store, retrieve and display characters in many different local languages. The problem was that these standards were largely incompatable when it came to data exchange as it is hard to define the translation or mapping between these diffrent character encodings.

IBM PC with MS-DOS and code pages

IBM PC compatable computers running MS-DOS used ISO/IEC 8858 derived characters sets. The original IBM PC used Code page 437 [DOS Latin US].

IBM_PC MS-DOS

These IBM PC compatable computers could also load different code pages to enable various character sets, but with limits in the color and fonts.

What was really needed was a ‘universal way to represent all character encodings’. One of the challenges was that 8 bits was not enough to represent the huge number of Japanese, Korean, Manadarin, Yue or Wu chinese characters. This means that you may need more than one byte [8 bits] to encode a character and more sophisicated hardware than dumb terminals to render these [non ASCII] characters.

The birth of Unicode

Microcomputers with ‘high resolution’ bit-mapped displays from Xerox PARC, Apple and Microsoft enabled the display or rendering of complex characters.

Xerox PARC Altair, MS Windows and Apple Macintosh

Represenitives from Xerox PARC and Apple started investigating a universal character set in 1987. By 1988 a document called Unicode 88 described a model which used 16 bits to represent the character encoding for modern languages. With more input from RLG [Research Libraries Group], Sun Microsystems and NeXT, the newly formed Uniccode consortium published the first Unicode standard in 1991. By the end of 1991 there were about 61 million IBM compatable PCs, 7 million Apple Macintosh and about two thousand Xerox Alto computers.

The Unicode Standard

Unicode is focused on the [numberic] encoding of characters [graphemes] rather than the various renderings [glyphs] of those characters.

The Unicode standard defines Unicode Transformation Formats (UTF): UTF-8, UTF-16, and UTF-32. In China, GB18030 is widely used and fully implements Unicode although it is not an official Unicode standard.

Unicdoe logo

Even though Unicode started as a 16 bit encoding, the first 256 encodings used the same 8 bits codes as ISO/IEC 8859-1 [Latin-1] to make it simple to convert existing western text into Unicode.

Unicode UTF-8

UTF-8 has variable width encoding and can support 1,112,064 character points.

Variable width UTF8 encoding

To minimize space, UTF-8 can use 1-4 bytes to encode a character. The first 128 characters in Unicode are a singe byte and correspond to the same ASCII byte encodings.

Technologies using UTF8

UTF-8 is widely used and is part of the specification for HTTP headers, IANA, HTML, CSS and XML.

Unicode UTF-16

UTF-16 is variable length, using one or two 16 bit code units to encode 1,112,064 characters.

Variable width UTF16

UTF-16 is used internally by Java, JavaScript and Microsoft Windows.

Java, JavaScript and Windwos use UTF16

Unicode UTF-32

UTF-32 is a fixed length character using four bytes (32 bits) for each Unicode code point.

Fixed width UTF32

Unicode Transformation Formats (UTF) are a space time trade off:

UTF-8 minimizes space, but it is more complex to find the offsets of character encodings as the width is variable
UTF-16 tends to use more space than UTF-8, but the offsets for character encodings are still tricky
UTF-32 uses the most space [four bytes] per character encoding, but finding the offsets of character encodings is trivial

Eventual dominance of Unicode

UTF-8 has been the most common HTML encoding since 2008 and now accounts for 97.5% of all web pages.

Character encoding populatiry

Although Java, JavaScript, Python and MS Windows use UTF-16, more recently there has been a move towards UTF-8 in MS Windows, Go, Rust, Swift, C++ and Julia.

Unicode Character sets in the Oracle Database

The Oracle Databse has supported five different Unicode database character sets.

Unicode character sets in Oracle Database

Things to note about Unicode support in the Oracle Database:

AL16UTF16 is NOT a supported database character set. It is only supported as a National Character Set for non Unicode database character sets
AL24UTFFS is no longer supported
UTF8 only supports Unicode 3.0 and does not support the supplimental characters for Japanese and Chinese characters sets
UTFE is only useful if the system is EBCDIC based
This leaves AL32UTF8 as the Unicode database of choice

How Unicode data is stored in an Oracle Database

The Oracle Database uses the following encoding for UTF-8:

Oracle Database UTF8 encoding

Comments on Oracle UTF-8 encoding:

ASCII characters use 1 byte
European (except ASCII), Arabic, and Hebrew characters require 2 bytes
Indic, Thai, Chinese, Japanese, and Korean characters as well as certain symbols such as the euro symbol require 3 bytes
Characters in the Private Use Area #1 require 3 bytes
Supplementary characters require 4 bytes
Characters in the Private Use Area #2 require 4 bytes
The AL32UTF8 character set supports 1-byte, 2-byte, 3-byte, and 4-byte values
The UTF8 character set supports 1-byte, 2-byte, and 3-byte values, but not 4-byte values
AL32UTF8 is a superset of UTF8 as it can support 4-byte values

Choosing an Oracle Database Character set

The character encoding scheme used by the database is defined as part of the CREATE DATABASE statement. All SQL CHAR data type columns (CHAR, CLOB, VARCHAR2, and LONG), including columns in the data dictionary, have their data stored in the database character set. In addition, the choice of database character set determines which characters can name objects in the database. SQL NCHAR data type columns (NCHAR, NCLOB, and NVARCHAR2) use the national character set.

Oracle Database uses the database character set for:

Data stored in SQL CHAR data types (CHAR, VARCHAR2, CLOB, and LONG)
Identifiers such as table names, column names, and PL/SQL variables
Entering and storing SQL and PL/SQL source code

Consider the following questions when you choose an Oracle Database character set for the database:

What languages does the database need to support now?
What languages will the database need to support in the future?
Is the character set available on the operating system?
What character sets are used on clients?
How well does the application handle the character set?
What are the performance implications of the character set?
What are the restrictions associated with the character set?

Recommended character set

Oracle recommends using Unicode for all new system deployments. Migrating legacy systems to Unicode is also recommended. Deploying your systems today in Unicode offers many advantages in usability, compatibility, and extensibility. Oracle Database enables you to deploy high-performing systems faster and more easily while utilizing the advantages of Unicode. Even if you do not need to support multilingual data today, nor have any requirement for Unicode, it is still likely to be the best choice for a new system in the long run and will ultimately save you time and money as well as give you competitive advantages in the long term.

If in doubt, use AL32UTF8:

AL32UTF8 is a superset of UTF8
AL32UTF8 is a universal characterset
- It can represent all known language character encodings
AL32UTF8 is space efficient
- ASCII uses one byte for storage
- European, Arabic and Hebrew use two bytes for storage
- Indic, Thai and common Chinese, Japanese and Korean use three bytes for storage
- Additional Chinese, Japanese, Korean and historic languages use four bytes

OUI character set

When you create an Oracle database, you must choose both the National Character Set and Character Set.

Choose AL32UTF8 as the database character set
Choose AL16UTF16 as the national character set
Choose your default language as appropriate
Choose your default territory as appropriate

Summary

There are many choices for the Oracle Database character set
Your hardware, operating system and data will constrain your choices for the database character set
Unicode is now the dominant character set encoding
Unicode UTF-8 is the dominant encoding for HTML
AL32UTF8 is the best supported Unicode encoding for the Oracle Database character set
AL32UTF8 is a default database charcter set since 12.2
AL32UTF8 is the recommended Oracle database character set

Disclaimer: These are my personal thoughts and do not represent Oracle’s official viewpoint in any way, shape, or form.

Why the Database Character Set Matters

Introduction

Some character encoding pre-history

NRCS and dumb terminals

ISO/IEC 8859

IBM PC with MS-DOS and code pages

The birth of Unicode

The Unicode Standard

Unicode UTF-8

Unicode UTF-16

Unicode UTF-32

Eventual dominance of Unicode

Unicode Character sets in the Oracle Database

How Unicode data is stored in an Oracle Database

Choosing an Oracle Database Character set

Recommended character set

Summary

Doug Hood

Oracle AI Vector Search Product Manager

Indexing TimesTen Scaleout 22.1

Data Distribution in TimesTen Scaleout 22.1

Why the Database Character Set Matters

Introduction

Some character encoding pre-history

NRCS and dumb terminals

ISO/IEC 8859

IBM PC with MS-DOS and code pages

The birth of Unicode

The Unicode Standard

Unicode UTF-8

Unicode UTF-16

Unicode UTF-32

Eventual dominance of Unicode

Unicode Character sets in the Oracle Database

How Unicode data is stored in an Oracle Database

Choosing an Oracle Database Character set

Recommended character set

Summary

Authors

Doug Hood

Oracle AI Vector Search Product Manager

Indexing TimesTen Scaleout 22.1

Data Distribution in TimesTen Scaleout 22.1