Oracle Text and Accented Characters

October 3, 2023 | 8 minute read
Roger Ford
Principal Product Manager
Text Size 100%:

Oracle Text provides easy full-text search on text in the Oracle Database. In most cases, you can create a simple default text index, which will work straight out of the box.

But sometimes, you need to customize the index to deal with unusual text, or to handle particular search requirements. One common example is how searches should deal with accented characters. Should a search for 'café' also match 'cafe'? This is a more complicated topic than it may initially appear, so read on for all the gory details.

The lexer does the initial part of processing text into indexable units. There are a number of lexer attributes in Oracle Text that affect how accented characters are dealt with. We're going to look at each of these and consider the effects they have, alone and together.

Before we look into them, a quick aside on typing accented characters if you don't have them natively on your keyboard.

Most of the world seems to have largely settled on UTF-8 as a way of representing text. UTF-8 is compatible with 7-bit ASCII for the common Roman letters a-z, A-Z, numbers and common punctuations. For other characters, such as accented characters and non-Roman alphabets (e.g. Greek, Cyrillic, Arabic, Chinese), it uses multiple bytes - up to four bytes per character. However, Microsoft Windows (at least in the Western world) uses the WIN1252 character set as an input character set.

If I want to type one of the accented characters in that set, I can hold down the ALT key in most Windows applications, then enter the four-digit decimal code for that character on my numeric keypad. So, to enter a lower-case o-umlaut character ("o" with two dots above it: "ö"), I would hold down ALT and type 0246 on the numeric keypad.

If you use Emacs you can use CTRL+x 8 followed by a punctuation mark and a letter, so o-umlaut would be CTRL+x 8 " o.

If you use vi, there's probably some arcane way of doing it, but I'm afraid I really don't know how.

If your aim is to insert characters into the Oracle Database, then there are a few pitfalls. SQL*Net attempts to convert characters from the client character set into the database character set, and many client programs assume a default character set of US7ASCII. You can avoid conversion by setting an environment variable, for example if your input character set and database character set are UTF8, you can set the following variable on your client:

      export NLS_LANG=american_america.al32utf8

Alternatively you can avoid any risk of ambiguity by using the SQL UNISTR() function. That accepts Unicode (UCS-2 not UTF-8, sorry!) hexadecimal character values preceded by backslash, and outputs the correct character. So 'café' could be inserted using either:

      INSERT INTO tab VALUES ('caf'||UNISTR('\\00E9'));

or

      INSERT INTO tab VALUES (UNISTR('caf\00E9'));

UNISTR() is strongly recommended when writing scripts due to the avoidance of any client-side character set issues. It is somewhat harder to read, of course.

Now let's take a look at the various lexer attributes used for index customization, and the effects that they have.

BASE_LETTER

Let's start with perhaps the simplest - the BASE_LETTER attribute.If I set this to "true" then characters with accents in the text will effectively be indexed as their base form.

So for example if I index the German word "schön" (meaning beautiful in English) then the actual indexed token will be "schon".That means if the user searches for " schön" or "schon" then the word will be found (since the query term is processed in the same way).

That makes life much easier for people who don't have the oh-umlaut character “ö” on their keyboard, and don't know the ALT trick mentioned above (or don't want to have to look up the necessary character code).

So that's simple - might as well set that option on always, right? Well, not necessarily. Because in German, the word "schon" actually has a quite different meaning to "schön", and German users wouldn't want to find the unaccented word if they specifically meant to look for the accented word.
So the simple rule of thumb is this: If you think that most of your users will be searching for words for which they have the correct keyboard (for example German users searching German text) then it's best to set BASE_LETTER to false. But if you think users might want to search foreign words for which they do not have the correct keyboard (for example English, or multi-national users searching German or mixed text) then it's best to set BASE_LETTER to TRUE.

ALTERNATE_SPELLING

Now it starts to get a little more complex. Some languages - notably German again - allow you to avoid the use of accented characters by spelling the words differently. In German, an accented character is replaced by the un-accented form of the same letter, followed by an "e". So "schön" could equally validly be spelled "schoen", and every German user would recognize it as the same word. Equally "Muenchen" (the city English speakers call Munich) would be recognized as the same as "München".

So that we can treat these alternate spelling forms as equivalent Oracle Text has the ALTERNATE_SPELLING attribute. When set to "german", Oracle Text will look for the "alternate form" of "vowel + e" and index both that and the accented form of the word. When processing a query, it will equally translate "oe" into "ö", etc, thus ensuring that the word can always be found, regardless of which of the alternate forms is searched for, and which is in the indexed text.

Aside:anyone following closely might ask "why does it index both the alternate form "schoen" and the accented form "schön"? Surely it would be sufficient to just index the "shcön" form? Well, mostly it would. But what happens if the word in question was actually an English word in the middle of the German text, such as "poet"? OK, it would be indexed as “pöt” and anybody searching for "poet" would still find it (since the transformation is applied to the searchterm as well). But what if they used a wildcard and searched for "po%"?They would still expect to find the term, but if only the accented form was indexed, they wouldn't. Hence the reason we index both forms, just in case.

Combining ALTERNATE_SPELLING and BASE_LETTER

OK, so we want ALTERNATE_SPELLING set to "german" for our German text. But we also know that people with non-German keyboards are often going to search it. So we set BASE_LETTER to true as well. What happens now?

If the indexer comes across "schön", ALTERNATE_SPELLING would normally index that without any change. But BASE_LETTER forces it to the unaccented form, and "schon" is actually indexed. If the indexer comes across "schoen", then ALTERNATE_SPELLING decides it should be indexed as both "schoen" and "schön". But before the tokens are written to the index, BASE_LETTER is applied, so the tokens "schoen" and "schon" are indexed.

That all works fine, and we can find either term by searching for "schon", "schön" or "schoen". Great!

But (there's always a but) what happens if we index the French word "Rouède" (the name of a town near the Spanish border)?"uè" is not a candidate for ALTERNATE_SPELLING, so it is left alone. Then BASE_LETTER is applied, and the word "Rouede" is written to the index. If the user searches for "Rouède" then the query-time processing works the same, the search is converted to "Rouede" and the query works fine. However, if the user searches for the base letter form "Rouede", things don't go so well. This time ALTERNATE_SPELLING does get applied to the query term (since the query processor has no way of knowing that the "e" character should be accented) and the searchterm is converted to "Roüde". BASE_LETTER is then applied, and it looks for "Roude" in the index. But the indexed term is "Rouede",so nothing is found.

OVERRIDE_BASE_LETTER

To solve this problem, the OVERRIDE_BASE_LETTER attribute was introduced.

If you set OVERRIDE_BASE_LETTER to "true", then ALTERNATE_SPELLING will "mask" BASE_LETTER. That means that if we meet accented characters in the text that have an alternate form (such as "ö"), we will index them in their original, accented form and also in their alternate form. If we meet them in their alternate form (eg Muenchen) we will index ONLY the alternate form and not transform them.  Accented characters that do not have an alternate form (such as "è") have BASE_LETTER processing applied to them to transform them to their equivalent unaccented character.  Then at query time, we apply only ALTERNATE_SPELLING to any appropriate accented search terms, and BASE_LETTER to all others. This has the positive effect that our previous example "rouède" can be found, if searched for with or without the accent on the "e" character.

It does have the negative effect that base letter searches no longer work on German words - we can't search for "schon" anymore, only "schön" or "schoen" will work. So OVERRIDE_BASE_LETTER makes sense if we want to perform ALTERNATE_SPELLING on German (or other specified language) words and BASE_LETTER on all other languages.

More Information

You can read the full specifications for these options in the Oracle Text Developers Guide under the following topics:
   Basic Lexer settings
   Alternate Spelling

Appendix: Test Script

This is the script I used to test the effects of the various options. To avoid any issues with character set translation, I used the UNISTR() function to create Unicode characters for my accented characters. Note the German words are prefixed by two-letter codes "wa" - with accent, "na" - no accent and "af" alternate form. That allowed me to distinguish in the index between the index terms derived from "schön" and those derived from "schon".

begin 
   ctx_ddl.drop_preference   ( 'my_lexer'); 
end; 
/

begin 
   ctx_ddl.create_preference ( 'my_lexer', 'BASIC_LEXER' ); 
   ctx_ddl.set_attribute     ( 'my_lexer', 'BASE_LETTER', 'true' ); 
   ctx_ddl.set_attribute     ( 'my_lexer', 'OVERRIDE_BASE_LETTER', 'true'); 
   ctx_ddl.set_attribute     ( 'my_lexer', 'ALTERNATE_SPELLING','german' ); 
end; 
/

drop table tt; 
create table tt(a1 number primary key,text varchar2(45));

-- town name "Rouède", accent on the e 
insert into tt values (1,'rou'||unistr('\00E8')||'de');
-- schön with accent (wa) 
insert into tt values (2,'wasch'||unistr('\00F6')||'n');
-- schon no accent (na) 
insert into tt values (3,'naschon');
-- muenchen alternate form (af) 
insert into tt values (4,'afmuenchen');
commit;
 
select * from tt;

create index tta on tt(text) indextype is ctxsys.context parameters ( 'lexermy_lexer' );

set feedback 2
select token_text, token_type from dr$tta$i;

PROMPT searching for the base letter form, without accent on the first e select * from tt where contains(text,'Rouede')>0;

PROMPT and with the accent select * from tt where contains(text,'Rou'||unistr('\00E8')||'de') > 0;
--select * from tt where contains(text,'m'||unistr('\00FC')||'nchen')>0; --select * from tt where contains(text,'muenchen') > 0;

set echo on
--select * from tt where contains(text,'naschoen') > 0; --select * from tt where contains(text,'naschon') > 0; --select * from tt where contains(text,'na'||unistr('\00F6')||'n') > 0;
select * from tt where contains(text,'waschon') > 0; select * from tt where contains(text,'waschoen') > 0; select * from tt where contains(text,'wa'||unistr('\00F6')||'n') > 0;
set echo off

-- The following section shows how to see how the query has been transformed - it shows -- the actual words looked for in the index.

drop table test_explain; create table test_explain( explain_id varchar2(30), id number, parent_id number, operation varchar2(30), options varchar2(30), object_name varchar2(64), position number, cardinality number);

begin 
   ctx_query.explain( 
      index_name    => 'tta', 
      text_query    => 'wasch'||unistr('\00F6')||'n', 
      explain_table => 'test_explain', 
      sharelevel    => 0, 
      explain_id    => 'Test'); 
end; 
/

col explain_id for a10 col id for 99 col parent_id for 99 col operation for a10 col options for a10 col object_name for a20 col position for 99

select explain_id, id, parent_id, operation, options, object_name, position from test_explain order by id;

 

Roger Ford

Principal Product Manager

Roger Ford has been at Oracle since 1987. He started writing code for Oracle 6, and has worked in Development, Support, Technical Marketing and Product Management.


Previous Post

Oracle Database Appliance X10 – Much More Powerful Database-Optimized Entry-Level Engineered Systems

Christian Craft | 3 min read

Next Post


How to configure Oracle Database API for MongoDB for Autonomous Databases with private endpoint

Hermann Baer | 12 min read