Annals of Oracle's Improbable Errors: Unicode

Showing posts with label Unicode. Show all posts

Tuesday, August 06, 2013

DECOMPOSE this: when what you see is not exactly what you think you see

Always check out the original article at http://www.oraclequirks.com for latest comments, fixes and updates.

I was trying to call function DECOMPOSE while investigating a strange problem I had with some Unicode strings that could not be properly indexed by Oracle Text and, amazingly, I found out that most pages where usage examples of such DECOMPOSE function are given, most likely never attempted to go beyond its default form as given by SQL Reference and I'd dare to say that they didn't even try to execute this function once.

Even more amazing is the fact that the syntax diagrams of DECOMPOSE in *all* versions up until 11.2 are wrong.

But let's start from the beginning.
As per SQL reference example, if you call this

select DECOMPOSE('Crónicas') d from dual;

you should get this result:

---------

Cro´nicas

Unfortunately what you see in SQL Developer (or Apex SQL Workshop) is instead:

-------- 
Crónicas

So, first of all, apparently either this function doesn't work as advertised (my db is AL32UTF8) or the functional description lacks some important detail.
After various attempts I realized that the "client" software must be playing a hidden role in this messy situation.
If, instead of displaying the resulting string, I get the length of the original string and its decomposed equivalent, I see that the latter is one character longer.

select length('Crónicas') l0

     , length(DECOMPOSE('Crónicas')) l1

  from dual;

        L0         L1
---------- ----------
         8          9

Indeed, if I get a hex dump, I can spot the difference:

select rawtohex('Crónicas') s0
     , rawtohex(DECOMPOSE('Crónicas')) s1

  from dual;

S0                 S1                 
------------------ --------------------
4372C3B36E69636173 43726FCC816E69636173

The decomposed string contains two characters:

6F is "o"
CC81 is a special Unicode character called "Combining acute accent"

So, it turns out that the result displayed in the Oracle SQL Reference is fictitious because the operating system may kick in and recombine the two distinct characters in order to display them according to the Unicode standard and you can't visually spot the difference.

Unfortunately, Oracle Text in 10.2 doesn't seem to cope with these combined accented characters, so you might be looking at two apparently identical strings on your monitor that are the result of distinct unicode character combinations which makes somewhat difficult to understand why one was found by a search query, while the latter wasn't.

But this was just the first part of the intriguing story...

As I already said, DECOMPOSE's syntax diagrams are dead wrong.
According to them, you should be able to call this functions also in the following forms:

select DECOMPOSE('Crónicas' CANONICAL) d from dual;

but all you get is the infamous ORA-00907:

ORA-00907: missing right parenthesis

According to the book "PLSQL Programming", you should be able to call DECOMPOSE as follows:

select DECOMPOSE('Crónicas', CANONICAL) from dual; -- added the comma

but even in this case you hit another error:

ORA-00904: "CANONICAL": invalid identifier

It turns out that the correct form is instead:

select DECOMPOSE('Crónicas','CANONICAL') from dual;

select DECOMPOSE('Crónicas','COMPATIBILITY') from dual;

I was unable to quickly find an example returning different results for the two modes, may be I'll find them later.

As a final note, if you specify a wrong literal as the second parameter, you'll get the following error:

ORA-12702: invalid NLS parameter string used in SQL function

Wednesday, September 26, 2007

ASCII, ASCIISTR, CHR, UNISTR: many ways of converting characters

Let me state first that this is not meant to be an academic discussion about Unicode, UTF-8, UCS-2 and other interesting encoding problems.

If you are looking for a concise yet funny guide to the differences between UTF-8, UTF-16, UCS-2, UCS-4 and so on, you'd better off checking out "the Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets".

So, this morning i needed to find out the Unicode value for the special punctuation character "…".

This character is catalogued in the Unicode character index by name as horizontal ellipsis and its hex code point is 2026.

As the database character set is defined as "AL32UTF8", my first temptation was to convert this hex number into its decimal equivalent:

select chr(8230) c from dual;

C
-
&

But what I've got here is the ASCII character 38.

At first one could think that this caused by a wrong NLS setting or by the fact that perhaps function CHR does not return multibyte characters.

Nothing of the kind.

This result is easily explained trying out the following:

select chr(mod(8230,256)) c
from dual;

C
-
&

Moreover, if the db character set is AL32UTF8:

select chr(14844070) c
from dual;

C
-
…

So, CHR can return multibyte characters, indeed it returns any character in the database character set.

But what has to do 14844070 with 8230?
My short answer is nothing.

I presume 14844070 is the numeric code in the internal UCS2 representation of this Unicode character that I've got from:

select ascii('…') n
from dual;

N
-
14844070

Anyway, while it could be interesting to know where this 14844070 comes from in the end, there is an easy way to get a character from its unicode code point:

select unistr('\2026') c
from dual;

C
-
…

Then, if for some reason you need to know the numeric code of a character starting from its unicode code point, the easiest way is to run:

select ascii(unistr('\2026')) n
from dual;

N
-
14844070

Again, the numeric value returned may vary with the database character set, although i could not try this out because i couldn't find any database in a different character set.

Another interesting function is ASCIISTR that is converting a string into a plain ASCII string where non-ASCII characters are represented by their UTF-16 code unit.

select asciistr('hello world…') s
from dual;

S
-
hello world\2026

This function can be especially useful if you need to dump data into a plain ASCII file (i.e. not UTF-8 or UTF-16 encoded).

For an Oracle official document discussing the topic, especially for the maximum string length implications for database columns and PL/SQL variables, see the Oracle Globalization Support Guide.

PS: on Windows another way of getting a unicode code point (or copying the character to the clipboard directly for pasting into a SQL source...) is to search for the desired character using the accessories/system/character map.

Annals of Oracle's Improbable Errors

Tuesday, August 06, 2013

DECOMPOSE this: when what you see is not exactly what you think you see

Wednesday, September 26, 2007

ASCII, ASCIISTR, CHR, UNISTR: many ways of converting characters

yes you can!

latest articles