Wednesday, September 26, 2007

ASCII, ASCIISTR, CHR, UNISTR: many ways of converting characters

Let me state first that this is not meant to be an academic discussion about Unicode, UTF-8, UCS-2 and other interesting encoding problems.

If you are looking for a concise yet funny guide to the differences between UTF-8, UTF-16, UCS-2, UCS-4 and so on, you'd better off checking out "the Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets".

So, this morning i needed to find out the Unicode value for the special punctuation character "…".

This character is catalogued in the Unicode character index by name as horizontal ellipsis and its hex code point is 2026.

As the database character set is defined as "AL32UTF8", my first temptation was to convert this hex number into its decimal equivalent:
select chr(8230) c from dual;

C
-
&
But what I've got here is the ASCII character 38.

At first one could think that this caused by a wrong NLS setting or by the fact that perhaps function CHR does not return multibyte characters.

Nothing of the kind.

This result is easily explained trying out the following:
select chr(mod(8230,256)) c
from dual;

C
-
&
Moreover, if the db character set is AL32UTF8:
select chr(14844070) c
from dual;

C
-

So, CHR can return multibyte characters, indeed it returns any character in the database character set.

But what has to do 14844070 with 8230?
My short answer is nothing.

I presume 14844070 is the numeric code in the internal UCS2 representation of this Unicode character that I've got from:
select ascii('…') n
from dual;

N
-
14844070
Anyway, while it could be interesting to know where this 14844070 comes from in the end, there is an easy way to get a character from its unicode code point:
select unistr('\2026') c
from dual;

C
-

Then, if for some reason you need to know the numeric code of a character starting from its unicode code point, the easiest way is to run:
select ascii(unistr('\2026')) n
from dual;

N
-
14844070
Again, the numeric value returned may vary with the database character set, although i could not try this out because i couldn't find any database in a different character set.

Another interesting function is ASCIISTR that is converting a string into a plain ASCII string where non-ASCII characters are represented by their UTF-16 code unit.
select asciistr('hello world…') s
from dual;

S
-
hello world\2026
This function can be especially useful if you need to dump data into a plain ASCII file (i.e. not UTF-8 or UTF-16 encoded).

For an Oracle official document discussing the topic, especially for the maximum string length implications for database columns and PL/SQL variables, see the Oracle Globalization Support Guide.


PS: on Windows another way of getting a unicode code point (or copying the character to the clipboard directly for pasting into a SQL source...) is to search for the desired character using the accessories/system/character map.

4 comments:

Oliver said...

Thanks for the helpful information! :=

Anonymous said...

The value that Oracle shows for ASCII('…') (Unicode ellipsis character, U+2026) is 14844070. In hexadecimal, this is 0xE280A6. This corresponds to the UTF-8 encoding of this character taken as a big-endian number.

Cheers, Adrian

Byte64 said...

Thanks for the information Adrian.

swati said...

Database1: UTF8
------------------------
select chr(233) from dual --> null
select chr(50089) from dual --> é
select ascii('é') from dual --> 50089
select ascii(tablename.fieldname) from tablename --> 50089


Database2: AL16UTF16
-----------------------
select chr(233) from dual --> null
select chr(50089) from dual --> é
select ascii('é') from dual --> 50089
select ascii(substr(product_description,2,1)) from product_master_data where product_code = '6' --> 233

why only in case of query its showing 233 I don't understand.

yes you can!

Two great ways to help us out with a minimal effort. Click on the Google Plus +1 button above or...
We appreciate your support!

latest articles