Annals of Oracle's Improbable Errors: UNISTR

Showing posts with label UNISTR. Show all posts

Thursday, January 19, 2012

How to remove the thousands group separator in numbers representing years from apex flash charts

Always check out the original article at http://www.oraclequirks.com for latest comments, fixes and updates.

A quick tip for formatting years when they show up as labels in flash charts inside Oracle Application Express (Apex).
The problem is in that Anychart is too smart, so when it encounters a label that looks like a number, it will format it with the separator for the thousands, i.e. "2012" will be displayed either as "2,012" or "2.012" depending on the application locale settings.

As I don't really liked to see years formatted that way, I thought I had to find a way to suppress the thousands separator without having to resort to using a custom XML chart source file. After a few attempts, it turned out that by prepending a non-breaking space to the string, I could fool anychart's super-smart engine eventually. The technique applies to UTF-8 encoded pages.

UNISTR('\00A0') || TO_CHAR(gl_date, 'YYYY')

Not a big deal but something that may come in handy if you are in a hurry and short of more brilliant ideas. May be there are other ways to achieve the same result, this was just the first one that occurred to me.

The chart clearly displays my annual income as apex freelance, just in case you are wondering :-P

Monday, March 30, 2009

LENGTHB and SUBSTRB: when smaller is bigger

Always check out the original article at http://www.oraclequirks.com for latest comments, fixes and updates.

Lately I'm working a lot with UTF-8 files and while doing some test i came across the following situation, which sounds rather odd at a first sight but it's perfectly legitimate in the way Oracle treats multi-byte strings, so i thought it could be an interesting story for those who rarely deal with this class of problems.

In short, given a multibyte string (i.e. AL32UTF8), when you take a byte-wise substring, a bigger chunk of bytes can result in a shorter string character-wise.

Standard functions SUBSTR and LENGTH have their byte-wise counterparts called SUBSTRB and LENGHTB respectively.
Now, let's take a multibyte string made up of two alternating Unicode characters, "A" = ascii 41h and the ellipsis "..." = Unicode 2026h.
As i explained some time ago, in order to convert back and forth Unicode character codes, you need to call dedicated string functions like UNISTR and ASCIISTR. For instance, the internal code for Unicode character 2026h can be obtained with ASCII(UNISTR('\2026')) which returns the magic number 14844070, that you see in the source code below.

The following three blocks of PL/SQL show this strange phenomenon, where the standard LENGTH function initially returns 6 for a substring of 10 bytes, then 5 for a substring of 11 bytes, then again 6 for a substring of 12 bytes. The relevant parameters and values are highligthed in color.

declare
s varchar2(50) := rpad(chr(14844070),10, 'A'||chr(14844070));
t varchar2(16) := substrb(s,1,10);
begin
for i in 1..length(s) loop
dbms_output.put(ascii(substr(s,i)));
dbms_output.put('|');
end loop;
dbms_output.new_line;
for i in 1..length(t) loop
dbms_output.put(ascii(substr(t,i)));
dbms_output.put('|');
end loop;
dbms_output.new_line;
dbms_output.put_line('length:'||length(t));
dbms_output.put_line('bytes:'||lengthb(t));
end;
/

14844070|65|14844070|65|14844070|65|14844070|65|14844070|65|
14844070|65|14844070|65|32|32|
length:6
bytes:10

declare
 s varchar2(50) := rpad(chr(14844070),10, 'A'||chr(14844070));
 t varchar2(16) := substrb(s,1,11);
begin
 for i in 1..length(s) loop
 dbms_output.put(ascii(substr(s,i)));
 dbms_output.put('|');
 end loop;
 dbms_output.new_line;
 for i in 1..length(t) loop
 dbms_output.put(ascii(substr(t,i)));
 dbms_output.put('|');
 end loop;
 dbms_output.new_line;
 dbms_output.put_line('length:'||length(t));
 dbms_output.put_line('bytes:'||lengthb(t));
end;
/

14844070|65|14844070|65|14844070|65|14844070|65|14844070|65|
14844070|65|14844070|65|14844070|
length:5
bytes:11

declare
 s varchar2(50) := rpad(chr(14844070),10, 'A'||chr(14844070));
 t varchar2(16) := substrb(s,1,12);
begin
 for i in 1..length(s) loop
 dbms_output.put(ascii(substr(s,i)));
 dbms_output.put('|');
 end loop;
 dbms_output.new_line;
 for i in 1..length(t) loop
 dbms_output.put(ascii(substr(t,i)));
 dbms_output.put('|');
 end loop;
 dbms_output.new_line;
 dbms_output.put_line('length:'||length(t));
 dbms_output.put_line('bytes:'||lengthb(t));
end;
/

14844070|65|14844070|65|14844070|65|14844070|65|14844070|65|
14844070|65|14844070|65|14844070|65|
length:6
bytes:12

These somewhat bizarre values are easily explained. When Oracle cuts a multi-byte string at some point falling inside a multibyte character, it replaces the meaningless fractional character portion with blanks. You can see this occurring in the first block, where the last two characters are blanks replacing 2/3 of the bytes originally belonging to the ellipsis character taking 3-bytes.

In conclusion when cutting multi-byte strings:

you may get spare blanks that were not part of the original string
you will get a varying length character-wise, depending on the cutting place.

Not a big deal, but something to be aware of.

Wednesday, September 26, 2007

ASCII, ASCIISTR, CHR, UNISTR: many ways of converting characters

Let me state first that this is not meant to be an academic discussion about Unicode, UTF-8, UCS-2 and other interesting encoding problems.

If you are looking for a concise yet funny guide to the differences between UTF-8, UTF-16, UCS-2, UCS-4 and so on, you'd better off checking out "the Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets".

So, this morning i needed to find out the Unicode value for the special punctuation character "…".

This character is catalogued in the Unicode character index by name as horizontal ellipsis and its hex code point is 2026.

As the database character set is defined as "AL32UTF8", my first temptation was to convert this hex number into its decimal equivalent:

select chr(8230) c from dual;

C
-
&

But what I've got here is the ASCII character 38.

At first one could think that this caused by a wrong NLS setting or by the fact that perhaps function CHR does not return multibyte characters.

Nothing of the kind.

This result is easily explained trying out the following:

select chr(mod(8230,256)) c
from dual;

C
-
&

Moreover, if the db character set is AL32UTF8:

select chr(14844070) c
from dual;

C
-
…

So, CHR can return multibyte characters, indeed it returns any character in the database character set.

But what has to do 14844070 with 8230?
My short answer is nothing.

I presume 14844070 is the numeric code in the internal UCS2 representation of this Unicode character that I've got from:

select ascii('…') n
from dual;

N
-
14844070

Anyway, while it could be interesting to know where this 14844070 comes from in the end, there is an easy way to get a character from its unicode code point:

select unistr('\2026') c
from dual;

C
-
…

Then, if for some reason you need to know the numeric code of a character starting from its unicode code point, the easiest way is to run:

select ascii(unistr('\2026')) n
from dual;

N
-
14844070

Again, the numeric value returned may vary with the database character set, although i could not try this out because i couldn't find any database in a different character set.

Another interesting function is ASCIISTR that is converting a string into a plain ASCII string where non-ASCII characters are represented by their UTF-16 code unit.

select asciistr('hello world…') s
from dual;

S
-
hello world\2026

This function can be especially useful if you need to dump data into a plain ASCII file (i.e. not UTF-8 or UTF-16 encoded).

For an Oracle official document discussing the topic, especially for the maximum string length implications for database columns and PL/SQL variables, see the Oracle Globalization Support Guide.

PS: on Windows another way of getting a unicode code point (or copying the character to the clipboard directly for pasting into a SQL source...) is to search for the desired character using the accessories/system/character map.

Annals of Oracle's Improbable Errors