Monday, March 30, 2009

LENGTHB and SUBSTRB: when smaller is bigger

Always check out the original article at http://www.oraclequirks.com for latest comments, fixes and updates.

Lately I'm working a lot with UTF-8 files and while doing some test i came across the following situation, which sounds rather odd at a first sight but it's perfectly legitimate in the way Oracle treats multi-byte strings, so i thought it could be an interesting story for those who rarely deal with this class of problems.

In short, given a multibyte string (i.e. AL32UTF8), when you take a byte-wise substring, a bigger chunk of bytes can result in a shorter string character-wise.

Standard functions SUBSTR and LENGTH have their byte-wise counterparts called SUBSTRB and LENGHTB respectively.
Now, let's take a multibyte string made up of two alternating Unicode characters, "A" = ascii 41h and the ellipsis "..." = Unicode 2026h.
As i explained some time ago, in order to convert back and forth Unicode character codes, you need to call dedicated string functions like UNISTR and ASCIISTR. For instance, the internal code for Unicode character 2026h can be obtained with ASCII(UNISTR('\2026')) which returns the magic number 14844070, that you see in the source code below.

The following three blocks of PL/SQL show this strange phenomenon, where the standard LENGTH function initially returns 6 for a substring of 10 bytes, then 5 for a substring of 11 bytes, then again 6 for a substring of 12 bytes. The relevant parameters and values are highligthed in color.

declare
s varchar2(50) := rpad(chr(14844070),10, 'A'||chr(14844070));
t varchar2(16) := substrb(s,1,10);
begin
for i in 1..length(s) loop
dbms_output.put(ascii(substr(s,i)));
dbms_output.put('|');
end loop;
dbms_output.new_line;
for i in 1..length(t) loop
dbms_output.put(ascii(substr(t,i)));
dbms_output.put('|');
end loop;
dbms_output.new_line;
dbms_output.put_line('length:'||length(t));
dbms_output.put_line('bytes:'||lengthb(t));
end;
/

14844070|65|14844070|65|14844070|65|14844070|65|14844070|65|
14844070|65|14844070|65|32|32|
length:6
bytes:10

declare
s varchar2(50) := rpad(chr(14844070),10, 'A'||chr(14844070));
t varchar2(16) := substrb(s,1,11);
begin
for i in 1..length(s) loop
dbms_output.put(ascii(substr(s,i)));
dbms_output.put('|');
end loop;
dbms_output.new_line;
for i in 1..length(t) loop
dbms_output.put(ascii(substr(t,i)));
dbms_output.put('|');
end loop;
dbms_output.new_line;
dbms_output.put_line('length:'||length(t));
dbms_output.put_line('bytes:'||lengthb(t));
end;
/

14844070|65|14844070|65|14844070|65|14844070|65|14844070|65|
14844070|65|14844070|65|14844070|
length:5
bytes:11

declare
s varchar2(50) := rpad(chr(14844070),10, 'A'||chr(14844070));
t varchar2(16) := substrb(s,1,12);
begin
for i in 1..length(s) loop
dbms_output.put(ascii(substr(s,i)));
dbms_output.put('|');
end loop;
dbms_output.new_line;
for i in 1..length(t) loop
dbms_output.put(ascii(substr(t,i)));
dbms_output.put('|');
end loop;
dbms_output.new_line;
dbms_output.put_line('length:'||length(t));
dbms_output.put_line('bytes:'||lengthb(t));
end;
/

14844070|65|14844070|65|14844070|65|14844070|65|14844070|65|
14844070|65|14844070|65|14844070|65|
length:6
bytes:12
These somewhat bizarre values are easily explained. When Oracle cuts a multi-byte string at some point falling inside a multibyte character, it replaces the meaningless fractional character portion with blanks. You can see this occurring in the first block, where the last two characters are blanks replacing 2/3 of the bytes originally belonging to the ellipsis character taking 3-bytes.

In conclusion when cutting multi-byte strings:
  1. you may get spare blanks that were not part of the original string
  2. you will get a varying length character-wise, depending on the cutting place.
Not a big deal, but something to be aware of.

No comments:

yes you can!

Two great ways to help us out with a minimal effort. Click on the Google Plus +1 button above or...
We appreciate your support!

latest articles