Minor UTF-8 decoding inconsistency
(1) By Florian Balmer (florian.balmer) on 2019-12-19 16:06:34 [source]
Description
There seems to be a minor inconsistency in the re_next_char()
function to decode UTF-8 sequences to UTF-32 code points. The function rejects overlong UTF-8 sequences and code points from the UTF-16 surrogates range, and returns U+FFFD REPLACEMENT CHARACTER instead. But there's a small loophole: the UTF-8 3-byte overlong forms of the 11-bit code points U+0400..U+07FF are accepted.
The test in src/regexp.c on line 112 should be changed from:
if( c<=0x3ff || (c>=0xd800 && c<=0xdfff) ) c = 0xfffd;
To:
if( c<0x800 || (c>=0xd800 && c<=0xdfff) ) c = 0xfffd;
An alternative approach may be to accept all UTF-8 overlong forms, but I'm not sure if this might yield false positive matches with the SQLite REGEX
operator, and the grep
command, for binary files/blobs.
Vaguely Related Issue
See the following comment in src/sqlite3.c:
/* When converting from UTF-16, the maximum growth results from
* translating a 2-byte character to a 4-byte UTF-8 character.
2-byte (1-word) UTF-16 characters are in the range U+0000..U+FFFF on the BMP, and can always be represented by 3-byte UTF-8 sequences, so the maximum growth factor is 1.5 (instead of 2). 4-byte (2-word) UTF-16 characters are in the range U+10000..U+10FFFF on the Supplementary Planes, and are always represented by 4-byte UTF-8 sequences (growth factor 1).
Maybe a tiny memory-saving optimization for SQLite? ;-)
TL;DR
The rest of the post has detailed test cases, code, and results.
Test Cases
U+03FF GREEK CAPITAL REVERSED DOTTED LUNATE SIGMA SYMBOL
- UTF32:
00000000 00000000 00000011 11111111 = 0x000003ff
- UTF-8 2-byte normal:
11001111 10111111 = 0xcf 0xbf
- UTF-8 3-byte overlong:
11100000 10001111 10111111 = 0xe0 0x8f 0xbf
- UTF-8 4-byte overlong:
11110000 10000000 10001111 10111111 = 0xf0 0x80 0x8f 0xbf
- UTF32:
U+0400 CYRILLIC CAPITAL LETTER IE WITH GRAVE
- UTF-32:
00000000 00000000 00000100 00000000 = 0x00000400
- UTF-8 2-byte normal:
11010000 10000000 = 0xd0 0x80
- UTF-8 3-byte overlong:
11100000 10010000 10000000 = 0xe0 0x90 0x80
- UTF-8 4-byte overlong:
11110000 10000000 10010000 10000000 = 0xf0 0x80 0x90 0x80
- UTF-32:
U+07FF: NKO TAMAN SIGN
- UTF-32:
00000000 00000000 00000111 11111111 = 0x000007ff
- UTF-8 2-byte normal:
11011111 10111111 = 0xdf 0xbf
- UTF-8 3-byte overlong:
11100000 10011111 10111111 = 0xe0 0x9f 0xbf
- UTF-8 4-byte overlong:
11110000 10000000 10011111 10111111 = 0xf0 0x80 0x9f 0xbf
- UTF-32:
U+0800 SAMARITAN LETTER ALAF
- UTF-32:
00000000 00000000 00001000 00000000 = 0x00000800
- UTF-8 3-byte normal:
11100000 10100000 10000000 = 0xe0 0xa0 0x80
- UTF-8 4-byte overlong:
11110000 10000000 10100000 10000000 = 0xf0 0x80 0xa0 0x80
- UTF-32:
Test Code
struct ReInput utf8test = {
"\xcf\xbf"
"\xe0\x8f\xbf"
"\xf0\x80\x8f\xbf"
"\xd0\x80"
"\xe0\x90\x80"
"\xf0\x80\x90\x80"
"\xdf\xbf"
"\xe0\x9f\xbf"
"\xf0\x80\x9f\xbf"
"\xe0\xa0\x80"
"\xf0\x80\xa0\x80",
0, 32
};
fossil_print("U+03FF:\n=======\n");
fossil_print("UTF-8 normal (2-byte): U+%04X\n",re_next_char(&utf8test));
fossil_print("UTF-8 overlong (3-byte): U+%04X\n",re_next_char(&utf8test));
fossil_print("UTF-8 overlong (4-byte): U+%04X\n\n",re_next_char(&utf8test));
fossil_print("U+0400:\n=======\n");
fossil_print("UTF-8 normal (2-byte): U+%04X\n",re_next_char(&utf8test));
fossil_print("UTF-8 overlong (3-byte): U+%04X\n",re_next_char(&utf8test));
fossil_print("UTF-8 overlong (4-byte): U+%04X\n\n",re_next_char(&utf8test));
fossil_print("U+07FF:\n=======\n");
fossil_print("UTF-8 normal (2-byte): U+%04X\n",re_next_char(&utf8test));
fossil_print("UTF-8 overlong (3-byte): U+%04X\n",re_next_char(&utf8test));
fossil_print("UTF-8 overlong (4-byte): U+%04X\n\n",re_next_char(&utf8test));
fossil_print("U+0800:\n=======\n");
fossil_print("UTF-8 normal (3-byte): U+%04X\n",re_next_char(&utf8test));
fossil_print("UTF-8 overlong (4-byte): U+%04X\n",re_next_char(&utf8test));
Test Results (without the Fix)
U+03FF:
=======
UTF-8 normal (2-byte): U+03FF
UTF-8 overlong (3-byte): U+FFFD
UTF-8 overlong (4-byte): U+FFFD
U+0400:
=======
UTF-8 normal (2-byte): U+0400
UTF-8 overlong (3-byte): U+0400 (?)
UTF-8 overlong (4-byte): U+FFFD
U+07FF:
=======
UTF-8 normal (2-byte): U+07FF
UTF-8 overlong (3-byte): U+07FF (?)
UTF-8 overlong (4-byte): U+FFFD
U+0800:
=======
UTF-8 normal (3-byte): U+0800
UTF-8 overlong (4-byte): U+FFFD
Test Results (with the Fix)
U+03FF:
=======
UTF-8 normal (2-byte): U+03FF
UTF-8 overlong (3-byte): U+FFFD
UTF-8 overlong (4-byte): U+FFFD
U+0400:
=======
UTF-8 normal (2-byte): U+0400
UTF-8 overlong (3-byte): U+FFFD (!)
UTF-8 overlong (4-byte): U+FFFD
U+07FF:
=======
UTF-8 normal (2-byte): U+07FF
UTF-8 overlong (3-byte): U+FFFD (!)
UTF-8 overlong (4-byte): U+FFFD
U+0800:
=======
UTF-8 normal (3-byte): U+0800
UTF-8 overlong (4-byte): U+FFFD
(2) By Florian Balmer (florian.balmer) on 2019-12-20 13:38:32 in reply to 1 [link] [source]
I'm sorry copy-pasting from the code editor to the markdown editor went out of sync. The ReInput::mx
member should be initialized with a length of 35
, not 32
.
This doesn't affect the relevant test results, but additional of my experiments with the re_next_char()
function and overlong UTF-8 sequences.