Minor UTF-8 decoding inconsistency

(1) By Florian Balmer (florian.balmer) on 2019-12-19 16:06:34 [source]

Description

There seems to be a minor inconsistency in the re_next_char() function to decode UTF-8 sequences to UTF-32 code points. The function rejects overlong UTF-8 sequences and code points from the UTF-16 surrogates range, and returns U+FFFD REPLACEMENT CHARACTER instead. But there's a small loophole: the UTF-8 3-byte overlong forms of the 11-bit code points U+0400..U+07FF are accepted.

The test in src/regexp.c on line 112 should be changed from:

if( c<=0x3ff || (c>=0xd800 && c<=0xdfff) ) c = 0xfffd;

To:

if( c<0x800 || (c>=0xd800 && c<=0xdfff) ) c = 0xfffd;

An alternative approach may be to accept all UTF-8 overlong forms, but I'm not sure if this might yield false positive matches with the SQLite REGEX operator, and the grep command, for binary files/blobs.

Vaguely Related Issue

See the following comment in src/sqlite3.c:

/* When converting from UTF-16, the maximum growth results from
 * translating a 2-byte character to a 4-byte UTF-8 character.

2-byte (1-word) UTF-16 characters are in the range U+0000..U+FFFF on the BMP, and can always be represented by 3-byte UTF-8 sequences, so the maximum growth factor is 1.5 (instead of 2). 4-byte (2-word) UTF-16 characters are in the range U+10000..U+10FFFF on the Supplementary Planes, and are always represented by 4-byte UTF-8 sequences (growth factor 1).

Maybe a tiny memory-saving optimization for SQLite? ;-)

TL;DR

The rest of the post has detailed test cases, code, and results.

Test Cases

U+03FF GREEK CAPITAL REVERSED DOTTED LUNATE SIGMA SYMBOL
- UTF32: 00000000 00000000 00000011 11111111 = 0x000003ff
- UTF-8 2-byte normal: 11001111 10111111 = 0xcf 0xbf
- UTF-8 3-byte overlong: 11100000 10001111 10111111 = 0xe0 0x8f 0xbf
- UTF-8 4-byte overlong: 11110000 10000000 10001111 10111111 = 0xf0 0x80 0x8f 0xbf
U+0400 CYRILLIC CAPITAL LETTER IE WITH GRAVE
- UTF-32: 00000000 00000000 00000100 00000000 = 0x00000400
- UTF-8 2-byte normal: 11010000 10000000 = 0xd0 0x80
- UTF-8 3-byte overlong: 11100000 10010000 10000000 = 0xe0 0x90 0x80
- UTF-8 4-byte overlong: 11110000 10000000 10010000 10000000 = 0xf0 0x80 0x90 0x80
U+07FF: NKO TAMAN SIGN
- UTF-32: 00000000 00000000 00000111 11111111 = 0x000007ff
- UTF-8 2-byte normal: 11011111 10111111 = 0xdf 0xbf
- UTF-8 3-byte overlong: 11100000 10011111 10111111 = 0xe0 0x9f 0xbf
- UTF-8 4-byte overlong: 11110000 10000000 10011111 10111111 = 0xf0 0x80 0x9f 0xbf
U+0800 SAMARITAN LETTER ALAF
- UTF-32: 00000000 00000000 00001000 00000000 = 0x00000800
- UTF-8 3-byte normal: 11100000 10100000 10000000 = 0xe0 0xa0 0x80
- UTF-8 4-byte overlong: 11110000 10000000 10100000 10000000 = 0xf0 0x80 0xa0 0x80

Test Code

struct ReInput utf8test = {
  "\xcf\xbf"
  "\xe0\x8f\xbf"
  "\xf0\x80\x8f\xbf"
  "\xd0\x80"
  "\xe0\x90\x80"
  "\xf0\x80\x90\x80"
  "\xdf\xbf"
  "\xe0\x9f\xbf"
  "\xf0\x80\x9f\xbf"
  "\xe0\xa0\x80"
  "\xf0\x80\xa0\x80",
  0, 32
};
fossil_print("U+03FF:\n=======\n");
fossil_print("UTF-8 normal   (2-byte): U+%04X\n",re_next_char(&utf8test));
fossil_print("UTF-8 overlong (3-byte): U+%04X\n",re_next_char(&utf8test));
fossil_print("UTF-8 overlong (4-byte): U+%04X\n\n",re_next_char(&utf8test));
fossil_print("U+0400:\n=======\n");
fossil_print("UTF-8 normal   (2-byte): U+%04X\n",re_next_char(&utf8test));
fossil_print("UTF-8 overlong (3-byte): U+%04X\n",re_next_char(&utf8test));
fossil_print("UTF-8 overlong (4-byte): U+%04X\n\n",re_next_char(&utf8test));
fossil_print("U+07FF:\n=======\n");
fossil_print("UTF-8 normal   (2-byte): U+%04X\n",re_next_char(&utf8test));
fossil_print("UTF-8 overlong (3-byte): U+%04X\n",re_next_char(&utf8test));
fossil_print("UTF-8 overlong (4-byte): U+%04X\n\n",re_next_char(&utf8test));
fossil_print("U+0800:\n=======\n");
fossil_print("UTF-8 normal (3-byte): U+%04X\n",re_next_char(&utf8test));
fossil_print("UTF-8 overlong (4-byte): U+%04X\n",re_next_char(&utf8test));

Test Results (without the Fix)

U+03FF:
=======
UTF-8 normal   (2-byte): U+03FF
UTF-8 overlong (3-byte): U+FFFD
UTF-8 overlong (4-byte): U+FFFD

U+0400:
=======
UTF-8 normal   (2-byte): U+0400
UTF-8 overlong (3-byte): U+0400 (?)
UTF-8 overlong (4-byte): U+FFFD

U+07FF:
=======
UTF-8 normal   (2-byte): U+07FF
UTF-8 overlong (3-byte): U+07FF (?)
UTF-8 overlong (4-byte): U+FFFD

U+0800:
=======
UTF-8 normal   (3-byte): U+0800
UTF-8 overlong (4-byte): U+FFFD

Test Results (with the Fix)

U+03FF:
=======
UTF-8 normal   (2-byte): U+03FF
UTF-8 overlong (3-byte): U+FFFD
UTF-8 overlong (4-byte): U+FFFD

U+0400:
=======
UTF-8 normal   (2-byte): U+0400
UTF-8 overlong (3-byte): U+FFFD (!)
UTF-8 overlong (4-byte): U+FFFD

U+07FF:
=======
UTF-8 normal   (2-byte): U+07FF
UTF-8 overlong (3-byte): U+FFFD (!)
UTF-8 overlong (4-byte): U+FFFD

U+0800:
=======
UTF-8 normal   (3-byte): U+0800
UTF-8 overlong (4-byte): U+FFFD

(2) By Florian Balmer (florian.balmer) on 2019-12-20 13:38:32 in reply to 1 [link] [source]

I'm sorry copy-pasting from the code editor to the markdown editor went out of sync. The ReInput::mx member should be initialized with a length of 35, not 32.

This doesn't affect the relevant test results, but additional of my experiments with the re_next_char() function and overlong UTF-8 sequences.