Minor UTF-8 decoding inconsistency

(1) By Florian Balmer (florian.balmer) on 2019-12-19 16:06:34

## Description

There seems to be a minor inconsistency in the [`re_next_char()` function][0] to decode UTF-8 sequences to UTF-32 code points. The function rejects overlong UTF-8 sequences and code points from the UTF-16 surrogates range, and returns U+FFFD REPLACEMENT CHARACTER instead. But there's a small loophole: the UTF-8 3-byte overlong forms of the 11-bit code points U+0400..U+07FF are accepted.

The test in [src/regexp.c on line 112][1] should be changed from:

```
if( c<=0x3ff || (c>=0xd800 && c<=0xdfff) ) c = 0xfffd;
```

To:

```
if( c<0x800 || (c>=0xd800 && c<=0xdfff) ) c = 0xfffd;
```

An alternative approach may be to accept all UTF-8 overlong forms, but I'm not sure if this might yield false positive matches with the SQLite `REGEX` operator, and the `grep` command, for binary files/blobs.

## Vaguely Related Issue

See the following [comment in src/sqlite3.c][2]:

```
/* When converting from UTF-16, the maximum growth results from
 * translating a 2-byte character to a 4-byte UTF-8 character.
```

2-byte (1-word) UTF-16 characters are in the range U+0000..U+FFFF on the BMP, and can always be represented by 3-byte UTF-8 sequences, so the maximum growth factor is 1.5 (instead of 2). 4-byte (2-word) UTF-16 characters are in the range U+10000..U+10FFFF on the Supplementary Planes, and are always represented by 4-byte UTF-8 sequences (growth factor 1).

Maybe a tiny memory-saving optimization for SQLite? ;-)

[0]: https://fossil-scm.org/index.html/artifact/2b7a91970e?ln=95-124
[1]: https://fossil-scm.org/index.html/artifact/2b7a91970e?ln=112
[2]: https://fossil-scm.org/index.html/artifact/907eda6236?ln=30289-30290

## TL;DR

The rest of the post has detailed test cases, code, and results.

## Test Cases

* U+03FF GREEK CAPITAL REVERSED DOTTED LUNATE SIGMA SYMBOL
 * UTF32: `00000000 00000000 00000011 11111111 = 0x000003ff`
 * UTF-8 2-byte normal: `11001111 10111111 = 0xcf 0xbf`
 * UTF-8 3-byte overlong: `11100000 10001111 10111111 = 0xe0 0x8f 0xbf`
 * UTF-8 4-byte overlong: `11110000 10000000 10001111 10111111 = 0xf0 0x80 0x8f 0xbf`

* U+0400 CYRILLIC CAPITAL LETTER IE WITH GRAVE
 * UTF-32: `00000000 00000000 00000100 00000000 = 0x00000400`
 * UTF-8 2-byte normal: `11010000 10000000 = 0xd0 0x80`
 * UTF-8 3-byte overlong: `11100000 10010000 10000000 = 0xe0 0x90 0x80`
 * UTF-8 4-byte overlong: `11110000 10000000 10010000 10000000 = 0xf0 0x80 0x90 0x80`

* U+07FF: NKO TAMAN SIGN
 * UTF-32: `00000000 00000000 00000111 11111111 = 0x000007ff`
 * UTF-8 2-byte normal: `11011111 10111111 = 0xdf 0xbf`
 * UTF-8 3-byte overlong: `11100000 10011111 10111111 = 0xe0 0x9f 0xbf`
 * UTF-8 4-byte overlong: `11110000 10000000 10011111 10111111 = 0xf0 0x80 0x9f 0xbf`

* U+0800 SAMARITAN LETTER ALAF
 * UTF-32: `00000000 00000000 00001000 00000000 = 0x00000800`
 * UTF-8 3-byte normal: `11100000 10100000 10000000 = 0xe0 0xa0 0x80`
 * UTF-8 4-byte overlong: `11110000 10000000 10100000 10000000 = 0xf0 0x80 0xa0 0x80`

## Test Code

```
struct ReInput utf8test = {
  "\xcf\xbf"
  "\xe0\x8f\xbf"
  "\xf0\x80\x8f\xbf"
  "\xd0\x80"
  "\xe0\x90\x80"
  "\xf0\x80\x90\x80"
  "\xdf\xbf"
  "\xe0\x9f\xbf"
  "\xf0\x80\x9f\xbf"
  "\xe0\xa0\x80"
  "\xf0\x80\xa0\x80",
  0, 32
};
fossil_print("U+03FF:\n=======\n");
fossil_print("UTF-8 normal   (2-byte): U+%04X\n",re_next_char(&utf8test));
fossil_print("UTF-8 overlong (3-byte): U+%04X\n",re_next_char(&utf8test));
fossil_print("UTF-8 overlong (4-byte): U+%04X\n\n",re_next_char(&utf8test));
fossil_print("U+0400:\n=======\n");
fossil_print("UTF-8 normal   (2-byte): U+%04X\n",re_next_char(&utf8test));
fossil_print("UTF-8 overlong (3-byte): U+%04X\n",re_next_char(&utf8test));
fossil_print("UTF-8 overlong (4-byte): U+%04X\n\n",re_next_char(&utf8test));
fossil_print("U+07FF:\n=======\n");
fossil_print("UTF-8 normal   (2-byte): U+%04X\n",re_next_char(&utf8test));
fossil_print("UTF-8 overlong (3-byte): U+%04X\n",re_next_char(&utf8test));
fossil_print("UTF-8 overlong (4-byte): U+%04X\n\n",re_next_char(&utf8test));
fossil_print("U+0800:\n=======\n");
fossil_print("UTF-8 normal (3-byte): U+%04X\n",re_next_char(&utf8test));
fossil_print("UTF-8 overlong (4-byte): U+%04X\n",re_next_char(&utf8test));
```

## Test Results (without the Fix)

```
U+03FF:
=======
UTF-8 normal   (2-byte): U+03FF
UTF-8 overlong (3-byte): U+FFFD
UTF-8 overlong (4-byte): U+FFFD

U+0400:
=======
UTF-8 normal   (2-byte): U+0400
UTF-8 overlong (3-byte): U+0400 (?)
UTF-8 overlong (4-byte): U+FFFD

U+07FF:
=======
UTF-8 normal   (2-byte): U+07FF
UTF-8 overlong (3-byte): U+07FF (?)
UTF-8 overlong (4-byte): U+FFFD

U+0800:
=======
UTF-8 normal   (3-byte): U+0800
UTF-8 overlong (4-byte): U+FFFD
```

## Test Results (with the Fix)

```
U+03FF:
=======
UTF-8 normal   (2-byte): U+03FF
UTF-8 overlong (3-byte): U+FFFD
UTF-8 overlong (4-byte): U+FFFD

U+0400:
=======
UTF-8 normal   (2-byte): U+0400
UTF-8 overlong (3-byte): U+FFFD (!)
UTF-8 overlong (4-byte): U+FFFD

U+07FF:
=======
UTF-8 normal   (2-byte): U+07FF
UTF-8 overlong (3-byte): U+FFFD (!)
UTF-8 overlong (4-byte): U+FFFD

U+0800:
=======
UTF-8 normal   (3-byte): U+0800
UTF-8 overlong (4-byte): U+FFFD
```

(2) By Florian Balmer (florian.balmer) on 2019-12-20 13:38:32 in reply to 1 [link]

I'm sorry copy-pasting from the code editor to the markdown editor went out of sync. The `ReInput::mx` member should be initialized with a length of `35`, not `32`.

This doesn't affect the relevant test results, but additional of my experiments with the `re_next_char()` function and overlong UTF-8 sequences.