## Description
There seems to be a minor inconsistency in the [`re_next_char()` function][0] to decode UTF-8 sequences to UTF-32 code points. The function rejects overlong UTF-8 sequences and code points from the UTF-16 surrogates range, and returns U+FFFD REPLACEMENT CHARACTER instead. But there's a small loophole: the UTF-8 3-byte overlong forms of the 11-bit code points U+0400..U+07FF are accepted.
The test in [src/regexp.c on line 112][1] should be changed from:
```
if( c<=0x3ff || (c>=0xd800 && c<=0xdfff) ) c = 0xfffd;
```
To:
```
if( c<0x800 || (c>=0xd800 && c<=0xdfff) ) c = 0xfffd;
```
An alternative approach may be to accept all UTF-8 overlong forms, but I'm not sure if this might yield false positive matches with the SQLite `REGEX` operator, and the `grep` command, for binary files/blobs.
## Vaguely Related Issue
See the following [comment in src/sqlite3.c][2]:
```
/* When converting from UTF-16, the maximum growth results from
* translating a 2-byte character to a 4-byte UTF-8 character.
```
2-byte (1-word) UTF-16 characters are in the range U+0000..U+FFFF on the BMP, and can always be represented by 3-byte UTF-8 sequences, so the maximum growth factor is 1.5 (instead of 2). 4-byte (2-word) UTF-16 characters are in the range U+10000..U+10FFFF on the Supplementary Planes, and are always represented by 4-byte UTF-8 sequences (growth factor 1).
Maybe a tiny memory-saving optimization for SQLite? ;-)
[0]: https://fossil-scm.org/index.html/artifact/2b7a91970e?ln=95-124
[1]: https://fossil-scm.org/index.html/artifact/2b7a91970e?ln=112
[2]: https://fossil-scm.org/index.html/artifact/907eda6236?ln=30289-30290
## TL;DR
The rest of the post has detailed test cases, code, and results.
## Test Cases
* U+03FF GREEK CAPITAL REVERSED DOTTED LUNATE SIGMA SYMBOL
* UTF32: `00000000 00000000 00000011 11111111 = 0x000003ff`
* UTF-8 2-byte normal: `11001111 10111111 = 0xcf 0xbf`
* UTF-8 3-byte overlong: `11100000 10001111 10111111 = 0xe0 0x8f 0xbf`
* UTF-8 4-byte overlong: `11110000 10000000 10001111 10111111 = 0xf0 0x80 0x8f 0xbf`
* U+0400 CYRILLIC CAPITAL LETTER IE WITH GRAVE
* UTF-32: `00000000 00000000 00000100 00000000 = 0x00000400`
* UTF-8 2-byte normal: `11010000 10000000 = 0xd0 0x80`
* UTF-8 3-byte overlong: `11100000 10010000 10000000 = 0xe0 0x90 0x80`
* UTF-8 4-byte overlong: `11110000 10000000 10010000 10000000 = 0xf0 0x80 0x90 0x80`
* U+07FF: NKO TAMAN SIGN
* UTF-32: `00000000 00000000 00000111 11111111 = 0x000007ff`
* UTF-8 2-byte normal: `11011111 10111111 = 0xdf 0xbf`
* UTF-8 3-byte overlong: `11100000 10011111 10111111 = 0xe0 0x9f 0xbf`
* UTF-8 4-byte overlong: `11110000 10000000 10011111 10111111 = 0xf0 0x80 0x9f 0xbf`
* U+0800 SAMARITAN LETTER ALAF
* UTF-32: `00000000 00000000 00001000 00000000 = 0x00000800`
* UTF-8 3-byte normal: `11100000 10100000 10000000 = 0xe0 0xa0 0x80`
* UTF-8 4-byte overlong: `11110000 10000000 10100000 10000000 = 0xf0 0x80 0xa0 0x80`
## Test Code
```
struct ReInput utf8test = {
"\xcf\xbf"
"\xe0\x8f\xbf"
"\xf0\x80\x8f\xbf"
"\xd0\x80"
"\xe0\x90\x80"
"\xf0\x80\x90\x80"
"\xdf\xbf"
"\xe0\x9f\xbf"
"\xf0\x80\x9f\xbf"
"\xe0\xa0\x80"
"\xf0\x80\xa0\x80",
0, 32
};
fossil_print("U+03FF:\n=======\n");
fossil_print("UTF-8 normal (2-byte): U+%04X\n",re_next_char(&utf8test));
fossil_print("UTF-8 overlong (3-byte): U+%04X\n",re_next_char(&utf8test));
fossil_print("UTF-8 overlong (4-byte): U+%04X\n\n",re_next_char(&utf8test));
fossil_print("U+0400:\n=======\n");
fossil_print("UTF-8 normal (2-byte): U+%04X\n",re_next_char(&utf8test));
fossil_print("UTF-8 overlong (3-byte): U+%04X\n",re_next_char(&utf8test));
fossil_print("UTF-8 overlong (4-byte): U+%04X\n\n",re_next_char(&utf8test));
fossil_print("U+07FF:\n=======\n");
fossil_print("UTF-8 normal (2-byte): U+%04X\n",re_next_char(&utf8test));
fossil_print("UTF-8 overlong (3-byte): U+%04X\n",re_next_char(&utf8test));
fossil_print("UTF-8 overlong (4-byte): U+%04X\n\n",re_next_char(&utf8test));
fossil_print("U+0800:\n=======\n");
fossil_print("UTF-8 normal (3-byte): U+%04X\n",re_next_char(&utf8test));
fossil_print("UTF-8 overlong (4-byte): U+%04X\n",re_next_char(&utf8test));
```
## Test Results (without the Fix)
```
U+03FF:
=======
UTF-8 normal (2-byte): U+03FF
UTF-8 overlong (3-byte): U+FFFD
UTF-8 overlong (4-byte): U+FFFD
U+0400:
=======
UTF-8 normal (2-byte): U+0400
UTF-8 overlong (3-byte): U+0400 (?)
UTF-8 overlong (4-byte): U+FFFD
U+07FF:
=======
UTF-8 normal (2-byte): U+07FF
UTF-8 overlong (3-byte): U+07FF (?)
UTF-8 overlong (4-byte): U+FFFD
U+0800:
=======
UTF-8 normal (3-byte): U+0800
UTF-8 overlong (4-byte): U+FFFD
```
## Test Results (with the Fix)
```
U+03FF:
=======
UTF-8 normal (2-byte): U+03FF
UTF-8 overlong (3-byte): U+FFFD
UTF-8 overlong (4-byte): U+FFFD
U+0400:
=======
UTF-8 normal (2-byte): U+0400
UTF-8 overlong (3-byte): U+FFFD (!)
UTF-8 overlong (4-byte): U+FFFD
U+07FF:
=======
UTF-8 normal (2-byte): U+07FF
UTF-8 overlong (3-byte): U+FFFD (!)
UTF-8 overlong (4-byte): U+FFFD
U+0800:
=======
UTF-8 normal (3-byte): U+0800
UTF-8 overlong (4-byte): U+FFFD
```