Fossil Forum

On the command line, if a multibyte word wraps in a comment, fossil breaks up the word as a single byte and destroys it.
Login

On the command line, if a multibyte word wraps in a comment, fossil breaks up the word as a single byte and destroys it.

On the command line, if a multibyte word wraps in a comment, fossil breaks up the word as a single byte and destroys it.

(1) By anonymous on 2018-10-08 16:01:36 [link]

status and timeline command

On the command line, if a multibyte word wraps in a comment, fossil breaks up the word as a single byte and destroys it.

This occurs because it does not check the UTF 8 high-order bit pattern during word wrapping.

(2) By anonymous on 2018-10-09 05:48:41 in reply to 1 [link]

[https://fossil-scm.org/fossil/artifact?txt=1&ln=301,302&name=4d18073b1a8e98bd]

src/comformat.c

Insert additional code to check UTF 8 char between lines 301 and 302

(3) By Stephan Beal (stephan) on 2018-10-09 11:37:23 in reply to 1 [link]

For anyone interested in implementing a fix for this, here's a snippet/block (extracted from one of my trees) which uses the fist byte of a UTF8 char to determine the number of bytes in that char (assuming well-formed input), providing for a relatively fast UTF8 strlen op:

(note that `x` is the current byte (unsigned char) in the string, `end` is the one-past-the-end byte, and `rc` is the running character count)

/* Derived from:
   http://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html
*/
for( ; x < end; ++x, ++rc ){
    if(*x>127U){
        switch(0xF0 & *x) {
          case 0xF0: /* length 4 */
              x += 3;
              break;
          case 0xE0: /* length 3 */
              x+= 2;
              break;
          default: /* length 2 */
              x += 1;
              break;
        }
    }
}

(4) By Florian Balmer (florian.balmer) on 2018-10-09 12:14:42 in reply to 1 [link]

I'd like to add that a "character sequence" in UTF-8 can be quite different from a "user-perceived character", as the latter may consist of multiple combining (diacritical) "characters" to form a "grapheme".

On Windows, there's the built-in [CharNextW()](https://docs.microsoft.com/en-us/windows/desktop/api/winuser/nf-winuser-charnextw) function to find the next user-perceived character in a UTF-16 string, but unfortunately this function does not handle code points beyond the BMP. Getting this right for the full Unicode code space on all platforms may require quite some work, or resorting to a (big) library like [ICU](https://www-01.ibm.com/software/globalization/icu/).

Yet, your suggested solutions are nice, clean and practical, in any case better than the current situation, and maybe sufficient for the capabilities of most consoles. But Fossil output can also be redirected to text files for other uses, for example.

(5) By Stephan Beal (stephan) on 2018-10-09 12:21:15 in reply to 4 [link]

Fossil internally converts (if needed) all checkin comments and such to UTF-8, so Windows code pages and UTF-16/32/whatever are not a concern at that level.

(Unless i'm sorely mistaken, but i don't believe i am.)

(6) By Florian Balmer (florian.balmer) on 2018-10-09 12:33:47 in reply to 5 [link]

I agree, but to get proper word-wrapping for free, I would pay the cost of a temporary conversion UTF-8 → UTF-16 → UTF-8.

On Windows, Fossil converts console output fed to `WriteConsoleW()` to UTF-16, anyway, so maybe the sequence could even be abbreviated.

(7) By anonymous on 2018-10-10 01:24:49 in reply to 1 [link]

No conversion to UTF 16 is necessary
see https://en.wikipedia.org/wiki/UTF-8#Description

(1) bit shift right 6 ; 0b11 or 0b10
 utf8 start bit 
    byte to bit shift right 6 : 0b11
 utf8 next pair byte
    byte to bit shift right 6:  0b10
(2) Bytes not applicable above are ascii

(8) By Warren Young (wyoung) on 2018-10-10 01:47:59 in reply to 7 [link]

Unicode is [much more complicated][1] than that. 

That answer is written for Perl, a language that has pretty much unparalleled Unicode support, whereas Fossil is written in C, a pre-Unicode language to which the various standards committees have added a fairly weak set of add-on facilities to support Unicode. So, the situation's even worse in C.

For one thing, your plan ignores [combining characters][2].


[1]: https://stackoverflow.com/a/6163129/142454
[2]: https://en.wikipedia.org/wiki/Combining_character

(9) By Florian Balmer (florian.balmer) on 2018-10-10 06:58:30 in reply to 7 [link]

Consider the French word "développeur", represented in the [https://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms | Unicode Normalization Form NFD].

Checking for UTF-8 character boundaries yields:

<blockquote>
<verbatim>
d | e | ́ | v | e | l | o | p | p | e | u | r
</verbatim>
</blockquote>

Checking for "user-perceived" character boundaries yields:

<blockquote>
<verbatim>
d | é | v | e | l | o | p | p | e | u | r
</verbatim>
</blockquote>

Avoiding breaks inside (multi-byte) UTF-8 characters is not enough, as from the perspective of an user, the following result may still be wrong:

<blockquote>
<verbatim>
............. de
́veloppeur ....
</verbatim>
</blockquote>

The mentioned Windows API function <code>CharNextW()</code> can detect user-perceived character boundaries for UTF-16 strings (limited to BMP code points). The cost of a temporary conversion from UTF-8 to UTF-16 and back is probably swamped in the heavy-lifting work required by this function.

Also note that any UTF-8-based equivalent to <code>CharNextW()</code> is likely to require internal conversion of multi-byte UTF-8 code unit sequences to UTF-32 code points, to be able to perform lookups in Unicode character classification tables.

I agree, this is probably not the most relevant thing for a command-line tool like Fossil, but I couldn't hold back my nitpicky comment that "handling Unicode" requires more than "handling UTF-8".

(10) By Florian Balmer (florian.balmer) on 2018-10-10 09:51:36 in reply to 9 [link]

It's by no means about picking on someone, but I just find the topic extremely exciting.

I have another example, why conversion between UTF-8, UTF-16 and UTF-32 is not "evil" or "slow":

The glyph lookup tables used by modern font technologies work with 32-bit code points (UTF-32). So anything that the text rendering engine is about to display needs to be converted to UTF-32.

This is true for UTF-16 text fed to `WriteConsoleW()` on Windows, and also true for strings fed to console output routines on UTF-8-oriented operating systems.

So this is a very common operation, continuously happening on Windows, Linux, and Mac.

Sure, displaying text is slow, but because the text rendering engine has to perform much more complex operations like character reordering from logical to visual order, contextual shaping depending on the surrounding characters, combining of characters with diacritics, and stacking or combining multiple characters into clusters. And we're not yet talking about graphics-related processing like anti-aliasing, and the like. The conversion between the various UTFs is negligible, in this context.

(11) By Florian Balmer (florian.balmer) on 2018-10-17 14:25:43 in reply to 1

## Patch: Modify the comment formatter to avoid output of incomplete UTF-8 sequences, and to avoid line breaks inside UTF-8 sequences

Putting proper handling of combining characters aside, I have implemented the [suggestion by stephan](https://fossil-scm.org/forum/forumpost/56d88d9d8e), to avoid splitting of UTF-8 sequences.

<https://www.fossil-scm.org/index.html/timeline?&r=comment-formatter-utf8>

Tests with comments containing 2-, 3- and 4-byte UTF-8 sequences (run on Windows 10, output of patched version in the left column, unpatched to the right):

    > fossil init sample.fossil
    > fossil open sample.fossil
    > fossil ci --allow-empty -m "{01234567}Ä{01234567}☃{0123456}👻"
    > fossil co prev

    > fossil time -n 1 -W 21 --comfmtflags 0

    === 2018-10-17 ===                       === 2018-10-17 ===
    09:38:29 [2447c13a09]                    09:38:29 [2447c13a09]
              {01234567}Ä                              {01234567}Ã
             {01234567}☃{                            „{01234567}â
             0123456}👻 (u                            ˜ƒ{0123456}ð
             ser: Florian                             Ÿ‘» (user: F
              tags: trunk                             lorian tags:
             )                                         trunk)
    --- entry limit (1) reached ---          --- entry limit (1) reached ---

    > fossil time -n 1 -W 21 --comfmtflags 1

    === 2018-10-17 ===                       === 2018-10-17 ===
    09:38:29 [2447c13a09]                    09:38:29 [2447c13a09]
             {01234567}Ä{                             {01234567}Ä
             01234567}☃{0                            {01234567}â˜
             123456}👻                               ƒ{0123456}ðŸ
             (user:                                   ‘» (user:
             Florian                                  Florian
             tags: trunk)                             tags: trunk)
    --- entry limit (1) reached ---          --- entry limit (1) reached ---

I believe it's not safe to assume well-formed UTF-8 in Fossil. Check-in comments entered via a "forgiving" text editor (or even a hex editor), or sneaked-in through forged manifests, may contain invalid UTF-8 sequences. And it seems that [Fossil itself allows certain invalid sequences, such as the "overlong representations of the NUL character"](https://www.fossil-scm.org/index.html/artifact?&name=2f276c62fe&ln=168-181):

    c0 80
    e0 80 80
    f0 80 80 80

My approach was to treat well-formed, incomplete, and ill-formed sequences all the same, to make sure lead bytes are output with the appropriate number of trail bytes. So there's some extra effort to check if the expected trail bytes are really there, instead of just assuming their presence and number as deduced from the lead byte.

***

## Incomplete Sequences

* Otherwise well-formed sequences with missing trail bytes

## Ill-Formed Sequences

* Invalid lead bytes
 * Lead bytes `C0-C1` followed by any trail byte → overlong representation
 * Lead bytes `F5-F7` → 4-byte sequence with code point beyond the Unicode limit of U+10FFFF
 * Lead bytes `F8-FF` → theoretically starting 5- to 8-byte sequences (undefined)

* Disallowed ranges for the first trail byte following certain lead bytes
 * Lead byte `E0` followed by first trail byte `80-9F` → overlong representation
 * Lead byte `ED` followed by first trail byte `A0-BF` → code point from the UTF-16 surrogates range
 * Lead byte `F0` followed by first trail byte `80-8F` → overlong representation
 * Lead byte `F4` followed by first trail byte `90-BF` → code point beyond the Unicode limit of U+10FFFF

***

## Notes on String Width Measurement

Besides handling of combining characters, proper column-oriented word wrapping (for terminals with fixed-width fonts or text files) also requires granting two "display cells" to East Asian FullWidth characters. (The "Emoji" are a derivative of the FullWidth characters.)

There's a reference implementation of a `wcwidth()` function to measure "display cells" of a string, adding `0` for combining characters, `2` for FullWidth characters, and `1` for "normal-width" characters.

<https://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c>

And here's an updated version with more recent Unicode data:

<https://git.tartarus.org/?p=simon/putty.git;a=commitdiff;h=4241734dde44caea5a9c73a79db9c6d4cae50861>

But this is probably beyond what even some text editors are able to handle.

(12) By anonymous on 2018-10-17 19:12:45 in reply to 11 [link]

I'm happy to report that the patch was also successfully tested on Ubuntu 18.10!

Also note that the Fossil executable for Windows downloaded from fossil-scm.org seems to be compiled with the `MBCS_COMMAND_LINE` option, so the tests from the previous post to create check-in comments with non-ASCII characters from the command line may only work with characters from the current console code page. In this case, the web ui can be used to edit the comments and add non-ASCII characters for testing.

(13) By Florian Balmer (florian.balmer) on 2018-10-17 19:15:04 in reply to 11 [link]

I'm happy to report that the patch was also successfully tested on Ubuntu 18.10!

Also note that the Fossil executable for Windows downloaded from fossil-scm.org seems to be compiled with the `MBCS_COMMAND_LINE` option, so the tests from the previous post to create check-in comments with non-ASCII characters from the command line may only work with characters from the current console code page. In this case, the web ui can be used to edit the comments and add non-ASCII characters for testing.

(To the moderators: sorry for the anonymous double-post, please reject/ignore/delete.)

(14) By Florian Balmer (florian.balmer) on 2018-11-16 11:46:19 in reply to 11 [link]

I've done some more work on the comment printing functions to support UTF-8.

<https://www.fossil-scm.org/index.html/timeline?&r=comment-formatter-utf8>

Along the way, I've also added output buffering to the (non-legacy) comment printing function. On Windows, Fossil does not rely on output buffering provided by the C runtime library, but calls `WriteConsoleW()` directly (for better Unicode support), yet sometimes for very small chunks (i.e. single characters). The output buffering results in a user perceptible speed improvement on Windows, and (surprisingly) also in a measurable speed improvement on Linux, possibly due to other overhead of (frequent) calls to `fossil_print()`.

The latest word breaking enhancements ensure identical layout of UTF-8 and ASCII text. (Results are analogous for 3- and 4-byte UTF-8 sequences, so they are omitted here.)

Tests:

    ./fossil test-comment-format --decode --indent 10 --width 20 "" "aaaaaaaaa \n aaaaaaaaa aaaaaaaaaaaa aaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaa"
    ./fossil test-comment-format --decode --indent 10 --width 20 "" "äääääääää \n äääääääää ääääääääääää ääääääääääää ääääääääääääääääääääääää ääääääääääää"
    ./fossil test-comment-format --decode --indent 10 --width 20 "" "aaaaaaaaa \n aaaaaaaaa aaaaaaaaaaaa aaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaa" --wordbreak
    ./fossil test-comment-format --decode --indent 10 --width 20 "" "äääääääää \n äääääääää ääääääääääää ääääääääääää ääääääääääääääääääääääää ääääääääääää" --wordbreak
    ./fossil test-comment-format --decode --indent 10 --width 20 "" "aaaaaaaaa \n aaaaaaaaa aaaaaaaaaaaa aaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaa" --origbreak
    ./fossil test-comment-format --decode --indent 10 --width 20 "" "äääääääää \n äääääääää ääääääääääää ääääääääääää ääääääääääääääääääääääää ääääääääääää" --origbreak
    ./fossil test-comment-format --decode --indent 10 --width 20 "" "aaaaaaaaa \n aaaaaaaaa aaaaaaaaaaaa aaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaa" --wordbreak --origbreak
    ./fossil test-comment-format --decode --indent 10 --width 20 "" "äääääääää \n äääääääää ääääääääääää ääääääääääää ääääääääääääääääääääääää ääääääääääää" --wordbreak --origbreak

The left column is the output of Fossil [&lbrack;29d3a2ed&rbrack;](https://www.fossil-scm.org/index.html/info/29d3a2ed) without the word breaking fixes, and the right column is the output of Fossil [&lbrack;c9ec3d18&rbrack;](https://www.fossil-scm.org/index.html/info/c9ec3d18) with the word breaking fixes applied. The layout for the "a" and the following "ä" variant should look identical, in the same column.

    Fossil [29d3a2ed]                         Fossil [c9ec3d18]

    aaaaaaaaa                                 aaaaaaaaa

               aaaaaaaaa                                 aaaaaaaaa
               aaaaaaaaa                                 aaaaaaaaa
              aaa aaaaaa                                aaa aaaaaa
              aaaaaa aaa                                aaaaaa aaa
              aaaaaaaaaa                                aaaaaaaaaa
              aaaaaaaaaa                                aaaaaaaaaa
              a aaaaaaaa                                a aaaaaaaa
              aaaa                                      aaaa
    (10 lines output)                         (10 lines output)
    äääääääää                                 äääääääää

               äääääääää                                 äääääääää
              ääääääääää                                 äääääääää
              ää ääääääää                               äää ääääää
              ääää ääääää                               ääääää äää
              ääääääääää                                ääääääääää
              ääääääää ää                               ääääääääää
              ääääääääää                                ä ääääääää
    (9 lines output)                                    ääää
                                              (10 lines output)
    aaaaaaaaa                                 aaaaaaaaa

               aaaaaaaaa                                 aaaaaaaaa
                        aaaaaaaaaa                                aaaaaaaaaa
              aa                                        aa
              aaaaaaaaaa                                aaaaaaaaaa
              aa                                        aa
              aaaaaaaaaa                                aaaaaaaaaa
              aaaaaaaaaa                                aaaaaaaaaa
              aaaa                                      aaaa
              aaaaaaaaaa                                aaaaaaaaaa
              aa                                        aa
    (12 lines output)                         (12 lines output)
    äääääääää                                 äääääääää

                        äääääääää                        äääääääää
              ääääääääää                                          ääääääääää
              ää                                        ää
              ääääääääää                                ääääääääää
              ää                                        ää
              ääääääääää                                ääääääääää
              ääääääääää                                ääääääääää
              ääää                                      ääää
              ääääääääää                                ääääääääää
              ää                                        ää
    (12 lines output)                         (12 lines output)
    aaaaaaaaa                                 aaaaaaaaa

               aaaaaaaaa                                 aaaaaaaaa
               aaaaaaaaa                                 aaaaaaaaa
              aaa aaaaaa                                aaa aaaaaa
              aaaaaa aaa                                aaaaaa aaa
              aaaaaaaaaa                                aaaaaaaaaa
              aaaaaaaaaa                                aaaaaaaaaa
              a aaaaaaaa                                a aaaaaaaa
              aaaa                                      aaaa
    (10 lines output)                         (10 lines output)
    äääääääää                                 äääääääää

               äääääääää                                 äääääääää
              ääääääääää                                 äääääääää
              ää ääääääää                               äää ääääää
              ääää ääääää                               ääääää äää
              ääääääääää                                ääääääääää
              ääääääää ää                               ääääääääää
              ääääääääää                                ä ääääääää
    (9 lines output)                                    ääää
                                              (10 lines output)
    aaaaaaaaa                                 aaaaaaaaa

               aaaaaaaaa                                 aaaaaaaaa
                        aaaaaaaaaa                                aaaaaaaaaa
              aa                                        aa
              aaaaaaaaaa                                aaaaaaaaaa
              aa                                        aa
              aaaaaaaaaa                                aaaaaaaaaa
              aaaaaaaaaa                                aaaaaaaaaa
              aaaa                                      aaaa
              aaaaaaaaaa                                aaaaaaaaaa
              aa                                        aa
    (12 lines output)                         (12 lines output)
    äääääääää                                 äääääääää

                        äääääääää                        äääääääää
              ääääääääää                                          ääääääääää
              ää                                        ää
              ääääääääää                                ääääääääää
              ää                                        ää
              ääääääääää                                ääääääääää
              ääääääääää                                ääääääääää
              ääää                                      ääää
              ääääääääää                                ääääääääää
              ää                                        ää
    (12 lines output)                         (12 lines output)

(15) By Florian Balmer (florian.balmer) on 2018-11-17 07:43:59 in reply to 11 updated by 15.1 [link]

A simple test case for the bug fixed with [&lbrack;70dd8f74&rbrack;](https://www.fossil-scm.org/index.html/info/70dd8f74) is the following command:

    fossil test-comment-format --trimspace --origbreak "" "abcdef ghijkl  " "ghijkl  "

The output is:

    abcdef
    gijkl
    (2 lines output)

But the expected output is:

    abcdef
    ghijkl
    (2 lines output)

The tests for spaces leading to increments of the line index happen at `(line index * 2)`, instead of at `(line index)`.

Moreover, the location at `(line index * 2)` could potentially be past the end of the buffer.

(15.1) By Florian Balmer (florian.balmer) on 2018-11-17 08:08:21 edited from 15.0 in reply to 11 [link]

A simple test case for the bug fixed with [&lbrack;70dd8f74&rbrack;](https://www.fossil-scm.org/index.html/info/70dd8f74) is the following command:

    fossil test-comment-format --trimspace --origbreak "" "a bc d" "bc d"

The output is:

    a 
    b d
    (2 lines output)

But the expected output is:

    a
    bc d
    (2 lines output)

The tests for spaces leading to increments of the line index happen at `(line index * 2)`, instead of at `(line index)`.

Moreover, the location at `(line index * 2)` could potentially be past the end of the buffer.

(16) By Richard Hipp (drh) on 2018-11-28 23:36:26 in reply to 14 [link]

>
    I've done some more work on the comment printing functions to support UTF-8.

So is the comment-formatter-utf8 branch ready to be merged onto trunk?  Other
than a few stylistic (which I will be patching shortly) I don't see anything
wrong with it, and it appears to work for me, though I haven't tested on a
repository that has a lot of non-ASCII characters in the comments.

(17) By Florian Balmer (florian.balmer) on 2018-11-29 09:15:28 in reply to 16 [link]

Yes, I think it's ready, thank you for your interest!

The regression tests are passed, and they also pass if the test data from test/comment.test is modified to contain multi-byte characters.

The `printf` command mentioned in this mailing list post &lbrack;0&rbrack; can also be used to perform tests with invalid UTF-8 sequences:

    ./fossil test-comment-format "" $(printf '\xC3\xA4\xA4')

&lbrack;0&rbrack; <https://www.mail-archive.com/fossil-users@lists.fossil-scm.org/msg25484.html>

I'd like to emphasize that the added functionality is very simple: it's just moving the "cursor" outside of (valid or invalid) UTF-8 multi-byte sequences, so that word wrapping and sending chunks to the console (on Windows) won't split UTF-8 sequences.

I'm currently doing research and testing for a more advanced version, based on the `wcwidth` &lbrack;1&rbrack; function already mentioned, to handle diacritics and wide characters. (I'm fascinated by this topic, but my problem is to find the time.)

&lbrack;1&rbrack; <https://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c>

Would it be possible to include the `wcwidth` function in Fossil? Or would the original author have to sign a contributor agreement? The file header says:

    * Markus Kuhn -- 2007-05-26 (Unicode 5.0)
    *
    * Permission to use, copy, modify, and distribute this software
    * for any purpose and without fee is hereby granted. The author
    * disclaims all warranties with regard to this software.

The Unicode data table required by `wcwidth` could be updated to version 11.0.0 or 12.0.0 Beta using the `uniset` &lbrack;2&rbrack; utility program. Would it be possible to include just the output of `uniset` in Fossil?

&lbrack;2&rbrack; <https://github.com/depp/uniset>

Still, proper Unicode width measurement and word wrapping involves even more heavy lifting than the `wcwidth` function is able to do:

* <http://userguide.icu-project.org/boundaryanalysis>
* <https://www.gnu.org/software/libunistring/manual/libunistring.html#uniwbrk_002eh>
* <https://www.gnu.org/software/libunistring/manual/libunistring.html#uniwidth_002eh>