Fossil Forum

Forum
Login

On the command line, if a multibyte word wraps in a comment, fossil breaks up the word as a single byte and destroys it.

By anonymous on 2018-10-08 16:01:36 [link]

status and timeline command

On the command line, if a multibyte word wraps in a comment, fossil breaks up the word as a single byte and destroys it.

This occurs because it does not check the UTF 8 high-order bit pattern during word wrapping.

By anonymous on 2018-10-09 05:48:41 [link]

https://fossil-scm.org/fossil/artifact?txt=1&ln=301,302&name=4d18073b1a8e98bd

src/comformat.c

Insert additional code to check UTF 8 char between lines 301 and 302

By stephan on 2018-10-09 11:37:23 [link]

For anyone interested in implementing a fix for this, here's a snippet/block (extracted from one of my trees) which uses the fist byte of a UTF8 char to determine the number of bytes in that char (assuming well-formed input), providing for a relatively fast UTF8 strlen op:

(note that `x` is the current byte (unsigned char) in the string, `end` is the one-past-the-end byte, and `rc` is the running character count)

/* Derived from:
   http://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html
*/
for( ; x < end; ++x, ++rc ){
    if(*x>127U){
        switch(0xF0 & *x) {
          case 0xF0: /* length 4 */
              x += 3;
              break;
          case 0xE0: /* length 3 */
              x+= 2;
              break;
          default: /* length 2 */
              x += 1;
              break;
        }
    }
}

By florian.balmer on 2018-10-09 12:14:42 [link]

I'd like to add that a "character sequence" in UTF-8 can be quite different from a "user-perceived character", as the latter may consist of multiple combining (diacritical) "characters" to form a "grapheme".

On Windows, there's the built-in CharNextW() function to find the next user-perceived character in a UTF-16 string, but unfortunately this function does not handle code points beyond the BMP. Getting this right for the full Unicode code space on all platforms may require quite some work, or resorting to a (big) library like ICU.

Yet, your suggested solutions are nice, clean and practical, in any case better than the current situation, and maybe sufficient for the capabilities of most consoles. But Fossil output can also be redirected to text files for other uses, for example.

By stephan on 2018-10-09 12:21:15 [link]

Fossil internally converts (if needed) all checkin comments and such to UTF-8, so Windows code pages and UTF-16/32/whatever are not a concern at that level.

(Unless i'm sorely mistaken, but i don't believe i am.)

By florian.balmer on 2018-10-09 12:33:47 [link]

I agree, but to get proper word-wrapping for free, I would pay the cost of a temporary conversion UTF-8 → UTF-16 → UTF-8.

On Windows, Fossil converts console output fed to WriteConsoleW() to UTF-16, anyway, so maybe the sequence could even be abbreviated.

By anonymous on 2018-10-10 01:24:49 [link]

No conversion to UTF 16 is necessary
see https://en.wikipedia.org/wiki/UTF-8#Description

(1) bit shift right 6 ; 0b11 or 0b10
 utf8 start bit 
    byte to bit shift right 6 : 0b11
 utf8 next pair byte
    byte to bit shift right 6:  0b10
(2) Bytes not applicable above are ascii

By wyoung on 2018-10-10 01:47:59 [link]

Unicode is much more complicated than that.

That answer is written for Perl, a language that has pretty much unparalleled Unicode support, whereas Fossil is written in C, a pre-Unicode language to which the various standards committees have added a fairly weak set of add-on facilities to support Unicode. So, the situation's even worse in C.

For one thing, your plan ignores combining characters.

By florian.balmer on 2018-10-10 06:58:30 [link]

Consider the French word "développeur", represented in the Unicode Normalization Form NFD.

Checking for UTF-8 character boundaries yields:

d | e | ́ | v | e | l | o | p | p | e | u | r

Checking for "user-perceived" character boundaries yields:

d | é | v | e | l | o | p | p | e | u | r

Avoiding breaks inside (multi-byte) UTF-8 characters is not enough, as from the perspective of an user, the following result may still be wrong:

............. de
́veloppeur ....

The mentioned Windows API function CharNextW() can detect user-perceived character boundaries for UTF-16 strings (limited to BMP code points). The cost of a temporary conversion from UTF-8 to UTF-16 and back is probably swamped in the heavy-lifting work required by this function.

Also note that any UTF-8-based equivalent to CharNextW() is likely to require internal conversion of multi-byte UTF-8 code unit sequences to UTF-32 code points, to be able to perform lookups in Unicode character classification tables.

I agree, this is probably not the most relevant thing for a command-line tool like Fossil, but I couldn't hold back my nitpicky comment that "handling Unicode" requires more than "handling UTF-8".

By florian.balmer on 2018-10-10 09:51:36 [link]

It's by no means about picking on someone, but I just find the topic extremely exciting.

I have another example, why conversion between UTF-8, UTF-16 and UTF-32 is not "evil" or "slow":

The glyph lookup tables used by modern font technologies work with 32-bit code points (UTF-32). So anything that the text rendering engine is about to display needs to be converted to UTF-32.

This is true for UTF-16 text fed to WriteConsoleW() on Windows, and also true for strings fed to console output routines on UTF-8-oriented operating systems.

So this is a very common operation, continuously happening on Windows, Linux, and Mac.

Sure, displaying text is slow, but because the text rendering engine has to perform much more complex operations like character reordering from logical to visual order, contextual shaping depending on the surrounding characters, combining of characters with diacritics, and stacking or combining multiple characters into clusters. And we're not yet talking about graphics-related processing like anti-aliasing, and the like. The conversion between the various UTFs is negligible, in this context.

By florian.balmer on 2018-10-17 14:25:43

Patch: Modify the comment formatter to avoid output of incomplete UTF-8 sequences, and to avoid line breaks inside UTF-8 sequences

Putting proper handling of combining characters aside, I have implemented the suggestion by stephan, to avoid splitting of UTF-8 sequences.

https://www.fossil-scm.org/index.html/timeline?&r=comment-formatter-utf8

Tests with comments containing 2-, 3- and 4-byte UTF-8 sequences (run on Windows 10, output of patched version in the left column, unpatched to the right):

> fossil init sample.fossil
> fossil open sample.fossil
> fossil ci --allow-empty -m "{01234567}Ä{01234567}☃{0123456}👻"
> fossil co prev

> fossil time -n 1 -W 21 --comfmtflags 0

=== 2018-10-17 ===                       === 2018-10-17 ===
09:38:29 [2447c13a09]                    09:38:29 [2447c13a09]
          {01234567}Ä                              {01234567}Ã
         {01234567}☃{                            „{01234567}â
         0123456}👻 (u                            ˜ƒ{0123456}ð
         ser: Florian                             Ÿ‘» (user: F
          tags: trunk                             lorian tags:
         )                                         trunk)
--- entry limit (1) reached ---          --- entry limit (1) reached ---

> fossil time -n 1 -W 21 --comfmtflags 1

=== 2018-10-17 ===                       === 2018-10-17 ===
09:38:29 [2447c13a09]                    09:38:29 [2447c13a09]
         {01234567}Ä{                             {01234567}Ä
         01234567}☃{0                            {01234567}â˜
         123456}👻                               ƒ{0123456}ðŸ
         (user:                                   ‘» (user:
         Florian                                  Florian
         tags: trunk)                             tags: trunk)
--- entry limit (1) reached ---          --- entry limit (1) reached ---

I believe it's not safe to assume well-formed UTF-8 in Fossil. Check-in comments entered via a "forgiving" text editor (or even a hex editor), or sneaked-in through forged manifests, may contain invalid UTF-8 sequences. And it seems that Fossil itself allows certain invalid sequences, such as the "overlong representations of the NUL character":

c0 80
e0 80 80
f0 80 80 80

My approach was to treat well-formed, incomplete, and ill-formed sequences all the same, to make sure lead bytes are output with the appropriate number of trail bytes. So there's some extra effort to check if the expected trail bytes are really there, instead of just assuming their presence and number as deduced from the lead byte.


Incomplete Sequences

  • Otherwise well-formed sequences with missing trail bytes

Ill-Formed Sequences

  • Invalid lead bytes

    • Lead bytes C0-C1 followed by any trail byte → overlong representation
    • Lead bytes F5-F7 → 4-byte sequence with code point beyond the Unicode limit of U+10FFFF
    • Lead bytes F8-FF → theoretically starting 5- to 8-byte sequences (undefined)
  • Disallowed ranges for the first trail byte following certain lead bytes

    • Lead byte E0 followed by first trail byte 80-9F → overlong representation
    • Lead byte ED followed by first trail byte A0-BF → code point from the UTF-16 surrogates range
    • Lead byte F0 followed by first trail byte 80-8F → overlong representation
    • Lead byte F4 followed by first trail byte 90-BF → code point beyond the Unicode limit of U+10FFFF

Notes on String Width Measurement

Besides handling of combining characters, proper column-oriented word wrapping (for terminals with fixed-width fonts or text files) also requires granting two "display cells" to East Asian FullWidth characters. (The "Emoji" are a derivative of the FullWidth characters.)

There's a reference implementation of a wcwidth() function to measure "display cells" of a string, adding 0 for combining characters, 2 for FullWidth characters, and 1 for "normal-width" characters.

https://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c

And here's an updated version with more recent Unicode data:

https://git.tartarus.org/?p=simon/putty.git;a=commitdiff;h=4241734dde44caea5a9c73a79db9c6d4cae50861

But this is probably beyond what even some text editors are able to handle.

By anonymous on 2018-10-17 19:12:45 [link]

I'm happy to report that the patch was also successfully tested on Ubuntu 18.10!

Also note that the Fossil executable for Windows downloaded from fossil-scm.org seems to be compiled with the MBCS_COMMAND_LINE option, so the tests from the previous post to create check-in comments with non-ASCII characters from the command line may only work with characters from the current console code page. In this case, the web ui can be used to edit the comments and add non-ASCII characters for testing.

By florian.balmer on 2018-10-17 19:15:04 [link]

I'm happy to report that the patch was also successfully tested on Ubuntu 18.10!

Also note that the Fossil executable for Windows downloaded from fossil-scm.org seems to be compiled with the MBCS_COMMAND_LINE option, so the tests from the previous post to create check-in comments with non-ASCII characters from the command line may only work with characters from the current console code page. In this case, the web ui can be used to edit the comments and add non-ASCII characters for testing.

(To the moderators: sorry for the anonymous double-post, please reject/ignore/delete.)

By florian.balmer on 2018-11-16 11:46:19 [link]

I've done some more work on the comment printing functions to support UTF-8.

https://www.fossil-scm.org/index.html/timeline?&r=comment-formatter-utf8

Along the way, I've also added output buffering to the (non-legacy) comment printing function. On Windows, Fossil does not rely on output buffering provided by the C runtime library, but calls WriteConsoleW() directly (for better Unicode support), yet sometimes for very small chunks (i.e. single characters). The output buffering results in a user perceptible speed improvement on Windows, and (surprisingly) also in a measurable speed improvement on Linux, possibly due to other overhead of (frequent) calls to fossil_print().

The latest word breaking enhancements ensure identical layout of UTF-8 and ASCII text. (Results are analogous for 3- and 4-byte UTF-8 sequences, so they are omitted here.)

Tests:

./fossil test-comment-format --decode --indent 10 --width 20 "" "aaaaaaaaa \n aaaaaaaaa aaaaaaaaaaaa aaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaa"
./fossil test-comment-format --decode --indent 10 --width 20 "" "äääääääää \n äääääääää ääääääääääää ääääääääääää ääääääääääääääääääääääää ääääääääääää"
./fossil test-comment-format --decode --indent 10 --width 20 "" "aaaaaaaaa \n aaaaaaaaa aaaaaaaaaaaa aaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaa" --wordbreak
./fossil test-comment-format --decode --indent 10 --width 20 "" "äääääääää \n äääääääää ääääääääääää ääääääääääää ääääääääääääääääääääääää ääääääääääää" --wordbreak
./fossil test-comment-format --decode --indent 10 --width 20 "" "aaaaaaaaa \n aaaaaaaaa aaaaaaaaaaaa aaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaa" --origbreak
./fossil test-comment-format --decode --indent 10 --width 20 "" "äääääääää \n äääääääää ääääääääääää ääääääääääää ääääääääääääääääääääääää ääääääääääää" --origbreak
./fossil test-comment-format --decode --indent 10 --width 20 "" "aaaaaaaaa \n aaaaaaaaa aaaaaaaaaaaa aaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaa" --wordbreak --origbreak
./fossil test-comment-format --decode --indent 10 --width 20 "" "äääääääää \n äääääääää ääääääääääää ääääääääääää ääääääääääääääääääääääää ääääääääääää" --wordbreak --origbreak

The left column is the output of Fossil [29d3a2ed] without the word breaking fixes, and the right column is the output of Fossil [c9ec3d18] with the word breaking fixes applied. The layout for the "a" and the following "ä" variant should look identical, in the same column.

Fossil [29d3a2ed]                         Fossil [c9ec3d18]

aaaaaaaaa                                 aaaaaaaaa

           aaaaaaaaa                                 aaaaaaaaa
           aaaaaaaaa                                 aaaaaaaaa
          aaa aaaaaa                                aaa aaaaaa
          aaaaaa aaa                                aaaaaa aaa
          aaaaaaaaaa                                aaaaaaaaaa
          aaaaaaaaaa                                aaaaaaaaaa
          a aaaaaaaa                                a aaaaaaaa
          aaaa                                      aaaa
(10 lines output)                         (10 lines output)
äääääääää                                 äääääääää

           äääääääää                                 äääääääää
          ääääääääää                                 äääääääää
          ää ääääääää                               äää ääääää
          ääää ääääää                               ääääää äää
          ääääääääää                                ääääääääää
          ääääääää ää                               ääääääääää
          ääääääääää                                ä ääääääää
(9 lines output)                                    ääää
                                          (10 lines output)
aaaaaaaaa                                 aaaaaaaaa

           aaaaaaaaa                                 aaaaaaaaa
                    aaaaaaaaaa                                aaaaaaaaaa
          aa                                        aa
          aaaaaaaaaa                                aaaaaaaaaa
          aa                                        aa
          aaaaaaaaaa                                aaaaaaaaaa
          aaaaaaaaaa                                aaaaaaaaaa
          aaaa                                      aaaa
          aaaaaaaaaa                                aaaaaaaaaa
          aa                                        aa
(12 lines output)                         (12 lines output)
äääääääää                                 äääääääää

                    äääääääää                        äääääääää
          ääääääääää                                          ääääääääää
          ää                                        ää
          ääääääääää                                ääääääääää
          ää                                        ää
          ääääääääää                                ääääääääää
          ääääääääää                                ääääääääää
          ääää                                      ääää
          ääääääääää                                ääääääääää
          ää                                        ää
(12 lines output)                         (12 lines output)
aaaaaaaaa                                 aaaaaaaaa

           aaaaaaaaa                                 aaaaaaaaa
           aaaaaaaaa                                 aaaaaaaaa
          aaa aaaaaa                                aaa aaaaaa
          aaaaaa aaa                                aaaaaa aaa
          aaaaaaaaaa                                aaaaaaaaaa
          aaaaaaaaaa                                aaaaaaaaaa
          a aaaaaaaa                                a aaaaaaaa
          aaaa                                      aaaa
(10 lines output)                         (10 lines output)
äääääääää                                 äääääääää

           äääääääää                                 äääääääää
          ääääääääää                                 äääääääää
          ää ääääääää                               äää ääääää
          ääää ääääää                               ääääää äää
          ääääääääää                                ääääääääää
          ääääääää ää                               ääääääääää
          ääääääääää                                ä ääääääää
(9 lines output)                                    ääää
                                          (10 lines output)
aaaaaaaaa                                 aaaaaaaaa

           aaaaaaaaa                                 aaaaaaaaa
                    aaaaaaaaaa                                aaaaaaaaaa
          aa                                        aa
          aaaaaaaaaa                                aaaaaaaaaa
          aa                                        aa
          aaaaaaaaaa                                aaaaaaaaaa
          aaaaaaaaaa                                aaaaaaaaaa
          aaaa                                      aaaa
          aaaaaaaaaa                                aaaaaaaaaa
          aa                                        aa
(12 lines output)                         (12 lines output)
äääääääää                                 äääääääää

                    äääääääää                        äääääääää
          ääääääääää                                          ääääääääää
          ää                                        ää
          ääääääääää                                ääääääääää
          ää                                        ää
          ääääääääää                                ääääääääää
          ääääääääää                                ääääääääää
          ääää                                      ääää
          ääääääääää                                ääääääääää
          ää                                        ää
(12 lines output)                         (12 lines output)

By drh on 2018-11-28 23:36:26 [link]

I've done some more work on the comment printing functions to support UTF-8.

So is the comment-formatter-utf8 branch ready to be merged onto trunk? Other than a few stylistic (which I will be patching shortly) I don't see anything wrong with it, and it appears to work for me, though I haven't tested on a repository that has a lot of non-ASCII characters in the comments.

By florian.balmer on 2018-11-29 09:15:28 [link]

Yes, I think it's ready, thank you for your interest!

The regression tests are passed, and they also pass if the test data from test/comment.test is modified to contain multi-byte characters.

The printf command mentioned in this mailing list post [0] can also be used to perform tests with invalid UTF-8 sequences:

./fossil test-comment-format "" $(printf '\xC3\xA4\xA4')

[0] https://www.mail-archive.com/fossil-users@lists.fossil-scm.org/msg25484.html

I'd like to emphasize that the added functionality is very simple: it's just moving the "cursor" outside of (valid or invalid) UTF-8 multi-byte sequences, so that word wrapping and sending chunks to the console (on Windows) won't split UTF-8 sequences.

I'm currently doing research and testing for a more advanced version, based on the wcwidth [1] function already mentioned, to handle diacritics and wide characters. (I'm fascinated by this topic, but my problem is to find the time.)

[1] https://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c

Would it be possible to include the wcwidth function in Fossil? Or would the original author have to sign a contributor agreement? The file header says:

* Markus Kuhn -- 2007-05-26 (Unicode 5.0)
*
* Permission to use, copy, modify, and distribute this software
* for any purpose and without fee is hereby granted. The author
* disclaims all warranties with regard to this software.

The Unicode data table required by wcwidth could be updated to version 11.0.0 or 12.0.0 Beta using the uniset [2] utility program. Would it be possible to include just the output of uniset in Fossil?

[2] https://github.com/depp/uniset

Still, proper Unicode width measurement and word wrapping involves even more heavy lifting than the wcwidth function is able to do:

By florian.balmer on 2018-11-17 07:43:59 and edited on 2018-11-17 08:08:21 [link]

A simple test case for the bug fixed with [70dd8f74] is the following command:

fossil test-comment-format --trimspace --origbreak "" "a bc d" "bc d"

The output is:

a 
b d
(2 lines output)

But the expected output is:

a
bc d
(2 lines output)

The tests for spaces leading to increments of the line index happen at (line index * 2), instead of at (line index).

Moreover, the location at (line index * 2) could potentially be past the end of the buffer.