encoding question

(1) By jvdh (veedeehjay) on 2023-03-05 17:35:50 [link] [source]

I currently want to track my edits to a document written in troff (no, not a manpage ;)). troff does not understand utf-8 as input encoding so I am using en_US.ISO8859-15 which at least also covers diereses (German "Umlaut", specifically).

naturally, fossil complains during checkin that the document is not in utf-8. I understand this warning can be silenced so that's not a real issue.

but now I notice that the web-interface of fossil obviously/seemingly does not care for the encoding of the text file it displays and presumably simply expects utf-8. result: the document is not displayed correctly as far as everything not 7-bit ASCII (the common subset of utf-8 and iso8859-15) in the file is concerned (said German umlauts, e.g.) in diffs and file content listings.

question: is there a way to make the web-gui use/recognize a different input encoding for the files in the repo? either some setting in the admin-section of the web-gui (did not find anything) or some setting in the repo that is honoured by the web-gui?

(2) By Stephan Beal (stephan) on 2023-03-05 17:41:32 in reply to 1 [link] [source]

but now I notice that the web-interface of fossil obviously/seemingly does not care for the encoding of the text file it displays and presumably simply expects utf-8. result:

When rendering to HTML fossil has to make some assumptions about encoding, and the only types of re-coding it knows about are those supported by sqlite, namely UTF8 and UTF16 (though, to be honest, i'm not sure if it can render UTF16 in the HTML interface). So...

question: is there a way to make the web-gui use/recognize a different input encoding for the files in the repo?

Nope. It's pretty weird that troff/groff don't support UTF8 by now.

(3) By jvdh (veedeehjay) on 2023-03-05 19:45:33 in reply to 2 [link] [source]

ok, thanks. bad luck, then.

utf-8 and groff: agreed that it is strange that it still is not really supported. IIRC there is some wrapper or similar which can mask this shortcoming (by silently doing the utf8→ iso8859 translation prior to feeding it to groff proper) but that's it. so out of the box groff does only understand iso8859.

out of curiosity, I do not really understand "fossil has to make assumptions about encoding". file(1) is perfectly capable of identifying the document in question as "ISO-8859 text", e.g. and fossil, too, recognises at least the non-utf8 nature of the file in question during checkin. so what would prevent it to handle the issue (to some extent, at least) in the web-gui?

whether it would be worth the trouble etc. to implement it, sure that's a different question. but if it were "easy" to add some switch to the 'Admin' section of the web-gui where one could select a specific encoding, I would find this useful for such certain cases. but I admit those might become scarcer over time :).

(4) By Warren Young (wyoung) on 2023-03-05 20:41:25 in reply to 3 [link] [source]

file(1) is perfectly capable of identifying the document in question as "ISO-8859 text",

That can only be a heuristic match. There's no way it can tell one 8-bit encoding from another other than by guessing.

You can see the problem in the output you quote: which part of ISO-8859 does it mean? There are sixteen of them!

Add to that the zoo of alternative 8-bit encodings and you can see why modern minimalist software like Fossil might choose to set that aspect of bad old world aside and say, "Unicode only, please."

add some switch to the 'Admin' section of the web-gui where one could select a specific encoding,

That's great until you want two 8-bit encodings. Someone hands you a Hebrew document. Are you going to ask for encoding-glob so you can say, "These docs over here are 8859-8, but these over here are 8859-1." 🤮

(5) By jvdh (veedeehjay) on 2023-03-05 21:11:05 in reply to 4 [link] [source]

I am all in favour of utf8.
there are situations where you can't use it (I have given one example).
the fossil web-ui does not render those encodings. this is undesirable independent of how many percent of repos it affects as long as utf8 is not the only encoding in use on the planet.
if I direct my browser to file://mynonutf8file.txt it renders the text just fine in my utf8 locale. seems to be doable with some file(1) like heuristics.
a switch to select the encoding would resolve the issue in a useful way, irrespective of the ability to construe situations where a single encoding will not suffice to render all documents in the repo. it would suffice to have that switch and a menu to select the suitable encoding. as said: whether worth the trouble to implement it is up to the developer(s) to decide. denying that it would be helpful or doable is beside the point and wrong.
I am not interested in some semi-flame war about this issue and see no need for puke emojis. we might leave it at that.

(6) By schmitzu on 2023-03-06 15:09:44 in reply to 5 [link] [source]

I can only second this!

I too, have some legacy repos with ANSI/ISO encoding and it's annoying to have many black question marks when looking at some source or html files. That said, I don't have a perfect solution, but...

There are some clever heuristics to detect the encoding of an unknown text file, if it contains at least one character >=0x80 (at least to distinguish between UTF-8 and ISO8859-x). So if we have a fossil setting saying "use this encoding if a text file is not detected as UTF-8" it would be of great help.

(7) By Stephan Beal (stephan) on 2023-03-06 15:33:03 in reply to 6 [link] [source]

There are some clever heuristics to detect the encoding of an unknown text file,

Detection is only part of the problem. After guessing which encoding it possibly is, the recoding itself still has to be performed. That would add a good deal of code, and the related long-term maintenance burden, to cover only a handful of repositories. Adding a dependency on a 3rd-party encoding-conversion library is unlikely to happen, as we limit 3rd-party dependencies to only the absolute necessities.

(9) By Warren Young (wyoung) on 2023-03-06 15:38:49 in reply to 7 [link] [source]

Adding a dependency on a 3rd-party encoding-conversion library

iconv(3) is POSIX, and Windows of course has rich encoding conversion APIs.

(12) By anonymous on 2023-03-07 00:09:40 in reply to 7 [link] [source]

I am not sure that any of it is necessary. Add a configuration option to specify what character encoding to use (it is up to the administrator to specify the correct character encoding; UTF-8 can be the default setting for compatibility), and then serve the HTML files (and, if appropriate, raw text files) with that character encoding; that is good enough. The server need not convert character encodings; the client can do so. The only required assumption is that it is a superset of ASCII and that some of the codes of ASCII characters are not used for the other use. (It would then also assume that file names are in the configured encoding; on UNIX systems, the system does not care about character encoding of file names so it is not a problem; this might be a bit of a complication on Windows, though.)

Note that the above would require the entire repository to use a single character encoding. This is not always appropriate, but for simplicity it might be best to do it this way anyways.

I use non-Unicode encodings myself (such as the PC character encoding, when writing DOS programs, some of which are maintained in a Fossil repository), so this does matter to me, too.

(13) By Warren Young (wyoung) on 2023-03-07 01:18:16 in reply to 12 [link] [source]

It's a good point about the browser being able to handle some of this when told how to render the content, but I think you're overlooking the Markdown/Wiki parsing step, where Fossil does have to know something about the encoding. Not all 8-bit character encodings are compatible with ASCII to the point of giving — let us say — the square bracket the same code point. How then can the MD parser understand [this is a link](https://and.it.goes.here.example.com/)?

Even if you're willing to restrict yourself to the pure ASCII extensions like ISO 8859, you've got the inverse problem if, as I hope, this feature also lets you use UTF-8 text. The way the UTF-8 bit patterns work, you can end up with false code points if you mistreat it as 8859, for example, causing markup errors.

This is a niche issue. Someone with this rare itch is going to have to step up and do this one, I fear. I suspect the regular developers all converted to some form of Unicode exclusively decades ago.

(14) By anonymous on 2023-03-07 03:26:17 in reply to 13 [link] [source]

Not all 8-bit character encodings are compatible with ASCII to the point of giving — let us say — the square bracket the same code point.

That is why I suggested restricting it to only character encodings that represent ASCII characters with only the ASCII byte sequences and non-ASCII characters entirely with bytes that have the high bit set (although not necessarily a single byte per character).

Even if you're willing to restrict yourself to the pure ASCII extensions like ISO 8859, you've got the inverse problem if, as I hope, this feature also lets you use UTF-8 text.

Of course it should allow UTF-8 as well, since it satisfies the above conditions too. However, my suggestion is to (for simplicity) require the entire repository to use the same character encoding for the purpose of display in the web interface. This would mean the character encoding for all cards as well as HTML; the command-line interface (at least on UNIX systems) would assume that the I/O character set (and the character encoding of file names) is the same as the repository character set and therefore does not need to be converted.

However, there is a problem when implementing diffs that display differences in individual characters, since it might not know which bytes are the beginning and ending of a character. One way to handle this is to (add an option if necessary) treat any contiguous sequence of bytes with the high bit set as a single character for the purpose of diffs.

I suspect the regular developers all converted to some form of Unicode exclusively decades ago.

Some programmers, who may wish to use Fossil for maintaining DOS programs or other stuff that wants to use extended character sets other than Unicode, or programmers who just do not like Unicode.

(If anything more complicated than the above is needed (e.g. different character encodings for different files, or files that are encoded as Shift-JIS) then hopefully libfossil would be suitable, but I don't know if there are other implementations of HTML interfaces using libfossil. You could still redirect some paths (e.g. /raw and /xfer) to the official fossil program easily enough, I think).

(16) By Kees Nuyt (knu) on 2023-03-07 10:32:15 in reply to 14 [link] [source]

Some programmers, who may wish to use Fossil for maintaining DOS programs or other stuff ...

Wouldn't it be MUCH easier to maintain the sources in UTF-8, and add an iconv step in the build process?

Fossil does this (not for iconv, but for other reasons):

$ make fossil
cc -g -O2 -o bld/translate ./tools/translate.c
:
bld/translate ./src/foci.c >bld/foci_.c
bld/translate ./src/forum.c >bld/forum_.c
bld/translate ./src/fshell.c >bld/fshell_.c
:

(20) By anonymous on 2023-03-07 19:50:50 in reply to 16 [link] [source]

I would prefer not to, and there are reasons for this, such as:

Avoid needing to add extra build steps (to convert character encoding, and if appropriate, line endings).
Avoid complexity, ambiguity, and security issues of Unicode.
Avoid the possibility to add characters outside of the target character set by mistake.
In some cases (although not DOS programming), you might be using characters that are not even in Unicode anyways. (In such a case it is unlikely that it can be displayed on most web browsers anyways)

If I have to, I will compromise by just not displaying the files correctly in the web interface. (I can use encoding-glob and crlf-glob, but this only affects the command-line interface.)

(Note: GitHub can display some non-Unicode encodings correctly (I don't know if other programs such as Gitea are capable of such a thing). So, if the project is mirrored on GitHub then they can be displayed on GitHub. GitHub allows using the web interface to edit individual files, but it will re-encode them as UTF-8 if they aren't already UTF-8 (and, fortunately, it displays a warning message in this case). Due to this, I had to clone the entire repository (a fork of the B-Free repository) just to edit one file so that it would still retain the existing EUC-JP encoding.)

(15) By Stephan Beal (stephan) on 2023-03-07 06:44:06 in reply to 12 [link] [source]

I use non-Unicode encodings myself (such as the PC character encoding, when writing DOS programs, some of which are maintained in a Fossil repository), so this does matter to me, too.

To add to what Warren said here:

This is a niche issue. Someone with this rare itch is going to have to step up and do this one, I fear.

Any such patcher would also have to convince our BDFL¹, Richard, of the utility, as he has final say-so on which features make it into fossil. Convincing anyone, beyond a tiny handful of users, that the utility of this particular functionality justifies the long-term maintenance burden will be a "hard sell."

^{^} Benevolent Dictator For Life

(8) By Warren Young (wyoung) on 2023-03-06 15:36:51 in reply to 6 [link] [source]

A single fallback encoding doesn't sound terrible to me. The method doesn't even cost us much in the case of Markdown/Wiki rendering since we're already doing a 2×O(1) pass over the document. (Once to parse it, once to translate the parse tree into HTML.) This method merely changes it to 3×O(1) worst case: one failed parse right at the end of the document, then a complete re-parse as the fallback encoding, with no further option to try.

The biggest thing I object to is having multiple 8-bit encodings. If two, then why not three? Why not ten? Why not all of them, and welcome back to 1989?

(10) By jvdh (veedeehjay) on 2023-03-06 17:07:53 in reply to 8 [link] [source]

well, latin1 would be a start, then, in my view

(11) By anonymous on 2023-03-06 20:25:57 in reply to 1 [link] [source]

I agree that it would be useful to specify non-Unicode encodings; I also would want this feature. The web page should not be required to assume that all files are UTF-8 (or even that they can be converted to UTF-8).

(17) By Vadim Goncharov (nuclight) on 2023-03-07 14:12:13 in reply to 1 [link] [source]

But do groff really not support UTF-8 now? Man pages in distros are showing different characters for e.g. list markers, different languages are also supported, e.g. man vim shows my native language while in UTF-8 locale, and man still calls groff on my system.

(18) By Stephan Beal (stephan) on 2023-03-07 14:37:40 in reply to 17 [link] [source]

But do groff really not support UTF-8 now?

Not according to StackOverflow and many other similar references.

https://man7.org/linux/man-pages/man7/groff_char.7.html says:

However, its input character set is restricted to that defined by the standards ISO Latin-1 (ISO 8859-1) and IBM code page 1047 (an EBCDIC arrangement of Latin-1). For ease of document maintenance in UTF-8 environments, it is advisable to use only the Unicode basic Latin code points, a subset of all of the foregoing historically referred to as US-ASCII, which has only 94 visible, printable code points.

(21.1) By Vadim Goncharov (nuclight) on 2023-03-08 22:07:04 edited from 21.0 in reply to 18 [link] [source]

This seems strange. Once again, I have /usr/local/man/ru.UTF-8/man1/vim.1.gz (amongst with man's in other locales from same vim package) on my system and it really has NOT an ISO-8859-5 but Unicode input inside it, that is, I copy-paste it as-is:

.TH VIM 1 "2002 Feb 22"
.SH ИМЯ
vim \- Vi IMproved (Улучшенный Vi), текстовый редактор для программистов
.SH КОМАНДНАЯ СТРОКА
.br
.B vim
[ключи] [файл ..]
.br
.B vim
[ключи] \-
.br
.B vim
[ключи] \-t метка
.br
.B vim
[ключи] \-q [файл ошибок]
.PP

...and you see that Fossil forum here shows readable Unicode.

(19) By jvdh (veedeehjay) on 2023-03-07 14:53:37 in reply to 17 [source]

groff sure can generate utf8 text output for tty devices. the issue at hand is the input encoding of the troff source file, see here.

so managing (say) manpage source files with fossil currently would lead to the reported rendering problems in the web-gui. so I for one would be glad if some sort of fallback to iso8859-1 (aka latin1) or, probably preferable, iso8859-15 (aka latin9) would be available in the future.

(22) By Vadim Goncharov (nuclight) on 2023-03-08 22:03:54 in reply to 19 [link] [source]

Seems that you should dig deeper into man system - as I shown in sibling message, I see an Unicode man page which fossil should not have any problems with. My example is from FreeBSD system, vim package.

(23) By jvdh (veedeehjay) on 2023-03-08 22:30:23 in reply to 22 [link] [source]

I was concerned with general (non-man) troff input (not using the "man" macros but the "ms" macros in my case). but anyway: if those manpages you refer to are in fact in utf8 encoding, groff proper cannot process them. it is of course perfectly possible that the "man" program on your system does the conversion to some encoding understood by groff on the fly prior to feeding the document source to 'groff -man'.

(26) By Vadim Goncharov (nuclight) on 2023-03-18 15:06:30 in reply to 23 [link] [source]

Well, then only idea to advice currently I have - is to keep in another encoding and use commit/checkout hooks to convert... that is, UTF-8 in repo, also in hope that groff will support UTF-8 one day, so hooks could be just disabled then.

(24) By Warren Young (wyoung) on 2023-03-08 22:35:07 in reply to 22 [link] [source]

FreeBSD doesn't use groff for man pages, that being a GNU program. The limitation we're discussing is groff specific.

(25) By Vadim Goncharov (nuclight) on 2023-03-18 15:02:00 in reply to 24 [link] [source]

Suddenly, this is right, groff on my system existed as a dependence of some package, and base system in now on mandoc instead... But there were times when groff was really in FreeBSD base system, and if I recall correctly, I am seeing non-English man pages not for the first years, and seen such on Linux also. Then how that could be? All of them use some kludges to convert codepage before feeding to groff?..