Code page of files in web interface

(1) By anonymous on 2021-11-09 21:45:47 [link] [source]

Can the code page for displaying text files in the web interface be changed? I don't want files or file names to be treated as Unicode (treating them as Unicode has all sorts of security issues, for one thing; it is also really messy and has other problems).

(2) By Stephan Beal (stephan) on 2021-11-10 01:31:18 in reply to 1 [link] [source]

Can the code page for displaying text files in the web interface be changed?

No. Code pages are a Windows-ism and fossil is primarily a Unix application.

I don't want files or file names to be treated as Unicode (treating them as Unicode has all sorts of security issues, for one thing;

Citation needed. Very nearly every web page you visit uses UTF8 encoding.

it is also really messy and has other problems).

Far less/fewer than code pages.

Unicode has been The Standard for many years now and that's not going to change in the foreseeable future.

(3) By Bartek Jasicki (thindil) on 2021-11-10 06:27:07 in reply to 2 [link] [source]

I don't want files or file names to be treated as Unicode (treating them as Unicode has all sorts of security issues, for one thing;

Citation needed. Very nearly every web page you visit uses UTF8 encoding.

Probably reference to the new Trojan Source Attack: https://www.darkreading.com/dr-tech/3-ways-to-deal-with-the-trojan-source-attack

(4) By Stephan Beal (stephan) on 2021-11-10 08:09:55 in reply to 3 [link] [source]

Probably reference to the new Trojan Source Attack:

Presumably so, but that's just one, where the OP claims "Unicode has all sorts of security issues."

(6) By jamsek on 2021-11-10 09:47:13 in reply to 4 [link] [source]

Also:

To exploit this weakness, the adversary would need to have direct access to developers’ workstations, source code management system, or continuous integration pipelines.

“If an attacker has direct access to your source code management system, frankly, you ~~probably~~ definitely have bigger problems than this attack,” Rudis stated.

Can't say I disagree with Rudis. Emphasis and edits mine.

(7) By Warren Young (wyoung) on 2021-11-10 10:03:25 in reply to 6 [link] [source]

Redis accepts PRs on GitHub. Do you believe that prior to this point that they were checking for this problem, especially given that the exploit's authors showed that typical developer's text editors would hide the problem as well?

You could get me to believe a claim like this about something like SQLite, where outside contributions are all but nonexistent, but any project that takes patches/bundles/diffs/PRs is vulnerable to the Trojan Source problem as long as tools silently obey bidi markers.

(8.2) By jamsek on 2021-11-10 10:27:11 edited from 8.1 in reply to 7 [link] [source]

Fair point: it's dependent on project policies. But subscribing to the
cathedral development model, which is the norm around here, I put such
policies as that which you link (PRs) as squarely in the camp of:

you ~~probably~~ definitely have bigger problems than this attack.

ETA: My opinion isn't an objection to implementing measures to prevent this, btw: And I think this^misref⁰, is definitely worth considering if not doing immediately.

^misref: https://fossil-scm.org/forum/forumpost/db66d9443708c3c2?t=h

⁰As for Trojan Source, following GitHub's lead and presenting a warning when these bidi markers are found is probably a good idea.

^{^ a b} Misreference

(9) By Scott Robison (sdr) on 2021-11-10 18:07:28 in reply to 8.2 [link] [source]

There is truth to this, but sadly we all depend on bazaar style development at this point if government, web apps and services, our ISPs, or whomever use software developed in that style. So even if we do not care for bazaar style development, we are very much impacted by it.

Linux recently reverted some contributions from University of Minnesota for trying to sneak things into the kernel. That Unicode trickery makes such things even harder to detect if you don't know what you're looking for just makes things even more complicated.

(5) By Warren Young (wyoung) on 2021-11-10 08:31:23 in reply to 3 [link] [source]

Oh, the pool goes much deepeer than that. Any time you have an interpreter — for which UTF-8 to wchar_t and back most definitely qualifies — you have a chance of misconversion. Unicode also allows things like visual spoofing, fooling the human as well.

Not that I'm advocating for getting rid of Unicode, but we have to be realistic about it: doing Unicode properly is extremely complicated. That link is about Perl, which is one of the most capable Unicode-aware programming languages available. Now consider how much worse it is in C, where the language and library facilities aren't as rich.

The solution to all this isn't "let's go back to 8-bit encoding," it's to keep up with all of the change that's solving the problems we find when making our software cope with the full scope and sweep of human language. It's difficult, but it's worth doing.

As for Trojan Source, following GitHub's lead and presenting a warning when these bidi markers are found is probably a good idea.

(11) By Stephan Beal (stephan) on 2021-11-13 03:45:04 in reply to 5 [link] [source]

As for Trojan Source, following GitHub's lead and presenting a warning when these bidi markers are found is probably a good idea.

i'm wondering if we could pack that up in looks_like_utf8() (in lookslike.c). It seems like bidi checking would semantically fit right in there.

That said: bidi is a perfectly valid feature of Unicode, so i would hate to see us plaster "this file contains bidi..." warnings on pages. It's not our place to judge the semantic value of content in a repository, but to reproduce it is exactly as it has been stored.

(12) By Scott Robison (sdr) on 2021-11-13 03:48:18 in reply to 11 [link] [source]

(13) By Warren Young (wyoung) on 2021-11-13 04:16:02 in reply to 11 [link] [source]

It's not our place to judge the semantic value of content in a repository, but to reproduce it is exactly as it has been stored.

I'll agree with that for wiki, the forum, and embedded docs, but is there any reason not to call foul when we see in the middle of a commit's diff view?

I for one am not in the habit of inserting Arabic text in the middle of my English source code. :)

I understand that one of the characteristic features of this attack is improperly balanced direction markers.

(14) By Stephan Beal (stephan) on 2021-11-13 04:32:55 in reply to 13 [link] [source]

I'll agree with that for wiki, the forum, and embedded docs, but is there any reason not to call foul when we see in the middle of a commit's diff view?

...

I understand that one of the characteristic features of this attack is improperly balanced direction markers.

By the same token (no pun intended) we don't validate balanced HTML/XML tags or string quoting in the diff view. The contents of the diff view are opaque to us, other than being "valid UTF8". Imbalanced characters does not invalidate them as UTF8, it just makes them semantically questionable (but not necessarily wrong - that's not our place to judge).

i'm not outright against a warning marker of some sort, and won't object if someone wants to patch that, but making such judgements on a user's behalf seems to me like a slippery slope as well as a maintenance hassle.

In any case, if someone's going to add it, it looks to me like looks_like_utf8() would be a low-friction/low-impact place to do so. It would simply(?) need a new bitmask result value if it sees any bidi markers. IIRC, that check is run before feeding the content to the diff engine. In the case of bidi, we'd need to remember that the marker was seen and add a warning to the header. Adding a warning directly inline in the diff would mean that every diff generator (we currently have 6(?) of them: unified text/html, sbs text/html, JSON, TCL) would have to know about that and warn in a way suitable to that diff format (how do we do that with unified diffs without breaking them for purposes of applying patches? (Rhetorical question!)).

(15) By anonymous on 2021-11-13 06:30:00 in reply to 14 [link] [source]

how do we do that with unified diffs without breaking them for purposes of applying patches? (Rhetorical question!)).

IIUC anything between diff and --- shall be ignored.

(16) By Scott Robison (sdr) on 2021-11-13 07:50:45 in reply to 13 [link] [source]

Could it not be the case that the bidi markers are found embedded in the middle of a quoted string, used as part of a program to generate UTF8 text? Should we also add parsers for all possible programming languages that might include such markers?

In some ways this smacks of the warnings that some modern compilers generate, either by default or because people activate "treat warnings as errors" because they once heard it was a good idea. Then those faux errors are reported to teams as though the policy of treat warnings as errors is an inviolate law of physics.

Is bidi sufficient? Should we start warning about non-ASCII unicode that can look like ASCII unicode? Or any two code points that can be confused with one another due to the appearance of the glyph?

I'm not opposed to a warning, but it feels ... "icky" ... to me. It is as though anyone coding in a non-English language is a second class citizen. It's not a bad idea, but does this lead to the need for even more issues? I feel like a tool that allowed people to audit their source code would be a better use of time than shoehorning it into fossil.

(17) By Daniel Dumitriu (danield) on 2021-11-13 14:25:58 in reply to 16 [source]

bidi markers are found embedded in the middle of a quoted string, used as part of a program to generate UTF8 text? Should we also add parsers...

I think this decision - whether special (for some defintion of 'special') Unicode characters are to be detected and/or allowed - should belong to the repo's users and admins. Fossil might help and provide means to do that - e. g. as options for add, diff, or commit, or some test-unicode-for-* command and/or a general (versionable?) setting.

IIUC anything between diff and --- shall be ignored.

Anything in the 'leading garbage' - basically before --- for unified diffs and *** for context diffs - is ignored by patch(1). Having said that and the above, I don't think a diff file should carry a warning; if anything, that should be shown on creation with the appropriate option.

(18) By Stephan Beal (stephan) on 2021-11-13 14:39:29 in reply to 17 [link] [source]

command and/or a general (versionable?) setting.

The problem with it being versionable is that the same villain who can commit Bad Bidi can also change that setting to keep their Bad Bidi from being reported. That's the reason symlink support is no longer versionable.

(19) By anonymous on 2021-11-13 18:32:16 in reply to 17 [link] [source]

I would want a mode to warn against all non-ASCII text (possibly limited to specified files), and a mode to interpret files as non-Unicode (in case the file isn't Unicode; it would be acceptable to just not interpret any bytes outside of ASCII range at all (and perhaps display hex codes for non-ASCII bytes, and also most ASCII control codes (to avoid a different kind of attack, involving curcor positioning to hide stuff)) if you tell it that it isn't Unicode, since that would be the simplest implementation).

(20) By Marcelo Huerta (richieadler) on 2021-11-13 22:09:15 in reply to 19 [link] [source]

I would want a mode to warn against all non-ASCII text (possibly limited to specified files)

As a native speaker of a non-English language, this phrase makes me shiver in terror.

and a mode to interpret files as non-Unicode (in case the file isn't Unicode)

You probably know this, but for the benefit of those who don't: files aren't in Unicode; Unicode is some kind of Platonic ideal which needs to be represented in the file system (or in network data transmissions, but you get my meaning) with some encoding. The file is in that encoding (UTF-8, UTF-16, ISO-8859-1, ASCII...) not in "Unicode".

Speaking of "files saved in Unicode" is misleading.

(21) By anonymous on 2021-11-13 23:59:11 in reply to 20 [link] [source]

As a native speaker of a non-English language, this phrase makes me shiver in terror.

That is why I said it is limited to specified files, e.g. files containing program codes. That way, you can put any non-ASCII text in other files (e.g. internationalization files). (Even so, I meant it to be option, not mandatory!)

Speaking of "files saved in Unicode" is misleading.

OK, you are right. However, I mean encodings that aren't Unicode. For example, if one file (or all files) uses PC character encoding. The encoding might not even be the same in all files in a repository. However, simply being allowed to ignore it can be helpful; the implementation need not implement all encodings, but simply to tell that it isn't necessarily UTF-8, and that this should not necessarily stop it from trying to display the file (if your computer is capable of doing so; you should still be allowed to tell it to not display non-ASCII characters, whether or not the file is UTF-8, since allowing them to be displayed might potentially be a security risk in some circumstances).

Although some people want to avoid everything other than English with ASCII, I am not one of them; I think that you can write in other languages, even those for which ASCII is unsuitable. Instead, I am one who think that Unicode is no good and that no character set can ever be suitable for all purposes.

(22) By Marcelo Huerta (richieadler) on 2021-11-14 00:40:11 in reply to 21 [link] [source]

That is why I said it is limited to specified files, e.g. files containing program codes

Again, you're not making any sense. Program code for programs whose audience is outside of English-speaking countries would almost certainly include messages in non-ASCII encodings.

However, I mean encodings that aren't Unicode.

You don't seem to understand. There's no such thing as "Unicode encodings". Unicode is a description of meaning of characters, and the assignment of "code points" in an ideal table. Encodings are ways to represent characters, but they are not Unicode themselves. The same Unicode codepoint is represented differently in UTF-8, UTF-16LE and UTF-16BE. None of those encodings "are" Unicode. I don't even know what encoding(s) are you talking about when you talk about "Unicode encodings".

For example, if one file (or all files) uses PC character encoding

If you mean "encoding used for text files in MS-DOS", there are a plethora of those. Which encoding you mean? CP432? CP850? The multitude of other encodings for Slavic languages?

Or are you referring to ISO-8859-1 or that mutant creation, Windows-1252?

Although some people want to avoid everything other than English with ASCII, I am not one of them; I think that you can write in other languages, even those for which ASCII is unsuitable

How generous of you, allowing us to exist!

I am one who think that Unicode is no good and that no character set can ever be suitable for all purposes.

This phrase indicates that you completely misunderstand what Unicode is, because it doesn't make any sense. Again, the Unicode standard and the encoding of a file are two different things. I don't know how I can make this clearer.

I can't be sure if you're trolling or not. But given that you don't identify yourself by name, I'll refrain from further participation in the discussion.

(23.1) By Scott Robison (sdr) on 2021-11-14 01:12:45 edited from 23.0 in reply to 22 [link] [source]

Note: I'm just replying to the thread, not to any one person. Hopefully it will help.

Back in the Dark Ages(TM) there was Unicode and there was ISO/IEC 10646 (the standard for Universal Coded Character Set). The two groups started off separately but subsequently aligned to not have competing standards.

Since they aligned, ISO/IEC 10646 defines the code points for a vast set of characters representing many languages in a way that is better than the hodge podge of encodings that existed previously.

Unicode, on the other hand, takes that list of code points and defines additional semantic meaning on top of the basic code point value and name that ISO/IEC defines.

If one really believes that no character set can ever be suitable for all purposes, then they ought not be programming because standardized languages define the character set they work with. Earlier C & C++ standards only supported a basic source character set of ASCII compatible characters (though not mandating ASCII), but allowed that any encoding could be used as long as a program written in the basic source character set behaved identically to one written in an expended character set, of which UCS certainly applies.

Yes, Unicode is not exactly the same as the encoding, but ultimately they are the same. Any Unicode or UCS encoding or transformation format is meant to represent a sequence of code points that systems can use to exchange data. There are a ton of them, though UTF-8 & UTF-16 are the most common in use today.

The problem as I've read with the bidi markers is not with the programming language, it is with the editors that are used, some of which do the wrong thing when rendering those code points. I'll bet there are lots of programs with buggy handling of Unicode / UCS code points. Should fossil cater to all of them, or should it be a tool used to record / fossilize artifacts into an "immutable" fossil record so that they can be replayed / unfossilized later?

Certainly fossil already tries to accommodate certain file formats by sniffing for binary, utf-8, utf-16 attributes, so it is not unprecedented that it might do more. I think that is the wrong approach in this case. UTF-8 is an encoding used by any language and is not a judgement call, it is an interoperability feature. Warning about certain code points is making a judgement call. Building bidi detection into fossil so it can warn about potentially harmful effects of certain code points raises concerns over what might be completely legitimate code. It also might lead some to infer that "if fossil doesn't warn about X then X must be okay".

Fossil is a tool. Can you use a screwdriver to drive a nail? Probably. Should you? I don't think so.

I think certain bazaar style projects would benefit from a tool to audit their code looking for potential issues. Much like lint used to do before every compiler started warning about things that are perfectly legal but could be used incorrectly.

Building bidi detection into fossil for warnings is like turning on "treat warnings as errors" on a random compiler and expecting / demanding that projects that use the feature properly stop using it so that their sensibilities will not be offended. Not as extreme, but once this one step is taken, I'm not sure where it ultimately leads.

In reality, I'm not that concerned about it. If it is built into fossil so be it. I think it is not a necessary feature for fossil and it should be handled at a different level, just like I don't think we should build a C compiler into fossil. If anything, allow people to hook into the sync & commit processes to check new artifacts for potentially troublesome features. Then it is handled in a way that doesn't impact people who don't want it, and if more is discovered in the future, it can be added per instance of fossil.

(24) By jamsek on 2021-11-14 02:09:56 in reply to 23.1 [link] [source]

Great post, Scott. And I agree. Hooks can be used to audit code for bidi text by those with projects that enable such problems

(28) By ravbc on 2021-11-15 10:56:57 in reply to 24 [link] [source]

Hooks would be great for such auditing purposes, except for:

fossil have no (easy) way to distribute hook scripts (partly because it's multi-platform)
hooks on central server could not block sync, because that will break syncing permanently
importing bundles, fossil-patchsets or elsewhere proposed "merge requests" would require testing them before the import, although the source format is only readable by fossil itself

(29) By Scott Robison (sdr) on 2021-11-15 18:05:43 in reply to 28 [link] [source]

Point 1 is true of all distributed systems.

Point 2 doesn't need to happen. The hook should report that something nefarious might be afoot (if the project is concerned about such things). It could then post an alert to the repo owner, or perhaps it could add a tag to the branch in question, or whatever. The point of the hook is not to forbid, but to provide situational awareness.

Regarding point 3, I think fossil supports functionality to discard an imported bundle in the event it doesn't measure up. I'm not 100% up to speed on bundles.

The reality is that it is impossible to detect and stop every nefarious thing that might ever be done or attempted. Hence the fossil model of providing situational awareness so that bad or misguided things can be moved off the mainline of development and marked "do not use" or some such. Having it be in the fossil record doesn't mean it has to be part of the mainline of development.

(25) By Florian Balmer (florian.balmer) on 2021-11-14 09:12:37 in reply to 23.1 [link] [source]

... "if fossil doesn't warn about X then X must be okay" ...

I agree, and would like to emphasize it's important not to confuse Unicode with UTF-8!

Fossil is able to produce correct diffs of UTF-16 text files, both for the CLI and the Web UI. So the same "Visual Spoofing" attacks are possible here, and looks_like_utf8() is not the right place to do the checks, as for example suggested in this post.

Moreover, Fossil is able to produce correct diffs (just checked it!) for GB18030 (which also covers the entire Unicode range) text files in the CLI, but web borwsers will display U+FFFD REPLACEMENT CHARACTER for code points (or, sequences) invalid in the encoding specified by the Web UI HTML pages (UTF-8). So "Visual Spoofing" attacks are even easier, here. Would Fossil also check GB18030 text files?

Half-hearted solutions only give a false feeling of security, and therefore I think such checks should be done in makefiles or commit hooks that are aware of the encodings used by the specific project's files. Aren't these "Project Policies", as frequently quoted here in the forum?

(10) By anonymous on 2021-11-10 20:03:37 in reply to 1 [link] [source]

For one thing, not all projects will be using Unicode even if some do. For example, some projects will be DOS programs and PC character encoding will more likely be used; PC encoding might also be used for programs that aren't DOS programs (but this is less likely). Even if Fossil itself runs on UNIX, the project being made might be for any operating system.

Furthermore, the Han unification of Unicode is not always beneficial, in case you want to write in Japanese, Chinese, etc.

The Trojan Source Attack (which I had known about for years, actually) is just one of the issues, not all of them.

Of course some projects will use Unicode, but some won't. (However, text direction overrides, homoglyphs, etc are, I think, a possible reason to want to disable Unicode mode for projects that work in plain ASCII (which many do, including most of my own), simply for an extra security (of course, like many other people have said, it doesn't really provide much real security, but it is one step).)

(26) By John Rouillard (rouilj) on 2021-11-15 01:07:11 in reply to 10 [link] [source]

A nice intro writeup is at: https://www.trojansource.codes/.

(27) By Scott Robison (sdr) on 2021-11-15 01:22:17 in reply to 26 [link] [source]

Code editors and repository frontends should make bidirectional control characters and mixed-script confusable characters perceptible with visual symbols ~~or warnings~~.

I think warnings are the wrong approach, but it seems reasonable to me that any "invisible" character could benefit from some alternative markup to make them clear.