Full UTF-8 Filename Supported Partially (Emojis)

(1) By jakedn on 2025-04-04 20:25:38 [link] [source]

Hello fossil World,

I just started using fossil the other day, and was excited to migrate my git repos to fossils.

The migrations passed all the tests I could come up with and then I ran into an issue; when adding a file with an emoji (utf-8 emoji) I get an Invalid Filename error.

The interesting thing is, my migration from git (with a repo containing emoji filenames) succeeded. As far as I can tell edits and adds to existing files (with emoji) that came from the migration works flawlessly. Reproduce:

make git (oh man) repo, add filename '👋hello-world', commit
migrate to fossil git fast-export --all | fossil import --git new-repo.fossil
open fossil into a new directory edit '👋hello-world' contents
'fossil add .' && 'fossil commit -m "commit changed file (bad filename)" <- this succeeds

I found a comment that states unicode chars above U+FFFF are not supported because they are not supported on windows-xp src, but I am not certain that is what is causing the error. Is there a setting I missed?

The way I see it (I am always open to the idea I am wrong) we have 2 options:

we can consider the fact the migration works without any warnings of invalid filenames as a bug.
we can consider the lack of support for new filenames with an emoji as a bug, because I can just migrate to git, add, and then migrate back with no errors🤔.

I want to say I appreciate having access to fossil, and it is an incredibly interesting approach to SCM and in theory is my preferred SCM (I really love the idea of a single SQLite file repository). Unfourtunatly, I am not an experienced developer, I couldn't find a solution myself.

I understand the idea of working cross-platform and not wanting introduce breaking changes to older platforms. I within the last year started using utf-8 character emoji more and more as all modern systems I use support them (zfs, macOS, windows, syncing solutions etc). In this reguard consider point 1 above, I can still create a fossil (from git repo) and with no warnings work with it not noticing I have a bad filename. A month goes by and I send it to my friend who for legacy reasons uses windows-xp and he gets unsupported file-names, and I have no idea what happened. >I didn't test opening this fossil on windows-xp to see what happens

I would love to hear other peoples thoughts about this.

(2) By Mike Swanson (chungy) on 2025-04-04 22:37:23 in reply to 1 [link] [source]

I found a comment that states unicode chars above U+FFFF are not supported because they are not supported on windows-xp src

Hmm. I'm not certain that comment is really true; NTFS and VFAT use UCS-2 for file names since the NT 3.x days (almost the whole OS uses UCS-2), and UTF-16 was designed specifically as a UCS-2 subset. It could probably assume UTF-16 on Windows and be fine. Microsoft does this especially in versions starting with Vista, assume that strings are UTF-16 as long as it manages to decode to UTF-16.

(3) By Florian Balmer (florian.balmer) on 2025-04-05 05:55:27 in reply to 1 [source]

Just tested the little program below on Windows XP, and it works.

But I think Fossil's file name restrictions are not only to support Windows XP, but for best compatibility with other operating systems lacking (full) Unicode support.

#include <windows.h>
#include <stdio.h>
main(){
  LPCWSTR zwFileName = L"\U0001F638"; /* U+1F638 Smiling Cat 😸 */
  HANDLE hFile;
  hFile = CreateFileW(
  zwFileName,GENERIC_READ,0,NULL,CREATE_ALWAYS,
  FILE_ATTRIBUTE_NORMAL,NULL);
  if( hFile==INVALID_HANDLE_VALUE ){
    printf("ERROR creating file: %S (0x%08X).\n",zwFileName,GetLastError());
  }else{
    CloseHandle(hFile);
    printf("SUCCESS creating file: %S.\nPress any key to delete.\n",zwFileName);
    getch();
    DeleteFileW(zwFileName);
  }
}

(4) By Florian Balmer (florian.balmer) on 2025-04-05 05:55:29 in reply to 2 [link] [source]

... UTF-16 was designed specifically as a UCS-2 subset ...

The other way round: UCS-2 once used to be what was known as "Unicode" and was restricted to the BMP. Then UTF-16 was invented as a superset of "Unicode" (which was then renamed to UCS-2) with support for characters outside the BMP (through surrogate pairs).

I always believed NTFS accepts arbitrary binary sequences for file names, but when playing around with the program posted above, some invalid sequences seem to be rejected.

But Windows XP is aware of surrogate pairs and supports non-BMP characters. But it probably displays them as "I □ Unicode" squares.

(5) By Mike Swanson (chungy) on 2025-04-05 06:28:25 in reply to 4 [link] [source]

I'll maintain my original statement. UTF-16 is always valid UCS-2, but the other way around is not true; hence, UTF-16 is a subset of UCS-2.

When Unicode decided to expand from the original "16-bits ought to be enough" assumption, a bunch of code points that were previous in the private use area of the basic multilingual plane were re-assigned to be used as high and low surrogate pairs, such that most UCS-2 documents would become UTF-16 compliant documents for free. However, if documents used those private use characters in orders that did not comply with UTF-16's surrogate pair system, they wouldn't be UTF-16 documents.

Windows NT was designed at an "inconvenient" time where Unicode believed that "16-bits ought to be enough" and the entire system is permeated with a two-bytes-per-character assumption. UTF-16 being a subset of that original design allows the system to pretend things are UTF-16, at least until that assumption is broken. Of course, for backwards compatibility reasons, this situation can never change. (Java and JavaScript also both suffer similar problems for their age of conception and backwards compatibility.)

At any rate, file names on NTFS, VFAT, and exFAT all maintain the two-bytes-per-character design, which is UCS-2. Basically any multiple-of-2 sequence is valid, minus a few reserved characters (on NTFS, that's technically just U+0000 and U+002F, but the Win32 subsystem has additional restrictions on top of that...)

I believe Fossil stores names in UTF-8, and it can be trivially converted to UTF-16 as-needed on Windows. Given the age of Windows XP, it probably wouldn't break the OS, but neither do I consider it an important target anymore.

(6) By Stephan Beal (stephan) on 2025-04-05 08:43:09 in reply to 5 [link] [source]

I believe Fossil stores names in UTF-8, and it can be trivially converted to UTF-16 as-needed on Windows.

Fossil prefers UTF-8 for everything. A depressing amount of code in fossil is dedicated to doing UTF-16 conversion for Windows APIs.

(7) By Florian Balmer (florian.balmer) on 2025-04-06 06:18:41 in reply to 6 [link] [source]

You mean the two functions fossil_utf8_to_path and fossil_path_to_utf8()? ;-)

But ... you are ... not ... planning ... to remove them, are you? ;-)

Joking aside, I believe the conversion UTF-32 ↔ UTF-8 ↔ UTF-16 (the first one also happens on *nix operating systems, all the time) is not a relevant slow-down. Also, from my measurements when working on the comment formatter to add new (heavy) checks to avoid splitting UTF-8 sequences, the impact was in the low, single-digit percent ranges.

Not trying to blame Cygwin, but just pointing out an interesting fact: the biggest part of the UTF-8 ↔ UTF-16 code in Fossil is Cygwin-specific, to support *nix-only characters in file names, by replacing (and restoring) them with private-use area code points.

I'm a big admirer of UTF-8 and their inventors (I think they're almost as famous as D. Richard Hipp). Nevertheless I think programming with UTF-16 is way easier than with UTF-8.

(8) By Florian Balmer (florian.balmer) on 2025-04-06 06:18:42 in reply to 5 [link] [source]

So the nitty-gritty question is whether the code units later used for the UTF-16 surrogates really were in a private-use area in Unicode 1.1 (1993) / Unicode 2.0 (1996). Or if they were unassigned, and hence never valid in UCS-2. I belive the latter applies, but my Internet-archeology skills are insufficient to provide citations, this morning, so this remains a speculation on my part.

(9) By Mike Swanson (chungy) on 2025-04-06 09:30:14 in reply to 8 [link] [source]

Looks like the latter, the surrogates were taken out of a reserved zone: https://www.unicode.org/versions/Unicode1.1.0/ch02.pdf

It doesn't really make a difference to the discussion, though. Especially in the context of Windows file APIs: they were designed in 1993 for sequences of 16-bit characters, the original assumption being that it would cover all of Unicode (and it did, until three years later). UTF-16 works in this same space (it was designed to, for compatibility purposes, after all), but old software won't necessarily render text correctly if it doesn't treat it as UTF-16.

In terms of having the full gamut of Unicode in Fossil-managed file names, it's probably worth making old software behave a little wonky. Emoji may be silly (besides that they are in a supplementary plane), but there's real language scripts being put into SMPs too.

In comparison to Git, it is "dumb" in storage and just treats everything as sequences of 8-bit bytes, like most Unix file systems. That UTF-8 is a de facto encoding standard is convenient, and probably a good enough assumption to make in this day and age, but I might wonder what happens to a Fossil conversion if a Git tree is populated with non-UTF-8-compliant names first. At least that scenario hasn't seemed to spring up yet.

(10) By jakedn on 2025-04-09 21:44:05 in reply to 3 [link] [source]

The fact it works is interesting. Windows xp was before my time (I started compsci in 2017), I went by the comment quoted in the source code.

Unfortunately keeping compatibility with operating systems that lack full Unicode support comes at the cost of not being able to use a git export stably (as I described above). I am surprised I even ran into this use case; I use git to version control my notes, and started using emoji for shorter faster to spot names.

I am still planning on moving many of my personal code projects to fossil and if this issue (maybe issue) is solved I would also have my notes controlled by fossil.

(11) By Vadim Goncharov (nuclight) on 2025-06-24 17:15:13 in reply to 9 [link] [source]

Emoji may be silly (besides that they are in a supplementary plane), but there's real language scripts being put into SMPs too.

What are these real languages? I know that ancient dead languages are added (e.g. Phoenician) but can't recall real used ones, and it's doubtful that dead languages will be used for file names and not inside documents...

(12) By Warren Young (wyoung) on 2025-06-25 18:13:19 in reply to 11 [link] [source]

What are these real languages?

There are currently 161 SMPs in the Supplementary Multilingual Plane alone. There are two additional planes after that, primarily dedicated to modern CJK scripts.