UTF-8 BOM and Windows new lines...

(1) By Marcio Gabe (marciogabe) on 2020-08-17 20:21:00 [link] [source]

Hi all...

I've been using Fossil for over a year and it's working great!

There's a couple of things I would like to ask/share/contribute...

I use Fossil for this C#/Visual Studio/Winforms project in a Windows 10 environment, and whenever I have to do a merge in fossil, for most of the files it work's just fine... (those with status of ADDED or UPDATED_BY_MERGE).

The issue usually happens whenever there's a conflict or the file is edited by Fossil while merging...

It seems all my project files created in VS use this UTF-8 BOM at the beginning of the .cs source files. After a merge, when I compare them using "fossil diff" (Configured to use Beyond Compare external diff tool) with the edited file, it shows only with UTF-8 against the UTF-8 BOM checked in the repository Image here. The hex diff looks like the picture in this link Hex diff. (original file checked in fossil on the left and current file in working directory on the right). Apparently Fossil discards that 'EF BB BF' byte order mark at the start of the file when processing the edit in the merge command... It shouldn't be really a problem, but since all my project files are checked in with the BOM and newly created files in Visual Studio come with that I wonder if there's a way to fix Fossil not to drop it in case it's there....
The other thing, related to the same issue is when there's a merge conflict. Fossil not only removes the BOM, but the line endings get mixed up a bit. Looking at the file in Notepad++ it shows that the text Fossil inserts in the conflict area have only the LF character instead of the regular CR/LF pair used by windows: Image here. Again, this is not really that much of a problem since the file needs attention anyway to get the conflict solved, but I must be careful to remove all line endings inserted by Fossil so no single LF character is left... just to keep my source code normalized. It would be great if Fossil could detect what line endings are being used in the file and use the same... or maybe have it as a setting (maybe it already has and I just don't know yet).

Anyway... thanks again... and please let me know if I can contribute somehow... I love Fossil and will continue to use and promote it whenever I can!...

(2) By Stephan Beal (stephan) on 2020-08-17 20:44:25 in reply to 1 [link] [source]

i can't say much to that except to disable addition of BOMs in UTF8. The point of a BOM is to specify byte order, so they're useless, and often problematic, in UTF8, which has no byte ordering issues. The Unicode group recommends against their use in UTF8.

It would be great if Fossil could detect what line endings are being used in the file and use the same... or maybe have it as a setting (maybe it already has and I just don't know yet).

My platforms are all Unixy, so i'm going ny fallible memory here instead of experiencing, but my recollection is that it tries to retain the line endings, with the caveat that when it tries to determine the ending it will use the first format it sees.

(3) By Warren Young (wyoung) on 2020-08-17 20:45:56 in reply to 1 [link] [source]

Fossil inserts in the conflict area have only the LF character instead of the regular CR/LF pair used by windows

Fossil doesn't really keep track of the line ending style of files it manages. It's all binary data to Fossil.

It could do as you wish, either by having some way to mark the files (e.g. by extending the meaning of crlf-glob) or by looking at the file before inserting the merge conflict blocks. I expect this is mostly a matter of not having the Windows native users needed to "fix" all of this, rather than some wish for it to not exist.

Fossil comes out of the Unix/Linux world, so it just blindly chooses LF line endings when doing this, since that works in all of the cases that Fossil was originally designed to serve.

Do note that Visual Studio will quite happily use LF-only files without messing with the line endings. Almost all Windows text editors these days will, in fact, particularly those targeted at software developers.

The only common case I can think of where this doesn't happen is with Notepad.

In my largest Windows-friendly public FOSS project, we have a policy: README type files get CRLF line endings just in case someone hasn't yet mapped *.txt files to something better than Notepad and double-clicks one in Explorer. Everything else is LF-only, since surely even Windows users haven't got *.cpp mapped to Notepad.

let me know if I can contribute somehow...

Funnily enough...

(4) By Marcio Gabe (marciogabe) on 2020-08-17 22:21:21 in reply to 2 [link] [source]

I would agree with you and it would be nice to get rid of the problem altogether... I for one am not in favor of using this BOM at all, but.... there's always a but....

VS by default... and unless one goes and add an extra extension to it or take additional steps while saving any file... (and then in every developer's machine that work on the project) keeps adding this BOM back up... It's been screamed at Microsoft to change this... and unfortunately it seems no one there really cares to change it...

See here and here.

So... I just thought it would be not so bad or hard for Fossil (since so far it's been the only place I've noticed this issue) to just ignore this BOM (as it's already doing) or keep it there if it's there already, even when editing the file during merge time.... (both files being merged and common ancestor already have the BOM... why is Fossil stripping it out while merging?) Because if it takes it out and I end up checking in a source file without the BOM, no problem... then when I save it again in VS, even if unchanged, it puts the BOM back in and suddenly it's seen as a modified file by Fossil, since it's binary different. Because of this I take extra care every time i check in the files to make sure BOM's still there... (I end up HEX editing the file to put those 3 bytes back before checking in, and it's ok, since it only happens when having to merge files and all the hundreds of file already in Fossil have it)

Same goes with the line endings... maybe a simple check in the file before inserting the merged conflicting line could work... so Fossil would just work even for us mere mortal Windows users... (or rather stuck with it) lol. (I know VS works with LF only... but it pops this dialog up every time.... and I don't want to uncheck the box 'Always show this dialog' exactly because I want to keep line endings consistent...

I know I'm probably doing things the hard way.... but then again, Fossil could just work out these things and play nice with whatever's out there... even if VS doesn't.

And sorry if I'm being too picky! Like I said... this is not really a complain... I love Fossil and I'm very grateful for it's existence... I just think these could be some simple fixes that would help not only me, but many other Fossil users to be as well.

(5) By Stéphane Aulery (LkpPo) on 2020-08-17 22:22:50 in reply to 1 [link] [source]

Hello,

Maybe these tools can help you fix the problem in VS:

You post make me discover that Microsoft is pushing UTF-8 actively: Use the UTF-8 code page. Are we finally approaching a single encoding?

When it comes to line breaks, VS recognizes : CRLF, LF, NEL, LS, PS (not CR alone?). So you could use only LF and avoid this mistake.

Regards,

LkpPo

(6) By Marcio Gabe (marciogabe) on 2020-08-17 22:27:28 in reply to 3 [link] [source]

Funnily enough...

Well... I do have a cloned Fossil repository here and even managed to compile it on my Windows box once (a few months back in the timeline)... but I don't know if my C and C++ knowledge is enough already to really make changes that would be acceptable and not break it...

I may try though... :)

(7.1) By Stéphane Aulery (LkpPo) on 2020-08-17 22:37:46 edited from 7.0 in reply to 6 [link] [source]

If you are considering this, maybe even better to contribute, also, a PR on the VS repository to kill the BOM for UTF-8 source files.

(8) By Marcio Gabe (marciogabe) on 2020-08-17 22:42:37 in reply to 5 [link] [source]

Yes... thank you... I would definitely use only LF and not having the BOM on every file... I only wish this was the default VS setting when I started this project I'm working on three years ago... now hundreds of source files later (2,920 files in Fossil, just checked the stats) it's a hassle to change all of them back...

So.. like I said in my other reply, I'm not having any real issue here other than very minor annoyance when having to merge a fork or branch... the way I work here it's not that often even... I'm just trying to point these out, just because I thought it would not be hard to tweak fossil a little bit so that even these little details could just work...

(9) By Stéphane Aulery (LkpPo) on 2020-08-17 23:07:22 in reply to 8 [link] [source]

Don't know if it's hard. That could be a good thing.

(10) By Stephan Beal (stephan) on 2020-08-18 00:55:37 in reply to 4 [link] [source]

So... I just thought it would be not so bad or hard for Fossil (since so far it's been the only place I've noticed this issue) to just ignore this BOM (as it's already doing) or keep it there if it's there already, even when editing the file during merge time....

i feel your pain, but ignoring even a single bit of the input would result in fossil's db having a different hash for the data as what's on the filesystem. That's simply not something our architecture can support.

Same goes with the line endings...

With the same problem of different hashes. i thought, perhaps incorrectly, that fossil had an option to convert the line endings of the on-disk files before making a commit (such that the on-disk files and repo copies match 100%), but it certainly doesn't on merging or similar activities.

And sorry if I'm being too picky!

It's a genuine issue, but it's unfortunately not one which a hash-centric machine like fossil can just hand-wave away. Hashes are immutable identities, and to change a single bit of your data changes its identity for fossil's purposes. Precisely what you give it is what fossil has to store and restore, no exceptions. If what's in the repo is not what's on disk, all bets regarding data integrity fly out the window.

i'm unfortunately (well, actually fortunately) not well-versed enough in the vagaries of Windows to make any further suggestions or speculation about potential solutions, but some of our forum members are, and perhaps one will hop in and enlighten us both.

(11.2) By aitap on 2020-08-18 12:03:19 edited from 11.1 in reply to 10 [link] [source]

With all due respect, I don't agree that it's about ignoring input or making things not match their hashes.

Something in the merge procedure allegedly removes the EB BB BF bytes from the beginning of the file. It doesn't have to do that. Not removing BOM during merging (despite BOM being present being a bad idea) would not break the blockchain consistency promise, because the result of the merge is not even part of the blockchain unless someone commits it later.

(Edit: there used to be a tangent where I thought I found a bug, but I turned out to be wrong about that)

Something in the merge conflict logic is inserting conflict markers followed by LF line endings, where it could in theory use CR LF if someone was to contribute a patch to do that (do-ocracy and all that). Conflict markers are not supposed to be checked in anyway, so there is nothing to change the hash of.

Marcio, here is where conflict markers are added. It should be possible to use the blob interface to look for LF bytes in pPivot and see if they are preceded by CR, then use different line endings for conflict markers if that turns out to be true. (Bonus question: what happens if line endings are different between pPivot, pV1 and pV2?) contains_merge_marker will have to be changed too, since right now it looks for literal conflict markers terminated by LF.

(12) By aitap on 2020-08-18 12:01:22 in reply to 11.1 [link] [source]

Well, yes, BOM is removed by text_diff, which is called by blob_merge, so by the time the merge result is reconstructed from converted blobs, there is no BOM any more. Whether this could be changed to account for admittedly uncommon UTF-8 files with BOM could be up for a discussion.

The missing single-character diff seems to result from mtime of the file not changing when it is edited too quickly by a script. So no bug here, after all. Sorry.

(13) By Stephan Beal (stephan) on 2020-08-18 14:18:57 in reply to 11.2 [source]

With all due respect, I don't agree that it's about ignoring input or making things not match their hashes.

Something in the merge procedure allegedly removes the EB BB BF bytes from the beginning of the file.

In a merge, anything can change and that's not a problem. In a non-merge checkin, or adding a file, however, we can't modify the input unless fossil had a hypothetical "convert newlines?" feature which also modified the input file. (i really thought that it had such a feature, but now i'm not so sure. Since i haven't had to use Windows in many years, i'm thankfully unaffected by all such grief.)

Something in the merge conflict logic is inserting conflict markers followed by LF line endings, where it could in theory use CR LF if someone was to contribute a patch to do that (do-ocracy and all that). Conflict markers are not supposed to be checked in anyway, so there is nothing to change the hash of.

That's a hypothetical problem, but we've literally never had a complaint about the conflict markers using NL delimiters, so there seems little reason to add more code to check for CRNL's for a "problem" nobody has yet noticed and which, as you point out, is irrelevant in practice because nobody checks in conflicted files. (The commit command requires a separate flag to allow it to check in a conflicted file.)

Marcio, here is where conflict markers are added. It should be possible to use the blob interface to look for LF bytes in pPivot and see if they are preceded by CR, then use different line endings for conflict markers if that turns out to be true. (Bonus question: what happens if line endings are different between pPivot, pV1 and pV2?) contains_merge_marker will have to be changed too, since right now it looks for literal conflict markers terminated by LF.

Good questions. Next question? ;)

One of the reasons i love Unix is that it simply doesn't have to deal with that type of can of worms :).

(14) By Marcio Gabe (marciogabe) on 2020-08-18 15:35:55 in reply to 10 [link] [source]

i feel your pain, but ignoring even a single bit of the input would result in fossil's db having a different hash for the data as what's on the file-system. That's simply not something our architecture can support.

And that's not what I'm suggesting... I already have way over 2000 C# source files checked in Fossil and they've all been created with VS and this BOM. So the 3 bytes BOM on every file is already part of Fossil's hashes... and that's perfectly OK, Fossil handle them just fine for 98% of the operations... VS is happy... everybody's happy.

All I'm saying is that in this 2% case, when Fossil edit a file by merge, the resulting file on disk (not yet checked in) is left without the BOM, and if I run a DIFF against the one in the repository it shows not only the merged-in changes, but also that the one on disk (awaiting to be commited) doesn't have the BOM while the one already in Fossil's repository has it. Then, if I'm not careful I may end up check-in this file without the BOM (it's changed anyway, so this appears to be fine). No problem with hashes here... the newly checked-in file will have a new hash without the BOM. So it could be a good thing, right? Fossil is helping me to get rid of all the BOM's already there little by little... the problem is that next time I save this very same file in Visual Studio, even though it doesn't complain about this now missing BOM, it puts it back... so now... let's say I changed one line, decided to change back, saved it, the file should not have been changed right? Well... all of a sudden Fossil will see it as changed, not because the code or the text changed, but because Visual Studio put those 3 bytes back!

I'm not saying Fossil should ignore those 3 bytes while hashing... not at all... I'm just saying it could not have removed those 3 bytes before while doing the merge edit. Then those 3 bytes, as they were part of the hash before, would have continued to be part of it all the way...

With the same problem of different hashes. i thought, perhaps incorrectly, that fossil had an option to convert the line endings of the on-disk files before making a commit (such that the on-disk files and repo copies match 100%), but it certainly doesn't on merging or similar activities.

Like I said, there are plenty of reasons to try tweaking Visual Studio to not use BOM in source files, and I agree... I'm just saying that if fossil simply kept it alone when it did the merging, just as it already does when committing a new file, running diffs, updating, etc. this would not be a problem... Fossil already hashes files with those 3 bytes... and it all works great... Same thing goes with the CR-LF... fossil doesn't touch them, hashes them fine, commits and restore them bit by bit perfectly... not an issue with hashes... The only inconsistency is when Fossil has to do a merge... this is what I described in detail already... Fossil strips the BOM in the resulting file on disk and inserts the conflict markings using LF only in files that are all CR-LF...

Again... I would change all my source files... or even if not, I can live with Fossil the way it is right now... like I said, I'm a very happy Fossil user... this is not really painful as I don't do merges this often... I only started this thread in hopes that IF it was something feasible for Fossil to tweak the merging command a bit it could be an improvement not only for me but for many others Windows developers stuck with VS nonsense's, CR-LF's and things like that...

cheers!

(15.1) By aitap on 2020-08-18 16:02:41 edited from 15.0 in reply to 13 [link] [source]

In a non-merge checkin, or adding a file, however, we can't modify the input

I don't think that Marcio wanted Fossil to change files in non-merge checkins. My interpretation of OP was that Marcio would like Fossil to keep UTF8 BOM when merging:

Apparently Fossil discards that 'EF BB BF' byte order mark at the start of the file <...> I wonder if there's a way to fix Fossil not to drop it

I don't know whether it is by design, though. The code that drops the BOM has been originally added by drh to cope with UTF-16 documents.

hypothetical "convert newlines?" feature

It seems to exist, after all:

$ /bin/echo -e 'hello\r' > file.txt
$ fossil add file.txt
ADDED  file.txt
$ fossil commit -m 'hello'
./file.txt contains CR/LF line endings. Use --no-warnings or the "crlf-glob" setting to disable this warning.
Commit anyhow (a=all/c=convert/y/N)? c
one or more files were converted on your request; please re-test before committing

That's a hypothetical problem, but we've literally never had a complaint about the conflict markers using NL delimiters

My interpretation of OP is that Marcio has a problem with this:

the line endings get mixed up a bit. Looking at the file in Notepad++ it shows that the text Fossil inserts in the conflict area have only the LF character instead of the regular CR/LF pair

Hence my link to where conflict markers are inserted. Marcio, if you are interested in preparing a patch to fix this, feel free to ask more questions.

One of the reasons i love Unix is that it simply doesn't have to deal with that type of can of worms :).

I admit that the Unix experience is pretty nice for me, too, but I happen to be surrounded by Windows users, so I can't help but empathise with their problems.

(16) By Stephan Beal (stephan) on 2020-08-18 16:20:28 in reply to 14 [link] [source]

All I'm saying is that in this 2% case, when Fossil edit a file by merge,

Apologies, i fixated on the wrong issue there. Indeed, fossil "probably should" retain the BOM when merging. Possibly. Not sure. There was probably a conscious decision made there to drop it, but i have no insight as to what the driving factors, other than convenience, might have been.

IIRC, Jan N. did much or most of the initial heavy lifting on improving fossil's UTF8 support. Perhaps he's listening in and can comment on it (or perhaps he can prove my recollection to be incorrect).

(17) By Stephan Beal (stephan) on 2020-08-18 16:22:48 in reply to 15.1 [link] [source]

That's a hypothetical problem, but we've literally never had a complaint about the conflict markers using NL delimiters

My interpretation of OP is that Marcio has a problem with this:

Indeed, a poor choice of words on my part. "Literally never" ⇒ "literally never until this thread."

(18) By graham on 2020-08-18 16:49:49 in reply to 15.1 [link] [source]

I don't know whether it is by design, though. The code that drops the BOM has been originally added by drh to cope with UTF-16 documents.

Could it be related to this SQLite thread to do with BOMs? (I've not checked any code/time-lines; just remembered seeing a thread about removal of BOMs).

(19) By aitap on 2020-08-18 17:06:43 in reply to 18 [link] [source]

Could it be related to this SQLite thread to do with BOMs?

No, Fossil uses its own blob functions to remove the BOM, the code is newer than 2004 (2014), and the rationale is clearer: the idea was to handle UTF-16, which typically has a BOM, and the code needed to present the text in a convenient format, retaining the invariants assumed for ASCII and UTF-8 text.

(20.1) By Marcio Gabe (marciogabe) on 2020-08-18 21:29:03 edited from 20.0 in reply to 15.1 [link] [source]

First of all, let me thank you all for taking the time to consider these minor things... I didn't mean to start WWIII here...

My interpretation of OP was that Marcio would like Fossil to keep UTF8 BOM when merging

You interpreted it correctly... Fossil already keeps it for all other operations... so it shouldn't hurt to simply keep it in the produced merged file (or put it back if it needs to strip it for the purpose of performing the three way merge)... This way it would be consistent... Thanks for understanding me on this!

Marcio, if you are interested in preparing a patch to fix this, feel free to ask more questions.

I did managed to compile the current trunk version of fossil last night and found three possible files that I could change a bit, and I think I was in the right direction, as you posted today about the merge3.c and the blob.c files.

I used the following command on the x64 Native Tools Command Prompt for VS 2019:

D:\Projects\Fossil\build>nmake /f ../win/Makefile.msc FOSSIL_ENABLE_SSL=1 FOSSIL_BUILD_SSL=1 PERLDIR=C:\Perl64\bin

Microsoft (R) Program Maintenance Utility Version 14.26.28806.0
Copyright (C) Microsoft Corporation.  All rights reserved.

Using 'x64' platform for OpenSSL...
Building zlib from "..\compat\zlib"...
...

This is the result:

D:\Projects\Fossil\build>fossil.exe version -v
This is fossil version 2.13 [a703b4ce25] 2020-08-17 21:19:46 UTC
Compiled on Aug 17 2020 21:22:47 using msc-19.26 (64-bit)
Schema version 2015-01-24
Detected memory page size is 4096 bytes
zlib 1.2.11, loaded 1.2.11
hardened-SHA1 by Marc Stevens and Dan Shumow
SSL (OpenSSL 1.1.1g  21 Apr 2020)
FOSSIL_ENABLE_LEGACY_MV_RM
UNICODE_COMMAND_LINE
FOSSIL_STATIC_BUILD
SQLite 3.33.0 2020-08-14 13:23:32 fca8dc8b57
SQLITE_DEFAULT_FILE_FORMAT=4
SQLITE_DEFAULT_WAL_SYNCHRONOUS=1
SQLITE_ENABLE_DBSTAT_VTAB
SQLITE_ENABLE_FTS4
SQLITE_ENABLE_FTS5
SQLITE_ENABLE_JSON1
SQLITE_ENABLE_LOCKING_STYLE=0
SQLITE_ENABLE_STMTVTAB
SQLITE_LIKE_DOESNT_MATCH_BLOBS
SQLITE_MAX_EXPR_DEPTH=0
SQLITE_OMIT_DECLTYPE
SQLITE_OMIT_DEPRECATED
SQLITE_OMIT_LOAD_EXTENSION
SQLITE_OMIT_PROGRESS_CALLBACK
SQLITE_OMIT_SHARED_CACHE
SQLITE_THREADSAFE=0
SQLITE_USE_ALLOCA

So I guess I could try to learn some C here... I can read the code and for the most part understand what it's doing, even the part where it inserts the conflict markings here... I saw the \n here at the end of the lines... maybe I could try to detect if the line about to be inserted in the conflict area ends with \r\n and replace the single \n in the marking being inserted...

(Edit:)

I just read all the posts again and paying more attention to what @aitap had already written about fixing the LF problem around the merge markings I saw it was already just like what I now wrote above...

Bonus question: what happens if line endings are different between pPivot, pV1 and pV2?) contains_merge_marker will have to be changed too, since right now it looks for literal conflict markers terminated by LF.

The bonus question... I would see it like this: since pPivot is a common ancestor, pV1 and pV2 would only probably contain different line endings if having being changed in different environments... since usually windows projects tend to be windows only, there should not be any LF's alone being introduced accidentally. For cross-platform projects, then for sure I would think LF only should be the standard anyway... in this case, pPivot and even pV1 and pV2 should not end up having CR-LF endings. So, maybe it should be good enough to just elect one of them, like pPivot, and just check the last line of the about to be inserted in the conflict section for what line ending was used there.

For the contains_merge_marker function maybe there's no need to really check if the marker is on a line by itself, so one way would be to not check for any line endings at all... they are all full of special characters that should not appear anywhere else on the file anyway. I believe they are being used just to detect if all conflicts have been resolved prior to allow the user to check-in the file... Other way would be to check either for \r or \n whichever comes first in that z[i] there...

(end Edit.)

As for the BOM issue, I also saw some interesting code about it here. Maybe it's not a problem related to anything other than having to strip it out for the 3-way merge change detection code to work, or even just a small oversight here when initializing the about to be merged file. Since Fossil pretty much builds a new file from the other three, it's just inadvertently ignoring the fact that all of them had the BOM... maybe a quick detection using the starts_with_utf8_bom() in the pPivot source should be enough to just add those three bytes around after here.

If I do make it in such a way that it works without breaking anything else I'll be thrilled.

And thanks again @aitap for encouraging me in this direction... Let's see what a newbie in C like me can come up with...

(21) By Larry Brasfield (LarryBrasfield) on 2020-08-18 23:48:45 in reply to 16 [link] [source]

It seems clear to me that what Fossil should do is respect the user's prerogative as to whether a BOM is present or not. It is under their control, after all. Taking that control away is undoubtedly (to me) the wrong answer.

I would suggest that it would be great if Fossil had a mechanism for enforcing local check-in policy. Maybe it would be just a hook or small set of hooks that, if utilized per the administrator's configuration, would allow a proposed check-in to be rejected, auto-modified (in some cases), or simply trigger a warning that some policy is about to violated with an "abort check-in" option.

I think this would be adequate to deal with the BOM presence issue and many more, without adding yet another layer of complexity to Fossil's main job (which it does well in part because it is well-defined and easy to comprehend.)

(22.1) By Stephan Beal (stephan) on 2020-08-19 00:08:56 edited from 22.0 in reply to 21 [link] [source]

I would suggest that it would be great if Fossil had a mechanism for enforcing local check-in policy.

Edit: doh. See Warren's response below this one.

That's come up before, bit the problem is platform portability of the tool, which would need full-fledged programmability in order to enforce arbitrary policies.

"Is there a work order number or scrum task ID in the commit message?"

(That was a requirement in a commercial project about 5 years ago.)

Automating answers to questions like that requires a scripting environment.

That said...

The build tree includes jimtcl already, which is feature-rich tclsh alternative.

Alternately, such projects could require that fossil be built with tcl support, and fossil adds a hook to launch it pre-commit (post is too late). It's still platform-dependent to a degree, but fossil already has a way to hook up tcl and tcl is available in all(?) modern environments.

(23) By Warren Young (wyetr) on 2020-08-19 00:02:17 in reply to 21 [link] [source]

it would be great if Fossil had a mechanism for enforcing local check-in policy

Then Fossil is now great, because we got that feature in the latest release.

Now that we've achieved greatness, I suppose we'd best move on to excellence.

(24.1) By Marcio Gabe (marciogabe) on 2020-08-19 00:45:17 edited from 24.0 in reply to 23 [link] [source]

As per aitap and Warren Young suggestions, i'm already trying one change that might fix the BOM issue... It solved my problem, and imho is quite simple:

Just added this right below this line in merge3.c:

/* If both pV1 and pV2 start with a UTF-8 byte-order-mark (BOM),
** keep it in the output. This should be secure enough not to cause
** unintended changes to the merged file and consistent with what
** users are using in their source files.
*/
if( starts_with_utf8_bom(pV1, 0) && starts_with_utf8_bom(pV2, 0) ){
  blob_append(pOut, get_utf8_bom(0), -1);
}

Please check this image.

This is what I was talking about...

If someone could test it on non windows project or environment to see if it does not break anything else... please consider adding it to trunk...

(25) By Larry Brasfield (LarryBrasfield) on 2020-08-19 00:48:22 in reply to 22.1 [link] [source]

No excuse, but I've been busy enough lately to miss some of these later developments. It's good to see. (Thanks, Warren.)

My point, generalized, is that instead of endless discussion about various tweaks and enhancements to Fossil itself, forum replies may appear resembling "Here is a jimtcl or CLI shell routine which, when hooked, would solve the problem you present."

Is jimtcl in the published builds? (I guess not from "build tree includes ...".)

One of the great joys of using Emacs and its sibs is that extending it is relatively simple, and for those who get lost among a plethora of parentheses (in elisp), many useful extensions can be found. Is there a jimtcl script collection for Fossile somewhere, or examples of extending Fossil that way?

(26) By Stephan Beal (stephan) on 2020-08-19 00:57:03 in reply to 25 [link] [source]

Jim must be in the source distributions because it's used as a fallback for running the configure script if tclsh isn't around. As far as extending fossil that way goes: long story short, fossil is monolithic app to the very core, with no library-like places for scripts to link to, beyond the hooks (which i have not investigated). Any external automation is severely limited compared to the whole wide world of emacs scripting.

Please pardon brevity - on a tablet from bed.

(27) By Warren Young (wyetr) on 2020-08-19 01:40:37 in reply to 26 [link] [source]

I don't think it's too productive to be talking about the copy of Jim Tcl included in the Fossil source tree as part of the Autosetup infrastructure.

First, it's a stripped-down version, not full-strength Jim. It has only the parts needed by Autosetup, plus possibly some things its creators think you might find useful from an auto.def file. If you're writing hooks, there's a good chance you'll want a more featureful interpreter, particularly if you're doing cross-platform work. (This is why it's called jimsh0 when it's built as a fallback for tclsh being unavailable.)

Second, there's currently no way to embed it into Fossil, even with the existing embedded Tcl features. It'd be nice to have full-strength Jim available for both Autosetup and for Fossil, with the option to run hooks with that embedded interpreter, but that's currently just a wishlist item.

Third, it doesn't look like the hooks feature requires any particular interpreter at all. See fossil hook add --command ... I believe that means you can write your hooks in anything your host OS can execute, provided it has the capability to do the necessary I/O.

(28) By John Rouillard (rouilj) on 2020-08-19 02:54:54 in reply to 27 [link] [source]

I thought TH1 was always embedded in fossil. If you chose to use tcl it will replace the th1 (jim tcl) with regular tcl.

Even th1 has enough to do simple hooks like:

does a regular expression exist in some text
slice and dice text
loop over lines in a file and do something with them

so given access to input you could validate a ticket id exists in a checkin comment or other simple text validation.

(29) By Warren Young (wyoung) on 2020-08-19 03:09:01 in reply to 28 [link] [source]

I thought TH1 was always embedded in fossil.

It is. Fossil UI won't run without it.

it will replace the th1 (jim tcl) with regular tcl.

TH1 and Jim aren't the same thing, and the Tcl integration feature doesn't replace TH1 anyway. When you configure with --with-tcl, TH1 remains, it just gets some new commands to call down into "real" Tcl from TH1.

Even th1 has enough to do simple hooks

Critically, TH1 lacks exec(), which greatly limits the typical scope of SCM hook use cases it can cover. For instance, good luck tying it into a static analysis tool. (And that's a purposeful lack, else people could send you a Fossil skin that would amount to a massive security hole!)

Anyway, the current Fossil hooks system doesn't use TH1. To use it, you'd have to do something dirty like fossil test-th-eval SCRIPT.

(30) By Marcio Gabe (marciogabe) on 2020-08-19 04:34:10 in reply to 24.1 [link] [source]

Ok... got the CR/LF solved as well...

All changes for the BOM and CR/LF were done only in the merge3.c file.

Please let me know what is the best way to send these changes for someone willing to test them out and maybe consider adding them in the main source...

The rule I used to solve some of the questions raised, is that both problems (keeping BOM and adding CR/LF instead of LF) are only applied if both files being merged have BOM and CR/LF's already. This should prevent any changes for folks using the standard way (no BOM and LF only) or if only one side of the merge got changed unintentionally. So only weird people like me stuck with this Windows/Visual Studio nonsense will benefit in projects were checked-in files, disk files and merged files are all consistently driving on the wrong side of the road (sorry UK friends!).

Here are some print-outs:

Print part 1.
Print part 2.
Print part 3.
Print part 4.

Code part 1:

/*
** Text of boundary markers for merge conflicts.
*/
static const char *const mergeMarker[] = {
 /*123456789 123456789 123456789 123456789 123456789 123456789 123456789*/
  "<<<<<<< BEGIN MERGE CONFLICT: local copy shown first <<<<<<<<<<<<<<<",
  "||||||| COMMON ANCESTOR content follows ||||||||||||||||||||||||||||",
  "======= MERGED IN content follows ==================================",
  ">>>>>>> END MERGE CONFLICT >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>"
};

/*
** Return true if the input blob contains CR/LF pairs.
*/
int contains_crlf(Blob *p){
  int i;
  const char *z = blob_buffer(p);
  int n = blob_size(p)+1;
  for(i=1; i<n; ){
	if( z[i-1]=='\r' && z[i]=='\n' ) return 1;
	while( i<n && z[i]!='\n' ){ i++; }
  }
  return 0;
}

/*
** Add CR/LF if useCrLf is true otherwise use regular LF.
*/
void add_line_end(Blob *p, int useCrLf){
  if( useCrLf ){
	blob_append(p, "\r\n", 2);
  }else{
	blob_append_char(p, '\n');
  }
}

Code part 2:

  /* If both pV1 and pV2 start with a UTF-8 byte-order-mark (BOM),
  ** keep it in the output. This should be secure enough not to cause
  ** unintended changes to the merged file and consistent with what
  ** users are using in their source files.
  */
  if( starts_with_utf8_bom(pV1, 0) && starts_with_utf8_bom(pV2, 0) ){
	blob_append(pOut, get_utf8_bom(0), -1);
  }

  /* Check once to see if both pV1 and pV2 contains CR/LF endings.
  ** If true, CR/LF pair will be used later to append the
  ** boundary markers for merge conflicts.
  */
  int useCrLf = 0;
  if( contains_crlf(pV1) && contains_crlf(pV2) ){
	useCrLf = 1;
  }

Code part 3:

}else
{
  /* We have found a region where different edits to V1 and V2 overlap.
  ** This is a merge conflict.  Find the size of the conflict, then
  ** output both possible edits separated by distinctive marks.
  */
  int sz = 1;    /* Size of the conflict in lines */
  nConflict++;
  while( !ends_at_CPY(&aC1[i1], sz) || !ends_at_CPY(&aC2[i2], sz) ){
    sz++;
  }
  DEBUG( printf("CONFLICT %d\n", sz); )
  blob_append(pOut, mergeMarker[0], -1);
  add_line_end(pOut, useCrLf);
  i1 = output_one_side(pOut, pV1, aC1, i1, sz);
  blob_append(pOut, mergeMarker[1], -1);
  add_line_end(pOut, useCrLf);
  blob_copy_lines(pOut, pPivot, sz);
  blob_append(pOut, mergeMarker[2], -1);
  add_line_end(pOut, useCrLf);
  i2 = output_one_side(pOut, pV2, aC2, i2, sz);
  blob_append(pOut, mergeMarker[3], -1);
  add_line_end(pOut, useCrLf);

}

Code part 4 (Just a comment change):

/*
** Return true if the input string contains a merge marker at the
** beginning of a line.
*/

(31) By Stephan Beal (stephan) on 2020-08-19 04:42:12 in reply to 30 [link] [source]

Please let me know what is the best way to send these changes for someone willing to test them out and maybe consider adding them in the main source...

We unfortunately cannot accept non-trivial drive-by patches from random internet personages, but we do offer commit access to those who fill out this release form and mail a physical copy to Richard for safe keeping in his fire safe:

https://fossil-scm.org/fossil/doc/trunk/www/copyright-release.html

Post that off to him, then tell us your preferred repo user name, and once he confirms its safe arrival we can get you set up.

(32.1) By Marcio Gabe (marciogabe) on 2020-08-19 21:57:14 edited from 32.0 in reply to 31 [link] [source]

Thanks Stephan, I really appreciate all you guys do and would love to eventually be able to participate and have access to the repository so I could maybe contribute in a better way to this tool I came to love and respect. I'm pretty much a team of one working in this small to medium size C# project... I've always been a windows developer, mostly Delphi... so, different worlds here. Since I discovered fossil about a year and a half ago, it has saved me tons of time with all its features... so I cannot praise Fossil enough...

Here's my main project humble stats page:

Repository Size:	158,662,656 bytes
Number Of Artifacts:	8890 (3,095 fulltext and 5,795 deltas) Details
Uncompressed Artifact Size:	114,731 bytes average, 9,318,696 bytes max, 1,019,849,270 total
Compression Ratio:	6:1
Number Of Check-ins:	874
Number Of Files:	2,920
Number Of Wiki Pages:	14
Number Of Tickets:	51
Duration Of Project:	536 days or approximately 1.47 years.

Now, I understand the fact that I'm just a random internet personage, and I clearly understand all the legal steps that must be taken to keep Fossil safe from any harm. The thing here is that I'm in Brazil and currently I don't think I will be able to physically send a paper to D. Richard, as much as I would like to. I also didn't think that what I was asking was such an impossible and religiously problematic war between operating systems and file formats. The solution I tried and shared above was just plain fun to explore as pure C is not my daily programming language. So it was more of a proof of concept to myself and to the point I was trying to make, that it was a simple thing to fix in Fossil, and that it might help some users.

I want to specially thank aitap for showing genuine interest in understanding my problem and suggesting a path in the right direction to help me fix the problem, as well as the others, for coping with me since I'm not a native English speaker and because of that maybe I couldn't explain properly the problem I was having.

As for the code... again... I don't know if what I did is good or not... All I wanted was for some of you guys, well knowledgeable of Fossil's code and the language, to check it and maybe say it could be something useful, in hopes that this could be eventually integrated... not necessarily the way I wrote it, but in any way, to imho improve even if a tiny bit for some small group of people, the fossil merge command. I just pretty much called functions already available in blob.c and in lookslike.c... so it's nothing that anyone could claim intellectual property of sorts...

I know this is not as important as soo many other great features you guys are working on... and that maybe no one, except me "literally never until this thread." brought this up... but please don't just throw this away because I cannot legally contribute, unless clearly there's real harm to the code or a reason for this feature to be undesireble other than personal beliefs in what a source code file should or should not look like or contain.

Life could be so much simpler... we are the ones who make it complicate.

Cheers! Love to you all

PS. Here's a link to the whole modified merge3.c file.

(33) By Stéphane Aulery (LkpPo) on 2020-08-19 22:09:56 in reply to 32.1 [link] [source]

Hello Marcio,

Do not use the post office.

If you want commit rights and push the code yourself, you just have to:

Print the form
Fill it with a pen
Scan it to a pdf
Send it via email to drh

Not a big deal.

Otherwise ask someone else to do it for you.

Regards,

(34) By Stephan Beal (stephan) on 2020-08-20 02:49:31 in reply to 32.1 [link] [source]

The thing here is that I'm in Brazil and currently I don't think I will be able to physically send a paper to D. Richard, as much as I would like to.

He might be able to accept a fax for such a case. Brazil has a reputation as being a difficult place to reliably get mail into and out of. Get in touch with him and he can possibly provide an alternative.

but please don't just throw this away because I cannot legally contribute

This one is not up to me - i don't use Windows or BOMs and have neither an opinion nor an interest in any given solution ;), but i have no reason to believe that your solution is in any way "undesirable."

There have been cases where someone without a contributor agreement posted such a patch and a developer took the idea for use as a basis in implementing it, without using the code as-is (which would violate the project's policy about requiring contributor agreements). Perhaps that will happen here. i was hoping that Jan N. would drop by, as that's well within his area of expertise.

Life could be so much simpler... we are the ones who make it complicate.

Ain't that the truth!

(35.1) By aitap on 2020-08-21 09:03:01 edited from 35.0 in reply to 32.1 [link] [source]

The thing here is that I'm in Brazil and currently I don't think I will be able to physically send a paper to D. Richard, as much as I would like to.

Sometimes, when a patch is deemed important, but cannot be accepted due to missing contributor agreement, D. Richard Hipp may reimplement it instead.

I haven't had the time to test your merge3.c file, but one minor difference that I think you introduced is that your version may match lines starting with merge markers, while the original only matched lines exactly equal to merge markers. You might need to check for EOL after the successful match and perhaps adjust n accordingly to avoid running past the end of buffer.

Either way, I think that it's a good idea to contact D. Richard Hipp about this patch. (Patches, as produced by fossil diff with files edited in a checkout of the repo, are typically a better way of exchanging changes to software than whole files or manually extracted modified parts.)

(36) By Marcio Gabe (marciogabe) on 2020-08-22 02:18:47 in reply to 35.1 [link] [source]

Thanks aitap. I just sent D. Richard an e-mail with the patch using fossil diff as you mentioned, along with the scanned version of the contributor agreement. I also modified the contains_merge_marker as per your comment, so it properly detects any type of line ending and accounts for the end of the buffer.