Fossil Forum

Different criteria for "binary" files in merge and in diff?
Login

Different criteria for "binary" files in merge and in diff?

Different criteria for "binary" files in merge and in diff?

(1) By Marcelo Huerta (richieadler) on 2023-01-03 20:59:35 [link] [source]

I'm facing a problem.

I'm handling changes in a repository where amongst other things I'm saving dependencies in TOML files. In spite of being clearly text, they are interpreted as binary due to the extension of some of the line, which contain a list of dependency in a single line, exceeding 14000 characters.

When I try to merge the affected files, I get the message

***** Cannot merge binary file poetry.lock

but if run a diff against the branch containing the changed file, it shows the differences perfectly.

Is there a way I can perform the desired merge? Why is the logic different?

(2) By Warren Young (wyoung) on 2023-01-03 21:51:33 in reply to 1 [link] [source]

a list of dependency in a single line

Is this a format you designed yourself, or one generated by a tool you cannot change? TOML has arrays, and it's perfectly legal to put newlines between elements.

Maybe there's a TOML linter/beautifier out there that will do this for you, allowing you to preprocess your way to better diffs.

Why is the logic different?

Because delta compression is speculative, and diff retrospective.

How much space should the delta compression algorithm reserve up front to store the diff chunks?

Contrast diff, where it's got the data in premasticated form and now must merely present it.

(3.3) Originally by Marcelo Huerta (richieadler) with edits by Warren Young (wyoung) on 2023-01-04 12:43:58 from 3.2 in reply to 2 [link] [source]

Is this a format you designed yourself, or one generated by a tool you cannot change?

The second one. It's the lock file generated by poetry, and it would be a botter to have to rearrange it manually (or with a custom tool) just to appease the SCM.

And in any case it doesn't solve my problem with the already commited one, as it already considers it a binary and I cannot merge the new changes.

(4) By Martin Gagnon (mgagnon) on 2023-01-04 12:43:07 in reply to 3.1 [link] [source]

I"m not familiar with poetry, are you talking about this lock file ?

(6) By Daniel Dumitriu (danield) on 2023-01-04 14:24:30 in reply to 4 [link] [source]

Yes, he is. I don't know if that qualifies as a misfeature or bug, but please do fill out a ticket, ehmm issue, with the folks there - or else I might do it. It's just sickening to have text files with 14,000 character-long lines for no good reasons (apart from saving a few bytes, let's say 140) - are there any, generally speaking?

(5) By Warren Young (wyoung) on 2023-01-04 12:44:08 in reply to 3.3 [link] [source]

Does this help?

If not, the motivation behind the question stands: you should try to find a way to regenerate this file from data elsewhere, not commit it.

(8.1) By Marcelo Huerta (richieadler) on 2023-01-05 01:54:42 edited from 8.0 in reply to 5 [link] [source]

The recommendation from the creators of the tool is exactly the opposite.

(10) By Warren Young (wyoung) on 2023-01-05 09:13:38 in reply to 8.1 [link] [source]

Then you have grounds for filing an issue about their file format not being VC-friendly.

(11.1) By Marcelo Huerta (richieadler) on 2023-01-05 22:10:13 edited from 11.0 in reply to 10 [link] [source]

I make a point of not exposing myself to situations where I'm certain that the result would be public ridicule.

Thank you for your comments, anyway.

(7.2) By Stephan Beal (stephan) on 2023-01-04 14:36:10 edited from 7.1 in reply to 1 [source]

Why is the logic different?

Both diff and merge are line-oriented, making them generally unhelpful for files where lines are thousands of bytes long. Similarly, wiki page diff views are unhelpful when each paragraph of text is stored as a single line (which is almost always the case when using a web-based text entry widget).

Edit: apologies, though the above is true, it does not answer the original question of why diff and merge have different criteria. No idea why that's the case.

(9) By mark on 2023-01-05 06:41:49 in reply to 1 [link] [source]

Assuming I've mocked the right simulation, I can't reproduce this. There may be a missing ingredient such as an accompanying merge conflict. The following demonstrates a merge involving a file exceeding the binary threshold (irrelevant output removed):

17:10 ✔ ~/tmp » fossil init long.fossil
17:10 ✔ ~/tmp » mkdir ll; cd ll
17:10 ✔ ~/tmp/ll » fossil open ../long.fossil
17:10 ✔ ~/tmp/ll » jot -s " " 8192 0 16384 > long
17:11 ✔ ~/tmp/ll » wc long
       1    8192   43597 long
17:11 ✔ ~/tmp/ll » fossil add long
17:11 ✔ ~/tmp/ll » fossil ci -m "long-lined file"
./long contains long lines. Use --no-warnings or the "binary-glob" setting to disable this warning.
Commit anyhow (a=all/y/N)? y
New_Version: 10ebcbf3dafe78e5c75d4c41c2f082cb71a9e32d422a79c7fbb9b7f241a902d6
17:11 ✔ ~/tmp/ll » sed -i 's/$/ extend an already ridiculously long line/' long
17:12 ✔ ~/tmp/ll » wc long
       1    8198   43638 long
17:12 ✔ ~/tmp/ll » fossil ci --branch tmp/long -m "make long line longer on branch"
./long contains long lines. Use --no-warnings or the "binary-glob" setting to disable this warning.
Commit anyhow (a=all/y/N)? y
New_Version: 3fea9e68993e0a22b456a92b3b3cebe65708f808d830f52142134da91e5766ae
17:13 ✔ ~/tmp/ll » fossil up trunk
UPDATE long
-------------------------------------------------------------------------------
updated-from: 3fea9e68993e0a22b456a92b3b3cebe65708f808 2023-01-05 06:13:19 UTC
updated-to:   10ebcbf3dafe78e5c75d4c41c2f082cb71a9e32d 2023-01-05 06:11:24 UTC
tags:         trunk
comment:      long-lined file (user: mark)
changes:      1 file modified.
 "fossil undo" is available to undo changes to the working checkout.
17:13 ✔ ~/tmp/ll » fossil merge tmp/long
UPDATE long
 "fossil undo" is available to undo changes to the working checkout.
17:13 ✔ ~/tmp/ll » fossil commit -m "merge tmp/long"
./long contains long lines. Use --no-warnings or the "binary-glob" setting to disable this warning.
Commit anyhow (a=all/y/N)? y
New_Version: b642d6cba556b2f9e109f8164e52cf13abad840c5bb909bbb0a2ad9a50a256f0
17:14 ✔ ~/tmp/ll » fossil timeline
=== 2023-01-05 ===
06:14:04 [b642d6cba5] *MERGE* *CURRENT*
         merge tmp/long (user: mark tags: trunk)
06:13:19 [3fea9e6899]
         make long line longer on branch (user: mark tags: tmp/long)
06:11:24 [10ebcbf3da] *BRANCH*
         long-lined file (user: mark tags: trunk)
06:10:44 [0098031801]
         initial empty check-in (user: mark tags: trunk)
+++ no more data (4) +++

If the diff is actually displayed (i.e., there is no "cannot compute difference between binary files" report), I'd expect the merge to work as the merge operation performs a 3-way merge--a wrapper around blob_merge(), which calls the diff routine on the branches being merged against a common ancestor. As such, the same binary heuristic the diff routine follows also applies to merge.

In any event, the below patch might help workaround the fatal error so you can perform the desired merge:

Index: src/diff.c
=======================================================================
hash - 76a259aef3686702517f73515a202fe89fb469df435269a437625c6fd22ec618
hash + ed8580e59e6983b7f167aaeef9f0588572bf9b6fb4714db9d6d134e93f380332
--- src/diff.c
+++ src/diff.c
@@ -261,7 +261,7 @@ static DLine *break_into_lines(
     zNL = strchr(z,'\n');
     if( zNL==0 ) zNL = z+n;
     nn = (int)(zNL - z);
-    if( nn>LENGTH_MASK ){
+    if( memchr(z, '\0', nn) ){
       fossil_free(a);
       return 0;
     }