binary files that are not really binary

(1) By Alfred M. Szmidt (ams) on 2020-02-21 08:46:15 [link] [source]

Hi,

I'm having some friction with Fossil and how it treats "binary" files.  The files in question do contain NUL characters, but they are not "binary", they are plain source code.  Is there some way get around this, i.e. to treat them as text?

(2) By Stephan Beal (stephan) on 2020-02-21 08:52:31 in reply to 1 [link] [source]

If they contain NULs then the routines which generate text/html output, e.g. the html diff view or diff console output, won't be able to render them properly.

Fossil's "this is binary" heuristics are baked into it, and are not configurable except to tell it not to warn about committing binary files which match user-defined globs.

(3) By Alfred M. Szmidt (ams) on 2020-02-21 09:30:37 in reply to 2 [link] [source]

Would it be possible to change those routines accordingly?

Since there is already some conversion done to the HTML diff view (i.e., quote escaping), it should be possible to apply something similar for NUL (and maybe other noise - I'm not entirely sure what the criteria is for something to be a binary file).

For the case of console -- which I am guessing is what the HTML output is based on, GNU diff already handles binary files well enough, and I guess similar logic could be applied?  The following works well enough:

  fossil diff --command "diff -au"

GNU diff has always done a very good job in producing sensible output in my case, since the files are "normal" files, with newlines etc.  Just that sometimes, there is a NUL character because the language allows that kind of a thing.

(4) By Stephan Beal (stephan) on 2020-02-21 09:39:22 in reply to 3 [link] [source]

i can't speculate how deep-seated the "NUL expectations" are - it's been a good 7 years since i dug around in those bits. The CLI and HTML diffs are definitely different implementations, as the former is line-based and the latter shows diffs with character-level precision.

The heuristics for "what is binary" also include files with exceptionally long lines, as fossil cannot generate readable diffs when the lines are thousands of bytes long.

(pardon my brevity - am on the phone in a waiting room.)

(5) By Scott Robison (sdr) on 2020-02-21 09:52:34 in reply to 1 [link] [source]

Can you describe the issue in a bit more detail? What type of source code is it? Is there some alternative representation that is supported?

For example, if the nul characters are embedded in strings that are C language compatible, they could easily be replaced by an equivalent octal, hexadecimal, or Unicode escape sequence which would not present a problem with fossil or arbitrary text editors or other tools.

That is often the type of issue when this question is asked. Using C-style escape sequences would make far more portable source files (if this is a C compatible string), even if the environment in question otherwise allows embedding a literal nul (or other traditionally non-printable character). Other languages might have comparable escape sequences or printable encoding in ASCII, UTF-8, or some other text encoding that would work with fossil.

(6) By Alfred M. Szmidt (ams) on 2020-02-21 10:06:45 in reply to 5 [link] [source]

This is Lisp code, where it is perfectly valid and normal to do things like (since strings are handled literally, and not parsed by the reader in any special manner):

  (insert "<NUL>") ;Where <NUL> is a literal NUL byte.

There is no C style escape sequences that one can use (other than implementing them as some kind of a reader macro, and rewriting a substantial amount of non-trival code).

(7) By Alfred M. Szmidt (ams) on 2020-02-21 10:13:58 in reply to 6 [link] [source]

To add on that, it is also a valid _escape_ character (<NUL> being a literal NUL, and <US> being the unit separator character -- 037):

  (PRESS-CHAR-SEQ #/<NULL> #/<US>)

And that code cannot be changed trivially, since it would involve changing the syntax of the language.

(8) By Scott Robison (sdr) on 2020-02-21 10:44:58 in reply to 7 [link] [source]

Fair enough. Every time this has come up in the past that I remember reading it has been based on a C compatible string encoding.

(15) By anonymous on 2020-02-25 21:38:19 in reply to 6 [link] [source]

What about representing #\<NUL> as (code-char 0) ?

(16) By Alfred M. Szmidt (ams) on 2020-02-27 09:59:11 in reply to 15 [link] [source]

They are not the same, #/ is a literal NUL handled at read time.  While (code-char 0) is a function call, you could do #.(code-char 0) but this gets ugly real quickly, and why do that when you have a well supported and easily understood syntax for doing character literals. 

That is ignoring what other invasive changes would be required to get this to work.  I think it is definitely the wrong way to handle things by trying to make the code fit the version control system, that seems quite backwards.

(17) By anonymous on 2020-02-28 02:44:33 in reply to 16 [link] [source]

Something like this:

; define constants for some non-printable ASCII chars
; (in case #\ equivalents are not available)
; and some sequences of such chars
(defconstant $/nul (code-char 0) "<NUL> like #\nul")
(defconstant $/us (code-char 31) "<US> like #\us")
(defconstant $/nul-us (list $/nul $/us) "<NUL><US> like #\nul #\us")

;example use of the defined constants
;instead of the explicit non-printable chars 
(print $/nul-us)

Tested with SBCL: https://ideone.com/9WInTP

Some Lisp implementations already have special non-printable ASCII chars defined beyond the standard ones like #\newlile, #\tab etc. I guess, this is for the exact same reason so that your source code remains printable and indeed compatible with the multitude of text-tools...It must be taking extra effort to even insert the non-printables into the text, let alone "seeing" them on-screen.

Well, anyways, I understand your frustration with this issue. Hope you'd find this helpful as a relatively straightforward work-around for the use-cases you've shown.

(18) By Alfred M. Szmidt (ams) on 2020-02-28 11:11:16 in reply to 17 [source]

Thank you for the suggestion, but you're making the assumption that this is Common Lisp :-) Inserting NUL or other such characters isn't that difficult in Emacs, C-q C-@ will give you a NUL.  Don't forget that that NUL is valid as a literal in a string, and in other contexts, e.g,

  (defun <NUL>some-internal-function ...)

Then you get the other issue that this is valid in other contexts, not just Lisp code!

I did a cursory look over the code base in question, and it might be doable (read, easier than I thought and less work than getting Fossil to handle binary files as ASCII) so I might just do that instead.

(9) By Richard Hipp (drh) on 2020-02-21 11:45:04 in reply to 1 [link] [source]

What is Fossil doing to these files with embedded 0x00 bytes that is causing problems for you?

(10) By Stephan Beal (stephan) on 2020-02-21 12:27:11 in reply to 9 [link] [source]

It's what fossil's not doing: it won't diff binary files, and "contains NUL bytes" is (quite reasonably) one of Fossil's is-a-binary conditions.

(13) By Alfred M. Szmidt (ams) on 2020-02-23 21:26:17 in reply to 10 [link] [source]

Right, which makes `fossil ui' not so useful for those particular files, or that one has to pass --command "diff -au" to `fossil diff'. 

I know that some programs when they decide what is a binary or not is by looking only at the initial N bytes of the file, and if there is a NUL byte there, it will treat it as binary.  Which seems (to me at least) to be a better heuristic than "has NUL byte anywhere".

Would it be complicated, not knowing the Fossil code base at all, to amend this slightly?  It could be protected by a user settable variable.

(14) By anonymous on 2020-02-25 19:08:22 in reply to 13 [link] [source]

By the same token, if the mentioned initial N bytes happen to include the "text" <NUL> that you use in your code, then we're back to square-one...

Clearly, the issue here is with a "hybrid" file that is not easy to reliably classify as binary/text in an automated way.

Fossil allows globbing for binary files to avoid the warnings and has an external diff as a workaround.
Seems fair to me.

By the way, the handling code is in src/lookslike.c looks_like_utf8() and invoked as either as such or via looks_like_binary() as in src/diffcmd.c

(11) By Florian Balmer (florian.balmer) on 2020-02-21 13:44:34 in reply to 1 [link] [source]

I have several files that require control characters, or long lines.

My solution is to "generate" such files, either from template files by replacing placeholders with the control characters, or directly by dumping parts of the long lines without line endings from a script. Then only the template files and/or scripts are checked into the repository, giving readable diffs.

Maybe something similar may work for you?

(12) By Alfred M. Szmidt (ams) on 2020-02-23 18:22:11 in reply to 11 [link] [source]

That would be a major pain, so I don't think that is at all an acceptable solution here.