Automatic binary data reconstruction for minimum size delta compression

(1.1) By Warren Young (wyoung) on 2019-07-21 08:15:54 edited from 1.0 [source]

I've previously published an article giving ad hoc methods for getting maximum delta compression and minimum chance of merge conflicts in binary data format files. These methods are effective for many file types, but they don't work for all file formats, and they're annoying to use even when they're possible.

It would be better if Fossil had a built-in method for applying such methods automatically.

Simply building file type reconstruction into Fossil isn't practical, even for the most popular file types. That would effectively require building libavformat, sox, MagickCore, Pandoc, and more into Fossil. One of Fossil's primary value propositions over its competition is low dependence on third-party libraries. It's not sufficiently worth solving this problem to destroy that value.

Therefore, I suggest an alternative: locally-configured plugin-style file reconstruction:

Configuration

To the current *-glob settings, add a new one, reconstruct-glob:

  pandoc.sh *.doc, *.docx, *.pdf, *.epub
  ffmpeg.sh *.mov, *.mp4, *.avi
  sox.sh *.mp4

The file format differs slightly in that the first word on is a script that Fossil launches with a defined argument format to decompress the raw file to a temporary form which Fossil can then apply delta compression to.

These scripts must adhere to an API defined in the following sections.

Argument Format

These scripts most likely wrap other tools, possibly even separate de-compression and re-compression back-end program. That's a local implementation detail.

The script figures out what to do based in part on the arguments passed to it:

  usage: fossil-reconstruct-pandoc <-cd> input_file temp_dir

  -c: compress input_file to temp_dir
  -d: decompress input_file to temp_dir

The temp dir is not merely a good idea, it's outright required for cases such as OpenXML documents, which are Zip files that contain multiple internal sub-document files. The tool decides how to unpack such files into a directory tree. Fossil just iterates over that tree, applies addremove-like logic to the files it finds, and delta-compresses the resulting file set as a single artifact.

Status Codes

Reconstruction scripts need to communicate the resulting state through use of predefined exit status codes: 0 for success, 1 for "refuse to process,", 2 for "missing reconstruction script", 3 for "missing external dependency," 4 for I/O error, etc.

These codes need to be stable over the long term, potentially decades. Obsoleted values thus can never be reused. That argues against clever schemes like in HTTP, where we define sub-ranges for different groups of codes, since we only have 255 values to play with on some platforms. We should just extend the list as it evolves over time, even though that will likely create a messy scheme over time.

Code 1: Refuse to Process

This mechanism must have a way for the plugin to say "I can't usefully process this file" since the globs above might be overly-grabby.

The "*.mp4" case above is a good example. Most MP4 files (a.k.a. ISO media) contain H.264 or H.265-compressed video, which cannot usefully be reconstructed in this manner. But not all! There are lossless codecs suitable for use in MP4 containers such as ProRes which can be reconstructed in a way that will improve delta compression.

Rather than impose a naming scheme to allow a Fossil glob to uniquely match only ProRes MP4s, we want a method for the plugin to reject files based on content, giving the current behavior in Fossil: best-effort binary diffs.

Code 2: Missing Reconstruction Script

To use a repository with artifacts produced using this feature, you need the set of scripts in order to open the repo. Where will Fossil look for them?

You might think that it would be a good idea to store these scripts in the repo itself, such as $CKOUT/.fossil-reconstruct. I don't think so, for several reasons:

Historical scripts might not run on modern platforms. (Thus the occasional need to autoreconf an old Autotools-based tarball to get it to build on a modern system.)
Projects tend to expand their set of platforms, so you need to improve existing scripts and add new ones to cover additional platforms. You might have coded your scripts to avoid problem #1 across 12 versions of Linux, but you probably won't be able to open such a repo on Windows short of requiring Cygwin or WSL.
It creates a dependency ordering problem, where Fossil might need to seek forward in the artifact stream to find the reconstruction scripts before it can process earlier artifact data.

I believe we should store such scripts in one of the Fossil configuration tables, much as we handle skinning today. The current scripts are expected to run on all supported check-out and server platforms for that repository, and they're expected to process all historical artifacts.

Fossil should search $PATH for each tool before looking at its configuration tables to handle cases where the in-repo versions fail.

There's a simple fall-back chain for finding the scripts, largely paralleling the way Fossil handles settings today: PATH first, then global config, then per-user config, then per-repo config.

Code 3: Missing External Dependency

These reconstruction scripts will themselves have dependencies. The pandoc.sh script mentioned above doubtlessly calls pandoc, and it in turn can be built with a subset of all available file formats. Thus, my script needs a way to tell Fossil, "Launch successful, but I'm missing some dependency that allows me to process this file."

Cross-Platform Delta Compatibility

There is a potential that cross-platform tools might not give equivalent deltas. An old version of Info-Zip on Linux probably won't produce exactly the same data for a given file as whatever tool you use on the Windows side, and it might not even give the same result as a current version of Info-Zip on macOS.

However, if we dig into the facts underlying this worry, we realize that if the basic file format is compatible across the platforms we care about, this problem should not matter. It doesn't matter that three different Zip tools produce slightly different results, it only matters that the corresponding Unzip tools can read each of the other tools' outputs.

If your use case cannot meet that guarantee, you've probably already got a compatibility problem today with these same files. You need to solve the compatibility problem before layering binary file data reconstruction atop it.

Manifest Format

Use of this feature will have to modify the artifact manifest so that Fossil knows it can only reconstitute the output file properly by calling a script via reconstruct-glob, passing the -c flag.

Globs for Script Names as Well

The first word in a reconstruct-glob line should itself be a glob so that multiple versions of the script can be provided to cover many platforms.

For example, you could provide a PowerShell version for Windows, a pdksh version for the BSDs, a Python 2 script for macOS and older Linuxes, and a Python 3 version for newer Linuxes. All of these could be matched as pandoc.*, selecting *.ps1, .ksh, .py2, and .py3 versions, respectively.

Rebuilds

With all of the above sorted out, you can then allow fossil rebuild --compress to run on an existing repo storing files that previously weren't possible to efficiently delta-compress (e.g. PNGs), producing a much smaller repo. Initially, this requires external scripts in the PATH, but going forward, it can use versions shipped to new users during initial clone along with the config data.

Compatibility Format

With all of this done, we should add a flag to fossil rebuild that makes it ignore reconstruct-glob to produce a repo compatible with current and historical Fossil versions. You need to run that on a system with the decompression tools, but once done, Fossil returns to current behavior on systems where Fossil remains unable to run the reconstruct-glob tools.

The end goal of all of this is that my ad hoc methods article should become much smaller, covering only file types that cannot be processed usefully at all, such as lossily-compressed file formats. (MP3, JPG...)

(2) By Warren Young (wyoung) on 2019-07-21 09:21:47 in reply to 1.1 [link] [source]

Addenda:

Fossil should by default call the reconstruction script with both -c and -d on separate temp directories, passing the result of the first to the second as input, then check that the data haven't changed. If the reconstruction script cannot do a single round-trip, Fossil should be justified in rolling back the checkin/rebuild operation.
For the sound file example, I copy-pasted *.mp4 then didn't fill in a proper example. Substitute something like FLAC or Apple Lossless.
For the ProRes movie file example, I should clarify that the reconstruction script will probably have to produce a directory full of still frame files in an uncompressed binary format on decompression and then re-encode the ProRes file on recompression. This will be expensive in terms of disk space and CPU time, so it'll only be used when the smallest possible repository size is important, and when the movie file doesn't change much between versions, at a frame level.

This feature will need to be purely optional, with decisions made about when and how to make use of it left to each repo's administrators. This is another good reason not to put the logic into Fossil itself.
In making my point about Pandoc, MagickCore, libavformat, etc., I didn't make something explicit that others not familiar with these projects might not realize: each one is around the size of Fossil itself, yet each covers only a small subset of the file types we want this feature to cover! Fossil minus SQLite, its third-party dependencies, and its test suite is currently 107 kSLOC. Of my examples, Sox is the smallest at "only" 38 kSLOC. Pandoc is next up at 53 kSLOC. Then we have libavformat at 160 kSLOC and MagickCore at 172 kSLOC. And libavformat is useless without libavcodec and more, which brings the total up to about a million SLOC before even considering third-party dependencies!

The point is, a fully-featured all-internal Fossil implementation of reconstruction covering the top 1% of file types would probably make Fossil many times larger just for all of the codecs required.

(3) By Warren Young (wyoung) on 2019-07-21 09:36:19 in reply to 1.1 [link] [source]

This feature will interact strangely with Fossil's embedded document and /file features.

If I have a PNG in the repo and my reconstruction scripts store its pixel data as an uncompressed Windows BMP instead, then embed that PNG into a wiki document, it'll probably fail because my browser won't want to render a BMP file. The Fossil server will need to reconstruct the PNG before serving it. That means you'll probably want to turn on Fossil's caching features if you're using this reconstruction feature.

Much the same is true when using Fossil's file browsing feature. When I click on a file in the browser, I want it to send my browser the same file content you'd get on checkout, not the uncompressed and deconstructed form.