Fossil

image-format-vs-repo-size.md at trunk
Login

image-format-vs-repo-size.md at trunk

File www/image-format-vs-repo-size.md artifact 30800868 on branch trunk


Image Format vs Fossil Repo Size

The Problem

Fossil has a delta compression feature which removes redundant information from a file relative to its parent on check-in.1 That delta is then zlib-compressed before being stored in the Fossil repository database file.

Storing pre-compressed data files in a Fossil repository defeats both of these space-saving measures:

  1. Binary data compression algorithms turn the file data into pseudorandom noise.2

    Typical data compression algorithms are not hash functions, where the goal is that a change to each bit in the input has a statistically even chance of changing every bit in the output, but because they do approach that pathological condition, pre-compressed data tends to defeat Fossil’s delta compression algorithm, there being so little correlation between two different outputs from the binary data compression algorithm.

  2. An ideal lossless binary data compression algorithm cannot be applied more than once to make the data even smaller, since random noise is incompressible. The consequence for our purposes here is that pre-compressed data doesn’t benefit from Fossil’s zlib compression.

You might then ask, what does it matter if the space savings comes from the application file format (e.g. JPEG, DOCX, Zip, etc.) or from Fossil itself? It really doesn’t, as far as point 2 above goes, but point 1 causes the Fossil repository to balloon out of proportion to the size of the input data change on each checkin. This article will illustrate that problem, quantify it, and give a solution to it.

Affected File Formats

In this article’s core experiment, we use 2D image file formats, but this article’s advice also applies to many other file types. For just a few examples out of what must be thousands:

Demonstration

The companion image-format-vs-repo-size.ipynb file (download, preview) is a JupyterLab notebook implementing the following experiment:

  1. Create a new minimum-size Fossil repository. Save this initial size.

  2. Use ImageMagick via Wand to generate a JPEG file of a particular size — currently 256 px² — filled with Gaussian noise to make data compression more difficult than with a solid-color image.

  3. Check that image into the new Fossil repo, and remember that size.

  4. Change a random pixel in the image to a random RGB value, save that image, check it in, and remember the new Fossil repo size.

  5. Iterate on step 4 some number of times — currently 10 — and remember the Fossil repo size at each step.

  6. Repeat the above steps for BMP, PNG, and TIFF.3

  7. Create a bar chart showing how the Fossil repository size changes with each checkin.

We chose to use JupyterLab for this because it makes it easy for you to modify the notebook to try different things. Want to see how the results change with a different image size? Easy, change the size value in the second cell of the notebook. Want to try more image formats? You can put anything ImageMagick can recognize into the formats list. Want to find the break-even point for images like those in your own repository? Easily done with a small amount of code.

Results

Running the notebook gives a bar chart something like4 this:

results bar chart

There are a few key things we want to draw your attention to in that chart:

Automated Recompression

Since programs that produce and consume binary-compressed data files often make it either difficult or impossible to work with the uncompressed form, we want an automated method for producing the uncompressed form to make Fossil happy while still having the compressed form to keep our content creation applications happy. This Makefile should6 do that for BMP, PNG, SVG, and XLSX files:

.SUFFIXES: .bmp .png .svg .svgz

.svgz.svg:
    gzip -dc < $< > $@

.svg.svgz:
    gzip -9c < $< > $@

.bmp.png:
    convert -quality 95 $< $@

.png.bmp:
    convert $< $@

SS_FILES := $(wildcard spreadsheet/*)


all: $(SS_FILES) illus.svg image.bmp doc-big.pdf

reconstitute: illus.svgz image.png
    ( cd spreadsheet ; zip -9 ../spreadsheet.xlsx) * )
    qpdf doc-big.pdf doc-small.pdf


$(SS_FILES): spreadsheet.xlsx
    unzip $@ -d $<

doc-big.pdf: doc-small.pdf
    qpdf --stream-data=uncompress $@ $<

This Makefile allows you to treat the compressed version as the process input, but to actually check in only the changes against the uncompressed version by typing “make” before “fossil ci”. This is not actually an extra step in practice, since if you’ve got a Makefile-based project, you should be building — and testing — it before checking each change in anyway!

Because this technique is based on dependency rules, only the necessary files are generated on each make command.

You only have to run “make reconstituteonce after opening a fresh Fossil checkout to produce those compressed sources. After that, you work with the compressed files in your content creation programs. Your build system might include some kind of bootstrapping or auto-configuration step that you could attach this to, so that it doesn’t need to be run by hand.

This Makefile illustrates two primary strategies:

Input and Output File Formats Differ by Extension

In the case of SVG and the bitmap image formats, the file name extension differs between the cases, so we can use make suffix rules to get the behavior we want. The top half of the Makefile just tells make how to map from *.svg to *.svgz and vice versa, and the same for *.bmp to/from *.png.

Input and Output Use the Same Extension

We don’t have that luxury for Excel and PDF files, each for a different reason:


  1. ^ This problem is not Fossil-specific. Several other programs also do delta compression, so they’ll also be affected by this problem: rsync, Unison, Git, etc. You should take this article’s advice when using all such programs, not just Fossil. When using file copying and synchronization programs without delta compression, on the other hand, it’s best to use the most highly-compressed file format you can tolerate, since they copy the whole file any time any bit of it changes.
  2. ^ In fact, a good way to gauge the effectiveness of a given compression scheme is to run its output through the same sort of tests we use to gauge how “random” a given PRNG is. Another way to look at it is that if there is a discernible pattern in the output of a compression scheme, that constitutes information (in the technical sense of that word) that could be further compressed.
  3. ^ We're using uncompressed TIFF here, not LZW- or Zip-compressed TIFF, either of which would give similar results to PNG, which is always zlib-compressed.
  4. ^ The raw data changes somewhat from one run to the next due to the use of random noise in the image to make the zlib/PNG compression more difficult, and the random pixel changes. Those test design choices make this a Monte Carlo experiment. We’ve found that the overall character of the results doesn’t change from one run to the next.
  5. ^ It’s not clear to me why there is a one-time jump in size for BMP and TIFF past the first commit. I suspect it is due to the SQLite indices being initialized for the first time. Page size inflation might have something to do with it as well, though we tried to control that by rebuilding the initial DB with a minimal page size. If you re-run the program often enough, you will sometimes see the BMP or TIFF bar jump higher than the other, again likely due to one of the repos crossing a page boundary. Another curious artifact in the data is that the BMP is slightly larger than for the TIFF. This goes against expectation because a low-tech format like BMP should have a small edge in this test because TIFF metadata includes the option for multiple timestamps, UUIDs, etc., which bloat the checkin size by creating many small deltas.
  6. ^ The Makefile above is not battle-tested. Please report bugs and needed extensions on the forum.