How to handle regularily changing images/binary files?

(1) By Marv (prefection) on 2023-01-03 14:56:10 [link] [source]

Hey,

I have recently started to do some Visual regression testing. This means essentially taking screenshots of webpages in their desired state and comparing them with new ones after making changes to the design. This - as the name suggests - helps finding regressions easier when editing the style of reused components throughout the webpage.

For this to work properly, you want to commit these screenshots with the current code, so that you can run a pre-commit hook to make sure that your latest changes did not break anything unexpectedly.

Working with this for a few iterations, I realized that my repository size has grown to a few hundred MB, coming from about 30 MB. Keeping this up will probably push it into the gigabytes.

I generally like the workflow of committing the screenshots, but I don't like that I bloat my repository more or less "unnecessarily". I have thought about changing the screenshots to be unversioned, which would probably solve this bloat, but would also make working on two branches at once impossible if I want the the screenshots to be consistent with my current check-out.

Another alternative would of course also be to not check them into fossil at all and just generate "expected" screenshots for my checkout to compare with them later. This would work, but would also require some manual effort that I either may forget about or simply drop because of the hassle.

So to get back to my question: Is there a way for me to include these regularily changing images/binary files in the repository without bloating its size too much?

(2) By Chris (crustyoz) on 2023-01-03 15:23:46 in reply to 1 [link] [source]

Source code management tools are essentially text, not image, oriented. Since there can be 10x to 100x increase in file size for an image vs textual content, you might consider NOT including the images in the repository. External links to the images which are stored elsewhere in the file system would keep the repository small and make clear its primary information processing function.

You then have a library management problem for related images.

How are you detecting differences between images? What is a meaningful difference?

Most image formats include compression so there is not much room for improvement on a per-image basis.

(7) By Warren Young (wyoung) on 2023-01-03 20:20:50 in reply to 2 [source]

External links to the images

While I agree with the general thrust of your argument, I've got to push back on this point.

First off, committing data generated from files already in the repo is generally a bad idea. In principle, these screen shots can be regenerated from the code, as with Selenium.

Second, the point of committing the screenshot with the code is that they're a natural pair: if you roll back to a historical version of the code, you want the screenshot for that historical version. However, this just takes you back to the prior point: write a script to regenerate the screenshot.

If the OP does that, he won't have a repo size bloat problem, and he can ignore my advice to take the screenshots in BMP or uncompressed TIFF. :)

(8) By Chris (crustyoz) on 2023-01-03 20:32:35 in reply to 7 [link] [source]

If the screenshots can be generated programmatically during the detection of unacceptable image content then there is no point to storing the screenshots anywhere. The detection issue is NOT a version control problem so why burden the SCM with it?

(12) By Warren Young (wyoung) on 2023-01-03 21:43:54 in reply to 8 [link] [source]

Indeed; some people do use Selenium for that. If the image changes against expectation, it's a regression.

(9) By Marv (prefection) on 2023-01-03 20:42:59 in reply to 7 [link] [source]

I am trying to avoid regenerating the screenshots on check-out, but I got to admit that while thinking about it right now, it would not be too bad because I am currently the only developer on the project and run tests entirely locally. If the check-out would be sent to a CI system that automatically runs these tests, committing the screenshots would be somewhat required as the CI job/runner would not know if the screenshot it compares with are the correct ones

(10) By Chris (crustyoz) on 2023-01-03 20:58:15 in reply to 9 [link] [source]

Sounds like a script that does the check-out and thus knows the version identifier would need to also run the tests after specifying the correct file names, including version.

(13) By Warren Young (wyoung) on 2023-01-03 21:54:34 in reply to 9 [link] [source]

regenerating the screenshots on check-out

Why then? Me, I'd make it part of "make test": if the screenshot is missing from the checkout directory, create it on the fly, same as you'd do for a missing *.o file in a C project.

(14) By Chris (crustyoz) on 2023-01-03 22:03:21 in reply to 13 [link] [source]

I used the phrase "on check-out" because that is how Marv describes it being done.

Me, I'd include the differencing between images as part of make activity and then throw away generated images for all but the most recent version and I would not include them in the repo.

(3) By Stephan Beal (stephan) on 2023-01-03 16:39:28 in reply to 1 [link] [source]

So to get back to my question: Is there a way for me to include these regularily changing images/binary files in the repository without bloating its size too much?

As Chris already touched on, SCM systems are not fantastic for binary content.

To answer the above-quoted question: it really depends on the binary format and how much they truly differ for ostensibly small changes.

Though fossil's delta algorithm is quite good, binary formats can vary in any number of ways for even the slightest change between snapshots, as your experience has shown with the bloat of your repository. If you're not already doing so, it might help to take the snapshots in PNG format, as those compress tightly for large areas of the same color (as opposed to photos, where JPG is better 99.99% of the time).

Another thing to try, but only testing will tell if it offers some improvement, is to store them in some uncompressed format, like uncompressed PNG or maybe even BMP (gasp!) or XPM (gasp!!). Perhaps (but only perhaps) fossil can create smaller diffs against those, in particular because you presumably don't expect the screenshots to differ much between checkins.

Warren did some experimentation with this in fossil years ago but i don't recall the management summary of his results off hand. Perhaps he will drop in and offer us some enlightenment.

(5.1) By Warren Young (wyoung) on 2023-01-03 19:52:31 edited from 5.0 in reply to 3 [link] [source]

only testing will tell

Already done. Moreover, the core experiment is repeatable and extensible, allowing the OP to feed his existing screenshots in as test data with a minor bit of programming, yielding a chart to study in place of speculation.

(11) By Marv (prefection) on 2023-01-03 21:06:28 in reply to 5.1 [link] [source]

This is super interesting, especially the makefile may solve the big hit of the repository size. The screenshots usually don't change too much so allowing fossil to do the compression may keep the size down

(4) By Konstantin Khomutov (kostix) on 2023-01-03 17:50:13 in reply to 1 [link] [source]

IIUC, the only known way to address

Is there a way for me to include these regularily changing images/binary files in the repository without bloating its size too much?

is to keep something like "pointers" in the repository instead of the original files, and keep the files themselves elsewhere, and have a way to access the files via such "pointers" when needed. Otherwise you need to store the files in the repository, and no matter how hard you try to store them efficiently, there are obvious limits in today's comprssion methods (which are beleived to be at their theoretical limits anyway).

One point to note is that as soon as you start keeping some blobs outside of the repository you lose the property of such repository of being self-contained. But it's a well-known property, and actually it was widely discussed back in the days distributed VCS were taking momentum: quite many opponents pointed out that a DVCS is usually a poor choice for certain workflows like, say, those typical for designers or modelers working on assets of a computer game, where plain lock → download → modify → upload → unlock works better.

You might consider using Git LFS and any suitable backend which mediates between the actual storage for big files and Git.

I know it's not about Fossil, but if managing huge files with some VCS is an issue to be solved, that tool has solved it.

(6.1) By Warren Young (wyoung) on 2023-01-03 20:15:57 edited from 6.0 in reply to 4 [link] [source]

You might consider using Git LFS…

While that may help with commit times, the only way that will solve the repo bloat problem is if they've taught it how to do bitmapped image delta compression on compressed image formats. I've heard that there are VCSes that know that trick, but it's rare and it usually requires per-image-format extension plugins.

For this type of case, my experiment's results are directly relevant: use uncompressed image formats instead of compressed, and you'll avoid the bloat entirely.

(15) By skywalk on 2023-01-04 05:12:20 in reply to 4 [link] [source]

You had me until LFS. Since the OP already mentioned he is the sole dev, he could hash the snapshot image and store to a text file in the repo. Then script or code a comparator for the hashes of the images. And/or discard them when deemed sufficient. 

Interesting to learn uncompressed images produce the least bloat. I will keep that in mind for newer repos.

(16.1) By Konstantin Khomutov (kostix) on 2023-01-04 17:07:46 edited from 16.0 in reply to 15 [link] [source]

This approach will also require storing the blobs themseves in a way they have unique names — to make them not clash with each other. Naming them after the hashes of their contents should work.

Would require quite a lot of tinkering with scripting, of course.

(17) By Stephan Beal (stephan) on 2023-01-04 12:48:56 in reply to 16.0 [link] [source]

This approach will also require storing the blobs themseves in a way they have unique names

Note that doing so would eliminate any benefit from fossil's delta algorithm. Though the internals can actually delta arbitrary blobs against each other (they don't have to be related to each other), there is no good algorithm for selecting when to do so, so it currently only generates deltas for blobs which are versions of the same file.

(18.2) By Konstantin Khomutov (kostix) on 2023-01-04 17:16:37 edited from 18.1 in reply to 17 [link] [source]

doing so would eliminate any benefit from fossil's delta algorithm.

Well, my perception is that the main point raised by the OP was potential bloating of the repository itself, which — that was not stated explicitly but could be inferred — supposedly might lead to slowdowns and other inconveniences — for instance, not all past versions of the files may be needed in the long run, the repository may be a bit unwieldy to handle etc.

Hence the sort of obvious suggestion is to store the bulk files elsewhere — say, on a dedicated fileserver or "in the cloud" (that is, on someone else's fileserver ;-)) and so on. Of course, if we store the stuff outside, Fossil is mostly out of the picture.

On the other hand, certain modern filesystems such as ZFS and Btrfs can do deduplication which may help. Still, that deduplication IMO happens on block level which can make it way less helpful when it comes to compressed image formats.