Fossil Big File Support?

(1) By anonymous on 2018-09-18 22:41:33 [link]

Unless I'm wrong Fossil currently only supports files that are less than 2GiB (or something close to it). I understand this limitation comes from SQLite's maximum BLOB size.

It'd be great if Fossil supported *binary* files larger than the current 2GiB size by effectively storing it in multiple BLOBs if necessary.

(2) By sean (jungleboogie) on 2018-09-18 23:32:59 in reply to 1 [link]

https://sqlite.org/limits.html

Check out #14.

(3) By sean (jungleboogie) on 2018-09-18 23:36:43 in reply to 1 [link]

While it may be possible to store files larger than 2GB, it might not be the best thing to do. Typically source control is for text files, although it's used and abuse abused with art assets, video files, etc. And that's not just regarding fossil, but source control in general.

Check the mailing list archives for a similar question from others. If you do this, you'll likely want to disable to checksum, as that can take a considerable amount of time to complete.

(4) By Stephan Beal (stephan) on 2018-09-19 02:07:00 in reply to 1 [link]

One important reason not to store huge files is that fossil's memory requirements are largely a function of the blob sizes. e.g. when performing deltas, which is does when committing changes, it needs memory to store/work with the original copy, the changed copy, and the delta all at once. It's very possible to run out of memory when working with large blobs, especially on embedded devices like a Raspberry Pi (one fellow, a couple years ago, was trying to use a blob which was something like 4x the size of his RAM and virtual memory, and wondered why fossil kept dying).

(5) By anonymous on 2018-09-19 21:01:14 in reply to 3 [link]

Yes, I did disable the checksum option. I agree that Fossil isn't designed to store art assets etc., but IMHO the simplicity of fossil makes it a great archiving tool even for binary files.

(6) By anonymous on 2018-09-20 00:29:46 in reply to 1 [link]

How practical is it to version such big binary files?

It may be easier and indeed faster to version _references_ to such assets. After a check out you may fetch the actual files using a script. No need to wait for Fossil to unpack the big files from the db.

Otherwise, keeping huge binaries in the repo kinda short-circuits the Fossil's utility and makes it slower on status and commit. It also makes cloning such repo a needlessly long waiting affair.

(7) By Kevin (KevinYouren) on 2018-09-20 00:54:48 in reply to 5 [link]

I would suggest you also consider a "proof of concept" archiving tool, which is also written by Dr Hipp.



https://sqlite.org/sqlar/

I use it to store photos and pdfs.

regs, Kev

(8) By anonymous on 2018-09-20 02:58:11 in reply to 7 [link]

I did it look into it prior using Fossil as an archival too. Unless I'm wrong, It does have the same BLOB size limit. But even putting the size limit aside, I find using fossil a lot more easier and more feature rich like built-in syncing,  multiple branches, wiki, de-dup, no accidental deletion etc., Best of all, single db file so makes it effortless to backup/restore. More or less I use fossil as wrapper layer to sqlar with some goodies.

(9) By anonymous on 2018-09-20 02:59:01 in reply to 6 [link]

My intention is not necessarily to 'version' them like text files but archive binary files and prevent accidental deletion.

(10) By sean (jungleboogie) on 2018-09-20 03:15:33 in reply to 9 [link]

That sounds like a backup program. You may like borg, a python backup utility.

(11) By anonymous on 2018-09-20 04:46:07 in reply to 10 [link]

Sorta like a backup utility but just a bit different. See this post:
<https://fossil-scm.org/forum/forumpost/16d7c4c287>

In any case, currently I split them manually myself and fossil has no issues dealing with them. Currently the repo is over 120GB and fossil works flawlessly.

(12) By sean (jungleboogie) on 2018-09-20 04:48:55 in reply to 11 [link]

Huh, pretty cool. Have any figures on how long it takes to open as a fossil file?

(13) By Kevin (KevinYouren) on 2018-09-20 10:25:15 in reply to 8

Sounds good.

I only have 11 distinct files greater than 1G, out of 1.55 million files (about 860,000 distinct files). I have 2 Ubuntu instances, and an  LFS (Linux From Scratch) instance on my laptop. And another laptop that is a clone. 

I did find the BLOB limit during testing of SQLAR, when I had  3G GPG file in  a sub-directory.

I used to split files into pieces when I backed up to diskette - I didn't have a tape drive. USBs solved that. I now have multiple removable drives.

regs, Kev

(14) By Warren Young (wyoung) on 2018-09-20 10:40:48 in reply to 13 [link]

It sounds like you're trying to use Fossil as an alternative to `rsync`: a method to keep two or more machines' filesystems in sync.

Isn't that awfully expensive in terms of disk space? At the very least, it doubles storage space, ignoring compression. Every time you update one of those OS disk images, you're likely to balloon the size of the Fossil repo.

On top of that, every time you check in changes to a file managed by Fossil, you temporarily need up to about 3x the size of that BLOB's size to compute the diff: 1x for the checked-in version, 1x for the new version, and up to 1x for all of the delta data, with the worst case being that the new version can't be delta-compressed at all. Fossil could move to a rolling diff model, reducing the worst case to the BLOB size + `2 * sizeof(rolling_buffer)`, but it's still a lot of RAM for large BLOBs.

There are rafts of machine syncing and private cloud storage alternatives out there. I don't think Fossil should try to morph into yet another of these. To the extent that Fossil does overlap this area, it's in filling a rather specialized niche.

(15) By Stephan Beal (stephan) on 2018-09-20 13:17:37 in reply to 14 [link]

Warren wrote:

> There are rafts of machine syncing and private cloud storage alternatives out there.

For those who haven't heard of it yet, [Syncthing](https://syncthing.net/) is a cross-platform, open source solution for hosting one's own syncable files. It's kind of like having (and maintaining) your own private dropbox service. i haven't used it but have heard good things about it.

Warren wrote:

> I don't think Fossil should try to morph into yet another of these.

Amen to that!

(16) By anonymous on 2018-09-20 15:32:28 in reply to 9 [link]

Looks like this mostly works for you. However by the same token, an accidental deletion may equally happen to the Fossil repo... Even more, if you keep all your fossils in the same 'basket' so to speak. If the directory is deleted by accident, all of the repos within are gone too.

Using Fossil as a kind of backup tool may not be optimal in the long run. Not sure if Fossil is preserving file attributes, like permissions and owner.

(17) By Stephan Beal (stephan) on 2018-09-20 15:43:48 in reply to 16 [link]

The only permission fossil records is the executable bit, and that was added relatively late in fossil's design. Fossil does not record any file owner info. git behaves similarly in this regard, ignoring all permissions except the executable bit.

(18) By Richard Hipp (drh) on 2018-09-20 16:21:03 in reply to 16 [link]

In a sense, Fossil was originally designed to do backup!

Remember, Fossil was designed to support SQLite development. Part of that
support includes providing automatic backups.  We accomplish this by having
peer server repositories in separate data centers hosting by independent
ISPs, and having those peer repositories automatically sync with one another.
When we make a change to SQLite, that change goes into our local repo and
(via autosync) is immediately also pushed to one of the peer servers.  Within
a few hours, the change is also replicated to the other servers.  In this way,
we can lose entire data centers and/or developer sites, with no actual loss
of content.

I have various private repositories in which I keep things slides for all
talks I've ever presented, personal and corporate financial records, and
so forth.  These private repositories are also replicated (though on a
different set of private servers) to avoid any single point failure.

When I go to set up a new laptop, I simply install Fossil then clone a
few repositories, and I suddenly have all these historical records available
for immediate access.  Prior to taking that laptop on a trip, I simply run
"fossil all sync" to make sure everything is up-to-date.

All that said, I'm not trying to back up multi-gigabyte files.  The biggest
artifacts I have are OpenOffice presentation files which can be a dozen
megabytes or so.  Fossil works great as a backup mechanism in that
context.  I'm not sure it would work as well as a backup mechanism for
gigabyte-sized videos and high-res photo and album collections.

(19) By jvdh (veedeehjay) on 2018-09-20 16:48:41 in reply to 18 [link]

apart from the multi-gig issue, I would argue that something like this (e.g. syncing stuff between desktop and laptop)  really is better done with a bidirectional file synchronizer capable of detecting conflicts (files changed on both sides/machines etc.)  e.g. `unison' should appeal to fossil users: https://github.com/bcpierce00/unison (yes, they use the wrong DVCS ;)).

I understand that one could to some extent use your approach but I really prefer full sync of my home directory (including file-system based syncing of existing fossil repos...) between desktop and laptop, e.g.

(20) By Kevin (KevinYouren) on 2018-09-20 22:21:45 in reply to 15 [link]

I tend to agree with both Warren and Stephan. Fossil has core use cases.

I am retired and work alone. I am a fringe user, with specialized requirements. An outlier, no longer mainstream.

I use Fossil like a diary system.
I use SQLAR as an alternative to tar.gz and zip.

I do not use Fossil as a backup mechanism.

I use these because they have SQLITE at the core. Is good.

As a user of IBM mainframe change management software from 1981 onwards, I have a different perspective. I was always used software that stored source, compared it, compiled it, reported problems, and then distributed from unit test to system test to integration test to user-acceptance test to production. Emergency fixes in Prod were catered for.

regs, Kev

(21) By andygoth on 2018-10-24 21:19:26 in reply to 1 updated by 21.1 [link]

See my [post](/forumpost/b022e1be1f) about how I store arbitrarily large files in SQLite.  Maybe the concept can be adapted in a future version of Fossil, or sqlar for that matter.

(21.1) By andygoth on 2018-10-24 22:02:41 edited from 21.0 in reply to 1 [link]

In another project, I was able to store arbitrarily large files in SQLite.  Maybe the concept can be adapted in a future version of Fossil, or sqlar for that matter.  I did so by extending the sqlar concept in a couple ways:

- Chunking large files to respect SQLite limits
- Multiple archives per database
- Efficiently sharing contents between identical files within or across archives
- Support for symlinks
- Support for empty directories

Here's the schema:

>
    -- File contents, divided into chunks which may be compressed.
    CREATE TABLE Chunks (
        cid         INTEGER REFERENCES Contents ON DELETE CASCADE,
        seq         INTEGER,                     -- Sequence number.
        chunk       BLOB NOT NULL,               -- Maximum $maxChunk bytes.
        PRIMARY KEY (cid, seq)
    ) WITHOUT ROWID;
    CREATE INDEX ChunksCid ON Chunks (cid);
>
    -- File contents, compressed with deflate if cmpSize is less than uncSize.
    CREATE TABLE Contents (
        cid         INTEGER PRIMARY KEY NOT NULL,-- Content ID.
        hash        BLOB NOT NULL UNIQUE,        -- SHA3-256 checksum.
        uncSize     INTEGER NOT NULL,            -- Uncompressed size.
        cmpSize     INTEGER NOT NULL             -- Compressed size.
    );
>
    -- Filenames with paths.
    CREATE TABLE Paths (
        pid         INTEGER PRIMARY KEY NOT NULL,-- Path ID.
        path        TEXT NOT NULL UNIQUE         -- Filename with full path.
    );
>
    -- Files in an archive.
    CREATE TABLE Files (
        aid         INTEGER REFERENCES Archives ON DELETE CASCADE,
        pid         INTEGER REFERENCES Paths,
        cid         INTEGER NOT NULL REFERENCES Contents,
        mode        INTEGER NOT NULL,            -- 0 normal, 1 exec, 2 symlink.
        mtime       INTEGER NOT NULL,            -- Last seen modification time.
        PRIMARY KEY (bid, pid)
    ) WITHOUT ROWID;
    CREATE INDEX FilesAid ON Files (aid);
    CREATE INDEX FilesPid ON Files (pid);
    CREATE INDEX FilesCid ON Files (cid);
>
    -- Collections of files.
    CREATE TABLE Archives (
        aid         INTEGER PRIMARY KEY NOT NULL,-- Archive ID.
        label       TEXT NOT NULL UNIQUE,        -- Name of archive.
        timestamp   INTEGER NOT NULL             -- Seconds since epoch.
    );
>
    -- Create reserved content ID 0 signifying empty directory.
    INSERT INTO Contents VALUES (0, '', 0, 0);

[See also](/forumpost/b022e1be1f).

(22) By Yitzchok (ychakiris) on 2020-04-13 16:37:12 in reply to 21.1 [link]

Can you share the "glue" code that uses this schema to do the work?

(23) By ddevienne on 2020-04-14 13:39:08 in reply to 21.1 [link]

Hi Andy. Is this write/append-only?

Or you've added the logic in the code somewhere, to get rid of  
dangling Paths and Contents entries?

With UPSERT to count references to Paths and Contents,  
and triggers on delete to *manually cascade* to rows from  
those tables reaching 0 *refcounts*, you'd support updates  
and deletes at the schema level, no?