Fossil Curious; Big Video Archive Project, can Fossil Do This? Or Two Fossils?

(1) By PhiTaylor (PhilienTaylor) on 2023-10-03 01:51:27 [link] [source]

Hi there, I'll introduce myself a bit later. Until then, hi! My name is Philien, feel free to call me Phi. A friend recommended Fossil SCM for a project I was working on, and as I researched Fossil, I decided I really liked this SCM. Problem is--it's a completely different project. This post is dedicated to describing the needs for the data I'm archiving, and why I am considering using Fossil.

What My Project Is

I'm kinda a data archivist; news files, art galleries, stories, music, etc. However, one data I've been archiving I have also been publicly curating. I'm a bit bashful to admit the contents because it involves market exploitation through commodity fetishism. I learned this term in the last two to three years, and it goes quite far to describe the literal iconographic intents of brands, their creative directors, and the teams hired to create this largely uncredited media. My goal is to be able to share this massive database of video files in a way that is, for all intentional purposes, indelible.

DCMA and History Erasure

While data on the internet goes missing quite frequently, the commercial art that is published to mass media doesn't deserve to be erased or forgotten. Companies spend millions, or even tens of millions of dollars, researching and developing an advertisement campaign before finally hiring international ad agencies to create this media. At the end of the day, this media will go largely uncredited, paying its creatives with a single-time fee and not providing residuals for the many, many times their media gets repeated. So when a brand wants to pivot, why should we permit it?

I cannot complete a list of companies who, once caught in international conflict and ethical failures, seek to rebrand, rename, and reenter the market as if nothing has happened. In other cases, a brand's trademark could be brought to another market where the image sought for this new market doesn't match the brand's history in another. This is where DCMA comes in.

While many folk might have joined the community I started because we were caught up in this commodity fetishistic mess--cognitohazards, you might say--my goal is to end this by providing a growing archive to the public, while keeping its maintenance only permitted to a select group of authorized individuals.

Sound familiar?

Fossil-SCM, "Not a Blockchain" and Wiki/Tags

My goal is to make this as painless as possible for me, even as it seems a bit late for that. The content I have likely numbers in the first thousand by now. Secondly, it is in my interest to curate the tags and tagging of media with the ethics I have now that I understand why I began the project in the first place. Finally, what I've observed in other solutions for this just don't cut it compared to Fossil. I imagine that, if I wanted, I could torrent my fossil file alongside the mounted filesystem from the fossil file. This means that users who want to maintain a copy of the database itself could copy the fossil file, without being able to easily cryptographically crack the administration of this data. (... Correct? This is a question, here... 'u'; ) Users could also use a live hosted mirror of this Fossil-SCM to review which artifact they're looking for, and ideally fish it out of the mounted filesystem. ( Is there a method to keep filenames somehow? 'u';; Or even folder structures? I thought this was not possible with artifacts, but.. ) Moreover, I read this post: Tags for Wiki Pages and it's led me to believe that I could create a Wiki page using the web interface for each campaign, each ad in the campaign with its edits and its multiple file publications, and tag them: the wiki pages, and if possible, the files.

It's incredibly ambitious for a file database that might measure in Gigabytes, but perhaps I could use two fossil files? In any case, here's why other ideas I've considered just don't cut it as well:

Failed Competitor Ideas

Hugo or Orchid or Jekyll

I once tried an ad database composed of server side includes. This is as messy as you might imagine. From what I've seen in Orchid, Hugo, or Jekyll, none of these seem to work with an actual database system, even if it's NoSQL or SQLite. I doubt I'm mistaken, and it's possible I am from my little research. This would be a case where I could create one massive website database file of content, share it, and if somebody wanted to run the template-generating engine upon it, they could. However, maintaining a file of this size by hand, or even using the UI of perhaps Hugo, seems to be lacking for me. It also means that if I used IPFS, or a static web/website, I would need to generate the whole website's template files before publishing it, and then validate the entire publication. It has value, but it might need more stability and self-maintenance.

Git or GitHub

This option seemed to have potential, until I learned about LFS while researching the original project I was discussing. As far as I've seen, Git handles very large files as poorly as perhaps any competitor: who wants to host them? While I have considered GitTorrent as a medium for sharing this content, I could simply just put the fossil-scm content into a GitTorrent commit, or just a plain torrent. Next, Git also doesn't include the relationships between these files--I would need a separate database/content file for that.

External Webhosts

Good luck giving random people funding for projects. If this were easy, core-js.js, Orchid, and a bunch of other open-source projects would be funded. I'm not in a position where I can afford a central web host and data host for this. Moreover, that provides a central point of failure. Free File Sharing Hosts are perhaps good for individual files, but for a collection in the size of gigabytes, I can see this file eventually being blacklisted or DCMAed again.

My Big Asks of Fossil

Where this leaves me is that Fossil-SCM can handle almost everything I need. I could run it off an android device in my home. I could add files, I could create wiki pages connecting to those artifacts, and then tag those pages. I could add sub-administrators, then sub-sub-administrators, editors, translators, taxonomy implementers, and those who would pass the privileges along to another generation one day. And I could allow people to clone the whole repo, or simply provide the fossil file so they may do so. .. Right?... Yet..

How could I create thumbnails for pages in an automated fashion? (as I understand it, it's possible to edit the Fossil templates so that it's possible to edit data structures differently. Might this work for allowing an automated thumbnail script or program to run? How safe is this, haha?)
Is it possible, now, to tag wiki pages? Or to create Documentation pages for each File/Artifact, or a collection of File/Artifacts? Or to connect Wiki pages to these files/artifacts? If so, is this some more template programming wizardry? (This is a must-have, even if it means each time the server comes on it must calculate interrelatedness between hundreds of files...)
How can Fossil-SCM manage a database full of large, unchanging video files? (I am more than willing to include a second repository of the actual media being referred to, but doesn't this open a whole new can of worms? Or is this the only solution to making this issue manageable?)
Are the security principles I'm suggesting in distributing/torrenting the raw Fossil file plausible? (What competitor allows you to share an encrypted database of content, administrative privileges, and version history? I don't think any exists..)
I would just settle for something that does this via SQLite and I could put into Git, if I have to. (I can't just run the program from a local phone or server to get up and running, but the SQLite file might as well be the most raw form of the data I seek to share, even if it's releasing it all publicly..)

Worthiness

If you aren't aware, there are massive corporations now that international mega-brands use to research competitors internationally and throughout decades. I should know, one of them had allowed public access to research decades of advertisements. The only other archival history of television and advertising I know at this scale was Marion Stokes'. Other websites, one by one, have died. Ad-rag.com.. The rest don't come to mind at this point.

About Me

Call me Philien or Phi Taylor. Those of you who know me know me by other names, that's fine. My pronouns are tsey/tsem/tseirs, or she/her/hers, or any. The website I'm talking about is AKA WackyWildTVAds.com which has had a variety of international research interest besides from hobbyists. (I think I can count... one. One group. XD)

Thank you

I want to thank everybody here for reading my message, and considering my use case. I want to thank everybody here for developing Fossil-SCM, which seems to be a kind of technology that could solve at least one, if not many, of my problems, if I am clever enough to implement the how. I'm at the point where I don't know if I have enough free time to begin banging a wrench at the system until I know many of these features I need could be implemented.

Thank you again for an amazing technology, everybody, here, at Fossil-scm.org.

Phi Taylor

(2.2) By Warren Young (wyoung) on 2023-10-12 21:33:57 edited from 2.1 in reply to 1 [source]

This is where DCMA comes in.

I assume you mean "DMCA", but I can't see the connection between your project's requirements and this US law. The DMCA certainly doesn't help you, and Fossil doesn't magically shield you from its requirements.

IANAL, but as far as I can tell, laws like the DMCA cut both ways in cases like yours. First, you remain legally obligated to use Fossil's shunning feature to respond to take-down demands in covered jurisdictions. But second, while Fossil does let your users maintain their own clones while disregarding the shun list, preserving their local copies of artifacts shunned in the parent repo, laws like the DMCA allow the complainant to go after any of them that re-share their clones.

Witness the current pressure being put on Archive.org to get a preview for what you're letting yourself into with this project.

massive database of video files in a way that is, for all intentional purposes, indelible.

The semantic content may be indelible, but as one who has been working in the digital video sphere for decades now, I can tell you that there's no reason to believe in the existence of…

    One format to rule them all,  
    One format to find them.  
    One format to bring them all,  
    And in the darkness of the hot-swap enclosures bind them;
    In the Land of Fossil, where the `BLOB` table entries lie.

The file you first collected in 1998 in Real Video format had to be converted to Flash Video in 2000, then into MPEG-2 for rebroadcast, then into Ogg for its purported "archival" nature, then into the original MPEG-4 part 2 era formats to get onto the iPhone, then into FFV1 when it became clear that Ogg wasn't going anywhere, then into H.264 (MPEG-4 part 10) as the legacy devices requiring part 2 video aged out, then into H.265 when…

If you think we're stopping with H.265 henceforth, you have far more faith in the stability of video codecs than I do.

And we haven't even gotten into the actual container formats yet. We'll get to that shortly.

I could torrent my fossil file…

Only if it's locked to all modification. Potential data corruption issues aside, I recall enough about the BitTorrent protocols that they require that the underlying file not change after it begins seeding, else the whole network referring to that file has to be recomputed and re-seeded.

You might get away with quarterly archives or similar where you clone from the master repo and then seed that, but it's unreasonable to expect to spread the ever-changing archive over BitTorrent, continuously.

A far better choice in that regard is Fossil's own sync format, but we will shortly get to why you're likely to be unsatisfied with that, too.

users who want to maintain a copy of the database itself could copy the fossil file, without being able to easily cryptographically crack the administration of this data.

I'm reasonably sure you're hoping for a condition that Fossil does not guarantee. Nothing I as a Fossil repo host can do will prevent you from modifying your local clone. The only thing Fossil's hash trees guarantee is that if you do that, you will never be able to complete a sync with my repos again. It doesn't prevent you from going off and re-hosting your hand-hacked copy separately.

Is there a method to keep filenames somehow? 'u';; Or even folder structures?

Fossil already does that. One of the ways to look up a given file in Fossil is by name, including paths. Every checkin's manifest includes a list of checkout-relative paths to the leaf file names present in that checkin.

That said, Fossil is not a general-purpose filesystem. The biggies are that it doesn't maintain file permissions, and its sole provision for declaring the existence of an empty directory is fairly weak.

If you want a globally sync'ed filesystem, use a tool made for the purpose such as SyncThing.

might measure in Gigabytes

Might? Might?

Single video files can easily crack the multigigabyte boundary.

You'd best be thinking in terms of terabytes for a small video archive.

How could I create thumbnails for pages in an automated fashion?

I cannot conceive of a world where that is in-scope for Fossil. Any change requesting such features is certain to be rejected, if only on the grounds that there are zillions of file formats, and where do you stop putting renderers for them all into Fossil?

This is much better handled outside Fossil, where you build an infrastructure to take arbitrary files and convert them into representative thumbnails somehow. Scripts for doing this with ffmpeg on video files are trivially web-searchable, for example.

And no, don't ask for a way to make Fossil call ffmpeg to do this on your behalf. We'd then have to add sox for audio formats, pandoc for text documents, some horrid lash-up of Pandas and .NET for spreadsheets, an even more horrid lash-up for 3D that'd end up dragging in WaveFront OBJ, FBX, COLLADA, Alembic, and USDZ as interchange formats… Bleah.

Fossil will never be a "render any file to a thumbnail" tool. Even broad-based specialist tools like ffmpeg don't render everything-to-everything.

How can Fossil-SCM manage a database full of large, unchanging video files?

Poorly, is how.

The very first thing you have to get around is SQLite Limitation #1, the 2 GiB BLOB size limit. If you have no videos over that size today, I don't see how that will continue given current trends.

Me, I've got several single videos an order of magnitude bigger than that in my iTunes library.

There's hope. There are multiple video processing schemes based on chunked content:

Apple chunked MPEG-4
HLS, based in part on the previous, with repackaging of the chunks in MPEG-TS format and M3U8 playlists.
DASH, the MPEG consortium's attempt to do HLS differently

These will break your huge video files into small (~2 MiB typical) chunks that Fossil can more easily digest, but now you have a new problem: Fossil's all-or-nothing sync implementation. The closest things Fossil has to incremental sync is retrying the sync until it succeeds.

Let's say you've built your archive to, let us say, a hundred gigs. Someone comes along and wants to clone it. If they cannot get all one-hundred gigs down in a single pull, they've got to start over and try again.

Now, if someone manages to succeed in that, and a file then changes, Fossil's sync protocol does become incremental, but only insofar as it gets you to a complete sync again. Let's say you add 20 gigs of stuff to the archive and someone comes along with their 100 gig clone and does a "fossil pull" from your central server; until they pull all 20 gigs of the delta between their repo and yours, they must keep retrying from the start.

an encrypted database of content

I believe you're confusing the cryptographically-strong hashes Fossil is based on with "encryption." Fossil doesn't provide repo encryption short of integrating it with SEE.

(3) By PhiTaylor (PhilienTaylor) on 2023-10-03 12:32:24 in reply to 2.1 [link] [source]

I assume you mean DMCA

I keep doing that. Next, I'll be saying DCMC. XD

Thank you, Warren, I hope to get to reply to this at length in the next 24 hours. A quick scan suggests there are many problems (I. E. The future of video advertising, perhaps even virtual worlds downloaded) I need to consider. The good news is that I do have answers for some of these issues so I look forward to keeping my education going within our conversation. Thank you again! :)

(4) By Stephan Beal (stephan) on 2023-10-03 13:01:45 in reply to 3 [link] [source]

A quick scan suggests there are many problems

The most basic one (of several) is that fossil is very much not ideal for multimedia collections. You mention that some content never changes. Fossil is very much optimized for text files which do change. If your files never change, they might as well be archived as-is directly on disk or in ZIP files, with no SCM.

In short: though it may superficially appear to be, fossil is most certainly not the tool you want for the task you've taken on.

(9) By PhiTaylor (PhilienTaylor) on 2023-10-12 02:16:32 in reply to 2.1 [link] [source]

Dear Warren,

Your thorough and thoughtful reply casts all sort of light and context onto my use case, which I have been considering for such a time that I believe I have all the responses to your queries, as well as a few queries of my own you've begotten. Let's get started.

Witness rights holders/publishers preventing Archive.org from keeping the remaining copies of their work available by e-loaning

(My words, not yours, Warren ^u^) The attack on information access is international and far reaching. -_-

I can't see the relations between your project and DMCA

It's quite possible that a rights holder from one country might dislike how the product and brand they purchased was marketed in another country. Erasing the reported connection of their commodity and branding to the marketing would be a way of erasing the cost of their ignorance regarding their investment's history: attack the proof.

As others stated, I don't need to keep the evidence that can be DMCAed, the footprint (metadata) and the fingerprint (hash of mediafile) are all I need. I just hope such tech gets used for the good of all mankind.

"Life feeds on life. In your petty pursuit of family redemption, you consume those who rally to your cause, and in so doing, you strengthen the Thing - accelerating the end! This is as it should be...it is why you are here."

I replaced your Lord of the Rings parody with a direct Darkest Dungeon quote. I remember losing hope trying to find the right codec pack that would open some of these files. T_T The work that the VideoLan team do to make VLC player play these ancient Real Media files, or Quicktime files, or whatever, humbles me greatly. You have surely shooketh my bones as they recall pre-2010 like some media decoding torment... Lol.

Yeah. I don't know if I'm going to be as noble or as heroic as anybody I might glorify. There might be no best way to create this archive and I might fall short of anything enduring. If VLC Player breaks or goes away, for example, I might be really lost. B)

You might get away with Quarterly releases...

Yes!!! That's great!! That's how I do it? Close the file/lock down/stop the server, and I can torrent it? B) yesss. That was my plan. The Syncing thing sounds nice though..

"rehosting your hand-hacked copy separately.."

This sounds like what I was expecting. A user would need to know my password/admin structure to update the original repo, instead of just fork their local copy. Am I mistaken? The encrypted SQLlite thing you mention later might be a bit pricey due to compiling AND getting the license..

"terabytes"

audible gulp It's one thing to have terabytes of space and not use it. It's another to have terabytes of files but only care about a few hundred megs.. One day VR ads might be something needed to be in this database. Whew. :x

"Fossil will never be a 'render any file into a thumbnail' tool."

I guess the furthest stretch I might consider is some web browser based embedding of a reference to the file, and approving sending it to that service for a thumbnail back... Yet if I'm not storing files in Fossil in the first place, it seems beyond me. ffmpeg script found online it is!!

"fossil's all-or-nothing sync implementation"

This was really educational, and it sounds like I benefit from having a large network of synced peers at varying points of their sync. Yet, I also imagine this implementation has a purpose. I imagine the only other option might be to create sub hashes of each chunk of the fossilfile, and keep those handy somehow. I have no clue if this is an ideology thing or a technology thing, though...

"Fossil doesn't provide repo encryption"

As I said before, is this your way of saying somebody could steal the integrity of the initial commit from me? I'm not sure how else to phrase what I'm describing here. A shallow example might be that I create a user who can modify, they change the pass for that user in their local repo, then submit a change from that user that tries to get merged back in. Anybody else with my original would need to approve the password change, right? Versus blindly accepting a modification that claims the password change never happened. So, I'm hoping your message is just a comment on how to create a way fossilfiles can require passwords to even open the codebase. :)

Thank you for everything, Warren. 🙇‍♀️

(11) By Warren Young (wyoung) on 2023-10-12 21:34:31 in reply to 9 [link] [source]

I believe the only detail left worth covering this far after all the other replies you've gotten is that you should look into Fossil's RBAC system to address your concerns over who can edit what. It has nothing to do with encryption or file hashes.

(27) By Vadim Goncharov (nuclight) on 2023-11-19 20:06:03 in reply to 2.2 [link] [source]

Let's say you've built your archive to, let us say, a hundred gigs. Someone comes along and wants to clone it. If they cannot get all one-hundred gigs down in a single pull, they've got to start over and try again.

Now, if someone manages to succeed in that, and a file then changes, Fossil's sync protocol does become incremental, but only insofar as it gets you to a complete sync again. Let's say you add 20 gigs of stuff to the archive and someone comes along with their 100 gig clone and does a "fossil pull" from your central server; until they pull all 20 gigs of the delta between their repo and yours, they must keep retrying from the start.

What? Do you mean Fossil's sync protocol is so ineffective that it cannot do on per-artifact basis, so retry always goes from start?..

If it's really so and not my misunderstanding, then architecture must be re-thought.

(28) By Stephan Beal (stephan) on 2023-11-19 20:15:45 in reply to 27 [link] [source]

Do you mean Fossil's sync protocol is so ineffective that it cannot do on per-artifact basis,

Any given artifact can be made up of any number of individual deltas, which fossil will (IIRC) send across the wire as-is, so the reality is more complicated than syncing individual artifacts in atomic units.

If it's really so and not my misunderstanding...

Fossil's sync works in a transaction, and if the transaction fails, everything changed by that transaction is rolled back.

... then architecture must be re-thought.

"If it ain't broke, don't fix it."

Fossil is made for small- and mid-sized source trees, not 20GB beasts. The fact that such use cases cause it grief is not unexpected, but also not a problem because fossil is not the right tool for such jobs.

(31) By Andy Bradford (andybradford) on 2023-11-20 01:06:51 in reply to 28 [link] [source]

> Fossil's sync  works in a  transaction, and if the  transaction fails,
> everything changed by that transaction is rolled back.

Just a bit  of clarification to avoid misunderstanding.  While it's true
that if  there is failure during  a transaction it will  be rolled back,
Fossil's sync command  is made up of multiple  distinct transactions. If
there  is an  error during  the sync,  the sync  command fails,  but any
committed transactions are retained and  only the current transaction is
rolled back.

https://www.fossil-scm.org/home/info/16da1b6dff02f503

So if  there are 100  megabytes of artifacts  to transfer, and  after 10
megabytes the network drops (2 rounds  of 5 megabytes each), Fossil will
retain the 10 megabytes that were successfully committed and on the next
sync will bring down the missing 90 megabytes.


Andy

(29) By Andy Bradford (andybradford) on 2023-11-20 00:40:44 in reply to 27 [link] [source]

> Do you mean Fossil's sync protocol is so ineffective that it cannot do
> on per-artifact basis, so retry always goes from start?..

I think  there is  some confusion  and misunderstanding,  let me  try to
clarify this a bit.

There is a difference between cloning (the "clone protocol" if you will)
and  the  "sync protocol".  Cloning  behaves  slightly differently  than
the  "sync  protocol" in  that  the  entire  repository must  be  cloned
100%  before  the  "sync  protocol"  is used for incremental changes.

Once  you have  completely cloned  the  repository, then  every sync  is
incremental. Furthermore, each  sync is comprised of  multiple rounds of
data synchronization, each of which is committed in it's own transaction
based up a default  transfer size that Fossil uses; so  the 20 "gigs" of
new material that  is not yet in your repository  will be synchronized 5
"megs" at a  time, committed, and then  the next round will  pull in the
next 5 "megs",  etc., until completely synchronized.  If something fails
during the sync, only the current transaction for the batch of artifacts
will be rolled back, not all. This means that if your sync progresses 10
"gigs", and  then drops the connection,  you will only need  to sync the
next 10 "gigs", not all 20. Only  if the sync has a single artifact that
is 20 "gigs" would you have to sync all or nothing.

At one point I started doing some work on making the "clone protocol" be
able to resume from failure:

https://www.fossil-scm.org/home/timeline?r=clone-resume

I probably got distracted and forgot  about the branch and then it looks
like it was  administratively closed (probably due to  inactivity). If I
get some time I may dig into it again.

Thanks,

Andy

(30.1) By Andy Bradford (andybradford) on 2023-11-20 00:52:16 edited from 30.0 in reply to 29 [link] [source]

> At one point I started doing  some work on making the "clone protocol"
> be able to resume from failure:
>
> https://www.fossil-scm.org/home/timeline?r=clone-resume

It looks like Richard also had an interesting idea for "fast clone" that
doesn't seem to have been developed fully:

https://www.fossil-scm.org/home/timeline?t=fast-clone

Andy

(32) By graham on 2023-11-20 02:03:55 in reply to 30.1 [link] [source]

Very much thinking aloud... I've not looked at either of the previous attempts you mention, nor know enough of the low-level workings of Fossil to know if the following approach is even feasible, but – instead of trying to make the "clone protocol" resumable – could one somehow create an "empty" repository and then use the (already resumable) "sync protocol" to pull in all changes?

(5) By anonymous on 2023-10-07 13:18:55 in reply to 1 [link] [source]

...While data on the internet goes missing quite frequently, the commercial art that is published to mass media doesn't deserve to be erased or forgotten.

...My goal is to be able to share this massive database of video files in a way that is, for all intentional purposes, indelible.

Can't say I can clearly see the extent of your project goals. However, about using Fossil for the purposes of archiving -- Fossil's strength is in tracking changes to what may be considered as "related" artifacts, like a source-files of a program, chapters of a book, etc.

In your use-case, the individual media files that you want to archive are not directly related, so there's no real need for tracking them together.

Also, those art pieces are in a way stale, or what could be thought of as uni-versioned (unversioned). Again, there is no need to track the change. Even for some revisions of same art, Fossil will not be able to show the change, rather just flag that the media is different.

You may consider using Fossil to track the metadata about the art you wanted to preserve. Indeed, the metadata as such forms some kind of catalog/taxonomy of the media. So basically, it's a library catalog, which is extendable, classifiable, editable, versionable. Fossil can do that!

As for the actual media -- it does not need to sit together with the classification. You may develop some reference scheme similar to doi: and a way to route the references to the actual media files (wherever you'd chose to host them). My understanding is that it's the actual works of art which fall under the DMCA. not their metadata.

So you, as an archivist you could distribute/host the catalog, not the actual media.

(8) By PhiTaylor (PhilienTaylor) on 2023-10-12 00:54:22 in reply to 5 [link] [source]

Hi there, Anon,

Based on what you've discussed, and what I've read in Konstantin's reply, I might be able to link to the files using git-annex, if these files are git-annex archived into The Internet Archive for example. You suggest meta referencing exactly thusly when saying "doi".

I. E. I link to the hash of the file via what it might be referred to, if specifically shared/torrented. This would either link to the file as git-annex Archived on the net, or to a magnet link just to the hash.

It's a pretty big formatting headache that I would pray the Fossil system might let me wiggle towards. This is probably why I was talking about nesting Fossil repos, or referencing another repo with another: I could point to a specific set of files via a GUI or a wiki/doc template and have a programmatic outcome. (Link to location in repo, magnet link to hash.). I'm aware got-annex has a web UI, so that could help.

Thank you for your help. 🙇‍♀️

(6) By Konstantin Khomutov (kostix) on 2023-10-08 15:35:38 in reply to 1 [link] [source]

To me, it sounds like a task for git-annex.

Having said that, I should note that I did not use that suite myself, so I cannot actually say whether it can fullfill all or most of your requirements.

(7) By PhiTaylor (PhilienTaylor) on 2023-10-12 00:30:09 in reply to 6 [link] [source]

Hi there, Konstantin,

I'll look more into git-annex. It looks interesting, and possibly useful.

(10) By patmaddox on 2023-10-12 16:55:17 in reply to 6 [link] [source]

I have used git-annex a bunch, and it’s well-suited for this.

The only caveat is that git slows down a lot once it gets to a few hundred thousand files.

(12) By Offray (offray) on 2023-10-31 15:46:46 in reply to 1 [link] [source]

I'm kinda a data archivist; news files, art galleries, stories, music, etc. However, one data I've been archiving I have also been publicly curating. I'm a bit bashful to admit the contents because it involves market exploitation through commodity fetishism. I learned this term in the last two to three years, and it goes quite far to describe the literal iconographic intents of brands, their creative directors, and the teams hired to create this largely uncredited media. My goal is to be able to share this massive database of video files in a way that is, for all intentional purposes, indelible.

I'm more kinda a data hacktivist. And I also care for the longevity of data and reproducibility of data supported claims/artifacts and I use Fossil for that, but usually int the context of data storytelling and open research.

Because of the design focus/limitation of Fossil regarding textual files that change instead of media files that don't, as have been explained in this thread, I think that Fossil can be more used, in the context you describe as a resilient index of media metadata using cypherlinks (hashes that point to content, like in torrent files) instead of hyperlinks (that point to server addresses and are less resilient). Other related artifacts, like the site screenshots, and event the scripts that you use to install the software to create such screenshots (lets say puppeteer) could be also in the Fossil repository. In this way, anyone cloning the repository should be in conditions to replicate/update the project with new media files, site screenshots and media analysis scripts and/or data narratives, for example.

HTH,

Offray

(24) By PhiTaylor (PhilienTaylor) on 2023-11-03 02:41:50 in reply to 12 [link] [source]

Thank you very much, Offray. This does, more or less, help, since I haven't heard the term "Cypherlink" before. :)

Otherwise, I'm still holding myself back from installing Fossil on my local mobile. I'm intimidated about whatever scripts or 'puppeteering' system I'll need to auto-gen the thumbnails, cypherlinks, and so on. shrug

Thank you for being a data hacktivist! :3

(13) By tmesis on 2023-11-01 14:08:37 in reply to 1 [link] [source]

This is incredibly out of scope for the core functionality of fossil, but I have a similar project that I wanted to work on (curating a bunch of large binary files along with arbitrary metadata, versioning, etc) and fossil felt so close to perfect for the project. The big problem is that fossil is bad at repositories of large(ish) binary files.

My dream is that I could have fossil coupled with a zfs zpool that is stored on the filesystem as a set of large files. The zpool could be mounted and you could put files in there and the the fossil repo would manage a rich metadata files that would reference the zfs binary files (link to them or something). Maybe they would be structured YAML or JSON or something so that the versioned repository of file metadata could be somewhat automated.

But this would allow someone to see the state of the binary repo by browsing the metadata records in the fossil repo without having to pull down the entire archive. People could do archival work (adding annotations and new research notes to the metadata files) without pulling down the binary file repo. But the entire repo would still only be a few files, the fossil repo and the set of large zpool files. I think this is much nicer than a sprawling git repo or a tar archived one.

You still would not be able to sync individual file objects, the zpool would need to be mounted to access the binary file data. But in theory, the system could store all of the zfs config information in fossil, so that you could automate a zfs send/recv and data could be efficiently synced between storage locations.

Maybe some day... :)

(14.1) By Warren Young (wyoung) on 2023-11-02 09:07:04 edited from 14.0 in reply to 13 [link] [source]

a zfs zpool that is stored on the filesystem as a set of large files

I hate to be pedantic, but you're tossing yourself a word-salad, creating a lot of confusion for yourself here. It's not entirely your fault; ZFS terminology is confusing and doesn't map 1:1 to traditional Unix terms.

What you'd want here is a ZFS filesystem, which sits atop a ZFS pool. A ZFS filesystem looks like a traditional Unix FS in that it appears as a mounted directory, but it operates more like a Fossil repo check-out in that it can be versioned — which ZFS calls snapshotting instead of committing — and then synced via "zfs send/recv".

Alas, we're missing a critical feature in ZFS relative to Fossil, being support for a snapshot hierarchy to mirror the Fossil commit DAG. As far as I'm aware, ZFS won't let you "fork" a snapshot. Whereas saying "fossil up ABCD1234" reverts to that hex-named version of the history in the checkout directory without affecting any of its descendant check-ins, saying "zfs rollback tank/repo@ABCD1234" destroys the current filesystem state and all intervening child snapshots back to that ancestral version. Without the ability to revert to an ancestor check-in/snapshot and then fork it, you can't have an equivalent of branches, much less merges, and don't you dare even dream of cherry-picks and backouts.

Without a way to mirror Fossil's commit DAG in the ZFS snapshots, I don't see how you can maintain an efficient history of these big files.

You might be tempted to respond, "So I will have a trunk-only Fossil repository, then." Problem is, that ignores the CAP theorem's application to Fossil, which is that if two different users make a check-in atop their local ZFS filesystem and then try to sync them, that creates a fork, something Fossil can handle but which ZFS cannot. With ZFS, you'd have to pick one system's child snapshot at the fork point to keep and throw the other away.

(15) By tmesis on 2023-11-01 16:19:10 in reply to 14.0 [link] [source]

Thanks for the clarification, pedantry is encouraged, it's very easy to get confused by the layers and terminology. I may not have been clear here, but I don't mean JUST "store the blobs as big files in a zfs filesystem". I mean using big files as the block storage for the filesystem. For example, taking four files (say data{1..4}.z), and then zpool create blobstore ./data1.z ./data2.z ./data3.z ./data4.z and then creating the zfs filesystem in that pool.

The goal of this mess is: have the blobs and the repo as a few large-ish files that are portable. The repo has all of the awesome fossil features (wiki, docs, rbac, etc), and the binary data is stored in the four data files which zfs is mounting as a contiguous filesystem. All together represent the repository.

I don't think it would be possible or sensible to totally mirror the fossil DAG into zfs. Instead, we would just be using zfs for storage of the blobs, and maybe making use of snapshotting or copy-on-write as an added layer of durability. And the efficient sync tools built into the software. I don't think the SCM paradigm maps very well onto a filesystem, even one as sophisticated as ZFS.

I was envisioning the fossil repo handling all of the DAG-related stuff, but each commit could point to one or more blobs in the zfs filesystem. The filesystem would probably not change very often, the files will be more or less static (and occasionally updated).

If two committers update a blob-metadata pair at the same time, the metadata file merge would be handled gracefully by fossil, but the blob resolution cannot be, and a human would have to decide if they should be merged or if one of the two replacement candidates is "correct", or if a special snapshot sequence should be made to handle both.

If we didn't need to update the files ever and we were just adding new ones as incremental versions of the older ones (suffixing them with integers or something), and we didn't really care about the zfs features like incremental sync, we could use something simpler like SquashFS or DwarFS for the blob storage file.

But I am not sure that would work as well? ZFS could take multiple files as a pool and mount them as a contiguous space thus allowing someone to easily chunk the blob storage into pieces, just adding another chunk as the space in the pool diminishes. I don't think you can do that natively with squashfs or dwarfs, you would have to make tar chunks.

(16) By Warren Young (wyoung) on 2023-11-01 16:53:52 in reply to 15 [link] [source]

zpool create blobstore ./data1.z ./data2.z ./data3.z ./data4.z

The only reason I can come up to justify doing that is if you didn't have a ZFS pool already and didn't have any unpartitioned disk space to create one in. I presumed that, since you lead with the "ZFS" idea that you had at least one pool already somewhere. Creating one ZFS filesystem per Fossil repo inside that preexisting ZFS pool seems sensible in that context.

each commit could point to one or more blobs in the zfs filesystem

If the structure is "/tank/repo/blobs/ABCD1234..." with each blob being the hash of the file — the same identifier stored in the Fossil manifest for that version of the file — then you've created a new problem, and a particularly bad one from this thread's perspective: a single-bit change to the input data creates a new hash on commit, requiring a complete copy of all the blocks in the blob.

One of the beauties of ZFS is that its CoW nature applies at the block level only, not at the whole file level. Changing one bit in a snapshotted file with a block size of 1M — a good choice for a repo holding files too large to host in Fossil — copies the block that contains that bit, but all other blocks making up that file remain unchanged. In Fossil terms, ZFS has its own variety of delta compression. Giving each file a new name in the backing ZFS data store on each committed change breaks that.

Instead consider the structure "/tank/repo/checkout" where the files are named normally inside that working checkout directory, and the snapshots are named after commit IDs computed the same way Fossil does today. That gets you ZFS-style delta compression, but you lose the ability to fork commits as I have said above.

Propose an alternate structure if you want a different result.

It might help if you read up on the Fossil file formats before answering.

(17) By tmesis on 2023-11-01 19:55:21 in reply to 16 [link] [source]

It was mostly wishful thinking, I hadn't thought it through deeply. I wanted somehow to marry the features of zfs for storing binary objects well and fossil's feature-rich single-file repository structure.

Thanks for the rich responses.

(26) By PhiTaylor (PhilienTaylor) on 2023-11-03 03:05:39 in reply to 17 [link] [source]

Thank you, Warren Young, and tmesis. This subthread has been very valuable to me, since the project which had introduced me to Fossil had me thinking of using ZFS... filesystems.... ?? for block delta change tracking to video files. IE: five files, each one changed to a new day's log which is mostly the same output in a differently timed striped order.

In this media archive case for storing files, I hadn't considered your project, tmesis. While I like that it is essentially a versioned file system, I might need to cherry pick changes rather than worry about delta changes as the files wouldn't change much. So.. I might review what you've written here, Warren, in the case I can apply tmesis' dream code some day to this media archiving project, and my other project can use Fossil instead of Git and Subversion/just Git for delta change tracking compression across multitudes of versions. Oh yeah, I was thinking of SQLite for this instead, recently..........

Thank you all. :3

(18) By patmaddox on 2023-11-01 20:37:23 in reply to 14.0 [link] [source]

As far as I'm aware, ZFS won't let you "fork" a snapshot.

A few things things to bring to your awareness, if you don't already know:

.zfs/snapshots
zfs clone
zfs send/receive

.zfs/snapshots is a read-only dir of the snapshots for a given dataset. If I create a snapshot named "my-precious-backup", there's a dir called .zfs/snapshots/my-precious-backup that is a read-only copy of the filesystem at the time of snapshot.

zfs clone creates a separate dataset using the same underlying blocks as the original. As you make modifications, it writes the new blocks to the new dataset. If your parent is 100GB and you write 1GB of data to the clone, then the clone only uses 1GB on disk, because they share the first 100GB of blocks.

zfs send/receive creates a dataset using a copy of the blocks provided by a snapshot. If the source has 100GB of data, you send/receive to a new data set and write 1Gb, then the new dataset uses 101GB on disk.

In a sense, it's like saying "you can't fork a dir." Sure you can - copy it, now you have a fork. You just don't have a cryptographic reference from the new dir to the original. But with zfs, you do.

As far as merge... don't think so. The closest would be to make a clone or send/receive, and then rsync from one side of the fork to the other. That would merge the content, and would maintain the reference to one parent (the snapshot of the destination dataset) but lose the other parent.

(19) By Warren Young (wyoung) on 2023-11-02 09:07:09 in reply to 18 [link] [source]

As one who has been running ZFS in daily production for years, the only concept I was missing was "zfs clone", which I had never had cause to use. Thank you for bringing it to my attention.

With that in mind, I'm beginning to think this is possible without disturbing the Fossil internals too badly. It won't be trivial, and it will create a small combinatorial testing explosion. We don't lean heavily on Fossil's test suite these days, but if anyone takes this project on, I would strongly encourage them to get the test suite into a zero-unexpected-errors state on their development systems before starting on the primary project so as to catch regressions, locate missing corner-case handling, and so forth.

As the file format doc says:

…the artifacts that make up a fossil repository are stored as delta- and zlib-compressed blobs in an SQLite database. This is an implementation detail and might change in a future release.

This ZFS mode would become Fossil's second backend implementation.

I don't see this removing SQLite from the picture. Nothing about this idea changes "~/.fossil", except that we'll likely need to define a few new settings, which may be set globally here. Ditto for "…/.fslckout". As for the repo DB, I believe that continues to exist, with the only thing changing being that the bulkiest of the blob table entries move out into the filesystem, leaving the manifest blobs undisturbed, where they are now.

Let's work through the day-to-day use cases to understand the implications:

`fossil init`

This not only creates the repo.fossil file, it needs to be able to create the ZFS filesystem backing it. I propose that if the ZFS pool name is preconfigured at the global level, the command looks the same as today. If instead you wish to init some repos with a ZFS backing store but have others use the traditional SQLite store, you give a new flag which I shall call "-z" here:

  $ fossil init -z tank/projects ~/museum/myproject.fossil

On a Linux box running OpenZFS, that should create an empty "/mnt/tank/projects/myproject" filesystem with one snapshot, being the initial check-in ID.

I am using nested filesystems here for more than the purposes of making this design doc clear. For this scheme to be practicable, it cannot rely on having superuser privileges for each commit. Instead, I suggest that Fossil call "sudo zfs allow on the user's behalf when creating a sub-filesystem for you so that future operations don't need the permission. This step requires superuser privileges, but many Fossil users will be admin on their own systems, and for those that are not, we can add a setting to tell Fossil about the root FS, not only allowing them to skip the -z flag but to have a local admin set the permissions they need. The admin says:

  $ sudo zfs create tank/projects
  $ sudo zfs allow -g developers clone,send,recieve,snapshot tank/projects

…and then the users given only group "developers" say:

  $ fossil set -g zfs-pool tank/projects
  $ fossil init ~/museum/myproject.fossil

If the sub-filesystem /mnt/tank/projects/myproject already exists, reuse it. This allows multiple developers on a given system working on that project to share a single checkout for that repository, potentially a good thing for ZFS-mode repositories. The idea is to reduce the number of unnecessary file copies, right? If multiple developers share a machine, they will have to cooperate in the old-fashioned ways when it comes to managing these filesystems; email, Fossil chat, etc. Nothing about this design forces this sharing, but we should not merely allow it, but make it easy.

`fossil open`

In ZFS mode, this command cannot simply involve changing to the /mnt directory created in the prior step. If it worked that way, you couldn't have "extras" in the checkout, and the .fslckout DB would end up in the snapshot each time you commit. Instead, I propose that an "open" populates the check-out directory with symlinks to the ZFS filesystem backing it, one for each managed file.

`fossil changes`

The internal Fossil code backing this command needs to be changed to understand that in ZFS mode, symlinks to managed files are proxies for the actual managed content except when what was committed was a symlink itself. I expect this to be a tricky bit of coding, borderline brittle, liable to trip up future maintainers who don't use ZFS mode themselves and thus aren't trained to see all of the implications.

`fossil commit`

Likewise, this code has to be modified to build the manifest from dereferenced symlinks, not commit everything as symlinks, while at the same time allowing you to commit symlinks if the allow-symlinks setting is enabled. (Confused yet?)

That being done, inserting the new manifest into the repo.fossil DB's blob table remains the same as today. The only other thing that changes is that it needs to create the ZFS snapshot on the filesystem backing the repo.

For the cases where you commit with --branch or --allow-fork, or when a commit made on another machine creates an inadvertent fork, the commit code has to be modified to call zfs clone for you.

`fossil merge`

I believe this can be handled semi-intelligently, with the limitation that a merge can have only one merge parent in ZFS mode. Where the files are unchanged, copy nothing. Where there are changes, use Fossil's existing diff algorithms to attempt to make as few changes to the merged-in file content as possible, giving ZFS the best chance to do minimal copy-on-writes. For compressed file content, the backing ZFS filesystem is likely to balloon for well-known reasons; nothing about this mode saves you from being foolish with your choice of file formats.

The tags that back named-branch merging live in the repo DB file, same as today. ZFS-mode doesn't change that. The only change needed is to do the lookup from the branch name to a check-in ID, which gets you the ZFS filesystem snapshot to look in, thus the path to do the copying from. With automount snapshots enabled in your ZFS pool, it should simply be able to copy directly from /mnt/tank/myprojects/.zfs/snapshot/ABCD1234…/path/to/changed/file into the check-out directory on the same pool.

I continue to believe cherry-picks and backouts range from difficult to insane to implement in ZFS-mode. I don't see a way to do multiple-parent merges at all, and even for single-parent merges, I believe ZFS will end up copying all of the blocks in the merged-in files, removing all value of this backend mode for that commit.

`fossil push`

In ZFS mode, this needs to be able to set up the SSH tunnel to the parent repo and build the "zfs send/recv" command pair across that pipe. Not only will that be more efficient than the current Fossil sync algorithm for large files, it will allow resumable sends if both ends have new-enough versions of OpenZFS. (Late 2015, according to the Internets.)

I don't see that ZFS-mode means the repo becomes incompatible with HTTP as a sync transport, but it does make it inefficient, because it will devolve to the current sync implementation. I expect ZFS-mode to work far better with ssh:// URLs.

I'm stopping here. Doubtless there is more to think about, but this should suffice to get the discussion rolling. The hard part now is finding someone who wants to implement it. It won't be me; I may be a longtime ZFS fanboy, but I don't have any repos that would benefit from this, and I see no reason to track the change history of any of my ZFS pools/filesystems with Fossil. My purpose in spending time on this design document is to make sure that whoever does begin work on this project has some guidance on how all of this works with the existing Fossil design philosophies. I don't believe it helps much if the resulting changes can never be merged back into Fossil, or worse, create a project-level fork. Fossil is too small a project to permit people to go off on wild tangents.

(20) By ddevienne on 2023-11-02 11:13:58 in reply to 13 [link] [source]

fossil felt so close to perfect for the project.
The big problem is that fossil is bad at repositories of large(ish) binary files.
My dream is that I could have fossil coupled with a zfs zpool that is stored on the filesystem as a set of large files

The problem with this approach is that Fossil is then no longer portable.

I wish instead for SQLite to better support large blobs with random access.
Because then everyone benefits, not just Fossil, which remains self-contained.

(21) By tmesis on 2023-11-02 14:54:50 in reply to 20 [link] [source]

Yeah the portability is key, so that's why I wanted to use files as block devices, so the overall "repo" is still portable: the sqlite db file and the block device files. I am probably misunderstanding something though, because it seems like those with more experience with zfs don't see this as a benefit.

Ultimately, I think that it would be nice if the blobs weren't stored in the sqlite database and were stored in a blob-optimized back-end archive. Something like dwarfs. You would have the repo files checked out into your filesystem, and then when a new or changed binary file is checked in, it rebuilds the binary archive file. This would be slow, but I think it would work. Merges might be rough... But the main thing is that the binary files are stored in a way that removes the size limit and has better binary-file delta compression.

I think for my project I am just going to give fossil a try and see when/if it breaks. Maybe it will work better than I expect. ;)

(22) By matt w. (maphew) on 2023-11-02 15:46:13 in reply to 13 [link] [source]

This is incredibly out of scope for the core functionality of fossil, but I have a similar project that I wanted to work on (curating a bunch of large binary files along with arbitrary metadata, versioning, etc) and fossil felt so close to perfect for the project. The big problem is that fossil is bad at repositories of large(ish) binary files.

Me too. It's the 2nd of two reasons I keep watching the Fossil forum even though I'm only an episodic user. I have several hundred hours of digital audio originally recorded on magnetic tape that I wish to clean up and republish, along with transcriptions, notes and scripts, while maintaining an audit trail of what was done so the inevitable mistakes can be seen and perhaps corrected.

(23) By matt w. (maphew) on 2023-11-02 16:00:49 in reply to 1 [link] [source]

Hi Phi. Your mission sounds the same as Archive.org's mission, applied to a specific cultural genre. You've intimated that watching the pressure being put on Archive and suspicion of them succumbing is an impetus (one of several I imagine) for you going it alone. They're bigger than you, so...

I suggest reaching out to Jason Scott with a version of your opening post here. He's philosophically aligned and has deep history and skills with various software and platforms of digital archiving.

Also, perhaps look into Safe Network. It's a "fully autonomous data and communications network" using peer to peer with "Store data in perpetuity. All public/published data on the Network will be immutable and available on the Network indefinitely" as a central tenet and has been developing since 2006. Not an endorsement, I just learned of the endeavour yesterday, but their docs are approachable, clear, and refreshingly free of crypto-bro web3 vibes.

https://en.wikipedia.org/wiki/Jason_Scott
https://mastodon.archive.org/@textfiles
https://primer.safenetwork.org/

(25) By PhiTaylor (PhilienTaylor) on 2023-11-03 02:49:41 in reply to 23 [link] [source]

Wow, Maphew/Matt, this post is very informative.

They're bigger than you, so...

Yeah, they are. And they're not reaching out to ME! :( How am I going to get their attention down here?? On the other hand, any sort of media archive would be subject to the same issues: history being more valuable when it's erased. The service they provide is less shareable like 'volumes' or 'editions' of a library. I wish they sought mirrors, somehow :) Since I am still reading into how I might use and share Git-Annex, I've read Git-Annex would and could upload to The Internet Archive for storage. I will likely be using them this way. Who knows. :X

Jason Scott

I don't know him, and I guess I'll tell him I read his Wikipedia page when I send him this fossil forum post. ^_^ I wonder who he is an what more he might help me with?

Safe Network

OH WOW. I have NEVER heard of this, and this sounds definitely up my alley for a whole variety of reasons. I don't know how they incentivize their fully autonomous network with data storage and communications, or how they keep it private, so this will be quite an interesting investigation for me! I'm used to this being pretty obfuscated or hard for me to imagine an API for (Veilid), the network being a bit too limited or personally hosted (Retroshare), or this being incentivized by others getting communal currency for aiding in data storage (web3 crypto vibes. Possibly hackable Contracts o_o; )

Thank you, Matt. bow