Fossil Forum

What's the best way to manually extract a file from a Fossil repo?
Login

What's the best way to manually extract a file from a Fossil repo?

What's the best way to manually extract a file from a Fossil repo?

(1.1) By Ryan S. Northrup ("RyNo") (YellowApple) on 2024-01-31 07:43:46 edited from 1.0 [link] [source]

Long story short: I'm trying to figure out how fossil cat $filename works (and eventually replicate it in programs of my own; I know I can just shell out to fossil cat $filename with just about any programming language, but that's less fun). As far as I can tell (from reading that command's definition in finfo.c and going from there), the general process (notwithstanding any caching) looks to be:

  1. Get the record ID for the desired check-in. Assuming trunk:

    SELECT rid FROM tagxref, tag
    WHERE tag.tagname='sym-trunk'
     AND tagxref.tagid=tag.tagid
     AND tagxref.tagtype>0
    ORDER BY mtime DESC limit 1;
    

  2. Get the manifest from the blobs corresponding to that record ID and all of its sources from the delta table (the code in content.c does the iteration manually by repeatedly running SELECT srcid FROM delta WHERE rid = $rid;, but a recursive query produces the same results AFAICT):

    WITH RECURSIVE ids(id) AS (
     SELECT $rid
     UNION ALL
     SELECT srcid FROM delta, ids WHERE delta.rid = ids.id
    ) SELECT id, content(uuid) FROM ids, blob WHERE blob.rid = ids.id
    

  3. "Somehow" apply those deltas to produce the full manifest (I haven't dug into this yet)

  4. Find the F card in that manifest with the filename I want (first field) and extract the UUID (second field)

  5. Lookup the file's record ID from that UUID:

    SELECT rid FROM blob WHERE uuid=$uuid;
    

  6. Get the blobs corresponding to that record ID and all of its sources from the delta table (same code/query as Step 2)

  7. "Somehow" apply those deltas to produce the full file contents (if this is even necessary; see below)

  8. Write the file contents to some output file/descriptor (e.g. STDOUT)

This leaves me with a few questions:

  • Do I really need to deal with deltas for this? When previewing the manifests for the latest check-ins in Fossil's repo (as of c. 2023-12-17 when I last cloned/synced it on this machine):

    SELECT rid, user, comment, uuid, substr(content(uuid),0,2000)
    FROM event, blob
    WHERE event.type = 'ci'
    AND event.objid = blob.rid
    ORDER BY mtime DESC
    LIMIT 10;
    
    There's exactly one check-in that appears to actually be a delta (rid=59344 / uuid=b86d4da5a23581f100c5d4bccf0399ae6dde05fb2c260e49ec63fd41618a7c80), as evidenced by the B card at the top of it; all the rest (including the current check-out, which is rid=59376 / uuid=70cae0a964299bb22c02e2da0d2be0ad604870a6cc90bcde4a8b6cc56abd2166) appear to be complete baselines, with F cards corresponding to all files (AFAICT) and no B card anywhere. Can I rely on check-ins to major branches (like trunk) always being baselines? And if they're indeed baselines, is there any reason why I'd want to query delta at all? Do baseline check-in manifests always contain the full list of files that exist for the given check-in?

    • Likewise, if I pull the contents of a file's blob (SELECT content(uuid) FROM blob WHERE uuid=$uuid;, where $uuid is what I yanked out of the F card for the relevant file in the given manifest), I seem to consistently get back the complete file contents. Is there ever a circumstance where a file will have deltas? What would that look like?
  • In spite of the lack of B cards for most check-ins, the delta table shows source IDs for those check-ins' record IDs. If I run a derivative of that recursive query:

    WITH RECURSIVE ids(id) AS (
    SELECT $rid
    UNION ALL
    SELECT srcid FROM delta, ids WHERE delta.rid = ids.id
    ) SELECT id, user, comment, uuid, substr(content(uuid),0,2000)
    FROM ids, event, blob
    WHERE blob.rid = ids.id
      AND event.objid = ids.id;
    
    Starting with my copy of Fossil's trunk, rid=59376 corresponds to srcid=59378, and rid=59378 corresponds to srcid=59389. All three of those are complete check-ins with complete manifests, lacking any B cards entirely. Is that expected?

    • Also, in the delta table, is srcid supposed to be the parent of rid or the other way around? Everything seems to point to rid being srcid's parent, given that 59389 references 59378's UUID in its P card (and likewise for 59378 referencing 59376), and given that the rids and mtimes (and D cards in the manifests) are in a reasonable-looking order that way, but the implementation and comments of content_get and delta_source_rid in content.c seem to point to srcid being rid's parent. I'm a novice when it comes to C, though, so it's highly likely I'm reading the source code wrong lol
  • I see that if I'm already in a checked-out / opened repo, I can query the localdb.vfile table and get file rids directly that way (e.g. SELECT content(uuid) FROM blob WHERE rid = (SELECT rid FROM vfile WHERE pathname = 'COPYRIGHT-BSD2.txt')). Same deal with localdb.vcache. What are the implications of using one or the other for this purpose (aside from this obviously not being possible if the repo hasn't been opened yet)?

  • I think I can almost condense the logic behind fossil cat $filename -r $branch (at least) down to a single query with

    SELECT content(blob.uuid)
    FROM tagxref, tag, mlink, filename, blob
    WHERE tagxref.tagid = tag.tagid
    AND tagxref.tagtype > 0
    AND tagxref.rid = mlink.mid
    AND filename.fnid = mlink.fnid
    AND blob.rid = mlink.fid
    AND tag.tagname = 'sym-$branch'
    AND filename.name = $filename
    ORDER BY tagxref.mtime DESC
    LIMIT 1;
    
    However, the one piece missing from that query is detecting whether files still exist (for example, that query will work for $filename=COPYRIGHT-GPL2.txt even though that file is long gone from the current version of trunk). Are there other tables I can join against to fix that? I get that Fossil ain't relational, but in this case it's right on the verge of being relational-friendly.

    • If there ain't an existing table to the above effect, would there be value in creating one (i.e. pre-extracting every check-in's F cards)? Maybe something like
      CREATE TABLE repository.manifestfile (
      id INTEGER PRIMARY KEY,
      mid INTEGER REFERENCES blob,
      fid INTEGER REFERENCES blob,
      fnid INTEGER REFERENCES filename
      );
      
      Which would then condense the fossil cat $filename -r $branch logic to something like
      SELECT content(uuid)
      FROM manifestfile, tag, filename, tagxref
      WHERE manifestfile.fnid = filename.fnid
      AND blob.rid = manifestfile.fid
      AND manifestfile.mid = (
      SELECT rid FROM tagxref, tag
      WHERE tag.tagname='sym-$branch'
        AND tagxref.tagid=tag.tagid
        AND tagxref.tagtype>0
      ORDER BY mtime DESC limit 1;
      )
      AND filename.name = $filename
      LIMIT 1;
      

Sorry for the numerous and rambling (and probably dumb) questions. Fossil's design is interesting and I'd love to be able to more fully wrap my head around it.

(2) By Stephan Beal (stephan) on 2024-01-31 04:25:18 in reply to 1.0 [source]

Do I really need to deal with deltas for this? When previewing the manifests for the latest check-in... All three of those are complete check-ins with complete manifests, lacking any B cards entirely. Is that expected?

Note that fossil has two distinctly different types of deltas. Delta compression is the primary form, which stores files in compact deltas to minimize space. That's what's in the delta table. As you note in point 2, the content() SQL function will apply such deltas for you when extracting a given blob's content.

What you're looking at with the B-card is called a delta manifest, which is a special-case compressed format for manifest which is independent of the general-purpose delta compression mentioned above. You may find:

src:/doc/trunk/www/delta-manifests.md

to be an informative introduction to those.

This repository's policy is to avoid delta manifests because they complicate client-side code, like yours, which is attempting to figure out exactly which files belong in a given check-in. In a non-delta manifest, it's trivial to verify the hashes of all included files. With a delta manifest, that becomes much more difficult.

As a shameless plug: you might also find libfossil interestin, as it includes a library interface to all of fossil's core SCM functionality except for syncing, including APIs which would allow your client C application (or, with a bit more work, a higher-level wrapper) to navigate manifest files in a way which accounts for (or not, as you choose) the vagaries of handling delta manifests.

However, the one piece missing from that query is detecting whether files still exist (for example, that query will work for $filename=COPYRIGHT-GPL2.txt even though that file is long gone from the current version of trunk). Are there other tables I can join against to fix that?

Not directly, IIRC (which might not be the case!), but there is a virtual table called files_of_checkin, colloquially known as foci, which can be queried to determine whether a given file is in a given checkin. That level of data is not stored directly in the db in a relational-friendly form because the number of records would explode: one per file per check-in. We don't have any friendly user-facing docs for foci but grepping src/*.c for files_of_checkin will reveal many examples of its usage.

(3) By Ryan S. Northrup ("RyNo") (YellowApple) on 2024-01-31 07:43:22 in reply to 2 [link] [source]

Thanks for the quick response! That's a lot faster than I expected lol

Note that fossil has two distinctly different types of deltas. Delta compression is the primary form, which stores files in compact deltas to minimize space. That's what's in the delta table. As you note in point 2, the content() SQL function will apply such deltas for you when extracting a given blob's content.

Aaaaaaaah, okay, that makes a lot more sense. I didn't catch that content() is decompressing and de-delta'ing. I guess that means to fully understand that piece (and to be able to fully recreate it in a program using plain SQLite with no Fossil-isms) I'll need to dive into delta.c a bit.

So for a pair of follow-up questions, with a variation on my recursive delta-grabbing query (to grab the raw blob contents and not the decompressed/de-delta'd version):

WITH RECURSIVE ids(id) AS (
 SELECT $rid
 UNION ALL
 SELECT srcid FROM delta, ids WHERE delta.rid = ids.id
) SELECT content FROM ids, blob WHERE blob.rid = ids.id;

If that returns exactly one row (because SELECT srcid FROM delta WHERE delta.rid = $rid; returns nothing), then once I decompress/reinflate the result I should have a complete and current copy of the artifact $rid represents (be it a manifest or file), right?

And if that returns multiple rows (because SELECT srcid FROM delta WHERE delta.rid = $rid; returns a row, and so on until it doesn't anymore), then once I recompress/reinflate each of the results I'll end up with the last result being a complete but out-of-date file, and all the rest will be deltas (in the documented format) that I'll need to re-apply on top of that base file one-by-one to produce the complete and up-to-date target file, right?

If so, then I think content_get() in content.c is starting to make a lot more sense now, at least for files.

What you're looking at with the B-card is called a delta manifest, which is a special-case compressed format for manifest which is independent of the general-purpose delta compression mentioned above. You may find:

src:/doc/trunk/www/delta-manifests.md

to be an informative introduction to those.

This repository's policy is to avoid delta manifests because they complicate client-side code, like yours, which is attempting to figure out exactly which files belong in a given check-in. In a non-delta manifest, it's trivial to verify the hashes of all included files. With a delta manifest, that becomes much more difficult.

Okay, cool. And that process (for finding the current version of a file as of a given check-in) would be to grab all the F cards, grab the manifest the B card points to if it exists, and repeat for each manifest until I've read everything from the baseline manifest1, right? That makes sense, even if it's less-than-ideal from a pure SQL perspective (but that's life lol).

I'm still a bit fuzzy with the ordering of rids when it comes to delta-compressed manifests; after running

WITH RECURSIVE ids(id) AS (
  SELECT $rid
  UNION ALL
  SELECT srcid FROM delta, ids
  WHERE delta.rid = ids.id
)
SELECT rid, size, uuid, mtime, user, comment
FROM ids, blob, event
WHERE blob.rid = ids.id
  AND event.objid = blob.rid;

with $rid set to 59524 (corresponding to check-in UUID 55a0a4d01a50b292d5e383e84b7ec02aeffca0afeaa95d9141fb83550186eb8e), it's apparent that 59524's blob is indeed delta-compressed, and recursing through the other blobs referenced in delta they seem to be in a reasonable order in terms of mtime (and that query matches trunk's timeline exactly in terms of ordering), but the rids are in a weird order, with some delta rows having srcids that are higher than their rids. I'm guessing that's normal?2

As a shameless plug: you might also find libfossil interestin, as it includes a library interface to all of fossil's core SCM functionality except for syncing, including APIs which would allow your client C application (or, with a bit more work, a higher-level wrapper) to navigate manifest files in a way which accounts for (or not, as you choose) the vagaries of handling delta manifests.

That's fantastic! Whether or not Fossil is embeddable as a library is something I intended to ask in a separate forum topic, but that answers my question before I even had a chance to ask it :)

I'll definitely give that a gander for any actual projects - but of course, it still helps to know how things work under the hood.

Not directly, IIRC (which might not be the case!), but there is a virtual table called files_of_checkin, colloquially known as foci, which can be queried to determine whether a given file is in a given checkin. That level of data is not stored directly in the db in a relational-friendly form because the number of records would explode: one per file per check-in. We don't have any friendly user-facing docs for foci but grepping src/*.c for files_of_checkin will reveal many examples of its usage.

And whaddya know, SELECT content(uuid) FROM files_of_checkin($branch) WHERE filename = $filename; is the exact perfect query I was looking for. Thanks!

Of course, that just kicks the can down to needing to know how content() and files_of_checkin() work under the hood to know how I'd go about implementing the full file extraction process myself, but at least now I've got a solid starting point.


  1. ^ The docs say a delta manifest should never point to another delta manifest / that a B card should always point to a baseline manifest, but who knows what weird manifest deformities my code might need to handle
  2. ^ Now that I think about it, if mtime represents when the check-in's creator created that check-in locally while rid represents the order the upstream server received those check-ins via pushes/syncs, then that would explain the weird rid ordering there, right?

(4) By Stephan Beal (stephan) on 2024-01-31 08:13:56 in reply to 3 [link] [source]

I guess that means to fully understand that piece (and to be able to fully recreate it in a program using plain SQLite with no Fossil-isms) I'll need to dive into delta.c a bit.

Not only delta.c (which defines the delta encoding), but also content.c:content_get() (which demonstrates how to recursively apply the deltas) and zlib (which applies additional compression). The whole constellation is rather involved, but it works smoothly.

If that returns exactly one row (because SELECT srcid FROM delta WHERE delta.rid = $rid; returns nothing), then once I decompress/reinflate the result I should have a complete and current copy of the artifact $rid represents (be it a manifest or file), right?

i have not had enough caffeine to yet jog my memory of whether that query is correct, but basically the steps to get all of the deltas are:

  • Find the RID you want (let's call that $rid). If there are any deltas to apply to that then...
  • delta.srcid will == $rid. If there are no matching records then your blob is complete as-is but might still be zlib-compressed. If there is a matching record, there will only be 1 and...
  • delta.rid refers to a different blob which contains the raw delta itself. Let's call this $did.

If you apply (blob.rid=$did) to (blob.rid=$rid) then you will end up with the un-delta'd content, BUT there may be a whole chain of deltas to apply in succession. It's been a good 8-ish years since i've noodled around in that algo, so don't recall off hand how it works, but the algorithm for that is tucked away in content.c:content_get().

Sidebar: though this is not guaranteed, fossil tends to keep the most recent version in its raw, undeltad form (possibly zlib-compressed), and creates deltas for the ancestor versions. That is the opposite of most historical SCMs, which keep some older version in raw form and apply deltas forward in time to create newer versions. Fossil does this because we normally want the latest version and that's faster to do if it's stored in raw form. That is: when you check in a new version of any given blob, fossil may (and very likely will) replace the blob.content of the previous version (N-1) with a delta. That delta represents what's necessary to re-create the (N-1) version from the new tip (N) version.

If so, then I think content_get() in content.c is starting to make a lot more sense now, at least for files.

content_get() is used for any and all de-deltification of content, so if you can grok that routine then you'll understand all there is to applying deltas. i recall struggling to understand that routine when porting it to libfossil, and thankfully haven't had to revisit it since then because it "just works" and has since fossil's earliest days.

Sidebar: delta.c is largely standalone. It's trivial to strip that of its fossil dependencies and use in other projects.

Okay, cool. And that process (for finding the current version of a file as of a given check-in) would be to grab all the F cards, grab the manifest the B card points to if it exists, and repeat for each manifest until I've read everything from the baseline manifest1, right?

That's essentially correct but you don't have to work recursively: delta manifests always derive from a non-delta (a.k.a. baseline) manifest. Abstractly, you essentially start with the baseline then replace any entries in it with name-matching entries from the delta, with a couple exceptions:

  • Deltas record deleted files by recording their name with no hash. Baselines record deletion by simply not having that file in the manifest at all. That is, if a delta removes a file, that entry will have no hash and you abstractly remove it from the baseline.

  • Renaming is another complication, but the intricacies of that currently elude me - the details are somewhere in manifest.c.

I'm still a bit fuzzy with the ordering of rids when it comes to delta-compressed manifests; after running

The RID ordering is irrelevant and unreliable. The logic for figuring out the exact chain of deltas, in proper order, is all encoded in content_get(). The schemas that uses are all defined in src/schema.c.

Whether or not Fossil is embeddable as a library is something I intended to ask in a separate forum topic, but that answers my question before I even had a chance to ask it :)

Feel free to contact me off-list about that if you have any questions: stephan wanderinghorse net. The library supports all of the core-most SCM features but not any of the UI-relevant features (like forum navigation and wiki parsing/rendering), nor a few of the nice-to-have SCM features (namely stash, undo, and bisect, but it has all of the infrastructure pieces necessary for implementing those). FWIW, the libf CLI tools are, with few exceptions, my "daily driver" tools for working with fossil, e.g. 99% of my checkins use its f-ci tool.

Of course, that just kicks the can down to needing to know how content() and files_of_checkin() work under the hood

content() is just a proxy for content_get() and files_of_checkin() is an sqlite virtual table defined in src/foci.c.