Split out a slice of one Fossil repo to create a new Fossil repo

(1) By Richard Hipp (drh) on 2020-12-04 13:38:03 [link] [source]

Backstory

The "althttpd.c" web server is a program I wrote years ago and which current runs the SQLite and Fossil websites. The source code to althttpd.c (all in a single file) and the althttpd.md documentation have been maintained in the SQLite documentation repository for many years. Recently, I decided to split out althttpd into its own separate repository.

But the new repository was seeded with just the most recent versions of althttpd.c and althttpd.md. All of the history for those files was omitted:

So now a user has asking for this history to be backported into the new repository. And, I have to admit, this is something that I would like to do. I'm also interested in splitting off the Lemon source code into a repository separate from the SQLite source repository. I haven't done that yet, as I am still trying to figure out how to preserve its long history for the new repo.

Questions

How to do this? Perhaps a new command like:

fossil split
fossil breakout
other suggestions

You give this command a list of files whose history you want to preserve and it creates a new repository that contains only the check-ins that touch the files you identify. But this leads to a whole host of additional questions:

Does the new repository preserve branch names, or does it try to put everything on trunk?
Does the breakout track the files across renames? If so, then it seems like when you identify a file on the command-line you would also need to identify a particular version of that file since it might be that two completely unrelated files happen to have the same name at different points in the history of the source project.
Do the exported check-in comments need to have some additional annotation to indicate that they originated in a separate repository? When the SQLite project was originally imported from CVS into Fossil, we added "(CVS nnnn)" tags to each check-in to indicate which CVS check-in it originated from. (example). And when the Fossil repository is exported to GitHub, an extra "FossilOrigin-Name:" tag is appended to the end of each check-in comment indicating where the check-in originated. (example).
Should there also be provisions to export wiki, tickets, and/or forum posts? How do we identify the particular wikis, tickets, and forum threads to export?
Specifying all the various options of what to export might become quite involved. Do we need some kind of "breakout specification file" that is edited separately, then provide to the "fossil breakout" command (or whatever it ends up being called) with the filename rather than try to list all the conditions on the command-line?

What are your suggestions?

(2) By Daniel Dumitriu (danield) on 2020-12-04 14:50:28 in reply to 1 [link] [source]

That user is me, as I was rather interested in the technical aspect, so I managed to write a pretty short bash script (although I am not a specialist) leveraging the likes of fossil timeline -p, fossil info and fossil cat, along with the other usual Unix suspects. This managed to get the 124 relevant commits into a new repository within a few seconds.

I will come back later with opinions to the other questions.

(3) By graham on 2020-12-04 15:27:30 in reply to 1 [link] [source]

Perhaps a new command like:

I've no problems with those options, just throwing another idea other there... Some document archiving software I worked on used the term hive (as in "to hive off") to separate a chosen subset of documents from the main archive into a new archive (with options to leave the hived-off bits in the original archive or to delete them).

Does the new repository preserve branch names

Not used Fossil enough (at least with a lot of branching) to be certain, but my gut feeling is that if there was complicated branching-and-merging in the files' histories, is that (a) this might be helpful to preserve, and (b) conceivably could cause problems if everything is "squashed" on to trunk.

Does the breakout track the files across renames? If so, then it seems like when you identify a file on the command-line you would also need to identify a particular version

My feeling is that if history is wanted, it should be complete¹: truncating the history just because the name changed seems "odd".

However, my instinctive feeling is that one would only ever be hiving-/splitting-off the current versions of a set of files... therefore you would only need to specify their current names. (And leave the hiving process to track any changes their names as it tacks backwards through history).

Unless you envisage a need to hive-off "the history of Lemon up to 2015", then I don't think you need to worry about specifying "the name it had at version [xxxxxxx]".

¹ I can conceive you might want to truncate the history at a certain point in the past ("the history of Lemon back to 2010"), which feels like it should not be a problem (just stop tracing backwards?), but I haven't thought of why you'd want a hive of the state of some files up to a time in the past.

Do the exported check-in comments need to have some additional annotation to indicate that they originated in a separate repository?

My (again) gut feeling is that while such an annotation might not get used often, if there is a need/desire to refer back to the "parent" repository, having them will make things easier. Therefore, so long as it's not difficult to do, I'd include them. My guess is such annotations only needs to include the original repository's check-in hash, with the following thoughts:

You probably don't want to try and make it clickable. Not only can I see this leading to confusion about which repository you're looking at, once the two repositories diverge, there's no guarantee that "how to reach the parent" will remain valid. (Unless, perhaps, a "parent-repo-URL" was made a configuration option???)
You possibly don't want/need to go overboard in trying to "identify" the parent repository, at least within each annotation: perhaps something along the lines of "From parent-repo check-in [xxxxxxx]". Identification of what "parent-repo" means should probably happen once, either in an "initial check-in comment" (or attached wiki-page) of some kind.

(4) By Daniel Dumitriu (danield) on 2020-12-04 15:40:39 in reply to 1 [link] [source]

Ad command name: alphabetically from dictionary:

breakout clip cutout cutout decoct dislodge displace distillate doff excerpt expunge extract fragment hive rip select separate skim splinter split strip tearout withdraw

I like extract and clip. And hive, too.

(5) By Daniel Dumitriu (danield) on 2020-12-04 16:19:15 in reply to 1 [link] [source]

Does the new repository preserve branch names, or does it try to put everything on trunk?

Probably preserve - I guess it should be possible to use a depth-first traversal restricted to the wanted files.

Does the breakout track the files across renames? If so, then it seems like when you identify a file on the command-line you would also need to identify a particular version of that file

Track renames. I'd say one would need a --to (and less probable --from), so I would take the filename as being present in that commit.

Do the exported check-in comments need to have some additional annotation to indicate that they originated in a separate repository?

Yes, as an option ("original checkin abcd1234", no links).

Should there also be provisions to export wiki, tickets, and/or forum posts?

Probably not, at least not in a first version.

Specifying all the various options of what to export might become quite involved. Do we need some kind of "breakout specification file"

This would be easier to decide when all the options are there, depending on how complex they end up. It would certainly make it easier for the user, not necessarily for the developer ;-)

By the way - changes in multiple wanted files in the same source checkin should of course make it in the same destination checkin, so some juggling is needed here.

(6) By anonymous on 2020-12-04 16:22:29 in reply to 1 [link] [source]

Just to clarify the desired outcome:

new repo is created
the requested history (inclding user and config data) is copied from the source repo over to the new repo
source repo remains unchanged

Is that correct?

As such this action appears like a variant of init , clone, or export. Unless you'd allow subsequent additions of more slices into already existing repo.

If my understanding is correct, this could be a --copy option to clone. Or --path followed by the list of paths to copy. Maybe allow to specify the file list from a file or stdin, so it could be piped.

extract also sounds right, but does it create a new repo or extracts into an already init'ed repo?

(7) By Richard Hipp (drh) on 2020-12-04 17:10:32 in reply to 6 [link] [source]

The desired outcome you describe is correct.

Additional points:

Not all check-ins are transferred into the new repo - only check-ins that touch the files that you intend to export.
Even after deleting the irrelevant check-ins, the DAG must be preserved.

(8) By anonymous on 2020-12-04 17:27:40 in reply to 7 [link] [source]

Another option in the same context could be --move, such that the extracted files to be deleted in the source repo. This is indeed a way to offload the full history for a subset of objects into a separate repo.

Here's another word choice: subset.

fossil subset source dest --move files...

(9) By Richard Hipp (drh) on 2020-12-04 17:52:56 in reply to 8 [link] [source]

You cannot remove files from a repository. That would change history, causing all of the historical check-in hashes to change.

(11) By anonymous on 2020-12-04 18:18:52 in reply to 9 [link] [source]

You cannot remove files from a repository.

I meant "delete" in Fossil sense, that the moved objects are no longer managed in the source repo, the prior history of theirs is still left as it was. Meanwhile the subset repository receives the copy of that history yet it's open-ended, such that it can continue going forward.

This way the development of the moved objects would go on using the subset-repo as their primary repo. It indeed gets off-loaded in a state consistent with the past history in the source repo.

(10) By Warren Young (wyoung) on 2020-12-04 17:55:53 in reply to 1 [link] [source]

Does the new repository preserve branch names,

I've got repos where doing this proposed extraction would be far less useful if it did not preserve branches.

Does the breakout track the files across renames?

It should, yes.

two completely unrelated files happen to have the same name at different points in the history of the source project.

The feature should allow the user to specify what to extract by glob, so if I say foo/*, then I mean any file under the foo top-level directory. If it names two different files — not of the same clade — then it's probably what I wanted, right? I want both of them in the output repo.

If the user's glob(s) name multiple files not of the same clade, it should advise the user but proceed.

To allow a user to include only a single clade, extend glob syntax: foo/bar/qux.c@abcd1234. That would also allow restriction to a particular version's view of the glob: foo/* means all files under foo across all time, but foo/*@abcd1234 means only those files under foo as of commit abcd1234.

There should be an --invert option to this command for use by those who want to break an existing repo apart: you might extract files without --invert 3 times with different globs to create 3 sub-repos, then give all 3 sets of globs with --invert to get a repo with everything else.

For referential integrity, if there are commits in the everything-else repo that refer to a mix of files inside and outside the set named by the globs, it should stop and complain, because it would mean you're asking it to create commits with incomplete file content.

I think that won't be a common case, though: someone wanting to create a repo that has everything except certain files can be expected to scope the globs to name only files that don't get changed along with files the user wants to keep in the new "everything else" repo. I do expect some pilot error, but I think any stoppages will be appreciated, rather than be perceived as annoying. "Yes, I knew that; thanks for catching that for me, Fossil!"

some additional annotation

I can see wanting it both ways, so it should be under control of a flag. --no-annotate?

export wiki, tickets, and/or forum posts?

Yes, by reference to commits.

If a user finds that desired tickets etc. didn't transfer, they can go back and edit them to refer to an included commit, then retry.

This is all part of referential integrity. The resulting repo must stand alone.

This means that giving --invert removes tickets and such if they refer to commits that are being excluded, since those artifacts apparently live somewhere else now.

breakout specification file

I'm not seeing the need. Shell scripting should suffice if the command gets too long to edit at the command line.

(12) By Warren Young (wyoung) on 2020-12-04 18:19:14 in reply to 10 [link] [source]

Regarding commit message annotations, if included, they should use interwiki tags so the user can tie the two repos back together, if desired.

Regarding --invert, yes, I realize this will create a repo that won't sync or clone with the original repo any more. Giving it should therefore generate a new project code to stop that early.

(14) By MBL (RoboManni) on 2020-12-05 13:20:17 in reply to 10 [link] [source]

By Warren Young (wyoung) on 2020-12-04 17:55:53

breakout specification file

I'm not seeing the need. Shell scripting should suffice if the command gets too long to edit at the command line.

if fossil should still be called a platform independent tool then a breakout specification file would be better than scripting, which is platform dependent.

(16) By Warren Young (wyoung) on 2020-12-05 21:32:26 in reply to 14 [link] [source]

Implicit in that criticism is that you'd need to do this breakout multiple times on different platforms. Why? Isn't it a one-and-done sort of thing? If you do your breakout in PowerShell and I do mine in POSIX shell, what does it matter?

(13) By jamsek on 2020-12-05 10:54:00 in reply to 1 [link] [source]

This would be a very nice feature, Richard! Recently I wanted to morph
a couple scripts that evolved into larger programs from constituents in
my scripts repo into standalone repositories. I did much the same
thing by extracting the most recent versions, but, as you mentioned,
this omitted the history, which would've been nice to transfer.

New command

Even though this isn't possible in Fossil, it might be best to use a
word that doesn't imply removing the check-ins from the original
repository. On the other hand, that restriction might eliminate most of
the obvious and, therefore, better terms.

fossil {cleave, derive, educe, emit, excavate, extract, fragment, isolate, piecewise, segment, separate, strip, subset, unbind}

New option

Rather than a new command, it might make more sense as a new option or
subcommand to an existing command. In effect, it's a partial export or
clone of an existing repository, so they would be the most apt
candidates.

fossil export {chunk, excerpt, fragment, packet, partial, portion, segment, section, subset}

Personally, I think I prefer the subcommand to export; however, given
the large number of options this feature might take, a new command might
be better.

Features

Does the new repository preserve branch names, or does it try to put everything on trunk?
- Toggling this on or off with a switch would be good.
Does the breakout track the files across renames?
- Yes; this would be best.
Do the exported check-in comments need to have some additional annotation to indicate that they originated in a separate repository?
- Toggling annotations on/off would also be good.
Should there also be provisions to export wiki, tickets, and/or forum posts?
- Yes; as referenced by the exported check-ins.
Do we need some kind of "breakout specification file"?
- My first thought was no; but then, rather than exclude it entirely,
  maybe it could be optional (e.g., fossil export subset -F spec_file).

(15) By Dan Shearer (danshearer) on 2020-12-05 16:51:35 in reply to 1 [link] [source]

I have two questions about the design of this.

How does this relate to shallow clone?

A 'fossil split' et al command is not the same as the shallow clone discussion , however in many circumstances it could be used as a heavyweight version of shallow clone. I am interested in something a bit like shallow clone because of not-forking, something invented specifically for splicing versions of SQLite with versions of other projects in order to see what the combined capabilities could be.

Will there be a 'fossil merge-back' ?

or, What about downstream re-importation, by accident or design?

We already had that issue (and solved it, thankyou Richard) where we wanted to import a tree (as it happened, a shallow clone via git) into a fossil repo and in the distant past that git tree had originated in a fossil tree and still had Fossil metadata present. So we know that such situations can arise.

Leaving out the complication of git, what if I do a "fossil split" and then later on want to bring some part of that new DAG back in?

This is a subset of a question I have never yet asked myself, which is "how do you export from one Fossil repo and import into another?" since I have only done that with git. LumoSQL needs to handle the Fossil->Fossil case so I will be looking up the answer to that soon. Perhaps it has already been solved and I just don't know that yet.

Dan Shearer

(17) By Warren Young (wyoung) on 2020-12-05 21:38:12 in reply to 15 [source]

in many circumstances it could be used as a heavyweight version of shallow clone

The current design provides narrowed repos, not shallow ones.

Narrow clones include a subset of top-level directories, whereas shallow clones include only a limited amount of history across all files. This feature cannot provide anything like shallow cloning, as currently designed.

This feature might be extended to do shallow selection of commits, but I can't see that the two feature sets would share much code.

or, What about downstream re-importation, by accident or design?

Accidental merge will be prevented by the new repo getting a new project code. Fossil's sync algorithm will refuse to sync two projects with different project codes.

The rest of your questions are better taken up as replies to your other post on this topic.

(18) By Dan Shearer (danshearer) on 2020-12-06 11:45:45 in reply to 1 [link] [source]

Richard Hipp (drh) wrote on 2020-12-04 13:38:03:

On further thought...

What are your suggestions?

1. Define the Maths Problem

It could be easier if we had language to discuss this from something like a Conference talk on Fossil maths. However, this may be related to series-parallel digraphs in DAG theory according to one mathematician I asked in passing. We might say that Fossil is looking to construct two DAGs that have a parallel composition, where a parallel composition is something usually calculated from two DAGs rather than the reverse, as would be the case here.

2. Consider Sacrificing Some Uniqueness

Would it work to create two trees with lower uniqueness guarantees but also having part of the hash in common? All Fossil operations would work as usual on both trees, but to those operations that understand there has been a tree split, commits could be distinguished on the basis or one half or the other of the SHA-256. For example if the top half of the hash was to be zeroed and replaced with the bottom half of a new hash. In the somewhat-related discussion of merging Fossils this might be a solution, and if so, a standard two-DAGs-in-one-hash design would be needed. There are many limitations to this approach, not least that it can only be done once. But first - would it be possible at all?

Dan Shearer

(19) By Warren Young (wyoung) on 2020-12-06 21:09:56 in reply to 18 [link] [source]

Define the Maths Problem

I don't think we're proposing any new DAG math, because that would imply that the resulting split repo isn't backwards compatible anymore. The math on the new DAG has to work the same as the math on the old DAG, else we're talking about "Fossil 3:" a breaking change.

While we could certainly break the Fossil sync protocol again, as we did for Fossil 2, that was plenty traumatic, even given that there was a smooth transition path. If I understand the sort of thing you're proposing, there could be no transition path other than a complete rebuild, which means every clone stops syncing until rebuilt under the new math.

Until then, a Fossil 2 repo would be as foreign to Fossil 3 as a Git repo: understandable only through a one-way translation layer. And after, the Fossil 3 repo would be just as foreign to Fossil 2 instances, except that there's no reason to expect we'd have a Fossil 3→2 translation method.

Now we have to ask, what is our compelling reason to put up with such things? As proposed, drh's feature idea requires no new math, no compatibility break.

lower uniqueness guarantees

I don't see how that helps, but I think it might help to add another card type — S, for "split" — that helps Fossil tie two repos back together. While extracting each commit manifest, Fossil could add an S card saying which commit in the parent repo this one came from.

Fossil could use the S cards for things like bisect between two repos, given the other repo's name:

   $ fossil bisect reset --parent-repo ~/museum/parent-repo.fossil

In the subset of commits where S cards in this repo refer to commits in the parent repo, it can safely check out versions of the shared files from both.

In principle, it could also allow the two repos to be merged back together, provided you're willing to rewrite all the commits from the child repo to include all of the "tip" F cards at the merge point.

In the somewhat-related discussion of merging Fossils

I see no reason to derail this thread. Let's take it up from there.

(20) By Marcelo Huerta (richieadler) on 2021-05-07 19:56:44 in reply to 1 [link] [source]

I was enthusiastic about the discussion of this feature... Is this something still being considered?

My votes goes for these possible names:

fossil extract
fossil splinter
fossil subset

(21) By jshoyer on 2021-05-08 14:01:34 in reply to 20 [link] [source]

Since you bumped this thread I'll chime in that I found this discussion interesting but I am glad that this feature idea has not been actively pursued. One of the best things about autosync mode is that it strongly encourages people to branch in branch-space rather than repository-space (which is closely linked to user-space). Built-in tools for excavating subclones would undermine that, in a way, encouraging users to make networks of repositories that share lots of common artifacts rather than making cleaner breaks.

If this feature existed I would probably try to use it in complicated ways and I am happy to be saved from that temptation. : )

I found the motivating example of Lemon more interesting than althttpd, because both SQLite and pikchr depend on Lemon, and because third-party projects like the ‘Lemon grove’ have sprouted.

(22) By sean (jungleboogie) on 2021-05-08 16:08:37 in reply to 21 [link] [source]

Tools/features can always be used to do something the creator didn't have in mind.

Hammers and axes are nice, but you know...there's the other sinful things you can do with them.

(23) By schmitzu on 2024-01-09 15:10:29 in reply to 1 [link] [source]

Sorry for stepping into this old post, but...

Now I also need such a command to split a repository. Is there an implementation of this in fossil in the meantime?

Regards
Uwe

(24) By Stephan Beal (stephan) on 2024-01-09 15:17:11 in reply to 23 [link] [source]

Is there an implementation of this in fossil in the meantime?

Nope.