on sliced clones

(1) By anonymous on 2020-09-14 08:07:25 [link] [source]

Hello everyone, I have a question: would it be possible to have a sliced clone of a Fossil repo?

I'll explain this with an example: the svn FreeBSD repo is very big, as it contains the entire history of the project.

If I want to compile and install FreeBSD from source, I have to clone, for example, the releng/12.1 branch with like:

svn checkout https://svn.freebsd.org/base/releng/12.1 /usr/src

and now I have a local copy of the 12.1 branch, following just that branch, without the need of cloning the entire repo history (that would take ages).

If I then want to upgrade to FreeBSD 12.2, I would then need to switch the followed branch with something like:

svn switch ^/releng/12.2

that command will download the 12.2 branch without downloading the entire history for the project.

Is this already implemented in Fossil / would it be possible? Thanks in advance.

(2) By Stephan Beal (stephan) on 2020-09-14 08:12:35 in reply to 1 [link] [source]

Is this already implemented in Fossil / would it be possible?

No and not in any near-term future. This topic (colloquially known as shallow cloning) has been thrown around several times before, but i don't recall how far it got. Not far enough that any code to support it has been written, in any case. It would seem to be a very difficult problem to solve given fossil's architecture and underlying assumptions. (That's not to say that it's "impossible.")

(3) By anonymous on 2020-09-14 09:03:13 in reply to 2 [link] [source]

Correct me if I am wrong:

A shallow clone is the clone of a subset of a repo history. A sliced clone is the clone of a single branch of a repo.

Ergo, a shallow clone can be the clone of the latest month of the project (trunk and other branches included) instead of the entire 10 years of history (for example).

A sliced clone is the clone of a single branch (for example the 2.x branch of a project), allowing me to update only that branch without the need to get the other branches' histories.

A sliced clone, as far as I understand, could be seen as a new fossil repository with the branch that I am trying to clone as the "trunk" of this clone. That would maybe cause the "branch switch" (if I want to make one) internally difficult because I will have to rebuild the entire repository using a new "trunk" (an alternative would be to create a new sliced clone and deleting the previous).

The first commit of the new trunk should be the parent of the first commit in the branch, and the next commits in the trunk should be the commits in the branch.

(4) By Stephan Beal (stephan) on 2020-09-14 09:24:10 in reply to 3 [link] [source]

A shallow clone is the clone of a subset of a repo history. A sliced clone is the clone of a single branch of a repo.

There is no fundamental difference between those two. A branch is still a subset of the history.

A sliced clone is the clone of a single branch (for example the 2.x branch of a project), allowing me to update only that branch without the need to get the other branches' histories.

The devil is in the details of the architecture: its design did not anticipate partial/shallow clones and it has very deep-seated assumptions about all state being available at all times. Undoing those assumptions would require an absolutely massive amount of code changes.

This proverbial horse has been beaten to death before by people familiar with fossil's architecture and code, and no straightforward solution to it has yet been found.

(5) By jshoyer on 2020-09-14 13:18:45 in reply to 3 [link] [source]

The Fossil Next Generation (NG) wiki page uses the term “selective sync” for “the ability to push/pull/sync single branches”.

People occasionally use the word “slice” to refer to narrow checkouts or narrow clones, also described on that page.

(6) By Richard Hipp (drh) on 2020-09-14 18:20:32 in reply to 1 [source]

For certain features, like running "fossil ui" and seeing a complete timeline of the project history, you need a lot of the repository history to be local - perhaps not everything, but enough such that the bandwidth advantage of running a narrow or slice clone is not that important.

Just to put some perspective on this, a complete clone of the 14,566 checkins to the Fossil self-hosting repository - activity spanning over 13 years - takes about 33,945,748 bytes of transfer when I just now tried it. In comparison, downloading just the most recent version of the code as a tarball takes 6,126,215 bytes. So, the entire development history of Fossil (including wiki and tickets in addition to check-ins) is less than 6 tarballs worth of bandwidth. If we were to do a shallow clone of (say) just the previous 30 days of activity, how much bandwidth do you think that would really save? If you are constantly having to pull down new files that are not available locally, you will quickly end up using more bandwidth than if you had just cloned in the first place.

Because of these two factors, it is hard to get motivated to do a shallow/narrow clone facility for Fossil.

But....

An alternative to shallow/narrow clones would be the ability to do a "fossil open" of a repository on a remote machine, without first cloning that remote repository. In other words, operate Fossil in client/server mode, kind of like CVS or SVN. Let's call this idea "client/server mode".

The advantages to client/server mode:

You can do a checkout (including a narrow or slice checkout) without have to transfer a lot of extraneous history or files outside the slice of interest. This might be especially advantageous if the repository is very wide and deep and you are only interested in a very small part of it.
You don't have a (potentially large) repository taking up space on your local machine.

The disadvantages:

You must remain on-network and your server must remain up in order for this to function. If you lose connectivity, work stops until you are back on-line. Just like with CVS/SVN.
You do not get the automated backups of the repository that you have when there are clones everywhere. This puts renewed importance on having good server backups.
Supporting both distributed and client/server modes leads to a lot of extra code to support in the implementation, which takes time away from other new features.

There are advantages to client/server mode, but there are also implementation hurdles. Some commands (example: "bisect", "annotate", and "ui") will be difficult to implement. Probably a whole new client/server protocol will need to be developed to support client/server mode. (I suggest that the client and the server exchange SQLite database files!)

So while client/server mode would be a nice-to-have, I do not personally have a need for it, nor do any of my paying customers, so it is difficult to get motivated to put in the significant effort needed to make it happen. That said, if you want to have a go at working on client/server mode yourself, I will will be happy to function as your "safari guide" and/or "mission control".

(7.1) By Warren Young (wyoung) on 2020-09-14 20:42:10 edited from 7.0 in reply to 6 [link] [source]

Such a feature could maintain a local repo as a backing store for the artifacts it pulls, so that over time, it may pull less and less over the network until it has a nearly complete copy of the parts of the repo you actually use.

At that point, you could say “fossil sync --complete” to fill in the gaps.

This would all extend naturally from the new “fossil open URL” syntax.

(9) By Warren Young (wyoung) on 2020-09-14 19:53:50 in reply to 7.0 [link] [source]

It occurs to me that Fossil’s current “phantom” mechanism could play helpfully into this: initially, everything is a phantom under a “thin” clone, and the holes fill in over time as the local user causes Fossil to pull more and more artifacts to support ongoing operations.

(10) By Richard Hipp (drh) on 2020-09-14 22:20:42 in reply to 9 [link] [source]

That was my original though, or something like it. But now I tend more toward a CVS/SVN-style client/server approach, as it would be easier to implement (I think) and use fewer server round-trips per operation, I'm guessing.

(8) By MBL (RoboManni) on 2020-09-14 19:44:50 in reply to 6 [link] [source]

TDS would be a nice protocol for a Client/server access to Sqlite3. The odbc client for postgresql also uses it, if am correctly informed and the clients to Ms-SQL. But the license conditions would need to be checked.

(11) By Richard Hipp (drh) on 2020-09-14 22:28:07 in reply to 8 [link] [source]

I am not suggesting that the client access the remote repository using SQL. That would be much too low-level, and would be very difficult to secure from attack. It would also be very slow, requiring many round-trips between client and server to get anything done.

(12) By MBL (RoboManni) on 2020-09-15 09:55:10 in reply to 11 [link] [source]

Did you consider my other post about large repos? Slicing "by branch" is one way, the vertical one .... another would be to introduce "cut-points", where an older repository part can remain untouched while just the newer portion, the "live part" needs to be sync'ed to be continued and used; the horizontal one.

Imagine such an cut-out old part as frozen forever repository or as a read-only part. Nothing gets lost or changed (at least if access rights disallow modifications).

As such a cut has the potential to cause pain ("the first cut is the deepest") it would only be recommended to use with care when selecting the point of time where further changes older than that become impossible after the cut has been done.

The cut itself has then just to consider what and how something has to happen when you try to cross that point. Working with the newer part would be as always but with higher performance. - Think about: How often has some checkout point been continued after long time; by branching out or continuation?

The cut-points could be even limited to a minimum by closing open leafs which are older than the point in time of the cut before performing the cut-rebuild.

I like the way fossil works nowadays but I am thinking also about long term usage and see that other users are in a similar position already. Let's now not talk too much about new synchronization methods but more about how to get back some of the initial performance while a repository is still somewhat small. What would be impacted with such a "hard cutting the timeline" from one repository into two repositories. Which of the interface requirements would break the backward compatibility? What would be the draw-backs once a read-only-forever status of the older repository part becomes accepted?

(13) By Stephan Beal (stephan) on 2020-09-15 13:09:35 in reply to 12 [link] [source]

Slicing "by branch" is one way

Unlike in SVN/CVS, branches in fossil are not fixed in place. At any given moment, any developer can go rename a branch or cause any given commit to branch off from its previous branch (see the "make this commit the start of new branch" option in the checkin editor). They are as fluid as any other tags because fossil branches are implemented as tags. Sharding by branch does not inherently reduce the problem's complexity in fossil.

(14) By MBL (RoboManni) on 2020-09-15 17:31:20 in reply to 13 [link] [source]

At any given moment, any developer can go rename a branch or cause any given commit to branch off from its previous branch

I like these possibilities and have often used them already.

While I see that slicing vertically is against the principles of fossil and its capabilities and flexibility I still see in my visions and have the hope that cutting off very old history into its own (maybe read-only) locked repository is possible with not too much effort. Just the transitions over the cutting point of time need to be clarified and done. - The possibility to have one instance to offer several repositories also already exists.

The newer repository at the cutting point may require some kind of rebase (at least what my understanding of what a rebase is) to keep the "starting" hashes as if the artifacts makeing them would be available.

Closing the older repository part should be quiet easy as it is mainly the omission of any writing rights to the users and maybe some hyperlinks to the newer live repository. Creating the cut-build of the newer repository may start off from an empty repository with an auto create of all open branches to their non-closed leafs. (Like a snapshot of what is valid at one single point of timeline.)

Like shunning such a cut will become also a nuclear option - but for repositories older than 10 years this might be no big deal anymore. And because the older part still remains accessible the principles of fossil that nothing gets lost will remain valid.

Is something like such a horizontal slicing feasible?

(15) By Stephan Beal (stephan) on 2020-09-15 17:53:07 in reply to 14 [link] [source]

very old history into its own (maybe read-only) locked repository is possible with not too much effort

Locking and read-only are impossible to 100% reliably implement in a distributed system so long as at least one client has write access. As soon as a single client can edit one copy, nobody else can be sure that their ostensibly read-only copy reflects current reality.

Is something like such a horizontal slicing feasible?

Absolutely: do an export of a specific version, delete anything you don't want, and create a new repository with the rest of it, optionally adding a link back to the origin in the commit message.

Sure, that loses the history, but if you don't care about the old content, why keep it around?