Towards shallow cloning

(1) By Warren Young (wyoung) on 2019-07-21 06:13:43 [link] [source]

There are two central problems in making arbitrarily-large Fossil repos practical:

The total amount of data. This is the easy problem. If the Internet can support Netflix, it can support large Fossil repo clones.
The post-clone rebuild time, in which Fossil re-creates all of the lookup tables that give it capabilities such as rapid forward tracing through the repo history and invulnerability to Git's dread detached head state.

We can easily see that the second problem dominates by comparing the time to scp a large Fossil repo to the time taken to fossil clone it over the same network.

I believe the solution to this problem is shallow clones. Once we can find someone with the time, will, and skill to implement the feature, the primary problem that developer will face is that reconstructing the current version of a given file can require pulling up every historical delta artifact.

That isn't a necessary consequence of Fossil's repo format. This need is simply due to Fossil's current implementation, which prizes overall smallest size in the repo. Thus fossil rebuild --compress.

To make shallow cloning practical, Fossil would need logic that puts a limit on how far back the delta chain for a particular file can go. If you set it at 1 year and change each file only once a year, then Fossil's repo size will end up not having any delta compression at all. The upside of that is that you can clone only the past year's worth of repo data for the cost of only about one checkout's worth of data.

The shorter you make that window, the bigger the size of a full repo clone, since delta compression becomes increasingly less effective.

It would be possible to write this feature in a way that lets you set the normal delta window size limit to 30 days but increase that window size by some nonlinear function like 2ⁿ, so that the prior window is 60 days, doubling back to the project's founding date, where the oldest delta window would be roughly half the repo's total history.

That would reduce the amount of repo bloat needed to support this feature, but it would require occasional rebuilds, else the current 30-day delta window just keeps rolling over indefinitely, creating a string of 30-day delta windows.

The end goal of all of this is that you should be able to tell Fossil to pull only the most recent versions of all open branches, going back no deeper than the current delta window. Fossil will then model the common use case: we mainly need the current versions immediately, maybe a few days old occasionally, and older versions with increasingly smaller probability. You might need all versions eventually, but not immediately after the initial clone. Historical delta windows can be pulled in over time as required.

(2) By Richard Hipp (drh) on 2019-07-21 17:07:00 in reply to 1 [source]

Normally it is the most recent check-in that is stored as full-text and older versions are deltas of the most recent. So when a new check-in is added to trunk (for example) the new check-in is stored full-text, and the previous tip is rewritten as a delta of the new check-in.

Of course, the BLOB and DELTA tables of the repository are designed in such a way that you can substitute a complete different delta design and everything will still continue to work without any changes to legacy. But the point is, I don't think deltas are a big factor in making shallow clones hard.

The core difficulties with shallow clone are these:

The sync algorithm does not know about check-ins, files, wiki, tickets, or anything else. It is just trying to ensure that both sides of the exchange have the same set of "artifacts". The sync algorithm pays no attention to the "age" of the artifacts and has no way of restricting the sync to recent artifacts. To change this would greatly increase the complexity of sync.
Very old propagating tags make a difference for recent check-ins. If you shallow-clone one year of content, there might be two-year-old tags that affect that content.

If the rebuild time is an issue, perhaps a better solution would be a new cloning mechanism that has the server compute a new, sanitized repository and then just ship down the complete repository. Sanitizing the repository would involve:

Removing sensitive information such as the USER table, and especially the password information in the USER table.
Removing the ACCESSLOG and ADMIN_LOG tables.
Remove all private branches and other content marked private. Clear the PRIVATE table.
Remove sensitive and site-specific information from the CONFIG table. Probably there should be a white-list of allowed CONFIG table keys and anything not on the white-list gets removed.
Remove all notification and email-related tables, such as SUBSCRIBER, MODREQ, and PENDING_ALERT.
Remove tables unversioned content, unless the clone client requests to also clone unversioned files.

All of this cleanup should be done using white-list approach, not a black-list approach. In other words, delete everything that is not on the white-list. That way, as new features are added, we do not risk forgetting to add sensitive content to a blacklist.

There is already a "fossil scrub" command that does much of the above. But I have not looked at the implementation of that command in a long time. It might need to be updated to account for recent enhancements and changes. (Later: Confirmed. The "fossil scrub" command does need some attention.)

See the https://www.fossil-scm.org/fossil/repo-tabsize page for a quick summary of table sizes. Actually sending the tables that get recomputed by the rebuild step of a clone (basically every table other than BLOB and UNVERSIONED) does not really add all that much to the size of the download.

See also the /artifact_stats page (which you will have to view on your local clone, as this is an administrator-only page in the current implementation) for additional interesting statistics on artifact sizes. Note, for example, that the largest 1% of artifacts take up 44% of the total repository space in Fossil. (The ratio is even higher in SQLite.)

(3) By Richard Hipp (drh) on 2019-07-21 17:16:24 in reply to 2 [link] [source]

Further information:

The delta algorithm actually works pretty well. If you take a tarball of the latest Fossil check-in, it is about 5.4 MB in size. But the whole repository with 12,634 check-ins is only 48.3 MB (if you omit the UNVERSIONED table) or less than 9 times larger.

So, from one point of view, if you simply encode the first 9 check-ins as full-text, you get the other 12,625 check-ins (99.92% of all check-ins) for free. :-)

(4) By anonymous on 2019-07-24 00:01:20 in reply to 2 [link] [source]

If fossil scrub can get the attention it needs, an interim "big clone" strategy could be implemented.

I envision something like:

At main host:

clone main repo
scrub clone
setup job to periodically update clone from main repo

At clone host:

make local copy of scrubbed clone
fossil sync https://example.com/main_repo

This would enable contributors to projects with large repositories to quickly make local clones.