Fossil and large repos?

(1) By Midar3 on 2020-09-08 12:18:24 [link] [source]

Hi,

I noticed that Fossil gets extremely disk space hungry and extremely slow when dealing with larger repositories.

In particular, cloning https://pkgsrc.fossil.netbsd.org takes an order of magintude longer than cloning the same via Git or Mercurial. The resulting repo file is also several times bigger than both Git or Mercurial.

Any idea what's up with that? I could understand being a few percent bigger, but several times?

Also, any operation on this repo takes very long - to the point that CVS, that has to talk to the server, is almost faster.

(2) By Stephan Beal (stephan) on 2020-09-08 12:53:09 in reply to 1 [link] [source]

Any idea what's up with that? I could understand being a few percent bigger, but several times?

Though fossil can architecturally handle it such repos, it's far from optimized for those scales. Fossil was designed for small-/mid-sized projects, not behemoths like that one.

One of the culprits is the number of files. Every checkin contains a list of all files which are part of that checkin, regardless of whether they were modified.

That particular repo uses delta manifests, which shrinks their size but not the logic needed to create and traverse them. Its current tip is:

https://pkgsrc.fossil.netbsd.org/artifact/18aed899a9074f9f

Which is a delta manifest derived from:

https://pkgsrc.fossil.netbsd.org/artifact/ba7dfd1855ecbc

which is a ~10.2MB manifest with 103006 files in it. Every commit (or call to "status") has to traverse every one of those files in order to be able to know for certain whether they've been modified or not. Even on an SSD, traversing those will take time.

Another culprit for large repos is the repo-cksum setting, which causes an extra layer of hash defense to be calculated during manifest generation, and that calculation's cost is linear on the number of files in the checkin and their sizes. It's also a massive memory hog. Because of the cost, and because it's the last in line of 2 or 3 other levels of hash protection, that calculation can be disabled via the repo-cksum setting.

There are probably other factors which come into play with massive repos, but those are the two which immediately come to mind.

(3) By sean (jungleboogie) on 2020-09-08 14:15:03 in reply to 1 [link] [source]

That's nearly a 10GB repo with 1,861,916 artifacts.

https://pkgsrc.fossil.netbsd.org/stat

This example repo has come up a few times on the mailing list before. You can search to find responses there, but Stephan's is pretty much what I remember others pointing out as well.

(4) By MBL (RoboManni) on 2020-09-08 15:12:01 in reply to 2 [link] [source]

See also this forum post to find an answer

(5.1) By Midar3 on 2020-09-08 22:12:54 edited from 5.0 in reply to 2 [link] [source]

Are there any plans to make repos like this work?

I'm mostly asking because I am a pkgsrc contributor, and would like to use Fossil instead of CVS. But cloning took already several hours and the resulting checkout is so gigantic that it wasted too much SSD space to be viable.

FWIW, https://src.fossil.netbsd.org/ has similar problems.

// Edit: Also, Fossil is what the current GitHub mirror is generated from, as well as the Mercurial mirror. Which sometimes takes hours to update.

(6) By Stephan Beal (stephan) on 2020-09-08 22:30:34 in reply to 5.1 [link] [source]

Are there any plans to make repos like this work?

Very probably not until someone who is directly affected by such extreme cases decides to get involved and can upend the architecture to be performant at such scales.

In all honesty, i can't understand why those particular trees insist on using fossil, as it's never been a great match for their needs.

(7.1) By MBL (RoboManni) on 2020-09-09 07:03:34 edited from 7.0 in reply to 3 [source]

Having looked to that repository I was thinking about how to handle such huge long running repository stories!?

First of all one can see that there is not much branch and merge activity and mainly there is just one alive timeline trunk .

But that is not the full truth... going back in the timeline, e.g. selecting the BMake (108 days ago) from the list of Branches, and then focussing on the time stamp, you can see that there are two trunk branches existing side by side with the same name and the same content.

Going back in time you can find even more branches, not only trunk, which exists in parallel more than one or twice. - Something strange was going on and was not corrected for long time.

Braking huge repositories into smaller portions

Coming back to the topic of large repos... what could be done

fossil rebuild could be instructed to split up a long running repository into two: old historical and young alive
at the cut point the older portion has to tag each cut branch as closed but continued in younger repository xyz with a link to the xzy repository*
the younger repository part then has to start with continued from older repository instead of initial entry with a link to the older repository
commitments into the old historical repository portion should be prohibited all together to avoid any branching happening out of the young alive repository looking to that huge repository, which goes back 23 years, this should not be a hindering fact. Which branch ever would have to start 20 years too late ?
repository integrity check at the cut points would just have to integrate a lookup if the old historical repository is just in place with all the linked hash id's.
how to deal with non-timeline content? My suggestion: duplicate that into both portions, the old AND the new one - problem of broken references solved. But this needs to be further discussed.

While the closing work on the old historical repository should be easy, maybe with the help of a new tag type branch closed here but continue in young repository the question is remaining how to build up the remaining young alive repository. One solution could be an automated checkout continued with an automated initial checkin, which becomes tagged as continued from older repository instead of initial. Multiple initial checkins are already supported with newer fossil versions, so why shouldn't multiple continued from older repository not also be possible!?

My solution would become a hard cut which would freeze the old historical but just the chaining of the different repositories is the basic idea to keep the idea of fossil to never forget anything alive. Easier working with the young alive repository with a much better performance would remain as the win-win from these efforts.

(8) By Stephan Beal (stephan) on 2020-09-09 12:04:00 in reply to 7.1 [link] [source]

Easier working with the young alive repository with a much better performance would remain as the win-win from these efforts.

If your repository has 1 million files then splitting it into old won't matter much for performance. The age of the files has no real effect on the processing speed of manifest creation (provided it's always traversing the latest copy - older copies are more expensive to traverse), but processing 1 million files of any age is going to take a tremendous amount of time above and beyond any SCM-side work.

Try the following in your shell, which simply runs "stat" on a single file 1 million times:

$ date; i=0; while [[ i -lt 1000000 ]]; do stat wiki.c > /dev/null; i=$((i + 1)); done; date
Wed 09 Sep 2020 01:49:18 PM CEST
^C

i cancelled it after 7 minutes on this laptop (with an SSD drive - a spinning disk could be considerably slower).

Re-running that test with 1000 files takes 3 seconds and 10k takes 29 seconds. 100k will probably take almost 300 seconds and 1M probably 10 times that. Best case, nearly an hour just to stat that many files.

That's without any of fossil's own internal work going on, and fossil has to do a great deal more than stat files when creating a manifest.

(9) By Warren Young (wyoung) on 2020-09-09 13:27:09 in reply to 1 [link] [source]

cloning...takes an order of magintude longer than cloning the same via Git or Mercurial.

It isn't an apples-to-apples comparison. Fossil's doing more. While there may be some micro-optimizations available to make it somewhat faster, the only ways to make it give the same performance would be to throw away some of those advantages.

Cloning probably isn't the biggest time, in fact, it's rebuilding the metadata required by Fossil's superior data model. I don't know about Mercurial, but Git doesn't need that step because its data model only lets it trace backwards up toward the root, not the other way. This has the bad effects laid out in that document and elsewhere.

The only way I see to avoid this cost is for the server to somehow transmit its precomputed metadata down to the client, which can then slurp it in uncritically.
Fossil has more checksums, cross-checks, and referential integrity measures than Git. Maybe Mecurial, too, though that's just speculation. You've been shown the repo-cksum option, but that's the only one you can turn off. The rest of the costs you don't get a choice on, currently. And if you did, would you turn them off?
Later, you bring up CVS, but that's apples-to-kumquats. Fossil could approach similar performance by allowing shallow cloning (item 21), but that's been on the wishlist for years now.