Successfully cloned massive NetBSD src fossil repository with resume
(1) By Andy Bradford (andybradford) on 2023-12-17 05:54:06 [link] [source]
I've been running and testing the code from the clone-resume branch for a couple of weeks now trying to clone the massive NetBSD src fossil repository[1]. I didn't think it would take as long as it did or I would have probably paid more attention to when I actually started the clone, however, I think I can get the picture from the /rcvfromlist page which has 41 entries, the first of which is: 1 2023-11-29 04:57:18 amb sha3 163.172.4.16 Given that initial entry, I've been trying to clone this repository for over 2 weeks. I believe that the 41 entries each represent an interrupted clone and resuming thereafter. Most of the interruptions were self-inflicted as I tested the ability to resume, however, there was one unexpected 30 minute network outage that interrupted the clone. The Rebuild phase of the clone started sometime on Monday the 11th of December and just finished sometime today, the 16th of December. For some reason, the output stopped reporting on progress when it got to 48.9% and just sat there for 3--4 days. It may have continued reporting sometime during the night when I wasn't paying attention. The delta compression phase took much less time than the rest of the rebuild but still took all day today to complete. Here is the output from the last resumed clone operation followed by the rebuild and delta compression. The reason why there are 2 lines representing the percentage complete is because I suspended the rebuild so I could do something else with the computer because it was taxing the I/O quite a bit. Clone done, wire bytes sent: 28323 received: 1214181826 remote: src.fossil.net bsd.org (163.172.4.16) Uncompressed payload sent: 10834 received: 1214152124 Rebuilding repository meta-data... 48.9% complete... 100.0% complete... Extra delta compression... 757855 deltas save 11,729,081,754 bytes Vacuuming the database... project-id: 3109ce34a3d43fd786dcf0133c6a96a6de40c573 server-id: fbbbc5a67d6d465dcef5d86f1d05fe52d2352a1e admin-user: amb (password is "BMhnetHYKr") Here are some statistics: $ time fossil dbstat -R netbsd-src.fossil project-name: NetBSD src repository-size: 50,270,945,280 bytes artifact-count: 4,304,686 (stored as 421,870 full text and 3,882,816 deltas) artifact-sizes: 60,499 average, 29,093,223 max, 260,430,148,484 total compression-ratio: 5:1 check-ins: 1,954,758 files: 526,433 across all branches wiki-pages: 0 (0 changes) tickets: 1 (1 changes) events: 0 tag-changes: 0 latest-change: 2023-12-11 21:57:41 - about 5 days ago project-age: 11,479 days or approximately 31.43 years. project-id: 3109ce34a3d43fd786dcf0133c6a96a6de40c573 schema-version: 2015-01-24 fossil-version: 2023-12-08 15:30:14 [29e9e84a1e] [2.24] (clang-13.0.0 ) sqlite-version: 2023-11-24 11:41:44 [ebead0e723] (3.44.2) database-stats: 6,136,590 pages, 8192 bytes/pg, 0 free pages, UTF-8, delete mode 0m10.89s real 0m04.55s user 0m05.76s system Here is my attempt to pull today after rebuild was complete: $ fossil pull -R netbsd-src.fossil Pull from https://src.fossil.netbsd.org/ Round-trips: 4 Artifacts sent: 0 received: 116 Pull done, wire bytes sent: 6366 received: 49913 remote: 163.172.4.16 Andy [1] https://www.fossil-scm.org/forum/forumpost/8cbc83e5d08a86b4
(2) By Florian Balmer (florian.balmer) on 2023-12-17 11:39:13 in reply to 1 [link] [source]
My first reaction:
https://i.giphy.com/1M9fmo1WAFVK0.webp
My second reaction:
Is the finished clone really usable? Is it possible to open a new check-out in reasonable time? Is it fast to detect changed files? Is browsing the repository through the web UI still a pleasant experience? Can you load CLI diff and timeline views without having to wait? If all answers are "yes", I'm definitely impressed!
(3) By Stephan Beal (stephan) on 2023-12-17 12:38:38 in reply to 2 [link] [source]
@Andy: congratulations!!! That improvement is long overdue!
@Florian:
Is the finished clone really usable?
Whether that particular repo was ever really "usable" with fossil is a matter of debate ;).
(For those who don't know: the netbsd repository is the single-largest known use of fossil.)
Is it possible to open a new check-out in reasonable time? Is it fast to detect changed files? Is browsing the repository through the web UI still a pleasant experience?
Fossil's manifest format, which lists all files in each checkin, doesn't scale well to repos of that size, causing some pain in all of the cases you mention. The "delta manifest" format was, IIRC, added because of that repository, but that format still requires (for many (most?) purposes) loading both the small delta manifest and its full-length parent manifest, so the delta manifest offers a storage savings but does not improve runtimes for many cases (e.g. browsing /dir).
(8) By Vadim Goncharov (nuclight) on 2024-04-28 00:00:51 in reply to 3 [link] [source]
Fossil's manifest format, which lists all files in each checkin, doesn't scale well to repos of that size, causing some pain in all of the cases you mention. The "delta manifest" format was, IIRC, added because of that repository, but that format still requires (for many (most?) purposes) loading both the small delta manifest and its full-length parent manifest, so the delta manifest offers a storage savings but does not improve runtimes for many cases (e.g. browsing /dir).
Seems some new VCS is needed, FossilNG perhaps? :) Or how this problem should be fixed? Is it inevitable architecture problem? Git is able to handle such repos, but I would not say git's format is fundamentally different from Fossil's...
(9.1) By Stephan Beal (stephan) on 2024-04-28 06:51:03 edited from 9.0 in reply to 8 [link] [source]
FossilNG perhaps?
See the like-named wiki page at src:/wiki?name=Fossil-NG
Git is able to handle such repos, but I would not say git's format is fundamentally different from Fossil's...
i know literally nothing about how git stores the list of files/hashes associated with each checkin, so can't comment on it.
However, i can, with some degree of authority, comment on...
Fossil has an optimization to reduce the number of files listed in a manifest, drastically decreasing the size of manifests for repos like the pkgsrc one. They're called delta manifests and, IIRC, they were added specifically because of the NetBSD pkgsrc repo. In practice, however, they don't save much space and they cost more RAM. See:
src:/doc/trunk/www/delta-manifests.md
Delta manifests are a trade-off, in any case. The main fossil repo and its sibling, the sqlite repo, are both explicitly configured to not generate delta manifests because their use makes it far more difficult for downstream clients to validate the contents of their downloads (see the above article for why).
(10) By Vadim Goncharov (nuclight) on 2024-06-14 20:31:50 in reply to 9.1 [link] [source]
See the like-named wiki page at src:/wiki?name=Fossil-NG
Seems that is just list of ideas "what we need" but not architecture/storage/format sketches. I'd add pull requests to that, though :)
i know literally nothing about how git stores the list of files/hashes associated with each checkin, so can't comment on it.
Oh, that's really dumb in format. Git essentially has just 3 file types - raw blob for file contents (delta compression in "packs" is viewed as implementation detail), the "tree" BLOB type and "commit" BLOB type, each identified by SHA hash. The "commit" type is most structured and close to Fossil's manifest with cards - it has NUL-delimited header (type/size) and usual LF-delimited lines giving author, parent hashes, description etc. Most important of them is the single line "tree 0c602ff1942<rest of hash>
".
The "tree" object, after type/size header, is just sequence of records:
mode SP filename NUL binary-20-bytes-of-SHA-hash
e.g.
00000000 74 72 65 65 20 35 31 32 00 31 30 30 36 34 34 20 |tree 512.100644 |
00000010 2e 67 69 74 69 67 6e 6f 72 65 00 4a 66 1e cc 38 |.gitignore.Jf..8|
00000020 c0 e1 31 d7 17 02 d7 65 ff 80 1f 7f 93 3e 85 31 |..1....e.....>.1|
00000030 30 30 36 34 34 20 4c 49 43 45 4e 53 45 00 a6 12 |00644 LICENSE...|
00000040 ad 98 13 b0 06 ce 81 d1 ee 43 8d d7 84 da 99 a5 |.........C......|
00000050 40 07 31 30 30 36 34 34 20 52 45 41 44 4d 45 2e |@.100644 README.|
00000060 6d 64 00 86 c0 79 b1 83 20 64 ed b4 f2 b2 1c 3c |md...y.. d.....<|
00000070 9c 8f ba 17 cf 1b 24 34 30 30 30 30 20 64 6f 63 |......$40000 doc|
00000080 73 00 0c d7 41 cf 6a 0a 66 40 d2 cf be b9 ec 41 |s...A.j.f@.....A|
00000090 31 2f 1c c5 58 44 31 30 30 36 34 34 20 65 6d 6f |1/..XD100644 emo|
If mode has 0x40000 than hash is pointer to another tree
object, that is, directory - otherwise, it's pointer to raw file contents BLOB.
So, if one has large repo, than it will be one tree
object for each directory per commit - but in fact, for those that did not change, there will be same tree object! E.g. if a check-in has changed just src/subsys1/file1.c
, then commit
BLOB will record tree
pointing to new src
which will point to tree
BLOB for susbsys1
- but doc
(and below), src/subsys2
trees and so on - are unchanged and point to same BLOBs (hashes) as in previous check-in.
As you can see, this architecture has problem with tracking renames and blame
command (it's really looks similar to FAT16), but effective in saving space for large repos when only some files were changed. Don't know what about CPU, but at least for some operations, probably, too.
Fossil has an optimization to reduce the number of files listed in a manifest, drastically decreasing the size of manifests for repos like the pkgsrc one. They're called delta manifests and, IIRC, they were added specifically because of the NetBSD pkgsrc repo. In practice, however, they don't save much space and they cost more RAM. See: src:/doc/trunk/www/delta-manifests.md
So, the problem with NetBSD's repo is really due to Manifest format listing all files, and even delta-manifests are not to rescue? Is the root of better Git's performance really here? As I don't know what Fossil does with manifests internally, e.g. what bottlenecks are shown by that NetBSD's case profiling...
(11) By Stephan Beal (stephan) on 2024-06-14 20:59:04 in reply to 10 [link] [source]
I'd add pull requests to that, though :)
Bundles are fossil's counterpart to PRs.
So, the problem with NetBSD's repo is really due to Manifest format listing all files, and even delta-manifests are not to rescue?
Whether that's "the" problem or "a" problem i can't say with certainty.
Fossil's manifest format effectively scales linearly on the number of files, and the netbsd repo has a tremendous number of files. Delta-manifests ostensibly reduce the on-disk storage for the manifests (netbsd's are huge), but once fossil's own delta compression is applied to a manifest (based on its prior version), and zlib compression on top of that, the real space savings for deltas, on average, drops to a negligible amount. Delta manifests always require more RAM memory than non-deltas because navigating them requires having both the delta and its baseline manifest in memory.
In any case: because each manifest lists all files in the repository, even a single-file change produces a manifest of a comparable size to the previous one (many thousands of files, in netbsd's case). Parsing and post-processing those takes time, and that time goes up for each entry in the manifest. That time is negligible for a small-/mid-sized repo with a few hundred or a few thousand files, but netbsd is way beyond that scale.
(12.1) By Andy Bradford (andybradford) on 2024-06-14 23:08:04 edited from 12.0 in reply to 11 [link] [source]
> In any case: because each manifest lists all files in the repository, > even a single-file change produces a manifest of a comparable size to > the previous one (many thousands of files, in netbsd's case). Here's an example manifest: $ fossil artifact 49b80aeeb5 -R netbsd-src.fossil | wc 802 2385 78372 Seems odd that this manifest has so few files (796). I would have expected more especially given that the working checkout has so many: $ find netbsd-src -type f | wc -l 208798 Aha, I see, mystery solved: $ fossil artifact 49b80aeeb5 -R netbsd-src.fossil | grep ^B B 4b3c3a99670a83d498bf6f475446afb10903ca892f72a1f83c181ca3edcc77ad Now this looks like what I would expect: $ fossil artifact 4b3c3a9967 -R netbsd-src.fossil | wc 210196 634147 25205471 So this repository is using delta-manifests. Pity that it hasn't been updated in months. Andy
(4) By Warren Young (wyoung) on 2023-12-18 01:32:59 in reply to 2 [link] [source]
But why?
Presumably because it's a fine torture-test of the new resumable cloning feature.
(5.1) By Andy Bradford (andybradford) on 2023-12-19 15:30:42 edited from 5.0 in reply to 2 [link] [source]
> Is the finished clone really usable? It seems to be. Certain operations take noticeably longer than others so there is definitely some patience required. Also, one contributing factor is that I'm running this on OpenBSD and it's disk I/O isn't necessarily the fastest in some cases so it's hard to say that all the slowness is due to Fossil and the size of the repository. That being said, here are some numbers. "fossil open" has to produce over 200,000 files on my filesystem: $ time fossil open netbsd-src.fossil --workdir netbsd-src > netbsd-src.out 17m58.50s real 1m31.82s user 2m24.87s system $ find netbsd-src -type f | wc -l 208798 $ time fossil status repository: /home/amb/Downloads/netbsd-src.fossil local-root: /home/amb/Downloads/netbsd-src/ config-db: /home/amb/.fossil checkout: a83039c6283ce6e2276dd568192d7a50c0474b90 2023-12-17 00:19:11 UTC parent: f5a17444ddea101eabb78fbb2126955796c57718 2023-12-16 23:40:33 UTC tags: trunk comment: tests/make: add basic tests for the ':M' modifier (user: rillig) WARNING: multiple open leaf check-ins on trunk: (1) 2023-12-17 00:19:11 [a83039c628] (current) [39 more open leaves for trunk hidden] 0m12.95s real 0m00.92s user 0m07.79s system Diff with no files changed yet: $ time fossil diff 0m07.97s real 0m00.69s user 0m06.85s system Diff with a random subset of files changed: $ time fossil status repository: /home/amb/Downloads/netbsd-src.fossil local-root: /home/amb/Downloads/netbsd-src/ config-db: /home/amb/.fossil checkout: a83039c6283ce6e2276dd568192d7a50c0474b90 2023-12-17 00:19:11 UTC parent: f5a17444ddea101eabb78fbb2126955796c57718 2023-12-16 23:40:33 UTC tags: trunk comment: tests/make: add basic tests for the ':M' modifier (user: rillig) EDITED crypto/external/bsd/libsaslc/dist/src/Makefile.bsd EDITED crypto/external/bsd/openssl.old/dist/test/pkits-test.pl EDITED crypto/external/bsd/openssl/lib/libcrypto/arch/powerpc/aes.inc EDITED external/apache2/llvm/dist/clang/include/clang/AST/DeclarationName.h EDITED external/bsd/nvi/dist/TODO EDITED external/cddl/dtracetoolkit/dist/Bin/bitesize.d EDITED external/gpl2/gettext/bin/msgmerge/Makefile EDITED external/gpl3/binutils/dist/gold/target-select.cc EDITED external/gpl3/gcc.old/usr.bin/gcc/arch/mipsn64el/gthr-default.h EDITED external/gpl3/gcc.old/usr.bin/gcc/arch/mipsn64el/insn-modes.h EDITED external/gpl3/gcc/dist/libgcc/config/nds32/isr-library/save_fpu_regs_01.inc EDITED external/gpl3/gdb.old/dist/gold/ChangeLog-2017 EDITED external/gpl3/gdb.old/dist/ld/testsuite/ld-mmix/loc2.d EDITED external/gpl3/gdb/dist/gdb/testsuite/gdb.mi/mi-ns-stale-regcache.exp EDITED external/gpl3/gdb/dist/sim/testsuite/arm/xscale/mra.cgs EDITED external/mpl/bind/dist/bin/tests/system/checkconf/check-root-static-ds.conf EDITED external/mpl/bind/dist/bin/tests/system/dyndb/driver/zone.h EDITED lib/libarch/i386/i386_get_ldt.c EDITED lib/libisns/isns_fileio.h EDITED sys/external/gpl2/dts/dist/arch/arm/boot/dts/sama5d36.dtsi EDITED usr.bin/renice/renice.c WARNING: multiple open leaf check-ins on trunk: (1) 2023-12-17 00:19:11 [a83039c628] (current) [39 more open leaf on trunk messages suppressed] 0m08.15s real 0m00.82s user 0m07.30s system $ time fossil diff | wc -l 8417 0m08.01s real 0m00.82s user 0m07.10s system Opening the ui: $ fossil ui #opened page in browser This page was generated in about 0.001s by Fossil 2.24 [29e9e84a1e] 2023-12-08 15:30:14 Browsing commits from the timeline seems fine. Clicking on a random commit seems fine, it renders the diff quickly enough in 0.021s. Here are some other things I clicked on from the UI: /timeline?r=trunk&c=2023-12-17+00%3A19%3A11 57.271s /dir?ci=tip 0.981s /finfo?name=rescue/list.crypto&m&ci=tip 84.731s /reports 22.951s /fileage?name=b1629468d8b9146d 74.721s /hash-collisions 10.711s /vdiff?from=3da478793e72b5fb&to=7e31584767cf23bd [out of memory] /vdiff?from=6cf62525d8c7094e&to=a83039c6283ce6e2 1.531s Time to commit a private set of changes: $ time fossil ci --private -m "Test private commit" New_Version: 6735ec7a8a729d4ffef6a2d5ff0816b987cf27505130a76ef26c67c88bb94e07 6m50.40s real 1m32.38s user 1m34.31s system So overall it seems to work. Almost 7 minutes to commit is a long time, but then, disabling the R-card would probably change that. I tend to prefer leaving the R-card enabled. Andy
(6) By Vadim Goncharov (nuclight) on 2024-02-17 19:40:26 in reply to 5.1 [link] [source]
Huh, when we tried to convert FreeBSD src repo from git to Fossil (you can find this on this forum), conversion took many days, but times afterall were around 40 seconds - meaning it's somewhat usable...
(7) By Andy Bradford (andybradford) on 2024-02-17 21:31:05 in reply to 6 [source]
> but times afterall were around 40 seconds - meaning it's somewhat usable It's definitely useable if one has the patience to clone it. I haven't pulled from the repository since December 18, so I decided to "fossil pull" today to get more of the repository. It took 8 minutes to complete the pull (which is largely a function of network latency and bandwidth I imagine): $ fossil pull -R netbsd-src.fossil Pull from https://src.fossil.netbsd.org/ Round-trips: 15 Artifacts sent: 0 received: 5718 Pull done, wire bytes sent: 279596 received: 24921431 remote: 163.172.4.16 And then it took another minute or two before it output that there was a fork: ***** WARNING: a fork has occurred ***** use "fossil leaves -multiple" for more details. Navigating the timeline is fine as far as I can tell. I tried a big diff from the UI (clicking two nodes) and I must have picked some pretty nasty ones: Difference From 738b8b5e0c633827 To 3f7ad32ff303270b It took Fossil a couple of minutes to generate a response and about 1.5GB of RAM. Firefox, on the other hand, took 9GB of RAM to render the page, but then suddenly Firefox stopped outputing anything... I saw this output from Fossil: ------------- 2024-02-17 21:23:54 UTC ------------ panic: Timeout after 600 seconds during web-page reply - user 208,420,000 µs, sys 26,610,000 µs HTTP_HOST=localhost:8080 HTTP_REFERER=http://localhost:8080/timeline HTTP_USER_AGENT=Mozilla/5.0 (X11; OpenBSD amd64; rv:121.0) Gecko/20100101 Firefox/121.0 PATH_INFO=/vdiff QUERY_STRING=from=738b8b5e0c633827&to=3f7ad32ff303270b REMOTE_ADDR=127.0.0.1 REQUEST_METHOD=GET REQUEST_URI=/vdiff?from=738b8b5e0c633827&to=3f7ad32ff303270b So it looks like Fossil gave up waiting for Firefox to consume the data. I wonder if I can tune the timeout so that I can actually see the page complete. But it seems that Fossil did just fine otherwise. Andy