Fossil User Forum

Successfully cloned massive NetBSD src fossil repository with resume
Login

Successfully cloned massive NetBSD src fossil repository with resume

Successfully cloned massive NetBSD src fossil repository with resume

(1) By Andy Bradford (andybradford) on 2023-12-17 05:54:06 [link] [source]

I've been running and testing the  code from the clone-resume branch for
a couple  of weeks  now trying  to clone the  massive NetBSD  src fossil
repository[1]. I didn't think it would take as long as it did or I would
have probably paid more attention to  when I actually started the clone,
however, I think I can get  the picture from the /rcvfromlist page which
has 41 entries, the first of which is:

1 	2023-11-29 04:57:18 	amb 	sha3 	163.172.4.16

Given that initial entry, I've been  trying to clone this repository for
over  2  weeks.  I  believe  that  the  41  entries  each  represent  an
interrupted  clone and  resuming thereafter.  Most of  the interruptions
were self-inflicted  as I tested  the ability to resume,  however, there
was one unexpected 30 minute network outage that interrupted the clone.

The Rebuild  phase of the clone  started sometime on Monday  the 11th of
December and  just finished  sometime today, the  16th of  December. For
some reason,  the output stopped  reporting on  progress when it  got to
48.9% and just sat there for  3--4 days. It may have continued reporting
sometime during the night when I wasn't paying attention.

The delta  compression phase took  much less time  than the rest  of the
rebuild but still took all day today to complete.

Here is the output from the last resumed clone operation followed by the
rebuild  and  delta  compression.  The  reason why  there  are  2  lines
representing the percentage complete is  because I suspended the rebuild
so I could do something else with the computer because it was taxing the
I/O quite a bit.

Clone done, wire bytes sent: 28323  received: 1214181826  remote: src.fossil.net
bsd.org (163.172.4.16)                                                          
Uncompressed payload sent: 10834  received: 1214152124                      
Rebuilding repository meta-data...
  48.9% complete...
  100.0% complete...
Extra delta compression... 757855 deltas save 11,729,081,754 bytes
Vacuuming the database... 
project-id: 3109ce34a3d43fd786dcf0133c6a96a6de40c573
server-id:  fbbbc5a67d6d465dcef5d86f1d05fe52d2352a1e
admin-user: amb (password is "BMhnetHYKr")

Here are some statistics:

$ time fossil dbstat -R netbsd-src.fossil 
project-name:      NetBSD src
repository-size:   50,270,945,280 bytes
artifact-count:    4,304,686 (stored as 421,870 full text and 3,882,816 deltas)
artifact-sizes:    60,499 average, 29,093,223 max, 260,430,148,484 total
compression-ratio: 5:1
check-ins:         1,954,758
files:             526,433 across all branches
wiki-pages:        0 (0 changes)
tickets:           1 (1 changes)
events:            0
tag-changes:       0
latest-change:     2023-12-11 21:57:41 - about 5 days ago
project-age:       11,479 days or approximately 31.43 years.
project-id:        3109ce34a3d43fd786dcf0133c6a96a6de40c573
schema-version:    2015-01-24
fossil-version:    2023-12-08 15:30:14 [29e9e84a1e] [2.24] (clang-13.0.0 )
sqlite-version:    2023-11-24 11:41:44 [ebead0e723] (3.44.2)
database-stats:    6,136,590 pages, 8192 bytes/pg, 0 free pages, UTF-8, delete mode
    0m10.89s real     0m04.55s user     0m05.76s system

Here is my attempt to pull today after rebuild was complete:

$ fossil pull -R netbsd-src.fossil 
Pull from https://src.fossil.netbsd.org/
Round-trips: 4   Artifacts sent: 0  received: 116
Pull done, wire bytes sent: 6366  received: 49913  remote: 163.172.4.16

Andy

[1] https://www.fossil-scm.org/forum/forumpost/8cbc83e5d08a86b4

(2) By Florian Balmer (florian.balmer) on 2023-12-17 11:39:13 in reply to 1 [link] [source]

My first reaction:

https://i.giphy.com/1M9fmo1WAFVK0.webp

My second reaction:

Is the finished clone really usable? Is it possible to open a new check-out in reasonable time? Is it fast to detect changed files? Is browsing the repository through the web UI still a pleasant experience? Can you load CLI diff and timeline views without having to wait? If all answers are "yes", I'm definitely impressed!

(3) By Stephan Beal (stephan) on 2023-12-17 12:38:38 in reply to 2 [link] [source]

@Andy: congratulations!!! That improvement is long overdue!

@Florian:

Is the finished clone really usable?

Whether that particular repo was ever really "usable" with fossil is a matter of debate ;).

(For those who don't know: the netbsd repository is the single-largest known use of fossil.)

Is it possible to open a new check-out in reasonable time? Is it fast to detect changed files? Is browsing the repository through the web UI still a pleasant experience?

Fossil's manifest format, which lists all files in each checkin, doesn't scale well to repos of that size, causing some pain in all of the cases you mention. The "delta manifest" format was, IIRC, added because of that repository, but that format still requires (for many (most?) purposes) loading both the small delta manifest and its full-length parent manifest, so the delta manifest offers a storage savings but does not improve runtimes for many cases (e.g. browsing /dir).

(8) By Vadim Goncharov (nuclight) on 2024-04-28 00:00:51 in reply to 3 [link] [source]

Fossil's manifest format, which lists all files in each checkin, doesn't scale well to repos of that size, causing some pain in all of the cases you mention. The "delta manifest" format was, IIRC, added because of that repository, but that format still requires (for many (most?) purposes) loading both the small delta manifest and its full-length parent manifest, so the delta manifest offers a storage savings but does not improve runtimes for many cases (e.g. browsing /dir).

Seems some new VCS is needed, FossilNG perhaps? :) Or how this problem should be fixed? Is it inevitable architecture problem? Git is able to handle such repos, but I would not say git's format is fundamentally different from Fossil's...

(9.1) By Stephan Beal (stephan) on 2024-04-28 06:51:03 edited from 9.0 in reply to 8 [link] [source]

FossilNG perhaps?

See the like-named wiki page at src:/wiki?name=Fossil-NG

Git is able to handle such repos, but I would not say git's format is fundamentally different from Fossil's...

i know literally nothing about how git stores the list of files/hashes associated with each checkin, so can't comment on it.

However, i can, with some degree of authority, comment on...

Fossil has an optimization to reduce the number of files listed in a manifest, drastically decreasing the size of manifests for repos like the pkgsrc one. They're called delta manifests and, IIRC, they were added specifically because of the NetBSD pkgsrc repo. In practice, however, they don't save much space and they cost more RAM. See:

src:/doc/trunk/www/delta-manifests.md

Delta manifests are a trade-off, in any case. The main fossil repo and its sibling, the sqlite repo, are both explicitly configured to not generate delta manifests because their use makes it far more difficult for downstream clients to validate the contents of their downloads (see the above article for why).

(10) By Vadim Goncharov (nuclight) on 2024-06-14 20:31:50 in reply to 9.1 [link] [source]

See the like-named wiki page at src:/wiki?name=Fossil-NG

Seems that is just list of ideas "what we need" but not architecture/storage/format sketches. I'd add pull requests to that, though :)

i know literally nothing about how git stores the list of files/hashes associated with each checkin, so can't comment on it.

Oh, that's really dumb in format. Git essentially has just 3 file types - raw blob for file contents (delta compression in "packs" is viewed as implementation detail), the "tree" BLOB type and "commit" BLOB type, each identified by SHA hash. The "commit" type is most structured and close to Fossil's manifest with cards - it has NUL-delimited header (type/size) and usual LF-delimited lines giving author, parent hashes, description etc. Most important of them is the single line "tree 0c602ff1942<rest of hash>".

The "tree" object, after type/size header, is just sequence of records:

mode SP filename NUL binary-20-bytes-of-SHA-hash

e.g.

00000000  74 72 65 65 20 35 31 32  00 31 30 30 36 34 34 20  |tree 512.100644 |
00000010  2e 67 69 74 69 67 6e 6f  72 65 00 4a 66 1e cc 38  |.gitignore.Jf..8|
00000020  c0 e1 31 d7 17 02 d7 65  ff 80 1f 7f 93 3e 85 31  |..1....e.....>.1|
00000030  30 30 36 34 34 20 4c 49  43 45 4e 53 45 00 a6 12  |00644 LICENSE...|
00000040  ad 98 13 b0 06 ce 81 d1  ee 43 8d d7 84 da 99 a5  |.........C......|
00000050  40 07 31 30 30 36 34 34  20 52 45 41 44 4d 45 2e  |@.100644 README.|
00000060  6d 64 00 86 c0 79 b1 83  20 64 ed b4 f2 b2 1c 3c  |md...y.. d.....<|
00000070  9c 8f ba 17 cf 1b 24 34  30 30 30 30 20 64 6f 63  |......$40000 doc|
00000080  73 00 0c d7 41 cf 6a 0a  66 40 d2 cf be b9 ec 41  |s...A.j.f@.....A|
00000090  31 2f 1c c5 58 44 31 30  30 36 34 34 20 65 6d 6f  |1/..XD100644 emo|

If mode has 0x40000 than hash is pointer to another tree object, that is, directory - otherwise, it's pointer to raw file contents BLOB.

So, if one has large repo, than it will be one tree object for each directory per commit - but in fact, for those that did not change, there will be same tree object! E.g. if a check-in has changed just src/subsys1/file1.c, then commit BLOB will record tree pointing to new src which will point to tree BLOB for susbsys1 - but doc (and below), src/subsys2 trees and so on - are unchanged and point to same BLOBs (hashes) as in previous check-in.

As you can see, this architecture has problem with tracking renames and blame command (it's really looks similar to FAT16), but effective in saving space for large repos when only some files were changed. Don't know what about CPU, but at least for some operations, probably, too.

Fossil has an optimization to reduce the number of files listed in a manifest, drastically decreasing the size of manifests for repos like the pkgsrc one. They're called delta manifests and, IIRC, they were added specifically because of the NetBSD pkgsrc repo. In practice, however, they don't save much space and they cost more RAM. See: src:/doc/trunk/www/delta-manifests.md

So, the problem with NetBSD's repo is really due to Manifest format listing all files, and even delta-manifests are not to rescue? Is the root of better Git's performance really here? As I don't know what Fossil does with manifests internally, e.g. what bottlenecks are shown by that NetBSD's case profiling...

(11) By Stephan Beal (stephan) on 2024-06-14 20:59:04 in reply to 10 [link] [source]

I'd add pull requests to that, though :)

Bundles are fossil's counterpart to PRs.

So, the problem with NetBSD's repo is really due to Manifest format listing all files, and even delta-manifests are not to rescue?

Whether that's "the" problem or "a" problem i can't say with certainty.

Fossil's manifest format effectively scales linearly on the number of files, and the netbsd repo has a tremendous number of files. Delta-manifests ostensibly reduce the on-disk storage for the manifests (netbsd's are huge), but once fossil's own delta compression is applied to a manifest (based on its prior version), and zlib compression on top of that, the real space savings for deltas, on average, drops to a negligible amount. Delta manifests always require more RAM memory than non-deltas because navigating them requires having both the delta and its baseline manifest in memory.

In any case: because each manifest lists all files in the repository, even a single-file change produces a manifest of a comparable size to the previous one (many thousands of files, in netbsd's case). Parsing and post-processing those takes time, and that time goes up for each entry in the manifest. That time is negligible for a small-/mid-sized repo with a few hundred or a few thousand files, but netbsd is way beyond that scale.

(12.1) By Andy Bradford (andybradford) on 2024-06-14 23:08:04 edited from 12.0 in reply to 11 [link] [source]

> In any case: because each manifest  lists all files in the repository,
> even a single-file change produces a  manifest of a comparable size to
> the previous one (many thousands of files, in netbsd's case).

Here's an example manifest:

$ fossil artifact 49b80aeeb5 -R netbsd-src.fossil | wc
     802    2385   78372

Seems  odd that  this manifest  has  so few  files (796).  I would  have
expected more especially given that the working checkout has so many:

$ find netbsd-src -type f | wc -l
  208798

Aha, I see, mystery solved:

$ fossil artifact 49b80aeeb5 -R netbsd-src.fossil | grep ^B
B 4b3c3a99670a83d498bf6f475446afb10903ca892f72a1f83c181ca3edcc77ad

Now this looks like what I would expect:

$ fossil artifact 4b3c3a9967 -R netbsd-src.fossil  | wc
  210196  634147 25205471

So this  repository is using  delta-manifests. Pity that it  hasn't been
updated in months.

Andy

(4) By Warren Young (wyoung) on 2023-12-18 01:32:59 in reply to 2 [link] [source]

But why?

Presumably because it's a fine torture-test of the new resumable cloning feature.

(5.1) By Andy Bradford (andybradford) on 2023-12-19 15:30:42 edited from 5.0 in reply to 2 [link] [source]

> Is the finished clone really usable?

It seems to be. Certain operations take noticeably longer than others so
there  is  definitely some  patience  required.  Also, one  contributing
factor  is that  I'm running  this on  OpenBSD and  it's disk  I/O isn't
necessarily the fastest in  some cases so it's hard to  say that all the
slowness is  due to Fossil  and the size  of the repository.  That being
said, here are some numbers.

"fossil open" has to produce over 200,000 files on my filesystem:

$ time fossil open netbsd-src.fossil --workdir netbsd-src > netbsd-src.out
   17m58.50s real     1m31.82s user     2m24.87s system

$ find netbsd-src -type f | wc -l
  208798

$ time fossil status
repository:   /home/amb/Downloads/netbsd-src.fossil
local-root:   /home/amb/Downloads/netbsd-src/
config-db:    /home/amb/.fossil
checkout:     a83039c6283ce6e2276dd568192d7a50c0474b90 2023-12-17 00:19:11 UTC
parent:       f5a17444ddea101eabb78fbb2126955796c57718 2023-12-16 23:40:33 UTC
tags:         trunk
comment:      tests/make: add basic tests for the ':M' modifier (user: rillig)
WARNING: multiple open leaf check-ins on trunk:
  (1) 2023-12-17 00:19:11 [a83039c628] (current)
[39 more open leaves for trunk hidden]
    0m12.95s real     0m00.92s user     0m07.79s system

Diff with no files changed yet:

$ time fossil diff
    0m07.97s real     0m00.69s user     0m06.85s system

Diff with a random subset of files changed:

$ time fossil status
repository:   /home/amb/Downloads/netbsd-src.fossil
local-root:   /home/amb/Downloads/netbsd-src/
config-db:    /home/amb/.fossil
checkout:     a83039c6283ce6e2276dd568192d7a50c0474b90 2023-12-17 00:19:11 UTC
parent:       f5a17444ddea101eabb78fbb2126955796c57718 2023-12-16 23:40:33 UTC
tags:         trunk
comment:      tests/make: add basic tests for the ':M' modifier (user: rillig)
EDITED     crypto/external/bsd/libsaslc/dist/src/Makefile.bsd
EDITED     crypto/external/bsd/openssl.old/dist/test/pkits-test.pl
EDITED     crypto/external/bsd/openssl/lib/libcrypto/arch/powerpc/aes.inc
EDITED     external/apache2/llvm/dist/clang/include/clang/AST/DeclarationName.h
EDITED     external/bsd/nvi/dist/TODO
EDITED     external/cddl/dtracetoolkit/dist/Bin/bitesize.d
EDITED     external/gpl2/gettext/bin/msgmerge/Makefile
EDITED     external/gpl3/binutils/dist/gold/target-select.cc
EDITED     external/gpl3/gcc.old/usr.bin/gcc/arch/mipsn64el/gthr-default.h
EDITED     external/gpl3/gcc.old/usr.bin/gcc/arch/mipsn64el/insn-modes.h
EDITED     external/gpl3/gcc/dist/libgcc/config/nds32/isr-library/save_fpu_regs_01.inc
EDITED     external/gpl3/gdb.old/dist/gold/ChangeLog-2017
EDITED     external/gpl3/gdb.old/dist/ld/testsuite/ld-mmix/loc2.d
EDITED     external/gpl3/gdb/dist/gdb/testsuite/gdb.mi/mi-ns-stale-regcache.exp
EDITED     external/gpl3/gdb/dist/sim/testsuite/arm/xscale/mra.cgs
EDITED     external/mpl/bind/dist/bin/tests/system/checkconf/check-root-static-ds.conf
EDITED     external/mpl/bind/dist/bin/tests/system/dyndb/driver/zone.h
EDITED     lib/libarch/i386/i386_get_ldt.c
EDITED     lib/libisns/isns_fileio.h
EDITED     sys/external/gpl2/dts/dist/arch/arm/boot/dts/sama5d36.dtsi
EDITED     usr.bin/renice/renice.c
WARNING: multiple open leaf check-ins on trunk:
  (1) 2023-12-17 00:19:11 [a83039c628] (current)
[39 more open leaf on trunk messages suppressed]
    0m08.15s real     0m00.82s user     0m07.30s system

$ time fossil diff | wc -l    
    8417
    0m08.01s real     0m00.82s user     0m07.10s system

Opening the ui:

$ fossil ui #opened page in browser
This page was generated in about 0.001s by Fossil 2.24 [29e9e84a1e] 2023-12-08 15:30:14 

Browsing  commits from  the timeline  seems fine.  Clicking on  a random
commit seems fine, it renders the diff quickly enough in 0.021s.

Here are some other things I clicked on from the UI:

/timeline?r=trunk&c=2023-12-17+00%3A19%3A11 57.271s
/dir?ci=tip 0.981s
/finfo?name=rescue/list.crypto&m&ci=tip 84.731s
/reports 22.951s
/fileage?name=b1629468d8b9146d 74.721s
/hash-collisions 10.711s
/vdiff?from=3da478793e72b5fb&to=7e31584767cf23bd [out of memory]
/vdiff?from=6cf62525d8c7094e&to=a83039c6283ce6e2 1.531s

Time to commit a private set of changes:

$ time fossil ci --private -m "Test private commit"
New_Version: 6735ec7a8a729d4ffef6a2d5ff0816b987cf27505130a76ef26c67c88bb94e07
    6m50.40s real     1m32.38s user     1m34.31s system

So overall it seems to work. Almost  7 minutes to commit is a long time,
but then,  disabling the R-card  would probably  change that. I  tend to
prefer leaving the R-card enabled.

Andy

(6) By Vadim Goncharov (nuclight) on 2024-02-17 19:40:26 in reply to 5.1 [link] [source]

Huh, when we tried to convert FreeBSD src repo from git to Fossil (you can find this on this forum), conversion took many days, but times afterall were around 40 seconds - meaning it's somewhat usable...

(7) By Andy Bradford (andybradford) on 2024-02-17 21:31:05 in reply to 6 [source]

> but times afterall were around 40 seconds - meaning it's somewhat usable

It's definitely useable  if one has the patience to  clone it. I haven't
pulled from  the repository since December  18, so I decided  to "fossil
pull" today to get more of the repository. It took 8 minutes to complete
the pull (which is largely a function of network latency and bandwidth I
imagine):

$ fossil pull -R netbsd-src.fossil                                            
Pull from https://src.fossil.netbsd.org/
Round-trips: 15   Artifacts sent: 0  received: 5718
Pull done, wire bytes sent: 279596  received: 24921431  remote: 163.172.4.16

And then it took another minute or two before it output that there was a
fork:

***** WARNING: a fork has occurred *****
use "fossil leaves -multiple" for more details.

Navigating the timeline is fine as far as I can tell.

I tried  a big diff  from the  UI (clicking two  nodes) and I  must have
picked some pretty nasty ones:


Difference From 738b8b5e0c633827 To 3f7ad32ff303270b

It took  Fossil a  couple of  minutes to generate  a response  and about
1.5GB of RAM.

Firefox, on the other hand, took 9GB of RAM to render the page, but then
suddenly Firefox  stopped outputing anything...  I saw this  output from
Fossil:

------------- 2024-02-17 21:23:54 UTC ------------
panic: Timeout after 600 seconds during web-page reply - user 208,420,000 µs, sys 26,610,000 µs
HTTP_HOST=localhost:8080
HTTP_REFERER=http://localhost:8080/timeline
HTTP_USER_AGENT=Mozilla/5.0 (X11; OpenBSD amd64; rv:121.0) Gecko/20100101 Firefox/121.0
PATH_INFO=/vdiff
QUERY_STRING=from=738b8b5e0c633827&to=3f7ad32ff303270b
REMOTE_ADDR=127.0.0.1
REQUEST_METHOD=GET
REQUEST_URI=/vdiff?from=738b8b5e0c633827&to=3f7ad32ff303270b

So it looks like Fossil gave up waiting for Firefox to consume the data.
I wonder if I  can tune the timeout so that I can  actually see the page
complete.

But it seems that Fossil did just fine otherwise.

Andy