Fossil Forum

An unexpect received artifact during a checkin
Login

An unexpect received artifact during a checkin

An unexpect received artifact during a checkin

(1) By Stephan Beal (stephan) on 2022-02-17 08:36:10 [link] [source]

This bit of weirdness just cropped up and gave me pause: when checking in to a remote repo which nobody else checks into, i noticed 1 received artifact during a checkin. First, the output which caught my attention...

[stephan@nuc:~/fossil/cwal/whcl]$ f ci -m ....
Pull from https://fossil.wanderinghorse.net/r/cwal
Round-trips: 1   Artifacts sent: 0  received: 0
Pull done, wire bytes sent: 439  received: 2692  ip: xxxxxxx

# ^^^ note the 0 sent/received, then it continued with the checkin and sync:

New_Version: 267bbe153775256da47fe8653039909268db4d0d
Sync with https://fossil.wanderinghorse.net/r/cwal
Round-trips: 3   Artifacts sent: 4  received: 1
Sync done, wire bytes sent: 8047  received: 3273  ip: xxxx

Note the "received: 1" in the post-checkin sync. After ensuring that no hackery was going on i looked into the "received artifacts" log, which has an entry with my remote's IP which looks like:

<snipped: the 3 files which were checked in and their manifest>
6029272a2953b1e8f4c804b7a12bc1d76ca6790c cluster (size: 4464)

Totaling 5 artifacts, which matches the sync count of 4 sent, 1 received.

The interesting bit is that last line. i understand what clusters are but not under which conditions they're created nor which side of the connection creates them.

My (mis?)understanding is that the cluster artifact is the mysterious "received: 1" artifact and that the remote created it for me. Is that the case?

(2) By Richard Hipp (drh) on 2022-02-17 12:25:46 in reply to 1 [source]

Your understanding is correct. IIRC (it has been many years since I designed that algorithm) the server-side generates a new cluster if there are more than 200 (or was it 100?) unclustered artifacts currently in the repository.

For completeness, the "pull" algorithm works like this:

  1. Client sends the server a list of all of the artifacts it has.

  2. Server sends the client all artifacts it has that are not on the list sent by the client in step 1.

For a large repo, the list in step 1 could be quite large. To keep the amount of network traffic down, the idea of a cluster was introduced. A cluster is just a list of other artifacts. If the client says "I have cluster ABCD", that tells the server that the client also has all the other artifacts named in that cluster. And the ABCD cluster might refer to other clusters as well.

An artifact is "unclustered" if it is not named in any cluster. A cluster is itself an artifact and can be either clustered or unclustered. So, in order for the client to tell the server everything it has, it suffices to send just the hashes of the unclustered artifacts. As new check-ins are created, the number of unclustered artifacts grows. Eventually a server someplace creates a new cluster so that the number of unclustered artifacts never gets to be more than 100 or so.

The --verily option to "fossil sync" simply disables the clustering mechanism, causing both sides to send all of the artifact hashes that it holds. This results in a lot more network traffic for a sync. The option exists in order to work around any bugs that might pop up in the cluster tracking logic.