Problem: server returned an error - clone aborted
(1) By anonymous on 2019-10-07 11:57:45 [link] [source]
There is an annoying behavior when using fossil with unstable connections. The issue is obvious when cloning large repositories.
When the connection is lost, fossil detect the loss and delete all the downloaded data. It does not seems possible to RESUME the cloning. The full data has to be downloaded one more time.
It's a real problem in some countries with strong internet restrictions, the cloning can fail at any time.
(2) By Richard Hipp (drh) on 2019-10-07 12:10:37 in reply to 1 [link] [source]
Fair enough. That is something we can work on.
(3) By Andy Bradford (andybradford) on 2019-10-09 14:56:26 in reply to 2 [link] [source]
A simple approach would simply be to commit each round-trip and not error out when a network error occurs and then just allow "fossil sync" to do the rest, however... While experimenting with the "fossil purge" command, I emptied my repository entirely to see how well "fossil sync" could then clean up the mess (as I mentioned here): https://www.fossil-scm.org/forum/forumpost/c1ff250092 I then ran "fossil sync" and here is what it did: $ time fossil sync --httptrace Sync with http://localhost:8080/ Round-trips: 2 Artifacts sent: 0 received: 0 Server says: pull only - not authorized to push Round-trips: 4725 Artifacts sent: 0 received: 34202 Sync done, sent: 4177319 received: 54810455 ip: 127.0.0.1 ***** WARNING: a fork has occurred ***** use "fossil leaves -multiple" for more details. /***** Subprocess 31921 exit(0) *****/ 34m57.93s real 2m44.71s user 1m31.62s system So it took 35 minutes to "clone" via the sync protocol (and tested with and without --httptrace, and the use of the option didn't introduce much of significance). It would seem that the sync protocol ended up sending one "card" per round-trip for a considerable amount of round-trips. I also found that after the big synchronization, that the clone didn't have all of the artfacts that existed in the parent repository. For example, "fossil dbstat" shows that in the parent there are 13,055 checkins, however, in the clone that I purged and then synced, there are only 11,719 checkins, and 38,760 artifacts vs 44,088 artifacts respectively. I used "fossil deconstruct" to attempt to see if there is a pattern for missing artifacts. One such missing artifact is 000575c77d which is named in cluster artifact 5e1353f514 which *is* present in the clone, so it's odd that sync wouldn't have discovered that. Another example, again shows that artifact 0008f29c769 is missing from my clone and it is again part of a different cluster artifact 9b5084bbc2. Perhaps there is a bug in processing cluster artifacts because my clone clearly has the cluster which should have produced a phantom in the phantoms table which eventually resulted in a "gimme" to pull the content, but there are hardly any: $ echo "SELECT count(*) FROM phantom;" | fossil sql 5 I also found that in the clone, the unclustered table is quite large: $ echo "SELECT count(*) FROM unclustered;" | fossil sql 34207 Consequently, the next time I tried to sync, my client sent a list of "igot" cards that was almost as large: $ grep -c '^igot' http-request-1.txt 34203 Perhaps this is because there just are a lot of missing artifacts from this repository. Finally, I decided to try --verily: $ fossil sync --verily --httptrace Sync with http://localhost:8080/ Round-trips: 2 Artifacts sent: 0 received: 0 Server says: pull only - not authorized to push Round-trips: 8 Artifacts sent: 0 received: 5328 Sync done, sent: 2257296 received: 13046887 ip: 127.0.0.1 /***** Subprocess 58152 exit(0) *****/ It ran in much less time (I forgot to time it, but it was probably about 60--120 seconds at the most) and appears to have brought in all of the artifacts, though there is still one less checkin which is odd: artifact-count: 44,088 (stored as 17,033 full text and 27,055 deltas) check-ins: 13,054 So, given this... perhaps we could track a failed clone in the sqlite database in the Fossil repository and the next time that a sync operation causes a pull the client could send the "pragma send-catalog" to resume the clone operation? Something like: https://www.fossil-scm.org/home/info/ec26471439ec5294 Let's see how this works... $ fossil clone http://localhost:8080/ clone.fossil Round-trips: 3 Artifacts sent: 0 received: 25474 response truncated: got 1964529 bytes of 5001092 Clone done, sent: 1009 received: 17379269 ip: 127.0.0.1 server returned an error - clone incomplete Rebuilding repository meta-data... 100.0% complete... Extra delta compression... Vacuuming the database... project-id: CE59BB9F186226D80E49D1FA2DB29F935CCA0333 server-id: 45821d85f56221cd472fd4e08ddc7fc7babccf5f admin-user: amb (password is "RuXG64Bs9C") clone operation had errors. Run "fossil pull" to complete clone. So far so good, right? $ fossil pull -R clone.fossil Pull from http://localhost:8080/ Round-trips: 1 Artifacts sent: 0 received: 200 infinite loop in DELTA table Abort trap (core dumped) Looks like this approach has some bugs. I'm not yet sure where the bug is---is it in the rebuilding of the partially cloned repository? Is the cache somehow corrupted? Is it in the calculation of this number: https://www.fossil-scm.org/home/artifact?udc=1&ln=273-275&name=d49bd600d03d5de7 $ echo "SELECT max(rid) FROM blob;" | fossil sql -R clone.fossil 25702 Guess I'll have to dig into this a bit more. Andy
(4) By Andy Bradford (andybradford) on 2019-10-09 15:16:12 in reply to 3 [link] [source]
> So, given this... perhaps we could track a failed clone in the > sqlite database in the Fossil repository and the next time that a > sync operation causes a pull the client could send the "pragma > send-catalog" to resume the clone operation? Or maybe this would be a better fit as an appendage to the "fossil clone" command; perhaps a --resume option could be added. Andy
(5) By Warren Young (wyoung) on 2019-10-09 15:21:56 in reply to 4 [link] [source]
It'd be better if it just detected the condition and auto-resumed it. Ideally, a local partial clone should be treated no differently than one that's just been offline for a year or so and so needs a lot of updates in sync.
(6) By Andy Bradford (andybradford) on 2019-10-09 19:43:04 in reply to 5 [link] [source]
I agree that ideally a partial clone should be treated no differently than one that's just been offline, however, there are actually some useful optimizations that clone does which the normal sync protocol does not use except upon request/demand of the user (because it isn't expected to need tens of thousands of artifacts). Also, if you read my analysis of the "need a lot of updates" scenario, wherein I completely emptied my repository so sync had to pull *everything* again, I appear to have exposed a bug in the sync protocol that will need to be figured out. This is why I opted to make the "resume" behave more like a clone than a sync. I suspect that the current failure to "resume" in the clone-resume branch may be related to how Fossil rebuilds and constructs the delta table at the end of the the clone---maybe rebuild should be deferred until clone is complete? Thanks for your feedback regarding "auto-resume". Are you suggesting that perhaps one simply need to call "fossil clone" as second time giving the same arguments and it will pull in the rest of the content? Currently, Fossil complains if you try to clone on top of an already existing file and I think I would prefer to preserve that warning which means the user would have to pass in another argument. I suggested "--resume" because the intent is clear. Or did you have something else in mind for "auto-resume"? Maybe instead it could just try harder? For example, with autosync, it is possible to configure autosync-retries which will cause the sync to try harder when synchronizing content. Maybe you mean that clone should also do something similar rather than simply bailing out on the first network failure? Thanks, Andy
(7) By Warren Young (wyoung) on 2019-10-10 00:53:47 in reply to 6 [link] [source]
...regarding "auto-resume"...Are you suggesting that perhaps one simply need to call "fossil clone" a second time...
Not primarily, but that should be one possible use case.
The one that I'm hoping ends up far more common is that the clone aborts at a point where you can still "open" the repo, because you have enough pulled down that Fossil can still give you something useful. This is, implicitly, another aspect of the long wished-for shallow cloning.
Having achieved a useful checkout, on next [auto]sync, you get as many more artifacts as the sync protocol can retrieve before your terrible Internet connection shuts the sync down again. Eventually, you get everything, and syncs get small enough that the chance of interruption becomes rare enough that it isn't a serious problem any more.
If it turns out that the clone is interrupted before you get enough pulled down to open anything useful, then you should be able to get to that point one of two ways:
Your proposal:
fossil clone
with the same arguments. Fossil realizes it's a partial clone and auto-resumes.A manual sync:
fossil sync -R /path/to/repo.fossil
. You should be able to run that with further partial syncs untilfossil open
is able to succeed, per above.
(8) By Warren Young (wyoung) on 2019-10-10 01:01:11 in reply to 7 [link] [source]
the long wished-for shallow cloning.
...which in turn means it works best if Fossil's cloning algorithm sends the newest artifacts first, so that the chances are highest that you can open any branch tip even on a partial clone.
Conversely, Fossil's initial clone algorithm (or resumed sync) should send the oldest artifacts last, since those are the least likely to be useful.
Another thing I long had on my wish list is Git-style cloning: clone and open the repo in a single step. (The Fossil repo DB would end up named something automatic, like .fslrepo
at the top of the checkout directory.) This in turn would let the client communicate the branch it intends to open, so the remote repo can send the tip of that branch first, being the most important artifacts to the client.
(9) By anonymous on 2019-10-10 02:36:13 in reply to 8 [link] [source]
Another thing I long had on my wish list is Git-style cloning
Maybe add an --open option to clone.
But also add a warning if the user already has a clone of the remote repo. Maybe like:
You already have a clone of _____.
Do you really want another clone or just another working copy?
(I know git now has an equivalent to open, but many/most git users will be in the habit of using clone to make more working copies.)
(10) By Stephan Beal (stephan) on 2019-10-10 02:53:17 in reply to 7 [link] [source]
If it turns out that the clone is interrupted before you get enough pulled down to open anything useful, then you should be able to get to that point one of two ways:
Your proposal: fossil clone with the same arguments. Fossil realizes it's a partial clone and auto-resumes.
Hypothetically, a flag is not necessary: if clone
sees that the given repo file already exists, it could assume this mode of operation, rather than complaining like it currently does. It would possibly have to start syncing and compare the remote project code and local file's project code, to ensure that it doesn't inject the sync content into an unrelated repo clone, and then error out with "local repo already exists and has a different project code."
i don't think your ideal of "having enough to open
" is attainable - AFAIK the sync pulls artifacts in a random order, and i don't see how it could do otherwise without the server parsing manifests/control artifacts in timestamp order, which it only knows from the timeline, not the blob table. Making the sync depend on "supplementary" tables like the timeline sounds undesirable to me. (That said: maybe it already does use the timeline - i'm not the slightest bit acquainted with the sync code.)
(11) By Andy Bradford (andybradford) on 2019-10-10 14:36:57 in reply to 7 [source]
> The one that I'm hoping ends up far more common is that the clone > aborts at a point where you can still "open" the repo, because you > have enough pulled down that Fossil can still give you something > useful. I've also had a similar thought. For example, if the clone aborts before the clone is able to receive project-id, I don't think we have a useful clone from which a resume can hope to be successful. Consequently, I was thinking of retaining the original behavior if the project-id is not defined after the clone completes (either due to full sync or interrupted sync). > A manual sync: fossil sync -R /path/to/repo.fossil. You should be able > to run that with further partial syncs until fossil open is able to > succeed, per above. That's how it currently works in the branch (well at least is intended to work), however, there appears to be something wrong and I haven't had time to investigate. Thanks, Andy
(12) By anonymous on 2023-04-02 07:29:55 in reply to 11 [link] [source]
This is still an issue. I had a "fossil clone URL" command fail after pulling a few gigabytes of data.
$ fossil clone https://src.fossil.netbsd.org/
Round-trips: 569 Artifacts sent: 0 received: 2473980
response truncated: got 1425046 bytes of 10010812
$ ls -la
total 14804552
drwxr-xr-x 4 user group 128 Apr 1 16:34 .
drwxr-xr-x 17 user group 544 Apr 1 16:34 ..
-rw-r--r-- 1 user group 7572877312 Apr 1 18:56 src.fossil
-rw-r--r-- 1 user group 58880 Apr 1 16:34 src.fossil-journal
So I tried to resume:
$ fossil clone https://src.fossil.netbsd.org/ ./src.fossil
file already exists: ./src.fossil
$ fossil clone https://src.fossil.netbsd.org/
file already exists: ./src.fossil
$ fossil sync
use --repository or -R to specify the repository database
$ fossil sync -R ./src.fossil
SQLITE_NOTICE(539): recovered 14 pages from
/dirname/src.fossil-journal
incorrect repository schema version: current repository schema version is "" but need versions between "2011-04-25 19:50" and "2015-01-24".
run "fossil rebuild" to fix this problem
OK, I ran "fossil rebuild as suggested:
$ fossil rebuild -R src.fossil
100.0% complete...
Oops, it's thrown away most of the data:
$ ls -l src.fossil
-rw-r--r-- 1 user group 229376 Apr 2 10:03 src.fossil
At this point I gave up trying to resume, and just deleted everything and re-started the sync.
I'd suggest, for the future, that the first thing "fossil clone" should do is record enough information for "fossil sync" to be able to resume. I don't know what that entails, but I imagine it would include at least the repository schema, the project-id and the remote URL. Also, if fossil clone fails, it could print a hint about "use fossil sync to resume".
(13) By anonymous on 2023-04-02 12:59:12 in reply to 12 [link] [source]
It looks as though the entire clone operation is wrapped in a single transaction, which will probably lead to most of the transferred data being rolled back after an error. I think I would prefer a new transaction at least for every "Round-trip" reported by the status display, which corresponds to each cycle of the while(go)
loop in client_sync()
in xfer.c
.
There is a db_{begin,end}_transaction
for each cycle of the while(go)
loop in client_sync()
, which makes sense, but the call to client_sync()
from clone_cmd()
in clone.c
wraps it in an outer transaction. With nested transactions, the inner transactions don't really get committed until the outer transaction is also committed.
(14) By Stephan Beal (stephan) on 2023-04-02 14:58:21 in reply to 13 [link] [source]
I think I would prefer a new transaction at least for every "Round-trip" reported by the status display ... but the call to client_sync() from clone_cmd() in clone.c wraps it in an outer transaction
The problem with a new top-level transaction per round-trip is that it could leave the repository in a useless state because the artifacts are transferred in what is essentially a random order. When committing each round-trip, a single failure could easily leave the DAG in a useless mess.
Working around that would(?) require(?) extending the sync mechanism to place received objects in a holding area until the sync is complete (committing them there on each round-trip), and then moving them all out of the holding area under a top-level transaction once all round-trips are finished. Patches to that effect (or functionally equivalent) would be gleefully considered.