Problem: server returned an error - clone aborted

(1) By anonymous on 2019-10-07 11:57:45 [link] [source]

There is an annoying behavior when using fossil with unstable connections. The issue is obvious when cloning large repositories.

When the connection is lost, fossil detect the loss and delete all the downloaded data. It does not seems possible to RESUME the cloning. The full data has to be downloaded one more time.

It's a real problem in some countries with strong internet restrictions, the cloning can fail at any time.

(2) By Richard Hipp (drh) on 2019-10-07 12:10:37 in reply to 1 [link] [source]

Fair enough. That is something we can work on.

(3) By Andy Bradford (andybradford) on 2019-10-09 14:56:26 in reply to 2 [link] [source]

A simple approach would simply be to commit each round-trip and not error
out when  a network error  occurs and then  just allow "fossil  sync" to
do the rest, however...

While  experimenting  with the  "fossil  purge"  command, I  emptied  my
repository entirely  to see how well  "fossil sync" could then  clean up
the mess (as I mentioned here):

https://www.fossil-scm.org/forum/forumpost/c1ff250092

I then ran "fossil sync" and here is what it did:

$ time fossil sync --httptrace
Sync with http://localhost:8080/
Round-trips: 2   Artifacts sent: 0  received: 0
Server says: pull only - not authorized to push
Round-trips: 4725   Artifacts sent: 0  received: 34202
Sync done, sent: 4177319  received: 54810455  ip: 127.0.0.1
***** WARNING: a fork has occurred *****
use "fossil leaves -multiple" for more details.
/***** Subprocess 31921 exit(0) *****/
   34m57.93s real     2m44.71s user     1m31.62s system

So it took 35 minutes to "clone"  via the sync protocol (and tested with
and without --httptrace, and the use of the option didn't introduce much
of significance). It would seem that  the sync protocol ended up sending
one "card" per round-trip for a considerable amount of round-trips.

I also found  that after the big synchronization, that  the clone didn't
have all  of the  artfacts that  existed in  the parent  repository. For
example,  "fossil dbstat"  shows that  in  the parent  there are  13,055
checkins, however,  in the clone  that I  purged and then  synced, there
are  only 11,719  checkins,  and 38,760  artifacts  vs 44,088  artifacts
respectively.

I used "fossil deconstruct" to attempt to  see if there is a pattern for
missing  artifacts. One  such missing  artifact is  000575c77d which  is
named in cluster artifact 5e1353f514 which *is* present in the clone, so
it's odd that sync wouldn't have discovered that.

Another example, again  shows that artifact 0008f29c769  is missing from
my  clone  and  it  is  again  part  of  a  different  cluster  artifact
9b5084bbc2.  Perhaps there  is  a bug  in  processing cluster  artifacts
because my  clone clearly has the  cluster which should have  produced a
phantom in the phantoms table which  eventually resulted in a "gimme" to
pull the content, but there are hardly any:

$ echo "SELECT count(*) FROM phantom;" | fossil sql 
5

I also found that in the clone, the unclustered table is quite large:

$ echo "SELECT count(*) FROM unclustered;" | fossil sql
34207

Consequently, the next  time I tried to  sync, my client sent  a list of
"igot" cards that was almost as large:

$ grep -c  '^igot' http-request-1.txt
34203

Perhaps this is  because there just are a lot  of missing artifacts from
this repository.

Finally, I decided to try --verily:

$ fossil sync --verily --httptrace
Sync with http://localhost:8080/
Round-trips: 2   Artifacts sent: 0  received: 0
Server says: pull only - not authorized to push
Round-trips: 8   Artifacts sent: 0  received: 5328
Sync done, sent: 2257296  received: 13046887  ip: 127.0.0.1
/***** Subprocess 58152 exit(0) *****/

It ran in much less time (I forgot to time it, but it was probably about
60--120 seconds at the  most) and appears to have brought  in all of the
artifacts, though there is still one less checkin which is odd:

artifact-count:    44,088 (stored as 17,033 full text and 27,055 deltas)
check-ins:         13,054

So, given  this... perhaps we could  track a failed clone  in the sqlite
database  in  the Fossil  repository  and  the  next  time that  a  sync
operation causes a pull the  client could send the "pragma send-catalog"
to resume the clone operation? Something like:

https://www.fossil-scm.org/home/info/ec26471439ec5294

Let's see how this works...

$ fossil clone http://localhost:8080/ clone.fossil
Round-trips: 3   Artifacts sent: 0  received: 25474
response truncated: got 1964529 bytes of 5001092
Clone done, sent: 1009  received: 17379269  ip: 127.0.0.1
server returned an error - clone incomplete
Rebuilding repository meta-data...
  100.0% complete...
Extra delta compression... 
Vacuuming the database... 
project-id: CE59BB9F186226D80E49D1FA2DB29F935CCA0333
server-id:  45821d85f56221cd472fd4e08ddc7fc7babccf5f
admin-user: amb (password is "RuXG64Bs9C")
clone operation had errors. Run "fossil pull" to complete clone.

So far so good, right?

$ fossil pull -R clone.fossil
Pull from http://localhost:8080/
Round-trips: 1   Artifacts sent: 0  received: 200
infinite loop in DELTA table
Abort trap (core dumped) 

Looks like this approach  has some bugs. I'm not yet  sure where the bug
is---is it in the rebuilding of  the partially cloned repository? Is the
cache somehow corrupted? Is it in the calculation of this number:

https://www.fossil-scm.org/home/artifact?udc=1&ln=273-275&name=d49bd600d03d5de7

$ echo "SELECT max(rid) FROM blob;" | fossil sql -R clone.fossil
25702

Guess I'll have to dig into this a bit more.

Andy

(4) By Andy Bradford (andybradford) on 2019-10-09 15:16:12 in reply to 3 [link] [source]

> So,  given this...  perhaps  we  could track  a  failed  clone in  the
> sqlite database  in the  Fossil repository  and the  next time  that a
> sync  operation  causes a  pull  the  client  could send  the  "pragma
> send-catalog" to resume the clone operation?

Or maybe this would be a better fit as an appendage to the "fossil clone"
command; perhaps a --resume option could be added.

Andy

(5) By Warren Young (wyoung) on 2019-10-09 15:21:56 in reply to 4 [link] [source]

It'd be better if it just detected the condition and auto-resumed it. Ideally, a local partial clone should be treated no differently than one that's just been offline for a year or so and so needs a lot of updates in sync.

(6) By Andy Bradford (andybradford) on 2019-10-09 19:43:04 in reply to 5 [link] [source]

I agree  that ideally a partial  clone should be treated  no differently
than  one that's  just been  offline, however,  there are  actually some
useful optimizations that clone does which the normal sync protocol does
not  use  except upon  request/demand  of  the  user (because  it  isn't
expected to need tens of thousands of artifacts).

Also, if you read  my analysis of the "need a  lot of updates" scenario,
wherein  I  completely  emptied  my  repository  so  sync  had  to  pull
*everything* again, I appear to have  exposed a bug in the sync protocol
that  will need  to be  figured out.  This is  why I  opted to  make the
"resume" behave more like a clone than a sync.

I  suspect that  the current  failure  to "resume"  in the  clone-resume
branch may  be related to how  Fossil rebuilds and constructs  the delta
table at  the end of  the the  clone---maybe rebuild should  be deferred
until clone is complete?

Thanks  for your  feedback regarding  "auto-resume". Are  you suggesting
that  perhaps one  simply need  to call  "fossil clone"  as second  time
giving the same arguments  and it will pull in the  rest of the content?
Currently, Fossil  complains if you  try to clone  on top of  an already
existing file and I think I  would prefer to preserve that warning which
means  the user  would have  to pass  in another  argument. I  suggested
"--resume" because the intent is clear.

Or did you have something else  in mind for "auto-resume"? Maybe instead
it could just try harder? For  example, with autosync, it is possible to
configure autosync-retries which will cause  the sync to try harder when
synchronizing  content.  Maybe  you  mean  that  clone  should  also  do
something similar  rather than simply  bailing out on the  first network
failure?

Thanks,

Andy

(7) By Warren Young (wyoung) on 2019-10-10 00:53:47 in reply to 6 [link] [source]

...regarding "auto-resume"...Are you suggesting that perhaps one simply need to call "fossil clone" a second time...

Not primarily, but that should be one possible use case.

The one that I'm hoping ends up far more common is that the clone aborts at a point where you can still "open" the repo, because you have enough pulled down that Fossil can still give you something useful. This is, implicitly, another aspect of the long wished-for shallow cloning.

Having achieved a useful checkout, on next [auto]sync, you get as many more artifacts as the sync protocol can retrieve before your terrible Internet connection shuts the sync down again. Eventually, you get everything, and syncs get small enough that the chance of interruption becomes rare enough that it isn't a serious problem any more.

If it turns out that the clone is interrupted before you get enough pulled down to open anything useful, then you should be able to get to that point one of two ways:

Your proposal: fossil clone with the same arguments. Fossil realizes it's a partial clone and auto-resumes.
A manual sync: fossil sync -R /path/to/repo.fossil. You should be able to run that with further partial syncs until fossil open is able to succeed, per above.

(8) By Warren Young (wyoung) on 2019-10-10 01:01:11 in reply to 7 [link] [source]

the long wished-for shallow cloning.

...which in turn means it works best if Fossil's cloning algorithm sends the newest artifacts first, so that the chances are highest that you can open any branch tip even on a partial clone.

Conversely, Fossil's initial clone algorithm (or resumed sync) should send the oldest artifacts last, since those are the least likely to be useful.

Another thing I long had on my wish list is Git-style cloning: clone and open the repo in a single step. (The Fossil repo DB would end up named something automatic, like .fslrepo at the top of the checkout directory.) This in turn would let the client communicate the branch it intends to open, so the remote repo can send the tip of that branch first, being the most important artifacts to the client.

(9) By anonymous on 2019-10-10 02:36:13 in reply to 8 [link] [source]

Another thing I long had on my wish list is Git-style cloning

Maybe add an --open option to clone.

But also add a warning if the user already has a clone of the remote repo. Maybe like:

    You already have a clone of _____.
    Do you really want another clone or just another working copy?

(I know git now has an equivalent to open, but many/most git users will be in the habit of using clone to make more working copies.)

(10) By Stephan Beal (stephan) on 2019-10-10 02:53:17 in reply to 7 [link] [source]

If it turns out that the clone is interrupted before you get enough pulled down to open anything useful, then you should be able to get to that point one of two ways:

Your proposal: fossil clone with the same arguments. Fossil realizes it's a partial clone and auto-resumes.

Hypothetically, a flag is not necessary: if clone sees that the given repo file already exists, it could assume this mode of operation, rather than complaining like it currently does. It would possibly have to start syncing and compare the remote project code and local file's project code, to ensure that it doesn't inject the sync content into an unrelated repo clone, and then error out with "local repo already exists and has a different project code."

i don't think your ideal of "having enough to open" is attainable - AFAIK the sync pulls artifacts in a random order, and i don't see how it could do otherwise without the server parsing manifests/control artifacts in timestamp order, which it only knows from the timeline, not the blob table. Making the sync depend on "supplementary" tables like the timeline sounds undesirable to me. (That said: maybe it already does use the timeline - i'm not the slightest bit acquainted with the sync code.)

(11) By Andy Bradford (andybradford) on 2019-10-10 14:36:57 in reply to 7 [source]

> The one  that I'm  hoping ends up  far more common  is that  the clone
> aborts at  a point where  you can still  "open" the repo,  because you
> have  enough pulled  down that  Fossil  can still  give you  something
> useful.

I've also had a similar thought. For example, if the clone aborts before
the clone is able to receive project-id,  I don't think we have a useful
clone from which a resume can hope to be successful. Consequently, I was
thinking  of  retaining  the  original behavior  if  the  project-id  is
not  defined after  the  clone completes  (either due  to  full sync  or
interrupted sync).

> A manual sync: fossil sync -R /path/to/repo.fossil. You should be able
> to run  that with further partial  syncs until fossil open  is able to
> succeed, per above.

That's how it  currently works in the branch (well  at least is intended
to work), however, there appears to be something wrong and I haven't had
time to investigate.

Thanks,

Andy

(12) By anonymous on 2023-04-02 07:29:55 in reply to 11 [link] [source]

This is still an issue. I had a "fossil clone URL" command fail after pulling a few gigabytes of data.

$ fossil clone https://src.fossil.netbsd.org/
Round-trips: 569   Artifacts sent: 0  received: 2473980
response truncated: got 1425046 bytes of 10010812

$ ls -la
total 14804552
drwxr-xr-x   4 user group         128 Apr  1 16:34 .
drwxr-xr-x  17 user group         544 Apr  1 16:34 ..
-rw-r--r--   1 user group  7572877312 Apr  1 18:56 src.fossil
-rw-r--r--   1 user group       58880 Apr  1 16:34 src.fossil-journal

So I tried to resume:

$ fossil clone https://src.fossil.netbsd.org/ ./src.fossil
file already exists: ./src.fossil
$ fossil clone https://src.fossil.netbsd.org/
file already exists: ./src.fossil
$ fossil sync
use --repository or -R to specify the repository database
$ fossil sync -R ./src.fossil
SQLITE_NOTICE(539): recovered 14 pages from 
/dirname/src.fossil-journal
incorrect repository schema version: current repository schema version is "" but need versions between "2011-04-25 19:50" and "2015-01-24".
run "fossil rebuild" to fix this problem

OK, I ran "fossil rebuild as suggested:

$ fossil rebuild -R src.fossil 
  100.0% complete...

Oops, it's thrown away most of the data:

$ ls -l src.fossil
-rw-r--r--   1 user group  229376 Apr  2 10:03 src.fossil

At this point I gave up trying to resume, and just deleted everything and re-started the sync.

I'd suggest, for the future, that the first thing "fossil clone" should do is record enough information for "fossil sync" to be able to resume. I don't know what that entails, but I imagine it would include at least the repository schema, the project-id and the remote URL. Also, if fossil clone fails, it could print a hint about "use fossil sync to resume".

(13) By anonymous on 2023-04-02 12:59:12 in reply to 12 [link] [source]

It looks as though the entire clone operation is wrapped in a single transaction, which will probably lead to most of the transferred data being rolled back after an error. I think I would prefer a new transaction at least for every "Round-trip" reported by the status display, which corresponds to each cycle of the while(go) loop in client_sync() in xfer.c.

There is a db_{begin,end}_transaction for each cycle of the while(go) loop in client_sync(), which makes sense, but the call to client_sync() from clone_cmd() in clone.c wraps it in an outer transaction. With nested transactions, the inner transactions don't really get committed until the outer transaction is also committed.

(14) By Stephan Beal (stephan) on 2023-04-02 14:58:21 in reply to 13 [link] [source]

I think I would prefer a new transaction at least for every "Round-trip" reported by the status display ... but the call to client_sync() from clone_cmd() in clone.c wraps it in an outer transaction

The problem with a new top-level transaction per round-trip is that it could leave the repository in a useless state because the artifacts are transferred in what is essentially a random order. When committing each round-trip, a single failure could easily leave the DAG in a useless mess.

Working around that would(?) require(?) extending the sync mechanism to place received objects in a holding area until the sync is complete (committing them there on each round-trip), and then moving them all out of the holding area under a top-level transaction once all round-trips are finished. Patches to that effect (or functionally equivalent) would be gleefully considered.