Ability to resume a failed clone

(1.1) By Andy Bradford (andybradford) on 2023-11-24 22:33:24 edited from 1.0 [source]

Hello,

Given the recent  discussion about the ability to resume  a failed clone
operation, I decided to resurrect the idea that fell by the wayside many
years  ago.  This  time  I  decided  to  simply  record  the  last  good
clone_seqno  that happend  to be  committed to  the repository  database
prior to failure. The presence of a non-zero clone_seqno in the database
means  that  the  clone  has  failed and  no  sync/open  operations  are
permitted on  such a repository  until "fossil clone --resume"  has been
used.

I decided  not to bother  with resuming failed  clones of files  since I
don't really see much of a point in doing so.

Here is the recent work:

https://www.fossil-scm.org/home/info/61e0ced9bfbc4a51

Here is  a demonstration where  cloning the canonical  Fossil repository
was interrupted by having the network drop in the middle and was able to
be resumed:

$ fossil clone https://www.fossil-scm.org/home clone.fossil
Round-trips: 7   Artifacts sent: 0  received: 57863
SSL: cannot connect to host www.fossil-scm.org:443 (Operation timed out)
Clone done, wire bytes sent: 1885  received: 35293967  remote: 45.33.6.223
server returned an error - clone incomplete
there are unresolved deltas - the clone is probably incomplete and unusable.
It may be possible to continue clone with --resume.
Rebuilding repository meta-data...
  100.1% complete...
Extra delta compression... none found
Vacuuming the database... 
project-id: CE59BB9F186226D80E49D1FA2DB29F935CCA0333
server-id:  3035f168e8664215796a05985bf0ab9ff752d458
admin-user: amb (password is "HEciZbspeF")

$ fossil clone --resume https://www.fossil-scm.org/home clone.fossil
Round-trips: 3   Artifacts sent: 0  received: 1605
Clone done, wire bytes sent: 809  received: 5841372  remote: 45.33.6.223
Rebuilding repository meta-data...
  100.0% complete...
Extra delta compression... 21 deltas save 101,372 bytes
Vacuuming the database... 
project-id: CE59BB9F186226D80E49D1FA2DB29F935CCA0333
server-id:  1e2084692e518787a2d20383fe758c46f7475a7c
admin-user: amb (password is "c1cc3c8223aa0c970be2df87897ac5b87a787bf7")


After the resume clone completed,  I ran "fossil test-integrity --parse"
on the repository and there appeared to be no more warnings than a clone
that had completed without errors.

It's probably  not bug-free and  could use some additional  feedback and
testing,  and in  fact,  I just  noticed  that  there is  a  bug in  the
generation of the admin-user password (now fixed).

Any thoughts on this approach?

Thanks,

Andy

(2) By Warren Young (wyoung) on 2023-11-24 23:02:44 in reply to 1.1 [link] [source]

Doesn't the --resume flag duplicate the DB flag clone_seqno != 0? Why not allow resume on repeating the command?

If the last sequence number is zero, the command is a no-op.

(3.2) By Andy Bradford (andybradford) on 2023-11-24 23:28:33 edited from 3.1 in reply to 2 [link] [source]

> Doesn't the --resume flag duplicate the  DB flag clone_seqno != 0? Why
> not allow resume on repeating the command?

Good question.

I figured that  the resuming of a failed clone  should be an intentional
decision by  the person rather  than an automatically  assumed operation
(e.g. I accidentally issue a clone  command against an existing clone so
I get  an error). I could  be wrong in  this line of reasoning  and it's
entirely possible that  by reissuing the clone command on  the same file
the  user has  expressed  the preference  to  resume. Otherwise,  you're
correct that --resume isn't really  necessary given that the presence of
the the last clone_seqno  in the DB allows Fossil to  decide on it's own
that it should resume.

I guess  I took  the more  cautious approach,  but there's  no technical
reason why we couldn't eliminate the --resume. In thinking it over, it's
possible that we could eliminate --resume  as long as we properly detect
that the clone is not in a state that it should be resumed.

Also, some thoughts that I've had while implementing this were:

What happens if the remote repository is "rebuilt" prior to the resume?

What  happens  if  the  tries  to resume  the  clone  from  a  different
repository (perhaps a clone that is hosted in a different datacenter)?

What happens if the repository is actually balanced between multiple web
servers and  the resume  clone happened  to get  relayed to  a different
backend server?

Should the user  even be allowed to change the  URL between the original
clone command and the command to resume?

Obviously the rid numbers on the remote repository in those scenarios is
very unlikely to match what the client has stored. Does it matter? Is it
likely that a user  will allow such a long passage  of time between when
the clone failed and when it is resumed? Or that they will decide to try
to clone from a different remote  repository? I'm not sure how to detect
that the source  for the clone has a different  "rid alignment" than was
originally used.

Andy

(4) By Andy Bradford (andybradford) on 2023-11-24 23:40:53 in reply to 3.2 [link] [source]

> I'm  not sure  how to  detect  that the  source  for the  clone has  a
> different "rid alignment" than was originally used.

Perhaps the client,  upon deciding that it needs to  resume a clone, can
issue a  "pragma" card requesting  that rid X has  hash Y (or  perhaps a
sample of a  few). If the server responds in  the affirmative, then it's
safe to proceed with the clone?

It's possible it's not worth the effort to detect these scenarios (yet).

Andy

(12) By Stephan Beal (stephan) on 2023-11-25 12:35:32 in reply to 4 [link] [source]

Perhaps the client, upon deciding that it needs to resume a clone, can issue a "pragma" card requesting that rid X has hash Y (or perhaps a sample of a few).

Going from memory, and am 3 days away from my computer (so can't readily confirm), but the RIDs never cross between the original and clone, do they? My (mis?)understanding is that RIDs are strictly for use within their own copy of the repository.

(13.1) By Andy Bradford (andybradford) on 2023-11-25 15:01:48 edited from 13.0 in reply to 12 [link] [source]

> but the RIDs never cross between the original and clone, do they?

The client  may not store them  in the DB in  the same order as  the rid
order on the original, however, the  clone_seqno is just a proxy for rid
because the Fossil  server code (in xfer.c) initializes seqno  to 1 (aka
rid 1), gets the maximum rid from the blob table and then starts looping
until after incrementing the seqno it reaches max.

https://www.fossil-scm.org/home/file?udc=1&ln=1466+1470+1474&ci=trunk&name=src%2Fxfer.c

Each iteration simply copies the content  from the blob table by looking
up the rid starting with rid 1. It ends when client_seqno == 0.

So, in theory, the client "knows" which rids it has received, but rather
than counting anything, it could, in theory, just record the hash of the
first cfile  that it receives after  issuing client_seqno. So a  list of
these hashes  could be used as  a heuristic of sorts  for determining if
the server can still fulfill a resumed clone.

Again, this is all theoretical, and I'm not sure it's worth it. It might
be easier to just prevent by disallowing certain uses (e.g. changing the
URL).

Andy

(14) By Preben Guldberg (preben) on 2023-11-25 16:04:56 in reply to 13.1 [link] [source]

It might be easier to just prevent by disallowing certain uses (e.g. changing the URL).

If we can detect a URL change, I kind of like using --resume as a deliberate action:

If you have a URL (and no --resume) it's new clone operation or an error.
If you use --resume, do not accept a URL and only do the work to resume a clone.

(15) By Andy Bradford (andybradford) on 2023-11-25 16:15:56 in reply to 14 [link] [source]

> If we can detect a URL change

I suppose I really meant "ignore"  any new URL. So rather than accepting
the URL  information from  the command  line, just  extract it  from the
repository as it has already been recorded there (I think).

At the same time,  I don't mind the flexibility of  allowing the user to
provide a different URL. Perhaps the administrator copied the repository
to a  new server  and it has  a new address.  Why should  Fossil prevent
this? I'm thinking  of leaving it as-is until we  actually hear of users
trying to  resume a clone  from a  different server than  was originally
used and  running into  errors. On that  note, let me  test to  see what
happens in such a situation.

Andy

(5) By Andy Bradford (andybradford) on 2023-11-24 23:53:35 in reply to 2 [link] [source]

> If the last sequence number is zero, the command is a no-op.

The original behavior when trying to clone on top of a file that already
exists is an error. I opted to retain this behavior:

https://www.fossil-scm.org/home/info/b0a60d8f1d3c3023

Do you see any  problems opening the repository in this  way to read the
value of the setting and then closing it? Or should it just remain open?


Andy

(6) By Warren Young (wyoung) on 2023-11-25 00:01:17 in reply to 5 [link] [source]

Do you mean the Fossil command “open” or the SQLite DB open? I don’t think you should be allowed to do the former until the clone completes successfully, but the latter is of course necessary to determine success.

Fun side-track: what happens in the two flavors of clone-and-open?

(7) By Andy Bradford (andybradford) on 2023-11-25 00:07:47 in reply to 6 [link] [source]

> Do you mean the Fossil command “open” or the SQLite DB open?

SQLite DB open, not "fossil open".

Because  it isn't  possible to  know  the value  of the  setting in  the
repository without first opening the repository, I mean this:

https://www.fossil-scm.org/home/artifact?udc=1&ln=204-206&name=4f7b97763b93cfad


> what happens in the two flavors of clone-and-open?

One of  my least favorite  things about Fossil...  ok, I'll test  it and
find out. Thanks for the reminder.

Andy

(8) By Andy Bradford (andybradford) on 2023-11-25 02:29:35 in reply to 6 [link] [source]

> Fun side-track: what happens in the two flavors of clone-and-open?

Seems to work as expected as I already put in a check to fail an open if
the clone failed:

$ fossil clone http://localhost:8080/                      
Round-trips: 8   Artifacts sent: 0  received: 54902
cannot connect to host localhost:8080
Clone done, wire bytes sent: 2099  received: 40100940  remote: 127.0.0.1
server returned an error - clone incomplete
there are unresolved deltas - the clone is probably incomplete and unusable.
It may be possible to resume the clone by running the same command.
Rebuilding repository meta-data...
  100.1% complete...
Extra delta compression... 1 delta saves 656 bytes
Vacuuming the database... 
project-id: CE59BB9F186226D80E49D1FA2DB29F935CCA0333
server-id:  2d91555085082384bdeb4d05dbd86620e1156c92
admin-user: amb (password is "gfsUMVECdD")
opening the new ./localhost.fossil repository in directory ./localhost...
This repository appears to be an incomplete clone.


It called my repository  localhost.fossil---how nice---and tried to open
into a  directory called localhost---how  nice. Of course it  failed, so
let me resume:

$ fossil clone http://localhost:8080/
Round-trips: 4   Artifacts sent: 0  received: 4547
Clone done, wire bytes sent: 1047  received: 14147800  remote: 127.0.0.1
Rebuilding repository meta-data...
  100.0% complete...
Extra delta compression... 46 deltas save 4,899,627 bytes
Vacuuming the database... 
project-id: CE59BB9F186226D80E49D1FA2DB29F935CCA0333
server-id:  4397c632fc857a406f47f8335bffe8b813dc5f48
admin-user: amb (password is "APNF26RLJG")
opening the new ./localhost.fossil repository in directory ./localhost...
.editorconfig
...
project-name: Fossil
repository:   /tmp/trial/localhost.fossil
local-root:   /tmp/trial/localhost/
config-db:    /home/amb/.fossil
project-code: CE59BB9F186226D80E49D1FA2DB29F935CCA0333
checkout:     3f97785608f1470be49526c6c775fa86286e413b 2023-11-24 12:59:47 UTC
parent:       2edeeee4a8ae3028070590fc9a95e2c225153c20 2023-11-22 20:22:38 UTC
tags:         trunk
comment:      Update the built-in SQLite to version 3.44.2. (user: drh)
check-ins:    18021


Seems to  be happy.  Now, what's  the other  open after  clone option...
--workdir? Ok, let's try:

$ fossil clone --workdir open http://localhost:8080/ clone.fossil
Round-trips: 7   Artifacts sent: 0  received: 51465
cannot connect to host localhost:8080
Clone done, wire bytes sent: 1834  received: 35099994  remote: 127.0.0.1
server returned an error - clone incomplete
there are unresolved deltas - the clone is probably incomplete and unusable.
It may be possible to resume the clone by running the same command.
Rebuilding repository meta-data...
  100.1% complete...
Extra delta compression... 1 delta saves 656 bytes
Vacuuming the database... 
project-id: CE59BB9F186226D80E49D1FA2DB29F935CCA0333
server-id:  b546e947c313d3026ef9a4fd08d404df20bbc9ad
admin-user: amb (password is "XDdJB4NULq")
opening the new clone.fossil repository in directory open...
This repository appears to be an incomplete clone.


Looks promising, now let's resume:

$ fossil clone --workdir open http://localhost:8080/ clone.fossil
Round-trips: 5   Artifacts sent: 0  received: 7984
Clone done, wire bytes sent: 1305  received: 19148746  remote: 127.0.0.1
Rebuilding repository meta-data...
  100.0% complete...
Extra delta compression... 46 deltas save 4,899,627 bytes
Vacuuming the database... 
project-id: CE59BB9F186226D80E49D1FA2DB29F935CCA0333
server-id:  f363e981c0b849969d22439efaf6056741cee3ce
admin-user: amb (password is "jZqW8HAo6p")
opening the new clone.fossil repository in directory open...
.editorconfig
...
project-name: Fossil
repository:   /tmp/trial/clone.fossil
local-root:   /tmp/trial/open/
config-db:    /home/amb/.fossil
project-code: CE59BB9F186226D80E49D1FA2DB29F935CCA0333
checkout:     3f97785608f1470be49526c6c775fa86286e413b 2023-11-24 12:59:47 UTC
parent:       2edeeee4a8ae3028070590fc9a95e2c225153c20 2023-11-22 20:22:38 UTC
tags:         trunk
comment:      Update the built-in SQLite to version 3.44.2. (user: drh)
check-ins:    18021


Looks like it's working.

Andy

(9) By Warren Young (wyoung) on 2023-11-25 03:09:37 in reply to 8 [link] [source]

Now, what's the other open after clone option...

fossil open URL

(10) By Andy Bradford (andybradford) on 2023-11-25 03:19:40 in reply to 9 [link] [source]

> fossil open URL

Indeed, I saw  some code for that  while I was making  these changes and
wondered, "now  what does that do"  but never got around  to checking it
out.

Let's see:

$ fossil open http://localhost:8080/
/[path]/fossil clone http://localhost:8080/ /tmp/trial/localhost.fossil
Round-trips: 6   Artifacts sent: 0  received: 44180
cannot connect to host localhost:8080
Clone done, wire bytes sent: 1570  received: 30098578  remote: 127.0.0.1
server returned an error - clone incomplete
there are unresolved deltas - the clone is probably incomplete and unusable.
It may be possible to resume the clone by running the same command.
Rebuilding repository meta-data...
  100.0% complete...
Extra delta compression... none found
Vacuuming the database... 
project-id: CE59BB9F186226D80E49D1FA2DB29F935CCA0333
server-id:  1500971faf3cb0c0675931f3db837bd47705c8f3
admin-user: amb (password is "ShLAVkT6tW")
This repository appears to be an incomplete clone.


Ok, now let's try again:

$ fossil open http://localhost:8080/
directory /tmp/trial is not empty
use the -f (--force) option to override
or the -k (--keep) option to keep local files unchanged


Hmm, that's odd, of course it's not empty, Fossil automatically placed a
repository in  that directory (does  anyone even use "fossil  open URL"?
Seems an  odd duck). Well,  let's try the -f  (even though I  don't know
that I should have  to do that and it seems like  the wrong option here,
maybe -k would be better):


$ fossil open -f http://localhost:8080/
/[path]/fossil clone http://localhost:8080/ /tmp/trial/localhost.fossil
Round-trips: 6   Artifacts sent: 0  received: 15269
Clone done, wire bytes sent: 1566  received: 24150161  remote: 127.0.0.1
Rebuilding repository meta-data...
  100.1% complete...
Extra delta compression... 47 deltas save 4,900,283 bytes
Vacuuming the database... 
project-id: CE59BB9F186226D80E49D1FA2DB29F935CCA0333
server-id:  6f49c50f6b44c08dfb433bdf96c9f8b1cd43486d
admin-user: amb (password is "xJGkTdvhAe")
.editorconfig
...
project-name: Fossil
repository:   /tmp/trial/localhost.fossil
local-root:   /tmp/trial/
config-db:    /home/amb/.fossil
project-code: CE59BB9F186226D80E49D1FA2DB29F935CCA0333
checkout:     3f97785608f1470be49526c6c775fa86286e413b 2023-11-24 12:59:47 UTC
parent:       2edeeee4a8ae3028070590fc9a95e2c225153c20 2023-11-22 20:22:38 UTC
tags:         trunk
comment:      Update the built-in SQLite to version 3.44.2. (user: drh)
check-ins:    18021


Seems to work just fine.

Andy

(11) By Andy Bradford (andybradford) on 2023-11-25 03:23:46 in reply to 2 [link] [source]

> Why not allow resume on repeating the command?

The clone-resume code now does this.  I wonder, should it perhaps output
something letting the person know that a "resume" is taking place?

Andy

(16) By Andy Bradford (andybradford) on 2023-11-27 06:22:21 in reply to 2 [link] [source]

> Why not allow resume on repeating the command?

For that matter, why not just try to automatically resume?

https://www.fossil-scm.org/home/info/bc0a4c60c00ddf67

Thoughts?  Is 3 too many?  Not necessary?

(17) By Warren Young (wyoung) on 2023-11-28 04:53:51 in reply to 16 [link] [source]

Thoughts?

How about, "This is awesome, and I cannot wait until it merges down to trunk?" 🤓

As the one who was telling people recently that Fossil cannot resume a clone, I assume this feature is in reaction to my warnings. I'm happy to be proven wrong post facto.

(18) By Daniel Dumitriu (danield) on 2023-11-28 09:49:28 in reply to 17 [link] [source]

How about, "This is awesome, and I cannot wait until it merges down to trunk?" 🤓

Same opinion here.

proven wrong post facto.

Rather post factum. It is only ex post facto, which does not apply here.

(19) By Warren Young (wyoung) on 2023-11-28 11:01:04 in reply to 18 [link] [source]

My Latin isn't as bad as my Mandarin…but it's a close call. 😛

(20) By Andy Bradford (andybradford) on 2023-11-30 04:08:28 in reply to 17 [link] [source]

> I assume this feature is in reaction to my warnings.

In addition to your warnings, it's actually something we discussed quite
a few  years ago and  I started  working on but  lost track of  so never
completed.

My recent commit to handle SIGINT during cloning:

https:/https://www.fossil-scm.org/home/info/ad2e148541fe0716/www.fossil-scm.org/home/info/ad2e148541fe0716

Actually came about because I've been testing the code against this huge
repository:

https://www.fossil-scm.org/forum/forumpost/8cbc83e5d08a86b4

Andy

(21) By Andy Bradford (andybradford) on 2023-12-01 18:16:23 in reply to 1.1 [link] [source]

I've made some  additional changes. In testing I decided  that it should
skip rebuilding  the database until the  clone is 100% done,  however, I
also wonder  if the additional  steps (VACUUM, outputting  project code,
etc.) should also be skipped until the clone is done?

https://www.fossil-scm.org/home/file?udc=1&ln=355-383&ci=17f3408f6b557e27&name=src%2Fclone.c

If the  clone isn't complete,  why do any of  these things? The  lack of
this  information being  done at  the end  may signal  to the  user that
something has gone  wrong, while the presence of the  data may lead them
to believe it  was successful (if they don't look  carefully through the
rest of the output).

Andy