Ability to resume a failed clone
(1.1) By Andy Bradford (andybradford) on 2023-11-24 22:33:24 edited from 1.0 [source]
Hello, Given the recent discussion about the ability to resume a failed clone operation, I decided to resurrect the idea that fell by the wayside many years ago. This time I decided to simply record the last good clone_seqno that happend to be committed to the repository database prior to failure. The presence of a non-zero clone_seqno in the database means that the clone has failed and no sync/open operations are permitted on such a repository until "fossil clone --resume" has been used. I decided not to bother with resuming failed clones of files since I don't really see much of a point in doing so. Here is the recent work: https://www.fossil-scm.org/home/info/61e0ced9bfbc4a51 Here is a demonstration where cloning the canonical Fossil repository was interrupted by having the network drop in the middle and was able to be resumed: $ fossil clone https://www.fossil-scm.org/home clone.fossil Round-trips: 7 Artifacts sent: 0 received: 57863 SSL: cannot connect to host www.fossil-scm.org:443 (Operation timed out) Clone done, wire bytes sent: 1885 received: 35293967 remote: 45.33.6.223 server returned an error - clone incomplete there are unresolved deltas - the clone is probably incomplete and unusable. It may be possible to continue clone with --resume. Rebuilding repository meta-data... 100.1% complete... Extra delta compression... none found Vacuuming the database... project-id: CE59BB9F186226D80E49D1FA2DB29F935CCA0333 server-id: 3035f168e8664215796a05985bf0ab9ff752d458 admin-user: amb (password is "HEciZbspeF") $ fossil clone --resume https://www.fossil-scm.org/home clone.fossil Round-trips: 3 Artifacts sent: 0 received: 1605 Clone done, wire bytes sent: 809 received: 5841372 remote: 45.33.6.223 Rebuilding repository meta-data... 100.0% complete... Extra delta compression... 21 deltas save 101,372 bytes Vacuuming the database... project-id: CE59BB9F186226D80E49D1FA2DB29F935CCA0333 server-id: 1e2084692e518787a2d20383fe758c46f7475a7c admin-user: amb (password is "c1cc3c8223aa0c970be2df87897ac5b87a787bf7") After the resume clone completed, I ran "fossil test-integrity --parse" on the repository and there appeared to be no more warnings than a clone that had completed without errors. It's probably not bug-free and could use some additional feedback and testing, and in fact, I just noticed that there is a bug in the generation of the admin-user password (now fixed). Any thoughts on this approach? Thanks, Andy
(2) By Warren Young (wyoung) on 2023-11-24 23:02:44 in reply to 1.1 [link] [source]
Doesn't the --resume
flag duplicate the DB flag clone_seqno != 0
? Why not allow resume on repeating the command?
If the last sequence number is zero, the command is a no-op.
(3.2) By Andy Bradford (andybradford) on 2023-11-24 23:28:33 edited from 3.1 in reply to 2 [link] [source]
> Doesn't the --resume flag duplicate the DB flag clone_seqno != 0? Why > not allow resume on repeating the command? Good question. I figured that the resuming of a failed clone should be an intentional decision by the person rather than an automatically assumed operation (e.g. I accidentally issue a clone command against an existing clone so I get an error). I could be wrong in this line of reasoning and it's entirely possible that by reissuing the clone command on the same file the user has expressed the preference to resume. Otherwise, you're correct that --resume isn't really necessary given that the presence of the the last clone_seqno in the DB allows Fossil to decide on it's own that it should resume. I guess I took the more cautious approach, but there's no technical reason why we couldn't eliminate the --resume. In thinking it over, it's possible that we could eliminate --resume as long as we properly detect that the clone is not in a state that it should be resumed. Also, some thoughts that I've had while implementing this were: What happens if the remote repository is "rebuilt" prior to the resume? What happens if the tries to resume the clone from a different repository (perhaps a clone that is hosted in a different datacenter)? What happens if the repository is actually balanced between multiple web servers and the resume clone happened to get relayed to a different backend server? Should the user even be allowed to change the URL between the original clone command and the command to resume? Obviously the rid numbers on the remote repository in those scenarios is very unlikely to match what the client has stored. Does it matter? Is it likely that a user will allow such a long passage of time between when the clone failed and when it is resumed? Or that they will decide to try to clone from a different remote repository? I'm not sure how to detect that the source for the clone has a different "rid alignment" than was originally used. Andy
(4) By Andy Bradford (andybradford) on 2023-11-24 23:40:53 in reply to 3.2 [link] [source]
> I'm not sure how to detect that the source for the clone has a > different "rid alignment" than was originally used. Perhaps the client, upon deciding that it needs to resume a clone, can issue a "pragma" card requesting that rid X has hash Y (or perhaps a sample of a few). If the server responds in the affirmative, then it's safe to proceed with the clone? It's possible it's not worth the effort to detect these scenarios (yet). Andy
(5) By Andy Bradford (andybradford) on 2023-11-24 23:53:35 in reply to 2 [link] [source]
> If the last sequence number is zero, the command is a no-op. The original behavior when trying to clone on top of a file that already exists is an error. I opted to retain this behavior: https://www.fossil-scm.org/home/info/b0a60d8f1d3c3023 Do you see any problems opening the repository in this way to read the value of the setting and then closing it? Or should it just remain open? Andy
(6) By Warren Young (wyoung) on 2023-11-25 00:01:17 in reply to 5 [link] [source]
Do you mean the Fossil command “open” or the SQLite DB open? I don’t think you should be allowed to do the former until the clone completes successfully, but the latter is of course necessary to determine success.
Fun side-track: what happens in the two flavors of clone-and-open?
(7) By Andy Bradford (andybradford) on 2023-11-25 00:07:47 in reply to 6 [link] [source]
> Do you mean the Fossil command “open” or the SQLite DB open? SQLite DB open, not "fossil open". Because it isn't possible to know the value of the setting in the repository without first opening the repository, I mean this: https://www.fossil-scm.org/home/artifact?udc=1&ln=204-206&name=4f7b97763b93cfad > what happens in the two flavors of clone-and-open? One of my least favorite things about Fossil... ok, I'll test it and find out. Thanks for the reminder. Andy
(8) By Andy Bradford (andybradford) on 2023-11-25 02:29:35 in reply to 6 [link] [source]
> Fun side-track: what happens in the two flavors of clone-and-open? Seems to work as expected as I already put in a check to fail an open if the clone failed: $ fossil clone http://localhost:8080/ Round-trips: 8 Artifacts sent: 0 received: 54902 cannot connect to host localhost:8080 Clone done, wire bytes sent: 2099 received: 40100940 remote: 127.0.0.1 server returned an error - clone incomplete there are unresolved deltas - the clone is probably incomplete and unusable. It may be possible to resume the clone by running the same command. Rebuilding repository meta-data... 100.1% complete... Extra delta compression... 1 delta saves 656 bytes Vacuuming the database... project-id: CE59BB9F186226D80E49D1FA2DB29F935CCA0333 server-id: 2d91555085082384bdeb4d05dbd86620e1156c92 admin-user: amb (password is "gfsUMVECdD") opening the new ./localhost.fossil repository in directory ./localhost... This repository appears to be an incomplete clone. It called my repository localhost.fossil---how nice---and tried to open into a directory called localhost---how nice. Of course it failed, so let me resume: $ fossil clone http://localhost:8080/ Round-trips: 4 Artifacts sent: 0 received: 4547 Clone done, wire bytes sent: 1047 received: 14147800 remote: 127.0.0.1 Rebuilding repository meta-data... 100.0% complete... Extra delta compression... 46 deltas save 4,899,627 bytes Vacuuming the database... project-id: CE59BB9F186226D80E49D1FA2DB29F935CCA0333 server-id: 4397c632fc857a406f47f8335bffe8b813dc5f48 admin-user: amb (password is "APNF26RLJG") opening the new ./localhost.fossil repository in directory ./localhost... .editorconfig ... project-name: Fossil repository: /tmp/trial/localhost.fossil local-root: /tmp/trial/localhost/ config-db: /home/amb/.fossil project-code: CE59BB9F186226D80E49D1FA2DB29F935CCA0333 checkout: 3f97785608f1470be49526c6c775fa86286e413b 2023-11-24 12:59:47 UTC parent: 2edeeee4a8ae3028070590fc9a95e2c225153c20 2023-11-22 20:22:38 UTC tags: trunk comment: Update the built-in SQLite to version 3.44.2. (user: drh) check-ins: 18021 Seems to be happy. Now, what's the other open after clone option... --workdir? Ok, let's try: $ fossil clone --workdir open http://localhost:8080/ clone.fossil Round-trips: 7 Artifacts sent: 0 received: 51465 cannot connect to host localhost:8080 Clone done, wire bytes sent: 1834 received: 35099994 remote: 127.0.0.1 server returned an error - clone incomplete there are unresolved deltas - the clone is probably incomplete and unusable. It may be possible to resume the clone by running the same command. Rebuilding repository meta-data... 100.1% complete... Extra delta compression... 1 delta saves 656 bytes Vacuuming the database... project-id: CE59BB9F186226D80E49D1FA2DB29F935CCA0333 server-id: b546e947c313d3026ef9a4fd08d404df20bbc9ad admin-user: amb (password is "XDdJB4NULq") opening the new clone.fossil repository in directory open... This repository appears to be an incomplete clone. Looks promising, now let's resume: $ fossil clone --workdir open http://localhost:8080/ clone.fossil Round-trips: 5 Artifacts sent: 0 received: 7984 Clone done, wire bytes sent: 1305 received: 19148746 remote: 127.0.0.1 Rebuilding repository meta-data... 100.0% complete... Extra delta compression... 46 deltas save 4,899,627 bytes Vacuuming the database... project-id: CE59BB9F186226D80E49D1FA2DB29F935CCA0333 server-id: f363e981c0b849969d22439efaf6056741cee3ce admin-user: amb (password is "jZqW8HAo6p") opening the new clone.fossil repository in directory open... .editorconfig ... project-name: Fossil repository: /tmp/trial/clone.fossil local-root: /tmp/trial/open/ config-db: /home/amb/.fossil project-code: CE59BB9F186226D80E49D1FA2DB29F935CCA0333 checkout: 3f97785608f1470be49526c6c775fa86286e413b 2023-11-24 12:59:47 UTC parent: 2edeeee4a8ae3028070590fc9a95e2c225153c20 2023-11-22 20:22:38 UTC tags: trunk comment: Update the built-in SQLite to version 3.44.2. (user: drh) check-ins: 18021 Looks like it's working. Andy
(9) By Warren Young (wyoung) on 2023-11-25 03:09:37 in reply to 8 [link] [source]
Now, what's the other open after clone option...
fossil open URL
(10) By Andy Bradford (andybradford) on 2023-11-25 03:19:40 in reply to 9 [link] [source]
> fossil open URL Indeed, I saw some code for that while I was making these changes and wondered, "now what does that do" but never got around to checking it out. Let's see: $ fossil open http://localhost:8080/ /[path]/fossil clone http://localhost:8080/ /tmp/trial/localhost.fossil Round-trips: 6 Artifacts sent: 0 received: 44180 cannot connect to host localhost:8080 Clone done, wire bytes sent: 1570 received: 30098578 remote: 127.0.0.1 server returned an error - clone incomplete there are unresolved deltas - the clone is probably incomplete and unusable. It may be possible to resume the clone by running the same command. Rebuilding repository meta-data... 100.0% complete... Extra delta compression... none found Vacuuming the database... project-id: CE59BB9F186226D80E49D1FA2DB29F935CCA0333 server-id: 1500971faf3cb0c0675931f3db837bd47705c8f3 admin-user: amb (password is "ShLAVkT6tW") This repository appears to be an incomplete clone. Ok, now let's try again: $ fossil open http://localhost:8080/ directory /tmp/trial is not empty use the -f (--force) option to override or the -k (--keep) option to keep local files unchanged Hmm, that's odd, of course it's not empty, Fossil automatically placed a repository in that directory (does anyone even use "fossil open URL"? Seems an odd duck). Well, let's try the -f (even though I don't know that I should have to do that and it seems like the wrong option here, maybe -k would be better): $ fossil open -f http://localhost:8080/ /[path]/fossil clone http://localhost:8080/ /tmp/trial/localhost.fossil Round-trips: 6 Artifacts sent: 0 received: 15269 Clone done, wire bytes sent: 1566 received: 24150161 remote: 127.0.0.1 Rebuilding repository meta-data... 100.1% complete... Extra delta compression... 47 deltas save 4,900,283 bytes Vacuuming the database... project-id: CE59BB9F186226D80E49D1FA2DB29F935CCA0333 server-id: 6f49c50f6b44c08dfb433bdf96c9f8b1cd43486d admin-user: amb (password is "xJGkTdvhAe") .editorconfig ... project-name: Fossil repository: /tmp/trial/localhost.fossil local-root: /tmp/trial/ config-db: /home/amb/.fossil project-code: CE59BB9F186226D80E49D1FA2DB29F935CCA0333 checkout: 3f97785608f1470be49526c6c775fa86286e413b 2023-11-24 12:59:47 UTC parent: 2edeeee4a8ae3028070590fc9a95e2c225153c20 2023-11-22 20:22:38 UTC tags: trunk comment: Update the built-in SQLite to version 3.44.2. (user: drh) check-ins: 18021 Seems to work just fine. Andy
(11) By Andy Bradford (andybradford) on 2023-11-25 03:23:46 in reply to 2 [link] [source]
> Why not allow resume on repeating the command? The clone-resume code now does this. I wonder, should it perhaps output something letting the person know that a "resume" is taking place? Andy
(12) By Stephan Beal (stephan) on 2023-11-25 12:35:32 in reply to 4 [link] [source]
Perhaps the client, upon deciding that it needs to resume a clone, can issue a "pragma" card requesting that rid X has hash Y (or perhaps a sample of a few).
Going from memory, and am 3 days away from my computer (so can't readily confirm), but the RIDs never cross between the original and clone, do they? My (mis?)understanding is that RIDs are strictly for use within their own copy of the repository.
(13.1) By Andy Bradford (andybradford) on 2023-11-25 15:01:48 edited from 13.0 in reply to 12 [link] [source]
> but the RIDs never cross between the original and clone, do they? The client may not store them in the DB in the same order as the rid order on the original, however, the clone_seqno is just a proxy for rid because the Fossil server code (in xfer.c) initializes seqno to 1 (aka rid 1), gets the maximum rid from the blob table and then starts looping until after incrementing the seqno it reaches max. https://www.fossil-scm.org/home/file?udc=1&ln=1466+1470+1474&ci=trunk&name=src%2Fxfer.c Each iteration simply copies the content from the blob table by looking up the rid starting with rid 1. It ends when client_seqno == 0. So, in theory, the client "knows" which rids it has received, but rather than counting anything, it could, in theory, just record the hash of the first cfile that it receives after issuing client_seqno. So a list of these hashes could be used as a heuristic of sorts for determining if the server can still fulfill a resumed clone. Again, this is all theoretical, and I'm not sure it's worth it. It might be easier to just prevent by disallowing certain uses (e.g. changing the URL). Andy
(14) By Preben Guldberg (preben) on 2023-11-25 16:04:56 in reply to 13.1 [link] [source]
It might be easier to just prevent by disallowing certain uses (e.g. changing the URL).
If we can detect a URL change, I kind of like using --resume
as a deliberate action:
- If you have a URL (and no
--resume
) it's new clone operation or an error. - If you use
--resume
, do not accept a URL and only do the work to resume a clone.
(15) By Andy Bradford (andybradford) on 2023-11-25 16:15:56 in reply to 14 [link] [source]
> If we can detect a URL change I suppose I really meant "ignore" any new URL. So rather than accepting the URL information from the command line, just extract it from the repository as it has already been recorded there (I think). At the same time, I don't mind the flexibility of allowing the user to provide a different URL. Perhaps the administrator copied the repository to a new server and it has a new address. Why should Fossil prevent this? I'm thinking of leaving it as-is until we actually hear of users trying to resume a clone from a different server than was originally used and running into errors. On that note, let me test to see what happens in such a situation. Andy
(16) By Andy Bradford (andybradford) on 2023-11-27 06:22:21 in reply to 2 [link] [source]
> Why not allow resume on repeating the command? For that matter, why not just try to automatically resume? https://www.fossil-scm.org/home/info/bc0a4c60c00ddf67 Thoughts? Is 3 too many? Not necessary?
(17) By Warren Young (wyoung) on 2023-11-28 04:53:51 in reply to 16 [link] [source]
Thoughts?
How about, "This is awesome, and I cannot wait until it merges down to trunk?" 🤓
As the one who was telling people recently that Fossil cannot resume a clone, I assume this feature is in reaction to my warnings. I'm happy to be proven wrong post facto.
(18) By Daniel Dumitriu (danield) on 2023-11-28 09:49:28 in reply to 17 [link] [source]
How about, "This is awesome, and I cannot wait until it merges down to trunk?" 🤓
Same opinion here.
proven wrong post facto.
Rather post factum. It is only ex post facto, which does not apply here.
(19) By Warren Young (wyoung) on 2023-11-28 11:01:04 in reply to 18 [link] [source]
My Latin isn't as bad as my Mandarin…but it's a close call. 😛
(20) By Andy Bradford (andybradford) on 2023-11-30 04:08:28 in reply to 17 [link] [source]
> I assume this feature is in reaction to my warnings. In addition to your warnings, it's actually something we discussed quite a few years ago and I started working on but lost track of so never completed. My recent commit to handle SIGINT during cloning: https:/https://www.fossil-scm.org/home/info/ad2e148541fe0716/www.fossil-scm.org/home/info/ad2e148541fe0716 Actually came about because I've been testing the code against this huge repository: https://www.fossil-scm.org/forum/forumpost/8cbc83e5d08a86b4 Andy
(21) By Andy Bradford (andybradford) on 2023-12-01 18:16:23 in reply to 1.1 [link] [source]
I've made some additional changes. In testing I decided that it should skip rebuilding the database until the clone is 100% done, however, I also wonder if the additional steps (VACUUM, outputting project code, etc.) should also be skipped until the clone is done? https://www.fossil-scm.org/home/file?udc=1&ln=355-383&ci=17f3408f6b557e27&name=src%2Fclone.c If the clone isn't complete, why do any of these things? The lack of this information being done at the end may signal to the user that something has gone wrong, while the presence of the data may lead them to believe it was successful (if they don't look carefully through the rest of the output). Andy