fossil timeline takes a very long time

(1.2) By jvdh (veedeehjay) on 2022-10-17 13:06:51 edited from 1.1 [source]

we have recently migrated our (ubuntu) file server to a VM. previously, to the best of my recollection

fossil timeline -n 0 >/dev/null

when executed on the repo holding the fossil source code itself ("the fossil repository" :)) took a few secs. now it takes 30-50 secs and the colleague in charge is hunting for the cause of this slowdown.

my questions:

what are typical times for the above timeline command (when looking at "the" fossil repo) for other users, when the repo resides on an NFS disk (locally, on my laptop with SSD, I of course get very short times < 1sec)? am I right that it should take a couple of secs at most?

2. strace shows lots (~ 23k) of queries for nonexistent *-wal and *-journal files not only in the checkout but also in my home dir (where the global settings .fossil file resides): is this to be expected for the timeline command? (my colleague's suspicion is that these ~ 50k queries via NFS might explain (or contribute to) the slowdown.

update: the issue could now be cornered to being related to the location of and multiple accesses to the $HOME/.fossil config file: this location, too, is on NFS and this is the central culprit for the slowdown. moving .fossil to an in-memory disk (tmpfs) and soft-linking to that one under the original name $HOME/.fossil speeds up the timeline query from 30s to 3s. still: is this as it should be??

thx/joerg

(2) By Stephan Beal (stephan) on 2022-10-17 13:46:08 in reply to 1.2 [link] [source]

when the repo resides on an NFS disk

NFS has notoriously slow locking. Remove the NFS and your problem will go away.

update: ... this location, too, is on NFS and this is the central culprit for the slowdown.

Ah, good, you discovered that yourself :).

strace shows lots (~ 23k) of queries for nonexistent *-wal and *-journal files not only in the checkout but also in my home dir (where the global settings .fossil file resides): is this to be expected for the timeline command?

Whether such a high count is expected is unfortunately above my pay grade, so i'll leave that one for someone who knows better.

soft-linking to that one under the original name $HOME/.fossil

Though this may have been resolved in the mean time, symlinking of db files has historically been frowned upon. See what is currently section 2.5 of:

https://sqlite.org/howtocorrupt.html

is this as it should be??

Locking on NFS is slow and sqlite requires lots of locking unless it's told not to lock at all. You can try this to see if it will speed things up, but it's also a recipe for eventual corruption except in guaranteed-single-client-only cases:

$ fossil -vfs unix-none timeline -n 0 >/dev/null

That will use a VFS which performs no locking, but... that doesn't work on my local setup (and that's probably a bug):

$ f -vfs unix-none timeline -n 0 > /dev/null
SQLITE_CANTOPEN(14): unable to open database file in "SELECT value FROM global_config WHERE name=$n"
Database error: unable to open database file
SELECT value FROM global_config WHERE name=$n

Removing the -vfs flag resolves it. Weird, but far enough removed from current priorities that i'm not going to jump down that rabbit hole just now! :)

(3) By Warren Young (wyoung) on 2022-10-17 14:20:03 in reply to 2 [link] [source]

It isn't just NFS, either. Try pinging the NAS/SAN host. On a fast, quiet network, you might get 0.12 ms round-trip. A good SSD has latency four times as fast, and even then in the network filesystem case, you haven't committed the data to durable storage yet, merely transferred it. 10x total latency difference isn't unexpected.

If you have rotating rust on the other end of that network link, it's even worse. You can expect transaction rates to drop from the 10k range to the 50-100 range simply because each commit takes two rotations of the storage medium.

Except insofar as you might try to optimize the transaction count some, none of this is Fossil-specific. SQLite over NFS is terribly slow regardless.

There's no getting around Amdahl's Law.

(4) By jvdh (veedeehjay) on 2022-10-17 17:10:12 in reply to 3 [link] [source]

stephan & warren: thanks, and all principally understood regarding "sqlite over nfs known to be very slow". what "very slow" specifically means and what runtimes for certain queries are therefore to be expected I do not know, though.

and if I may: the specific problem I'd encountered was/is the extremely long time it takes to get the timeline listing of a repo if (and only if) the config file .fossil is on an NFS drive. the repo in question, too, resides on NFS. but just moving .fossil off NFS resolves the problem, mostly: cutting run time for getting the timeline from the repo (that's on NFS all the time) down from 30-50 to 3 secs.

so the question remains: why does fossil have such heavy interaction with the config file .fossil when asked to provide the timeline of a repo that it totally dominates overall runtime of the timeline query?

and I am also curious if ~23k attempts to open $HOME/.fossil-wal and 23k attempts to open $HOME/.fossil-journal are to be expected when asking for some repos timeline?

so maybe (despite the sqlite/NFS "impedance mismatch") there still is some fossil-specific issue at work here?

(10) By Warren Young (wyoung) on 2022-10-17 18:38:29 in reply to 4 [link] [source]

why does fossil have such heavy interaction with the config file

Because it assumes the DB is local. You've violated that, and the computer is now exacting its revenge.

(12) By jvdh (veedeehjay) on 2022-10-17 19:57:45 in reply to 10 [link] [source]

not an answer to what I asked (or meant): even if the config file were local (not NFS), the number of reads from the config file would be the same, i.e. huge (but possibly not causing user notable delays, agreed).

the question here was: why is querying some repo for timeline info causing such heavy interaction with the config file at all (or at least more than once for checking some settings)? I don't get why this is happening at all.

also, my empirical finding now is that moving the config file off NFS restores sort-of sane performance even for the repo proper being on NFS. I naively think this might be relevant to improve fossil's performance for the probably not-so-rare use cases where repos are on NFS, no?

(14) By Warren Young (wyoung) on 2022-10-17 20:14:52 in reply to 12 [link] [source]

why is querying some repo for timeline info causing such heavy interaction with the config file at all

I don't know. My only point was, it's so fast it isn't causing a problem when it's local, so no one may have even noticed it before, much less been motivated to improve it.

If you're going to insist on doing Fossil over NFS, then you're in a position to scratch that itch. We await your patch improving things.

(15) By jvdh (veedeehjay) on 2022-10-17 20:37:42 in reply to 14 [link] [source]

no one is using fossil over NFS you mean? well...

and the rest of your reply, with all due respect, is not helpful:

I do not "insist" on using fossil over NFS. it just happens that we have a server and NFS disks served by it. so no alternative, really (and sure not an unusual setup I'd say). it's not my stubborn insistence on doing that...
as you are sure aware, not every fossil user can competently provide patches to the fossil source base. count me among those. I do not "demand" that anyone is doing anything about the situation or doing me a favour. I just use the forum to ask my questions and provide some info which might or might not be of interest to other people (including the devs). the way to often uttered "we await your patch" is just a tiny bit annoying, frankly. definitely not helpful.

but anyway. thanks for bothering to reply, nevertheless.

(18) By Warren Young (wyoung) on 2022-10-17 22:06:04 in reply to 15 [link] [source]

no one is using fossil over NFS you mean?

Given the known problems of SQLite over network filesystems, I dare say you've got a very rare use case indeed.

My claim is not that NFS is rare, it's the use of SQLite over NFS is rare. Fossil is a relatively rare case of SQLite, so you've got rare times rare equals nearly unheard-of.

If you're expecting this presumably nonzero slice of the user base to fix things for you, I think you'll be waiting a while.

the way to often uttered "we await your patch" is just a tiny bit annoying, frankly. definitely not helpful.

Free software advances by people spending their time, skills, and other resources to improve things. Given the premise that almost no one is running Fossil over NFS, then there is little chance of anyone else having any incentive to fix things for you.

The point of the benchmark tests is that it shows the current wasteful implementation doesn't matter in the main case. On my iMac, your case runs in a quarter of a second. I suspect I know what's happening here, but if I act on that guess and spend a bunch of time to make it run 10x faster, what possible ROI will this have for me? Sure, I can point to my 25 ms runtime, but who the bleep cares?

Contrast use cases that affect many people. Then it takes only a tiny percentage of the user base to choose to contribute for advances to happen quickly.

(22) By jvdh (veedeehjay) on 2022-10-17 22:36:23 in reply to 18 [link] [source]

If you're expecting this presumably nonzero slice of the user base to fix things for you, I think you'll be waiting a while.

I do think this exchange goes in a wrong/bad direction. I already stressed that I do not expect anything from anybody here. please take note of this...

I also do think, that a) your replies are frequently helpful contentwise (appreciated) and b) are even more frequently somewhat off-putting tone-wise (not appreciated). so please don't lecture me on what free software advance is about, e.g., thank you very much.

a couple years ago I think the mood on the fossil mailing list was decidedly more friendly+relaxed. maybe one can steer back to that approach. I do not want to fight fights here...

whether or not the problem I am seeing is "esoteric" I cannot really decide. but maybe neither can you? or do you have statistics for that?? and even if it were: so? bugs (if it can be called that in the present case is a different question) surfacing only in corner case usage are no bugs? irrelevant? of no concern whatsoever? that would be a peculiar attitude.

my view (and motivation to write more than the initial question here) is:

maybe the reported issue/problem is of potential relevance to the people (or the single person as far as fossil is concerned) in charge of the project since it hints at room for improvement or bugs that might deserve fixing (and this, ultimately, is only loosely correlated with "how many people are affected by the issue", in my view anyway).

if such reports regularly trigger the "don't complain, don't expect that someone fixes it, since it is irrelevant/your problem, fix it yourself and submit patch" reflex this is ultimately putting users off I'd say.

but let's stop it at that. it's not leading anywhere.

I think the "config file on NFS causes strange slow downs with fossil timeline command" issue is now described clearly enough. whether someone in the know decides that it is worth a second look and possible fix is for the "someone" to decide.

thanks agan. EOT

(5) By Richard Hipp (drh) on 2022-10-17 17:24:57 in reply to 1.2 [link] [source]

"fossil timeline -n 0 >/dev/null" ... took a few seconds - now it taks 30-50 secs....

When I run:

    time fossil timeline -n 0 | wc

On my 8-year-old Ubuntu desktop, I get:

     41346  478964 3697150

    real     0m0.298s
    user     0m0.324s
    sys      0m0.016s

So, 300 milliseconds. And this on obsolete hardware. I'm not sure what you are doing to make it take so much time.

(6) By jvdh (veedeehjay) on 2022-10-17 17:34:27 in reply to 5 [link] [source]

as said: NFS is involved. I am doing nothing peculiar otherwise ...

and executing your

time fossil timeline -n 0 | wc

on my 2y old macbook yields

2.2s

on first run and

0.4s

on second try -- which both is still notably longer than what you see for whatever reason...

but the main issue is/was the time it takes on our server (~30-50s) when the config file resides on NFS as explained in previous post.

(7) By Martin Gagnon (mgagnon) on 2022-10-17 17:42:40 in reply to 6 [link] [source]

It seems that your use WAL mode, perhaps WAL mode doesn't perform well on NFS ?

(8) By Warren Young (wyoung) on 2022-10-17 18:36:51 in reply to 7 [link] [source]

"WAL does not work over a network filesystem."

(9) By Richard Hipp (drh) on 2022-10-17 18:37:11 in reply to 7 [link] [source]

WAL mode makes use of shared memory. But if you have two or more processes on different hosts sharing the same database over NFS - they they obviously cannot share memory. So WAL mode in that scenario fails utterly. You will corrupt the database file if you try to write.

This is why WAL mode is not the default in SQLite.

(11) By jvdh (veedeehjay) on 2022-10-17 19:27:56 in reply to 9 [link] [source]

so I am getting lost somehow, I'm afraid...:

I do not even really know what WAL mode is (I am not using sqlite outside of fossil heavily so never needed to find out), let alone wittingly activated (or deactivated) it. also, there are numerous active fossil repos on that server for some 8-10 years w/o ever having seen db corruption (the repos are talked to only via file system or ssh from the outside).

dumb question: how can I check/change WAL mode status? on a quick search I don't find anything in fossil CLI help except a --wal flag of rebuild. I don't see it in fossil set output either. does it happen at compile time? I only do standard fossil compile (make clean; make).

what I have seen+reported where lots (lots!) of failed attempts to open *-wal and *-journal files in the strace output like:

newfstatat(AT_FDCWD, "path-to/fossil-fsl/.fslckout-wal", 0x7ffc834fb390, 0) = -1 ENOENT (No such file or directory)

which is in the folder holding the fossil source code repo (yes, on NfS drive)

and

newfstatat(AT_FDCWD, "$HOME/.fossil-wal", 0x7ffc834fa300, 0) = -1 ENOENT (No such file or directory)

which is my home dir (on NFS, too). and the same again with '*-journal' file names.

so what does this tell me? that WAL mode is active but it should not be? how do I switch it off, then (a RTFM + link to the respective documentation would suffice :))?

(13) By Warren Young (wyoung) on 2022-10-17 20:13:19 in reply to 11 [link] [source]

how can I check/change WAL mode status?

This will either print 'delete' or 'wal':

$ fossil sql 'pragma journal_mode'

how do I switch it off

Make sure all DB connections are down, then say:

$ fossil sql 'pragma journal_mode=delete'

…if and only if you got 'wal' from the first command.

(16) By jvdh (veedeehjay) on 2022-10-17 20:44:06 in reply to 13 [link] [source]

thanks, this was helpful :).

I did that. the answer is "delete". so summary of situation:

WAL mode is off
tenth of thousands of attempts to read .fossil-wal and .fossil-journal files that do not exist when I issue fossil timeline -n 0 on the fossil source code repo.
very slow response to that fossil timeline query if the global config db (.fossil in my case) resides on NFS. might or might not be related to those ~50k attempts to read .fossil-wal and .fossil-journal.
performance increases approx 10x if config db is moved off NFS (while the repo itself still resides on NFS).

I continue to find it very astonishing that the location of the config db (NFS or local disk) makes such a huge difference (for the timeline command in my case, but probably that will apply to other fossil commands as well I guess).

maybe DRH can shed some light on this (including some best practice howto when NFS is involved)?

(17) By jvdh (veedeehjay) on 2022-10-17 21:49:42 in reply to 13 [link] [source]

…if and only if you got 'wal' from the first command.

just curious: why the "and only if"? i..e. what harm is done if the answer was "delete" in the first place and one then issues journal_mode=delete?

(19) By Warren Young (wyoung) on 2022-10-17 22:09:48 in reply to 17 [link] [source]

"…the journal_mode cannot be changed while a transaction is active."

I don't know that there is a problem of setting it from its current setting back to that same setting, but I also don't know that there isn't one. If it pleases you to roll the dice, go on ahead.

(20) By jvdh (veedeehjay) on 2022-10-17 22:12:32 in reply to 19 [link] [source]

If it pleases you to roll the dice, go on ahead.

it does not. why should it?

in any case, not that relevant. I wont touch that setting any time soon again.

thanks again.

(28) By Stephan Beal (stephan) on 2022-10-18 13:39:56 in reply to 20 [link] [source]

I wont touch that setting any time soon again.

Another you might want to try is journal mode "truncate". In recent experimentation i found that it is significantly faster on one particular obscure filesystem and it might (might, just speculating) also be faster on NFS.

(30) By jvdh (veedeehjay) on 2022-10-18 15:43:23 in reply to 28 [link] [source]

mmh. I have now issued

fossil sql 'pragma journal_mode=truncate'

within the checkout (I trust it is a per-repo action or is this logged in the config file as mode of operation for all repos?).

however, prior to and also after issuing the command, dbstat reports that "delete mode" is selected...

what am I doing wrong?

(21) By Martin Gagnon (mgagnon) on 2022-10-17 22:36:23 in reply to 11 [link] [source]

You can check if a repo is in WAL mode or not (without using the sql command) by looking at the last line of dbstat command.

example:

$ fossil dbstat
project-name:      Fossil
repository-size:   210,319,360 bytes
artifact-count:    56,637 (stored as 9,014 full text and 47,623 deltas)
artifact-sizes:    109,587 average, 8,560,094 max, 6,206,618,621 total
compression-ratio: 29:1
check-ins:         17,289
files:             1,374 across all branches
wiki-pages:        73 (496 changes)
tickets:           1,113 (3,574 changes)
events:            8
tag-changes:       1,634
latest-change:     2022-09-29 11:30:00 - about 18 days ago
project-age:       5,568 days or approximately 15.24 years.
project-id:        CE59BB9F186226D80E49D1FA2DB29F935CCA0333
schema-version:    2015-01-24
fossil-version:    2022-09-21 16:52:12 [1356151980] [2.20] (gcc-7.5.0)
sqlite-version:    2022-09-02 21:19:24 [da7af29096] (3.40.0)
database-stats:    205,390 pages, 1024 bytes/pg, 82,875 free pages, UTF-8, wal mode
#                                                                         ~~~~~~~~~~

It can be switched to "wal" mode using the "rebuild -wal" command, as you may have noticed, but there doesn't seem to be a way to go back to delete mode without using the "sql" command.

(23) By jvdh (veedeehjay) on 2022-10-17 22:44:17 in reply to 21 [link] [source]

thanks a lot for this tip. that ("dbstat") I can memorise more easily than the raw sql pragma thingy... I actually nearly had forgotten about the 'dbstat' command altogether. so double bonus ;).

and as said in a previous post: in fact, WAL mode is off for the repo in question (which makes me even more wonder why fossil tries to access those (non-existing) *-wal and *-journal files 50k times, presuming those do only come into play if WAL mode were on).

(24) By brickviking on 2022-10-18 09:57:58 in reply to 5 [link] [source]

Here's results for my systems. I haven't updated my fossil for about three weeks (Fossil 2.20 [6362a98fa6] 2022-09-29 09:25:09 when compiled on my system)

For what it's worth, my 14 year old spinning rust system starts off at 8.5 seconds, then seems to settle down to a speedier 0.567 seconds on subsequent invocations, provided I use the command in short order.

Testing the same scenario on a FreeBSD 13.1 VM run from that same machine netted me similar results and are about what I'd expect for my system (8.9 seconds initially, settling down to about 0.89 seconds).

An Ubuntu system from 2014 runs the same command initially in about 1.1 seconds, settling down to about 0.29 seconds. None of these systems involve NFS.

One point of note with strace though when I run like this:

strace -fo fossilout.strace fossil timeline -n0  -R fossil.fossil

I get four results for ENOENT specifically on /home/viking/.fossil-wal. I get 25 results for fossil.fossil-wal, and the same results seem to be true for the -journal files. I also never turned on WAL mode, and 'pragma journal_mode' also gives me 'delete'.

Cheers, brickviking (Post 5)

(25.2) By jvdh (veedeehjay) on 2022-10-18 10:52:26 edited from 25.1 in reply to 24 [link] [source]

For what it's worth, my 14 year old spinning rust system starts off at 8.5 seconds, then seems to settle down to a speedier 0.567 seconds on subsequent invocations, provided I use the command in short order.

yes, that's sure file caching kicking in.

otherwise, the numbers seem to be in agreement with drh's (on ubuntu, too), while I see notably inferior performance on 2y old macbook plus SSD (about 0.8s or so) which is not relevant (since still fast), but remarkable (to me) in so far as the mac should be much faster (definitely not slower) spec-wise.

and thanks for confirming those (I guess) spurious queries of .fossil-wal and *-journal.

here, I find it remarkable that you seemingly get 25 of those while I get 2x23000 in the previously described setup (repo as well as .fossil on NFS drive(s)).

I continue to wonder why fossil talks so much to it's global config file at all (irrespective of the question whether this causes a relevant performance penalty (which it does in the NFS scenario, of note, caused only by the location of .fossil, not by the location of the actually queried repo on NFS). it's hardly imaginable to me that this is unavoidable or intended.

(26) By brickviking on 2022-10-18 12:43:36 in reply to 25.2 [link] [source]

... queries of .fossil-wal and *-journal.

here, I find it remarkable that you seemingly get 25 of those ...

To correct things, I get 4 each of /home/viking/.fossil-wal and -journal. I get 25 each for fossil.fossil-wal and -journal. Different file.

Cheers, brickviking (Post 6)

(27) By jvdh (veedeehjay) on 2022-10-18 12:54:06 in reply to 26 [link] [source]

yes, I did get that :). fossil tries to access altogether 2 non-existent *-wal and 2 *-journal files: one pair concerning the considered repo, one pair concerning the config db. and the mysterious heavy interaction with the config file causes that (NFS-only, seemingly) performance breakdown while the repo itself residing on NFS does not cause massive slow down (for the timeline command, anyway).

what I I meant by "remarkable" is the difference in frequency of attempted accesses by 3 orders of magnitude between your test and mine.

(29) By Stephan Beal (stephan) on 2022-10-18 13:46:06 in reply to 27 [link] [source]

what I I meant by "remarkable" is the difference in frequency of attempted accesses by 3 orders of magnitude between your test and mine.

A hypothetical explanation: when fossil reads settings it often goes through a hierarchy of them:

checkout ==> repo ==> global

Your particularly slow case may indicate that it's always falling back to the global settings for you. Perhaps you can experiment with explicitly setting lots of the config options on in the repo. Which settings, i cannot even begin to speculate, but if you can discover (via strace) which ones it's continually looking up, you might be able to "relocate" those to the repo-side config.

(31) By jvdh (veedeehjay) on 2022-10-18 15:57:59 in reply to 29 [link] [source]

thanks for this idea. I will play with that but...

there are only 10 interactions with ".fossil" in the strace output. however there are 23033 interactions with (or rather failed attempts to open nonexistent) ".fossil-wal" and another 23033 with ".fossil-journal".

WAL is definitely off so I do not understand that part (20k+ attempts to open .fossil-wal, despite WAL not involved at all).

and .fossil-journal queries I do not understand since fossil timeline seems not to require/provide some sort of rollback of transactions (and for the config file, at that), no?

I am just mystified (NFS or not) why this is happening at all. I of course understand, that the triggered machinery might be agnostic of what fossil command actually is currently executed, but still it seems that rather massive redundant overhead is happening here and I would hope this could be avoided/bypassed.

and the fact remains, that just moving .fossil off the NFS (while keeping the repo on NFS as before) does improve the performance from quite unusable (30-50s) to "slow but tolerable" (3s). this really is strange I think.

(33) By Stephan Beal (stephan) on 2022-10-18 16:42:43 in reply to 31 [link] [source]

and .fossil-journal queries I do not understand since fossil timeline seems not to require/provide some sort of rollback of transactions (and for the config file, at that), no?

i can't answer your question about the wal files, but every time a db is opened, sqlite will look for a corresponding stale journal file (which a WAL file is). Those are typically left behind by app crashes. If it finds one, it will replay that journal if it can (if the db isn't read-only). That happens at a level below fossil, though - it's not actually fossil looking for those files, but sqlite3.

I am just mystified (NFS or not) why this is happening at all. I of course understand, that the triggered machinery might be agnostic of what fossil command actually is currently executed, but still it seems that rather massive redundant overhead is happening here and I would hope this could be avoided/bypassed.

No interesting hypothesis spring to mind and you've ruled out the "make the config local" workaround.

and the fact remains, that just moving .fossil off the NFS (while keeping the repo on NFS as before) does improve the performance from quite unusable (30-50s) to "slow but tolerable" (3s). this really is strange I think.

Personal anecdote: about 15 years ago i was working on an I/O-heavy C library and found that its POSIX locking code ran anywhere from 10x to 50x slower over NFS than it did locally. Locking over NFS is amazingly slow, but proper locking is needed in order to support db file consistency.

i.e. it does not surprise me in the slightest bit that hitting any sqlite3 db over NFS is painfully slow.

Why the global config db is hit so often is a mystery, though - fossil tends to cache results of oft-used settings and reuse them. For example, the setting which determines whether the timeline uses UTC or server-local time is only queried once and that result is cached (that one query might otherwise explain many hits on the timeline).

Ah, there is a way for you to find out what settings its searching for:

[stephan@nuc:~/f/fossil]$ f timeline --sqltrace -n 0 1>/dev/null 2> /tmp/x

that output file will contain ALL SQL statements that fossil uses during that run. Grepping that for "global_config":

[stephan@nuc:~/f/fossil]$ grep -c global_config /tmp/x
23041

wow. Obviously something's not right there.

"comment-format" is the offending setting. Presumably it's called once per timeline row. i'll look into getting that result cached.

(34) By Stephan Beal (stephan) on 2022-10-18 16:47:39 in reply to 33 [link] [source]

"comment-format" is the offending setting. Presumably it's called once per timeline row. i'll look into getting that result cached.

A fix is underway, but you can try --comment-format 1 flag as a temporary workaround:

[stephan@nuc:~/f/fossil]$ f timeline --sqltrace -n 0 --comment-format 1 1>/dev/null 2> /tmp/x
[stephan@nuc:~/f/fossil]$ grep -c global_config /tmp/x
1

(36) By jvdh (veedeehjay) on 2022-10-18 17:02:31 in reply to 34 [link] [source]

WOW, I think you've found it indeed:

time fossil timeline -n 0 --comment-format 1|wc
  38826  478997 3674726

real    0m02.68s
user    0m00.60s
sys     0m00.18s

whereas (repo and config on NFS in both cases)

time fossil timeline -n 0 |wc                  
  38826  478997 3674726

real    0m30.11s
user    0m01.05s
sys     0m02.72s

this is really great! very much appreciated that you've looked into this and found the cause (which, I understand indeed qualifies as a bug, has nothing to do with NFS per se and can/should be fixed :)).

remaining question on my side: where is that --comment-format 1 bit to be found in the documentation?

(37) By Stephan Beal (stephan) on 2022-10-18 17:12:53 in reply to 36 [link] [source]

where is that --comment-format 1 bit to be found in the documentation?

$ f help -o
Command-line options common to all commands:

...
  --comfmtflags VALUE     Set comment formatting flags to VALUE
  --comment-format VALUE  Alias for --comfmtflags
...

And the help for that setting can be found here.

But the way i found it was to use grep to count the queries, which clearly named "comment-format" as the offending setting, and then grepping it in the code. It was already set up for caching but missed this particular fallback case, which caused it to keep trying instead of falling back to the default.

(39) By jvdh (veedeehjay) on 2022-10-18 18:01:34 in reply to 37 [link] [source]

ah there it is :). thanks!

(41) By brickviking on 2022-10-19 05:23:00 in reply to 33 [link] [source]

Le 18 Octobre, Stephan a écrit:

[stephan@nuc:~/f/fossil]$ f timeline --sqltrace -n 0 1>/dev/null 2> /tmp/x
[stephan@nuc:~/f/fossil]$ grep -c global_config /tmp/x
23041

In my comparative test, I get 2, for a fossil that was compiled Sept 29th. Looking in my settings, comment_format is already set to 1, though I can't tell whether that's as a result of me doing a normal "fossil pull ..." or not, because I don't know what few settings get cloned if any.

Anyhow, I hope that provides a wee bit of context.

Cheers, brickviking (Post7)

(32) By jvdh (veedeehjay) on 2022-10-18 16:03:09 in reply to 29 [link] [source]

I have just done that: setting everything locally. meaning, I explicitly set locally all options that were previously reported by fossil set as "(global)": unfortunaely, this makes no runtime difference at all (still hovering around 30s). bummer...

(35) By Stephan Beal (stephan) on 2022-10-18 16:58:17 in reply to 1.2 [link] [source]

the issue could now be cornered to being related to the location of and multiple accesses to the $HOME/.fossil config file: this location, too, is on NFS

Please try the latest trunk and see if that resolves the issue. That change caches the "comment-format" result and reduces hits on the global config db, for the timeline command, from one per timeline row to one. The result was supposed to have been cached before, but the caching missed one particular case, causing it to effectively be a no-op.

(38) By jvdh (veedeehjay) on 2022-10-18 17:59:08 in reply to 35 [link] [source]

yep, you did the trick:

time fossil timeline -n 0 |wc
  38875  479612 3679407

real    0m02.66s
user    0m00.54s
sys     0m00.22s

i.e. about 12x faster than before :). the rest (still being distinctly slower than on local disk) might very well then really be an NFS+sqlite effect. but this does not cause real pain I'd say: 2.5s for the full listing is still sort of ok :).

for the record: I now find 5 attempts in the strace output to open .fossil-wal (and 5 for .fossil-journal) which previously happened 23k times. "why .fossil-wal" at all still remains unclear, as I understand you, but anyway...

I also wonder: here https://fossil-scm.org/forum/forumpost/4df2af46b76e5af9 "brickviking" reported his statistics, seeing just 4 access attempts to -journal and *-wal, but *before your fix: is the whole thing a regression that has crept in only recently?

in any case: really thanks a lot for bothering to look into (and solving!) this issue!

(40) By Stephan Beal (stephan) on 2022-10-18 20:32:47 in reply to 38 [link] [source]

... seeing just 4 access attempts... is the whole thing a regression that has crept in only recently?

If it is, it wasn't in/via the code which was touched today - the last change there was in January 2019. It's conceivable that an sqlite3 change caused it but unlikely that such a change would go unnoticed by its many tens of thousands of users: i monitor that forum closely and we haven't received any recent regression reports similar to this one.

It's also conceivable (but seems unlikely) that the clone you were working from already had that setting in it, in which case you would not have seen the slowdown unless, perhaps, the repo clone itself was on NFS. Perhaps (just speculating) re-cloning the repo lost that setting, causing it to fall back to the global config.

No idea, really.