Assertion failed

(1) By anonymous on 2021-05-25 16:09:06 [source]

Hello,

Given fossil version 2.15.1 [2f901f98b3], I'm getting the following assertion failure:

Assertion failed: ((pBlob)->xRealloc==blobReallocMalloc || (pBlob)->xRealloc==blobReallocStatic), function blob_reset, file ./src/blob.c, line 213.
 
fossil blob_reset.cold.1 + 35
fossil blob_reset + 42
fossil db_finalize + 72
fossil db_sql_trace + 81
fossil sqlite3Close + 76
fossil db_close + 816
fossil fossil_atexit + 147
fossil open_db + 882
fossil sqlite3_shell + 1890
fossil cmd_sqlite3 + 206
fossil fossil_main + 2032
fossil main + 9

(2) By Warren Young (wyoung) on 2021-05-25 16:17:51 in reply to 1 [link] [source]

You’re apparently doing this under “fossil sql”, but is it just firing up the shell that does this, or only for certain queries? What platform? For all repos or just one?

Details!

(3) By Richard Hipp (drh) on 2021-05-25 16:22:16 in reply to 2 [link] [source]

I'm with Warren on this. I could make some guesses about what is going wrong. But unless and until I can reproduce the problem, I have no way of knowing whether or not it has been fixed. Please, therefore, explain to us how you generate this error.

(4) By anonymous on 2021-05-25 16:24:54 in reply to 2 [link] [source]

Here are the details:

# uname -a
Darwin 20.4.0 Darwin Kernel Version 20.4.0: Thu Apr 22 21:46:47 PDT 2021; root:xnu-7195.101.2~1/RELEASE_X86_64 x86_64

# fossil remote-url
https://fossil-scm.org/forum

# fossil sql <<EOF 2>/dev/null
.bail on
.headers off
.mode tabs

select  config.value as value
from    config

where   config.name = 'short-project-name'

limit   1;

.exit
EOF

(5) By Richard Hipp (drh) on 2021-05-25 16:52:07 in reply to 4 [link] [source]

Does not reproduce for me. Anybody else able to generate this error?

(6) By sean (jungleboogie) on 2021-05-25 17:36:48 in reply to 5 [link] [source]

Not for me on trunk or the commit the OP is using.

$ uname -a
Linux 4.19.0-16-amd64 #1 SMP Debian 4.19.181-1 (2021-03-19) x86_64 GNU/Linux

(8) By anonymous on 2021-05-25 19:38:33 in reply to 5 [link] [source]

Further details...

(1) The 'fossil sql` command is part of a larger script processing a given artifact — using various fossil cli commands.

(2) Several of these scripts run concurrently (~ a small dozen) over distinct artifacts.

The issue arises intermittently when multiple scripts run in parallel.

What else would help to narrow this down?

(10) By Warren Young (wyoung) on 2021-05-25 19:49:49 in reply to 8 [link] [source]

What else would help to narrow this down?

An actual reproducing test case.

A reproducible bug is a dead bug walking.

An intermittent bug relying on underspecified behavior rarely encountered in the wild is likely to live a very long time.

(7.2) By Warren Young (wyoung) on 2021-05-25 19:22:20 edited from 7.1 in reply to 4 [link] [source]

A bunch of us have been trying that over in chat, and none of us can reproduce your symptom. We've tried with:

Raspberry Pi
Linux on x86_64
Catalina 10.15.7 on x86
Big Sur 11.3.1 on both M1 and x86_64
Fossil release 2.15.1 (src:2f901f98b3)
Fossil tip-of-trunk

Furthermore, your test seems to be overly complicated for what you're trying to achieve, which makes me wonder if your actual symptom comes from some quite different purpose, which you're hiding. This gives the same result without nearly as much code:

   $ fossil sql -R ~/path/to/fossil-forum.fossil \
     ".mode tabs" \
     "select value from config where name='short-project-name'"

That is, you don't need…

…the .bail on bit at all, as far as I can tell. This query returns a string or nothing. Why do you need bail-on-error behavior?
…the .headers off bit: that's default behavior.
…the LIMIT 1, since there is only ever zero or one name-to-value mapping in this table.
…the over-specified config table names. There is no ambiguity requiring that here.
…the redundant .exit at the end: it's going to exit regardless.

Maybe explaining what it is you're really trying to do would help us reproduce the symptom?

(9) By anonymous on 2021-05-25 19:49:24 in reply to 7.2 [link] [source]

Thank you for the follow-up.

Yes, this is part of a larger script, which uses various fossil cli commands.

That said, this is the only use of fossil sql — as the config data is not exposed through the regular cli as far as I know.

When the script is run serially, processing one artifact at the time, there is no apparent issue.

When several instances of the script are run concurrently, the assertion fails every now and then.

Stylistic issue aside, what else would help narrow down the issue?

Thanks for the help!

(11) By Warren Young (wyoung) on 2021-05-25 19:53:14 in reply to 9 [link] [source]

Does the symptom recur if you rewrite the code as I've shown, or does it require each-and-every one of the oddities I've flagged to show up?

For instance, if removing the redundant .exit fixes it, then:

You have an immediate workaround while we're working on the actual problem; and
We'd then know there's something about the double-exit that triggers the problem.

If the symptom entirely goes away when rewriting the query as I've shown, then the next step would be to bisect the difference between the two: reintroduce one of the flagged oddities at a time until the symptom comes back, at which point you have the likely culprit.

(12) By MBL (RoboManni) on 2021-05-25 20:20:42 in reply to 9 [link] [source]

How is the concurrency be done?

(13) By anonymous on 2021-05-25 20:24:34 in reply to 12 [link] [source]

background process (i.e. #script &)

(14) By Richard Hipp (drh) on 2021-05-25 21:00:14 in reply to 1 [link] [source]

Please rebuild using check-in eddfa8dfbe830c27 or later and let us know whether or not it clears your problem. Thanks.

(15) By Warren Young (wyoung) on 2021-05-25 21:38:44 in reply to 14 [link] [source]

There's a secondary symptom besides the debug assertion, where doing too many "fossil sql" calls in parallel can result in a "database is locked" error. This is because "fossil sql" doesn't retry the DB conn if it's busy or locked, doubtless for good reasons, but until there's a way to make it do so, you may want to patch your local Fossil like this:

Index: src/shell.c
==================================================================
--- src/shell.c
+++ src/shell.c
@@ -14757,12 +14757,20 @@
             SQLITE_OPEN_READONLY|p->openFlags, 0);
         break;
       }
       case SHELL_OPEN_UNSPEC:
       case SHELL_OPEN_NORMAL: {
-        sqlite3_open_v2(p->zDbFilename, &p->db,
-           SQLITE_OPEN_READWRITE|SQLITE_OPEN_CREATE|p->openFlags, 0);
+        int i;
+        for (i = 0; i < 15; ++i) {
+          int rc = sqlite3_open_v2(p->zDbFilename, &p->db,
+             SQLITE_OPEN_READWRITE|SQLITE_OPEN_CREATE|p->openFlags, 0);
+          if( rc==SQLITE_BUSY || rc==SQLITE_LOCKED ) {
+            sqlite3_sleep(1000);
+          }else{
+            break;
+          }
+        }
         break;
       }
     }
     globalDb = p->db;
     if( p->db==0 || SQLITE_OK!=sqlite3_errcode(p->db) ){

That allows $CORE_COUNT instances of back-to-back "fossil sql" calls in parallel on a 4-core Hyperthreaded box to run without ever getting locked up. It's possible to run out this hard-coded 15-tries-once-a-second retry loop by doing even more parallel runs per second, but if you've got that much overlapping concurrency going on, you'd probably do better to throttle calls to a reasonable number rather than increase this timeout further. Running 1000 parallel queries on an 8-core box won't make it go faster than running 8-16 or so.

(16) By Warren Young (wyoung) on 2021-05-26 01:21:49 in reply to 15 [link] [source]

This problem is fixed a different way on trunk now.

(17) By anonymous on 2021-05-26 08:44:05 in reply to 16 [link] [source]

This is because "fossil sql" doesn't retry the DB conn if it's busy or locked

Oh. Good point. Thanks for pointing this out.

Is fossil sql the only cli command exhibiting this behavior at the moment?

Either way, thank you very much for the patches; will give them a spin and keep on monitoring the system behavior.

Thanks again for all the help.

(18.1) By Warren Young (wyoung) on 2021-05-26 09:48:17 edited from 18.0 in reply to 17 [link] [source]

The commit linked from the prior post changes that. Now it retries for up to 10 seconds before giving up.