Segfault when creating a wiki page or using the sandbox
(1) By anonymous on 2020-12-02 23:02:22 [source]
Hi,
I have built fossil on macOS 10.11.6 from trunk today (64 bit executable, no custom compile settings). I have a new empty repository running as a server now. When trying to use the wiki sandbox or when creating a new wiki age, I get this ‘bad request’ as content on the wiki web page:
Segfault (0) 0 fossil 0x000000010557f2b8 sigsegv_handler + 40 (1) 1 libsystem_platform.dylib 0x00007fff98ba452a _sigtramp + 26 (2) 2 ??? 0x00007fd1d9500000 0x0 + 140539270791168 (3) 3 fossil 0x000000010551f1cf builtin_emit_fossil_js_once + 95 (4) 4 fossil 0x000000010551f3cd builtin_fossil_js_bundle_or + 221 (5) 5 fossil 0x000000010563daf2 wikiedit_page + 1378 (6) 6 fossil 0x000000010558065d process_one_web_page + 1565 (7) 7 fossil 0x0000000105581734 cmd_webserver + 1444 (8) 8 fossil 0x000000010557dff2 fossil_main + 1842 (9) 9 fossil 0x000000010557d8b9 main + 9 (10) 10 libdyld.dylib 0x00007fff965545ad start + 1 (11) 11 ??? 0x0000000000000005
Any idea what to do about it?
Torsten
(2) By Warren Young (wyoung) on 2020-12-02 23:15:18 in reply to 1 [link] [source]
Doesn't happen on macOS 10.15.7.
What happens if you repeat your attempt under:
$ fossil clean
$ ./configure --with-sanitizer=address
$ make -j11
$ ./fossil ui
(3) By Richard Hipp (drh) on 2020-12-02 23:25:07 in reply to 1 [link] [source]
Which of the 11 different check-ins from today did you use? Is the problem reproducible? Have you rebuilt and retried it using the very latest check-in?
(4) By Torsten Berg (torstenberg) on 2020-12-03 07:38:43 in reply to 2 [link] [source]
I assume, you meant to write make clean
in the first line? (‘fossil clean’ does not do anything, since the repository is still empty)
When doing make -j11
I get:
clang: error: unsupported argument 'address' to option 'fsanitize=' clang: error: unsupported argument 'address' to option 'fsanitize=' clang: error: unsupported argument 'address' to option 'fsanitize=' make: *** [bld/shell.o] Error 1 make: *** Waiting for unfinished jobs.... make: *** [bld/linenoise.o] Error 1 make: *** [bld/sqlite3.o] Error 1 clang: error: unsupported argument 'address' to option 'fsanitize=' clang: error: unsupported argument 'address' to option 'fsanitize=' make: *** [bld/th_lang.o] Error 1 make: *** [bld/th.o] Error 1
So, I cannot test what the outcome would be, sorry.
(5) By Torsten Berg (torstenberg) on 2020-12-03 07:52:22 in reply to 3 [link] [source]
I used check-in bbfd6123506e10e1.
I have now rebuilt fossil with the latest leaf 0457c40ae79d94c0 and still get the same segfault when trying to use e.g. the wiki sandbox:
Segfault (0) 0 fossil 0x000000010aaf4ce8 sigsegv_handler + 40 (1) 1 libsystem_platform.dylib 0x00007fff98ba452a _sigtramp + 26 (2) 2 ??? 0x00007f9d38c00000 0x0 + 140313238700032 (3) 3 fossil 0x000000010aa94c0f builtin_emit_fossil_js_once + 95 (4) 4 fossil 0x000000010aa94e0d builtin_fossil_js_bundle_or + 221 (5) 5 fossil 0x000000010abb3c42 wikiedit_page + 1378 (6) 6 fossil 0x000000010aaf608d process_one_web_page + 1565 (7) 7 fossil 0x000000010aaf7164 cmd_webserver + 1444 (8) 8 fossil 0x000000010aaf3a22 fossil_main + 1842 (9) 9 fossil 0x000000010aaf32e9 main + 9 (10) 10 libdyld.dylib 0x00007fff965545ad start + 1
(6) By Stephan Beal (stephan) on 2020-12-03 08:01:00 in reply to 1 [link] [source]
When trying to use the wiki sandbox or when creating a new wiki age
FWIW, i'm also unable to reproduce it on Linux and going through the code listed in stack trace to look for a likely culprit hasn't turned up anything suspicious.
(7) By Torsten Berg (torstenberg) on 2020-12-03 08:03:43 in reply to 6 [link] [source]
OK, I just tried with the version tagged version-2.13 and this one does the job. So, I assume something must have happened in between that version and the recent leaf.
(8) By Warren Young (wyoung) on 2020-12-03 08:07:07 in reply to 7 [link] [source]
So you can bisect between them to pinpoint the problem commit, then. Please do.
(9) By Warren Young (wyoung) on 2020-12-03 08:24:24 in reply to 4 [link] [source]
I assume, you meant to write
make clean
in the first line?
I meant what I wrote. fossil clean
will returns the repo to the just-checked-out state, whereas make clean
might do that, depending on how the clean
target is written in the Makefile
. In principle, both could do the same thing, but given how people are, I'd expect fossil clean
to be more reliably starting from fresh than make clean
on any given Fossil-based project.
Just for one practical difference, fossil clean
will force you to re-configure, whereas make clean
will not. (On purpose, I'm sure.)
‘fossil clean’ does not do anything
Sure about that?
the repository is still empty
Sorry, but that's literally nonsense. The repository contains Fossil's historical source code.
If you meant to say that the working directory is "empty," that's also not true. It should contain a built version of Fossil and the source code for same, else how are you getting a crash?
fossil clean
will remove all products of that build and force a restart. That's important to my proposed test, because you can't mix *.o
with different -fsanitize
options. You must start with a clean working tree when changing sanitizer options.
unsupported argument 'address' to option 'fsanitize...
That doesn't make a lot of sense: 10.11 came with Xcode 7, which was the first to include ASAN.
I'll see if I can get a VM running 10.11, but in the meantime, work with the assumption that it should work. You may find a path past whatever obstacle you're running into.
(10) By Torsten Berg (torstenberg) on 2020-12-03 08:47:13 in reply to 9 [link] [source]
OK, I am beginning to understand where the misunderstanding is on my side. Are you assuming that I am building fossil from a cloned repository? Actually, I only downloaded a tarball and built from that. That’s why I thought fossil clean
would not make sense. Because, when I do that for my freshly created local repository (a completely different project) using the newly built fossil, then it will not do anything as I haven’t added files to that repository yet :-)
I will now continue to find the commit where the problem occurs in the first place (see posts below).
(11) By Torsten Berg (torstenberg) on 2020-12-03 10:11:29 in reply to 8 [link] [source]
OK, finished.
The last commit that does work is 3ad620df0058144d.
The next commit only 5 hours later is f044cf2a91b5906f. This one shows the segfault. The commit comment is: “More aggressive reuse of prepared statements for improved performance.”
(12) By Warren Young (wyoung) on 2020-12-03 11:19:44 in reply to 11 [link] [source]
Does the symptom go away if you build tip-of-trunk with that single commit backed out?
$ fossil merge --backout f044cf2a91b5906f
(13) By Richard Hipp (drh) on 2020-12-03 12:14:40 in reply to 11 [link] [source]
Create a file (here named "r1.txt
") with the following content:
GET /wikiedit?name=Sandbox
And then run:
fossil test-http <r1.txt
Does that also segfault? If so, please run the command in the debugger and give us a stack trace from there.
If not, please also try running:
fossil ui --sqltrace
And post the output up to the point where the segfault occurs.
(14) By Richard Hipp (drh) on 2020-12-03 13:14:12 in reply to 6 [link] [source]
I am likewise unable to repro the problem. Furthermore, I note that no segfaults are showing up in the server logs on any of my Fossil websites.
I'm really curious to figure out what the cause of this is. I hope that Torsten replies with more information.
(15) By Richard Hipp (drh) on 2020-12-03 13:36:15 in reply to 14 [link] [source]
Admin Tip:
If you are running a Fossil website, you should enable the error log. (The security audit page will fuss at you if you do not.) Enable error logging by adding an "errorlog: FILENAME" line to the CGI script, or by using the "--errorlog FILENAME" command-line option for the "fossil http" or "fossil server" commands.
Having done so, any segfaults will be caught by the sigsegv_handler() routine in main.c, which should then add an appropriate message to your error log. If you log in as the Setup user, you can test this on the "/test-warning" page. There is an option on that page (option 5) that lets you provoke a deliberate segfault. This then should show up in your log file.
I tried this just now on the Fossil Forum, and the log file now contains the following entry:
------------- 2020-12-03 13:24:58 UTC ------------
panic: Segfault
(0) [0x45cbae]
(1) [0x46029f]
(2) [0x45edd6]
(3) [0x45fa8d]
(4) [0x45dbd2]
(5) [0x405b8c]
(6) [0x72fe76]
(7) [0x730465]
(8) [0x405e59]
HTTP_HOST=www.fossil-scm.org
HTTP_REFERER=https://www.fossil-scm.org/forum/test-warning
HTTP_USER_AGENT=Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:83.0) Gecko/20100101 Firefox/83.0
PATH_INFO=/test-warning
QUERY_STRING=case=5
REMOTE_ADDR=2605:a601:a1ab:1100:9c31:6145:79d5:d081
REQUEST_METHOD=GET
REQUEST_URI=/forum/test-warning
SCRIPT_NAME=/forum
So we know that this mechanism is working. And yet, there are no reports of any (non-deliberate) segfaults in the logs.
(16) By Warren Young (wyoung) on 2020-12-03 19:36:05 in reply to 13 [link] [source]
fossil test-http <r1.txt
I built a macOS 10.11 (El Capitan) VM, and the symptom replicates here with that command within the Fossil repo, atop an anonymous clone, using tip-of-trunk.
I was able to get it to build with ASAN, which reports:
==4935==ERROR: AddressSanitizer: SEGV on unknown address 0x7fff00000000 (pc 0x000107a8cc00 bp 0x7fff59220490 sp 0x7fff5921fc10 T0)
#0 0x107a8cbff in wrap_strcmp (libclang_rt.asan_osx_dynamic.dylib+0xdbff)
#1 0x106a0fff9 in builtin_emit_fossil_js_once builtin.c:721
#2 0x106a1053a in builtin_fossil_js_bundle_or builtin.c:870
#3 0x106c9d6e0 in wikiedit_page wiki.c:1302
#4 0x106ad8a11 in process_one_web_page main.c:1971
#5 0x106ad3157 in fossil_main main.c:940
#6 0x106ad1afa in main main.c:648
#7 0x7fff86e385ac in start (libdyld.dylib+0x35ac)
Backing out the change blamed by the OP doesn't fix the symptom, so I started re-bisecting it and ended up backing up to 2.12, which changes the output thus:
=================================================================
==6432==ERROR: AddressSanitizer: SEGV on unknown address 0x000100000000 (pc 0x00010409c244 bp 0x7fff5bc893d0 sp 0x7fff5bc88f20 T0)
#0 0x10409c243 in vxprintf printf.c:218
#1 0x1040a0896 in mprintf printf.c:891
#2 0x1041c146d in style_emit_fossil_js_apis style.c:1549
#3 0x104205aa0 in wikiedit_page wiki.c:1271
The remaining 4 steps are essentially the same as above.
Since that's also blaming the JS emit code, I assume it's the same problem within the older formulation of the same code — before JS bundling — so I kept digging.
I ended up tracing it to commit [19f2753522], the merge of /wikiedit
to trunk.
Therefore, I started bisecting again on the feature branch and narrowed it to commit [2ec332a0c2], but that's another merge commit, so I checked branch refactor-js-handling
, but both ends of it pass this test.
Therefore, it's the merge of branch refactor-js-handling
into branch ajax-wiki-editor
that caused the problem. The resulting ASAN report is:
==14407==ERROR: AddressSanitizer: SEGV on unknown address 0x000100000000 (pc 0x000102f03af4 bp 0x7fff5ce203d0 sp 0x7fff5ce1ff20 T0)
#0 0x102f03af3 in vxprintf printf.c:218
#1 0x102f08146 in mprintf printf.c:891
#2 0x103028c4d in style_emit_fossil_js_apis style.c:1535
#3 0x10306d46e in wikiedit_page_v2 wiki.c:1250
#4 0x102ed1d51 in process_one_web_page main.c:1961
#5 0x102ecc5cb in fossil_main main.c:938
#6 0x102ecaf8a in main main.c:648
#7 0x7fff86e385ac in start (libdyld.dylib+0x35ac)
(17.1) By Warren Young (wyoung) on 2020-12-03 19:49:48 edited from 17.0 in reply to 16 [link] [source]
I double-checked tip-of-trunk under ASAN on the VM host, macOS 10.15.7, and it does not crash. That suggests this is a compiler/library bug specific to this 5-year-old platform.
I'll note that getting this VM set up was a colossal PITA due to problems with TLS certificates and such. Unless my ASAN stack traces posted above cause a bright flash of enlightenment for someone, maybe the best path is simply to retire use of this old platform.
(18) By Torsten Berg (torstenberg) on 2020-12-03 21:49:09 in reply to 16 [link] [source]
Hi,
wow, thank for the effort! I was just getting home and wanted to start digging into this myself ... but I guess this is not necessary any more. If it is this older macOS causing the trouble, it is time to get that macOS server updated. At least, I could now build a running version of fossil that will do for now ... which is great, so thanks again for your support and help!
If you still would like me to test something, please let me know!
Torsten
(19) By Warren Young (wyoung) on 2020-12-03 22:09:36 in reply to 18 [link] [source]
Or, just roll back to 2.11 on that box. That's still a few years advanced past the EOL date for that OS, so it's better than running a contemporaneous version of Fossil.
(20) By BPK (bpk000) on 2020-12-03 22:21:55 in reply to 1 [link] [source]
I'm also running OSX 10.11.6 (15G22010) and can reproduce the segfault issue. Here's a screenshot showing the segfault after trying to create a new wiki page.
I checked out bbfd6123506e10e1 and then:
$ ./configure && make
$ fossil init test.fossil
$ fossil open -f test.fossil
$ fossil ui
Next:
- In the browser, I navigated to http://localhost:8080/wikinew
- Entered name of new wiki page: test1; Markup style: Fossil Wiki; clicked Create button.
- The resulting page upon submission shows the segfault -- the screenshot linked above.
Here's a cut-and-paste of the browser output:
Segfault (0) 0 fossil 0x00000001081e2058 sigsegv_handler + 40 (1) 1 libsystem_platform.dylib 0x00007fff9f7b952a _sigtramp + 26 (2) 2 ??? 0x0000000000000000 0x0 + 0 (3) 3 fossil 0x0000000108180b6a builtin_emit_fossil_js_once + 90 (4) 4 fossil 0x0000000108180ebd builtin_fossil_js_bundle_or + 221 (5) 5 fossil 0x00000001082a2781 wikiedit_page + 1473 (6) 6 fossil 0x00000001081e34c7 process_one_web_page + 2055 (7) 7 fossil 0x00000001081e444f cmd_webserver + 1471 (8) 8 fossil 0x00000001081e0924 fossil_main + 1892 (9) 9 fossil 0x00000001081e01b9 main + 9 (10) 10 libdyld.dylib 0x00007fff9b63c5ad start + 1
The terminal output is the following:
$ fossil ui
Listening for HTTP requests on TCP port 8080
------------- 2020-12-03 21:46:35 UTC ------------
panic: Segfault
(0) 0 fossil 0x00000001081e2058 sigsegv_handler + 40
(1) 1 libsystem_platform.dylib 0x00007fff9f7b952a _sigtramp + 26
(2) 2 ??? 0x0000000000000000 0x0 + 0
(3) 3 fossil 0x0000000108180b6a builtin_emit_fossil_js_once + 90
(4) 4 fossil 0x0000000108180ebd builtin_fossil_js_bundle_or + 221
(5) 5 fossil 0x00000001082a2781 wikiedit_page + 1473
(6) 6 fossil 0x00000001081e34c7 process_one_web_page + 2055
(7) 7 fossil 0x00000001081e444f cmd_webserver + 1471
(8) 8 fossil 0x00000001081e0924 fossil_main + 1892
(9) 9 fossil 0x00000001081e01b9 main + 9
(10) 10 libdyld.dylib 0x00007fff9b63c5ad start + 1
HTTP_HOST=localhost:8080
HTTP_REFERER=http://localhost:8080/wikinew
HTTP_USER_AGENT=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/11.1.2 Safari/605.1.15
PATH_INFO=/wikiedit
QUERY_STRING=name=test1&mimetype=text/x-fossil-wiki
REMOTE_ADDR=127.0.0.1
REQUEST_METHOD=GET
REQUEST_URI=/wikiedit?name=test1&mimetype=text/x-fossil-wiki
The fossil server continues to run after the error. I can navigate to other pages (/timeline, /home, etc).
If there is something else I can do to help diagnose the issue, please let me know.
(21) By Stephan Beal (stephan) on 2020-12-04 01:30:27 in reply to 17.1 [link] [source]
I double-checked tip-of-trunk under ASAN on the VM host, macOS 10.15.7, and it does not crash. That suggests this is a compiler/library bug specific to this 5-year-old platform.
Thank you for the detailed archaeology! i'm on my way out the door for the rest of the day but will take another look through the stacktraced code "just in case" as soon as i can. My instinct is also that it's a platform bug, but it can't hurt to go over it again.
(22) By Warren Young (wyoung) on 2020-12-04 18:42:07 in reply to 21 [link] [source]
it's a platform bug
Only insofar as correct operation depends on undefined behavior in the C standard.
The issue turned out to be passing 0 thru a variadic function as the sentinel for a list of pointers. In many contexts, 0 and NULL are equivalent, but not in this case. 0 could be interpreted as a 32-bit int but NULL as a 64-bit pointer, for example, which would result in a 64-bit pointer with half its bits being garbage, for example.
I didn't dig down to the assembly to find out if that's literally the case here, but it seems plausible.
Anyway, changing the sentinel to NULL fixed the symptom. I've applied a GCC/Clang style function attribute; squishing all of the resulting warnings fixes the reported symptom, and it should prevent recurrences.
Other variadic functions in Fossil should make use of this new NULL_SENTINEL
attribute.
(23) By Richard Hipp (drh) on 2020-12-04 19:00:08 in reply to 22 [link] [source]
Nice catch, Warren. Tnx for the patch!
(24) By Richard Hipp (drh) on 2020-12-04 19:26:33 in reply to 22 [link] [source]
"Preview" builds that include this fix have been uploaded to the Download page.
(25) By Stephan Beal (stephan) on 2020-12-05 03:59:48 in reply to 22 [link] [source]
In many contexts, 0 and NULL are equivalent, but not in this case.
!!!
Holy cow! That means i have numerous patches to seek out and apply in other trees as well, as i've often used literal 0 as a sentinel for that case. i had no idea NULL is considered "not necessarily 0" for that context, but in hindsight that makes sense.
(26) By Warren Young (wyoung) on 2020-12-05 21:01:39 in reply to 25 [link] [source]
On further reflection, I believe the core issue here is that El Capitan was one of the transitional 32→64 bit releases. Earlier platforms would have both int
and char*
as 32-bit, and later ones have them as both 64-bit. This is why ASAN didn't trigger on modern platforms: they're either fully 64-bit modern OSes or are pure 32-bit legacy ones, suitable for small VMs and such.
I've done a quick audit of the other variadic functions in Fossil, and I don't see any that are susceptible to this. Almost all of them are printf
-style, wherein the format string tells the function what size each parameter is. This particular function was susceptible because the contract was implicit — a list of pointers — and thus couldn't be checked.
Another example of such a function is execl(3)
. Note the warning in the third paragraph of its man page. This exact issue is why that warning is there.
(27) By anonymous on 2020-12-07 16:32:10 in reply to 22 [link] [source]
Re: 815b4fc
Generally, using NULL vs 0 is safe as long as the compiler can deduce the correct type and promote the constant to that type. Tons of C code is written with mixed NULL and 0.
However, in the va_arg context the expected type information for the sentinel is not immediately visible at the call site. Therefore the sentinel (be it NULL or 0) needs to be explicitly cast into the expected type, (const char *)
in this case.
It does look kinda verbose, but it's unambiguous. Another way is to use an empty string ""
for the sentinel. But in such case, the va_arg end-of-list test has to be also changed accordingly in the processing function.
Bottom line, for the future safety, add (const char *)
in call sites like this:
builtin_fossil_js_bundle_or("pikchr", (const char *)NULL);
(28) By Warren Young (wyoung) on 2020-12-07 22:30:14 in reply to 27 [link] [source]
(const char *)NULL
C99 §7.17.3: "NULL
…expands to an implementation-defined null pointer constant".
Moreover, the GCC "sentinel" attribute checks for this particular case.
On my macOS system (Clang-based), NULL is defined as ((void *)0)
.
On a nearby GCC-based Linux box, it's defined the same for C but as __null
for C++ with g++
extensions, a magic compiler internal with much the same meaning.