ext CGI feature segfaults

(1) By anonymous on 2020-03-18 18:42:12 [source]

> fossil ui --extroot `pwd`/cgi --nojail

where extroot contens is:

  ~/Code/fossil-scm/cgi:

  drwxr-xr-x   4 russki  staff   128B Mar 18 18:18 ./
  drwxr-xr-x  35 russki  staff   1.1K Mar 18 18:19 ../
  -rwxr-xr-x   1 russki  staff   125B Mar 18 18:18 test*
  -rw-r--r--   1 russki  staff   434B Mar 11 11:40 index.html

localhost:8080/ext/index.html renders static content fine

localhost:8080/ext/test segfaults

panic: Segfault
(0) 0   fossil                              0x0000000102986d85 sigsegv_handler + 40
(1) 1   libsystem_platform.dylib            0x00007fff6b091f5a _sigtramp + 26
(2) 2   libdyld.dylib                       0x00007fff6ad83149 _Z21dyldGlobalLockReleasev + 0
(3) 3   libsystem_c.dylib                   0x00007fff6ae320f7 putenv + 121
(4) 4   fossil                              0x000000010295c102 ext_page + 802
(5) 5   fossil                              0x00000001029882dd process_one_web_page + 2636
(6) 6   fossil                              0x0000000102988f47 cmd_webserver + 1416
(7) 7   fossil                              0x0000000102985bb0 fossil_main + 1869
(8) 8   fossil                              0x0000000102985463 fossil_main + 0
(9) 9   libdyld.dylib                       0x00007fff6ad83015 start + 1
(10) 10  ???                                 0x0000000000000005 0x0 + 5

Here's what's in the test script:

#!/usr/local/opt/tcl-tk/bin/tclsh

puts {Status: 200 Ok}
puts {Content-Type: text/html}
puts ""
puts {<span>Hey there</span>}

>  fossil version
This is fossil version 2.10 [9d9ef82234] 2019-10-04 21:41:13 UTC

> sw_vers
ProductName:	Mac OS X
ProductVersion:	10.13.6

(2) By anonymous on 2020-03-18 20:43:37 in reply to 1 [link] [source]

#!/usr/local/opt/tcl-tk/bin/tclsh

Does your test script run properly by itself on command-line?

I just tested this case on Ubuntu (from an new repo) and fossil (2.10) properly renders ext/test result.

Here's a test script (./test-ext.sh):

#!/bin/bash 


trap on_ctrl_c INT
on_ctrl_c() {
  cleanup
}

setup() {
  mkdir test-ext && cd test-ext
  fossil init ../test-ext.fossil
  fossil open ../test-ext.fossil

  mkdir cgi
  cat <<EOF>>cgi/test
#!/usr/bin/tclsh

puts {Status: 200 Ok}
puts {Content-Type: text/html}
puts ""
puts {<span>Hey there</span>}
EOF
}

cleanup(){
  rm cgi/test
  rmdir cgi
  fossil close
  rm ../test-ext.fossil
  cd ..
  rmdir test-ext
}

##-------------

if [ "$1" == "clean" ]; then
  cd test-ext || exit 0
  cleanup
  echo "$0:DONE $1"
  exit 0
fi

setup

set -v
#TEST
chmod +x cgi/test
./cgi/test

## CTRL-C to exit
fossil ui --extroot `pwd`/cgi --nojail ##Open https://localhost:8080/ext/test

set +v
echo "$0:DONE"

(3) By anonymous on 2020-03-19 09:39:02 in reply to 2 [link] [source]

Does your test script run properly by itself on command-line?

runs perfectly fine when called as a script on command line. I also thought it was the case of being run in jail, hence --nojail flag there

(4) By vlad on 2020-03-19 12:51:27 in reply to 2 [link] [source]

original author of the thread here

My attempt at running fossil under debugger proved futile, sadly. Mind you I've never programmed C, so it could be just me.

Steps I took:

> FOSSIL_BREAK=1 fossil ui --extroot /Users/russki/Code/fossil-scm/cgi --nojail

attach to the fossil process with lldb fine

br set -n ext_page since it is the one that appeared in the backtrace of that segfault

continue

looks like it starts listening and loads the timeline as expected

Two problems I encountered here:

run under debugger like this I can't even get static content,
above break never gets hit.

I also compiled "out of source". While UI opens up, looks like ext doesn't work at all. Can't even get static content with otherwise the exact same command line args. But replace custom build with whatever brew installed system wide and at least static content is served via ext/.

Another observation is that segfault sometimes takes a minute to manifest itself. As in request ext/foo.html and you get it. A minute later I see a segfault in terminal.

I expect I'll be working with Fossil going forward so naturally I'd like to figure this out but also understand how I can debug things like that. Adding a printf at the start of fossil_main seems to at least print. Doing the same at the beginning of ext_page doesn't print. Maybe my assumption that it ought to be hit is wrong but then why it appears in the bt?

(5) By Richard Hipp (drh) on 2020-03-19 14:44:46 in reply to 4 [link] [source]

The "fossil ui" command forks for each incoming HTTP request, which messes up debuggers.

The way to do this is to put the raw text of the HTTP request that causes the segfault into a file. Name the file anything you want, but here we will call it "x1.txt". The x1.txt file probably should look something like this:

GET /ext/test HTTP/1.0

Note that there must be a blank line (and extra \n) at the end. I just don't know how to show that blank line using Markdown. Verify that this causes the segfault by running:

fossil test-http --extroot /Users/russki/Code/fossil-scm/cgi <x1.txt

You might need to adjust the content of the x1.txt file to get to fail. But once you do get it failing, then run the command in a debugger. If you can give us a detailed stack trace, that will be useful.

(7) By Warren Young (wyoung) on 2020-03-19 15:25:58 in reply to 5 [link] [source]

I had to use a variation on that to get a useful result:

   $ lldb -f ./fossil -- test-http --extroot ~/tmp

Given the OP's test script as ~/tmp/x.tcl and saying "run" at the LLDB prompt, I had to paste the following in, since there doesn't seem to be a way to tell LLDB to attach a file to the subprocess's stdin:

  GET /ext/x.tcl HTTP/1.0

That done, I get the following from the debugger:

Process 83626 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x00007fff79b123e6 libsystem_c.dylib`__setenv_locked + 306
libsystem_c.dylib`__setenv_locked:
->  0x7fff79b123e6 <+306>: cmpq   $0x0, (%rsi)
    0x7fff79b123ea <+310>: movq   %r15, %r13
    0x7fff79b123ed <+313>: je     0x7fff79b12404            ; <+336>
    0x7fff79b123ef <+315>: xorl   %r15d, %r15d
Target 0: (fossil) stopped.

This seems to confirm my guess that the environment is somehow "locked" in this state, but my search-fu turns up nothing useful about what's going on.

Maybe try setting the environment up before the fork() call?

(10) By vlad on 2020-03-19 15:42:11 in reply to 5 [link] [source]

Here's the best I could do in the debugger:

> lldb fossil

(lldb) target create "fossil"
Current executable set to 'fossil' (x86_64).

(lldb) process launch -i x1.txt --stop-at-entry -- test-http ../fossil-scm.fossil --extroot /Users/russki/Code/fossil-scm/cgi

Process 90140 stopped
* thread #1, stop reason = signal SIGSTOP
    frame #0: 0x000000010036a19c dyld`_dyld_start
dyld`_dyld_start:
->  0x10036a19c <+0>: popq   %rdi
    0x10036a19d <+1>: pushq  $0x0
    0x10036a19f <+3>: movq   %rsp, %rbp
    0x10036a1a2 <+6>: andq   $-0x10, %rsp
Target 0: (fossil) stopped.
Process 90140 launched: '/usr/local/bin/fossil' (x86_64)

(lldb) c

Process 90140 resuming
Process 90140 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x00007fff6ae35104 libsystem_c.dylib`__setenv_locked + 313
libsystem_c.dylib`__setenv_locked:
->  0x7fff6ae35104 <+313>: cmpq   $0x0, (%rsi)
    0x7fff6ae35108 <+317>: je     0x7fff6ae3511e            ; <+339>
    0x7fff6ae3510a <+319>: xorl   %ebx, %ebx
    0x7fff6ae3510c <+321>: movq   -0x38(%rbp), %rdi
Target 0: (fossil) stopped.

(lldb) bt

* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
  * frame #0: 0x00007fff6ae35104 libsystem_c.dylib`__setenv_locked + 313
    frame #1: 0x00007fff6ae320f7 libsystem_c.dylib`putenv + 121
    frame #2: 0x000000010003c102 fossil`ext_page + 802
    frame #3: 0x00000001000682dd fossil`process_one_web_page + 2636
    frame #4: 0x0000000100065bb0 fossil`fossil_main + 1869
    frame #5: 0x0000000100065463 fossil`main + 9
    frame #6: 0x00007fff6ad83015 libdyld.dylib`start + 1
    frame #7: 0x00007fff6ad83015 libdyld.dylib`start + 1

I guess EXC_BAD_ACCESS doesn't sound too good.

If we run in terminal all I get is:

> fossil test-http ../fossil-scm.fossil --extroot /Users/russki/Code/fossil-scm/cgi --nojail <x1.txt
Illegal instruction: 4

If I change x1.txt to GET /ext/static.html ... etc it works fine, so static pages work.

I'll try all of this under OpenBSD next, but others reported ext to work e.g. under Linux, so I don't expect the same error.

(6) By Warren Young (wyoung) on 2020-03-19 15:15:16 in reply to 1 [link] [source]

This reproduces on macOS 10.14.6.

I can say that the thread of execution enters the putenv() call underlying fossil_setenv() in src/file.c, and that the segfault occurs in that call.

What's most curious is that printing the parameters to the function doesn't segfault, so it's not a simple case of dereferencing a bad string, else my fprintf(stderr, ...) debug calls would also fail.

Is the environment somehow "locked" in this state?

I don't see how to proceed short of tracing into a debug build of libc at this point.

Incidentally, changing the implementation of this Fossil wrapper function to use setenv(3) doesn't help. Thus my latest commit being a branch.

(8) By graham on 2020-03-19 15:31:51 in reply to 6 [link] [source]

Thinking aloud... does macOS impose any (unusual) limits on the lengths of names/values/total-length of the environment? Is it the very first call to putenv() that fails, or do some get through? (On Windows, I see about 20 values being set, with HTTP_COOKIE being the longest at about 400 characters).

(9) By Warren Young (wyoung) on 2020-03-19 15:38:00 in reply to 8 [link] [source]

does macOS impose any (unusual) limits on the lengths of names/values/total-length of the environment?

I doubt it.

This macOS libc code appears to come straight from FreeBSD, which someone could confirm by attempting to reproduce the symptom there.

Is it the very first call to putenv() that fails

Yes, the one for DOCUMENT_ROOT due to the use of the --extroot flag.

(11) By Warren Young (wyoung) on 2020-03-19 16:05:59 in reply to 9 [link] [source]

Well, this is fun: the symptom does not reproduce on FreeBSD 11.3-p6, updated just now. It gives the expected result, the script's output.

I may attempt to upgrade to 12.1 later.

(12) By graham on 2020-03-19 16:28:55 in reply to 9 [link] [source]

To me, that suggests memory corruption: either what mprintf() returns is invalid (did you try fprintfing zString?), or earlier corruption is triggering a segfault in putenv() (e.g. if it needs to resize the environment).

For a rather "hacky" exploration attempt, you could try replacing the call to mprintf() with something like:

static char zString[2000];
sprintf(zString,"%s=%s", zName, zValue);

and seeing if (at least the first call to) putenv() returns without segfaulting. If it does, you "just" have to find where memory got corrupted :-)

(13) By Warren Young (wyoung) on 2020-03-19 16:34:42 in reply to 12 [link] [source]

did you try fprintfing zString?

Yes, per this post up-thread.

you could try replacing the call to mprintf()

I effectively did that already with my setenv-alternative branch, which doesn't use the mprintf'd string.

I probably also should have mentioned that I previously rebuilt Fossil with:

  $ ./configure --with-sanitizer=address,enum,null,undefined

I got no complaints from any of the sanitizers.

And since my last post, I've updated my FreeBSD test VM to 12.1-p2, and it's still not failing the same as on macOS 10.14. That makes me wonder if Apple made some changes to this mechanism, since it appears to work properly on FreeBSD.

(14) By anonymous on 2020-03-23 01:28:44 in reply to 1 [link] [source]

Please check the update in [6e7211a26]. This should address the SEGV you caught. I used the test-ext.sh above to test the fix.

In brief, OSX seems to be peculiar about assignments of NULL to double pointers. Previously, we were setting the environ to NULL to "empty" the environment, but this was treated as a NULL pointer instead of NULL first element, thus *environ dereference caused the SEGV.

So, environ[0] is what we meant to NULL ...sorry, OSX :) Meanwhile, Linux is all fine about this either way.

By the way, looks like I also found a clang optimizer bug (or maybe just a "usage issue") in the process of diagnosing this issue (not directly related).

(15) By vlad on 2020-03-23 09:32:36 in reply to 14 [link] [source]

That fixed it on OSX. Thank you very much for the fix!