fossil server|ui: SSL read error
(1) By Florian Balmer (florian.balmer) on 2022-01-24 06:16:17 [source]
While playing around with SSL during my tests to fix the Windows XP
builds, I noticed that the ui
and server
commands no longer work
with SSL on Windows:
> fossil ui test.fossil --cert "unsafe-builtin" --port 443
...
Launch webbrowser: start "" "https://localhost:443/" &
Type Ctrl-C to stop the HTTP server
SSL read error.
What I have so far is that the culprit(s) are check-ins
[ba95498d36] and [8d456a5b77],
i.e. if they are both reverted, then ui
and server
work with SSL,
again.
I'm not sure yet whether the problem also occurs with other systems.
Side note: Isn't there also a minor typo in the 1st check-in, i.e. should it say "Fossil version" (capitalized) instead of "fossil version"? I think there was some "branding change" to settle for the former variant across the project, quite some time back?
(2) By Stephan Beal (stephan) on 2022-01-24 06:22:17 in reply to 1 [link] [source]
What I have so far is that the culprit(s) are check-ins [ba95498d36] and [8d456a5b77], i.e. if they are both reverted, then ui and server work with SSL, again.
ba95 may be problematic but if you back out 8d45 then POST will fail for any data larger than 16kb (try it with /wikiedit or /fileedit or an attachment in /chat). ba95 is not strictly necessary but "really should" be retained because if it's removed then a failed read over TLS is silently ignored. i'll take a closer look at these when i'm back on the computer.
(3) By Stephan Beal (stephan) on 2022-01-24 06:25:46 in reply to 1 [link] [source]
Side note: Isn't there also a minor typo in the 1st check-in, i.e. should it say "Fossil version" (capitalized) instead of "fossil version"? I think there was some "branding change" to settle for the former variant across the project, quite some time back?
That version string was taken directly from "fossil version" output. If it's changed in the SERVER_SOFTWARE string then it should also be changed in the version command.
(8) By Florian Balmer (florian.balmer) on 2022-01-24 07:15:03 in reply to 3 [link] [source]
Ah, I see. Such output should probably remain stable because of scripts possibly depending on it ...
(4) By Stephan Beal (stephan) on 2022-01-24 06:42:47 in reply to 1 [link] [source]
While playing around with SSL during my tests to fix the Windows XP builds, I noticed that the ui and server commands no longer work with SSL on Windows:
The "SSL read error" error line comes from fossil ba95498d36 but i'm unable to reproduce triggering that, which suggests that this issue is OS-dependent. It would be interesting to know if it fails on a Windows other than XP.
Chromium on Linux is repeating this message often:
[3579590:3579594:0124/072955.888517:ERROR:ssl_client_socket_impl.cc(983)] handshake failed; returned -1, SSL error code 1, net_error -202
but i am unable to see any actual breakage. The ui seems to work just fine, despite this frequent message.
Firefox on Linux is not showing any similar messages and also seems to work fine.
Both browsers were tested with POSTs of data larger than 16k and both behaved as expected.
(5) By Stephan Beal (stephan) on 2022-01-24 06:57:28 in reply to 4 [link] [source]
The "SSL read error" error line comes from fossil ba95498d36 but i'm unable to reproduce triggering that, which suggests that this issue is OS-dependent. It would be interesting to know if it fails on a Windows other than XP.
i don't think that fossil:06e300e5bd325792 will resolve it for you but it's worth a try (and it's an internal improvement, in any case). The behavior on Linux is unchanged with this patch.
(6) By Florian Balmer (florian.balmer) on 2022-01-24 07:02:28 in reply to 5 [link] [source]
Thanks for looking into this.
I was (passively) waiting for a build to complete, and just pulled in [06e300e5bd] -- without success, unfortunately.
So far I can confirm the problem also appears on Windows 10, with both OpenSSL 1.1.1m and 3.0.1, and is "fixed" (apart from the 16K POST limit) with the mentioned reverts.
But I can't reproduce it on my Ubuntu VM, either.
(7) By Florian Balmer (florian.balmer) on 2022-01-24 07:14:18 in reply to 6 [link] [source]
No wait, if I back-out [06e300e5bd], [ba95498d36] and [8d456a5b77], I get a working version on Windows, and without the 16K POST limit (succeeded with 200K).
I'm not sure I can get any further, for today ...
(9) By Stephan Beal (stephan) on 2022-01-24 07:42:02 in reply to 7 [link] [source]
No wait, if I back-out [06e300e5bd], [ba95498d36] and [8d456a5b77], I get a working version on Windows, and without the 16K POST limit (succeeded with 200K).
That's not only contrary to the Linux behavior but it violates the SSL_read() docs, which specify that it only reads in (at most) 16kb chunks. Our old impl cannot catch chunks beyond the first one. It's conceivable that SSL_read() does not chunk/buffer the same on Windows but the SSL docs make no mention of that.
See https://sqlite.org/althttpd/forumpost/179db4856c for the discussion which uncovered that bug and the relevant libssl doc snippet.
My current instinct is to wrap the old behavior in an #ifdef _WIN32
block and the new behavior everywhere else, but i won't be able to test such a change so won't do that myself. We know the new behavior generally works because we've been using it for more than a week now in althttpd on my own server and all three sqlite.org web servers (and very possibly this server, too), but all of those are Linux. The althttpd TLS support is a copy/paste port of fossil's, so no differences are expected, but althttpd doesn't build on Windows so we can't simply compile and compare them.
@BSD/Mac users: can any of you verify whether:
$ fossil ui --cert "unsafe-builtin" --port 8443 --localauth
from the fossil checkout works? A basic sanity check is:
- Go to the hamburg menu...
- "List of wiki pages"
- Sort that list by number of versions. The Cookbook will be on top and is huge. Open it.
- Tap the Edit button in the submenu, then switch to the Preview tab and scroll to the bottom. It should end with
fossil ui $3
. If it doesn't, it was truncated.
(10) By Stephan Beal (stephan) on 2022-01-24 07:53:14 in reply to 9 [link] [source]
My current instinct is to wrap the old behavior in an #ifdef _WIN32 block and the new behavior everywhere else, but i won't be able to test such a change so won't do that myself.
More simply, perhaps:
Index: src/http_ssl.c
==================================================================
--- src/http_ssl.c
+++ src/http_ssl.c
@@ -813,26 +813,34 @@
/*
** Read cleartext bytes that have been received from the client and
** decrypted by the SSL server codec.
*/
size_t ssl_read_server(void *pServerArg, char *zBuf, size_t nBuf){
- int n, err = 0;
+ int n;
+#ifndef _WIN32
+ int err = 0;
+#endif
size_t rc = 0;
SslServerConn *pServer = (SslServerConn*)pServerArg;
if( nBuf>0x7fffffff ){ fossil_fatal("SSL read too big"); }
else if( BIO_eof(pServer->bio) ) return 0;
while( 0==err && nBuf!=rc ){
n = SSL_read(pServer->ssl, zBuf + rc, (int)(nBuf - rc));
+#ifdef _WIN32
+ rc += n;
+ break;
+#else
if( n==0 ){
break;
}
err = SSL_get_error(pServer->ssl, n);
if(0==err){
rc += n;
}else{
fossil_fatal("SSL read error.");
}
+#endif
}
return rc;
}
/*
@Florian can you try that patch and see if it resolves the issue on Windows? If so, my current suggestion would be to go with that solution until/unless someone cares to figure out/resolve why it behaves differently on Windows.
(11) By Florian Balmer (florian.balmer) on 2022-01-24 08:04:21 in reply to 10 [link] [source]
@BSD/Mac users: can any of you verify whether:
$ fossil ui --cert "unsafe-builtin" --port 8443 --localauth
...
Tap the Edit button in the submenu, then switch to the Preview tab and scroll to the bottom. It should end with fossil ui $3. If it doesn't, it was truncated.
I was just about to confirm again that this works on Windows with the 3 mentioned backouts ...
@Florian can you try that patch and see if it resolves the issue on Windows?
Nice! The patch works, if the 0==err
check is removed from the while
statement (the err
variable will be missing for _WIN32
).
I'll have a more detailed look, and perform more detailed tests by tomorrow, hopefully.
Thanks!
(12) By Stephan Beal (stephan) on 2022-01-24 08:06:59 in reply to 11 [link] [source]
The patch works, if the 0==err check is removed from the while statement (the err variable will be missing for _WIN32).
The top part of the patch removed that var but i'll simplify it and move err's declaration into the non-Windows block, then check that in as a preliminary workaround.
(13) By Stephan Beal (stephan) on 2022-01-24 08:30:11 in reply to 11 [link] [source]
Nice! The patch works, if the 0==err check is removed from the while statement (the err variable will be missing for _WIN32).
A slight simplification of that patch is at fossil:b890451cfbf892e1, which behaves like it did before but fails to catch read errors.
An improvement (and further simplification) of that patch is one commit later in fossil:b70557f690594.
If that second patch works for you then great. If not, we can revert to the previous one but will have the bug that SSL read failures will go unreported and may lead to data corruption. That's actually how this "looping bug" was caught in the first place: attachments posted in /chat were getting silently truncated.
(14) By Florian Balmer (florian.balmer) on 2022-01-25 07:49:46 in reply to 13 [link] [source]
The current trunk doesn't seem to work, unfortunately. But I'm running out of time to do more tests, at the moment.
(15) By Florian Balmer (florian.balmer) on 2022-01-25 07:54:38 in reply to 14 [link] [source]
If not, we can revert to the previous one but will have the bug that SSL read failures will go unreported and may lead to data corruption.
Because ssl_read_server()
is called from an outer loop on Windows, this
doesn't seem to happen. But this is all quite a mystery to me, at the moment.
(16) By Stephan Beal (stephan) on 2022-01-25 10:14:59 in reply to 14 [link] [source]
The current trunk doesn't seem to work, unfortunately.
Can you verify whether fossil:b890451cfbf892e1 "works" (noting that it cannot detect read errors and will eventually corrupt something)?
If we cannot resolve the Windows discrepancy by the next release then disabling the built-in TLS support on Windows for that release is the only sensible option we'll have. Publishing a release which is known to silently truncate inbound data if a POST fails part of the way through would be a poor choice.
(17) By Florian Balmer (florian.balmer) on 2022-01-26 05:42:48 in reply to 16 [link] [source]
Ok, now I've got it. Part (B) is mostly due to my ignorance, I'm sorry about that.
(A) Breaking the SSL_read()
loop
- This seems necessary on Windows, but it's unclear to me why. However, Fossil
always calls
ssl_read_server()
from outer loops to retry as long as their buffers have room, so data truncation to 16 KB is not a problem (on Windows). - So this problem is now "fixed" (for obscure reasons, though, which is slightly uncomfortable).
(B) SSL_read()
errors
fossil server|ui --cert "unsafe-builtin"
worked on all tested platforms.- But suddenly started to fail on Windows XP, when I copied the Fossil
repository to the VM to test the large w</ikiedit preview case. The Fossil
repository had an invalid
web-server
setting, so I deleted it. Not sure why, but from then I switched tofossil ui
instead offossil server
for testing, and didn't bother to fix theweb-browser
setting, or use the--nobrowser
option. Don't know why. - And now, at the beginning of each test, before I was able to bring Chromium to
the top and F5 reload, the default IE browser popped up, trying to access the
URL -- and causing the read error, because it doesn't support the required
protocol version! This was all quite random and confusing, not reproducible
with intermittent
fossil server
trials, and neither on Windows 10. - Sorry this took me so long to figure out.
- But I think it's a bug to
fossil_fatal()
exit on SSL read errors. They should just abort the current request, but keep the server running -- otherwise anybody with an old browser could bring down the Fossil servers. I think this was addressed in the meantime by commit [acffc8f785], at least for non-Windows platforms. Or maybe on non-Windows platforms, each request was already a separatefork()
, and didn't shutdown the whole server.
Ok, so now I'm going to re-build and re-test with the latest trunk, and
follow-up with comments and/or commits if necessary. But right now I see no
problems apart from the server being shutdown when accessed with old IE --
because I don't think anybody is going to use fossil server
over the Internet
from a Windows machine. ;-)
Thanks for your help!
(18) By Florian Balmer (florian.balmer) on 2022-01-26 07:48:19 in reply to 17 [link] [source]
I may have a solution that works. Can you try (and merge) it if it also works for you, or tweak it if necessary? (Not urgent.)
Thanks!
(19) By Stephan Beal (stephan) on 2022-01-26 09:55:04 in reply to 18 [link] [source]
But I think it's a bug to fossil_fatal() exit on SSL read errors. They should just abort the current request, but keep the server running -- otherwise anybody with an old browser could bring down the Fossil servers. I think this was addressed in the meantime by commit [acffc8f785], at least for non-Windows platforms. Or maybe on non-Windows platforms, each request was already a separate fork(), and didn't shutdown the whole server.
We removed the fatal() call because Richard found server log entries which appeared to be caused by clients which would open an SSL connection and then kill it uncleanly before actually using it. i had not considered that the Windows impl does not/cannot fork, so was not aware of the side effect of killing the whole server. That was entirely my fault.
I may have a solution that works. Can you try (and merge) it if it also works for you, or tweak it if necessary?
(TL/DR: skip to the bottom paragraph for my current conclusions/recommendation.)
In /chat, while working on the fatal()-related fix, we talked about the 0 vs -1 return semantics and came to the conclusion that 0 was much cleaner for our particular cases. As you noted, the outer calls to ssl_read_server() already know that they are expecting data, so any read result of 0 means an error happened. Changing that result to a -1 requires adding a cast-to-minus-1 check which we don't otherwise need, so it's more invasive in more places than returning zero.
My recommendation is to keep it returning 0 instead of -1. We can add a switch(), like the one you introduced, if that's necessary, but is it really necessary? i'm almost certain that half of those error codes cannot happen via SSL_read(), such as SSL_ERROR_WANT_ACCEPT, SSL_ERROR_WANT_CONNECT, and SSL_ERROR_WANT_WRITE.
This part won't work reliably as-is:
/* SSL_read() returns at most 16 KB of data, so retry in this case. */
if( n!=16384 ) break;
As demonstrated in this althttpd post, SSL_read() does not always return that exact value. The relevant snippet is this debug output:
tls_read_server(nBuf=111141), n=15613, rc=15613
tls_read_server(nBuf=111141), n=16384, rc=31997
tls_read_server(nBuf=111141), n=16384, rc=48381
...
The nBuf=... part is how much we expect to read, n=... is the result value from SSL_read(), and rc=... is the total amount we have read so far in the loop. Notice how the first call to SSL_read() returns some arbitrary value lower than 16kb.
On non-Windows that SSL_get_error() check seems unnecessary: we can reliably break from the loop when SSL_read() returns <=0. On Windows, for mysterious reasons, looping is unnecessary so we can unconditionally break out of the loop. For our particular uses, any SSL_read() result of <=0 indicates some form of error (we don't really care which type) and the higher-level callers have been corrected to treated it as such without calling fossil_fatal().
If the current trunk version works for you on Windows then my recommendation is that we keep that implementation, primarily because the required changes affect fewer functions, so it's simpler and less invasive. If it doesn't work on Windows then we'll need some variant of your patch, but the -1 return semantics seem unnecessary to me and requires additional return checks in code which doesn't need to differentiate between a read of 0 bytes and a read error.
(20) By Florian Balmer (florian.balmer) on 2022-01-26 11:43:05 in reply to 19 [link] [source]
Thanks for your detailed and interesting answer. I'm sorry I'm not able to follow on chat more closely -- that would have saved you some time.
Indeed the (size_t)-1
seemed strange compared to existing Fossil code.
Returning 0
is just fine, and winhttp.c:win32_http_request()
aborts the
request on this return value.
... was not aware of the side effect of killing the whole server. That was entirely my fault.
No worries. But I was tempted to load some SQLite website in my old IE to check whether Althttp(s)d works the same ;-)
i'm almost certain that half of those error codes cannot happen via SSL_read(), such as ... SSL_ERROR_WANT_WRITE.
Yeah, that seems strange, but from my reading, there may be continuous SSL-related send/receive traffic going on, in addition to the user (HTTP) payload, and a new "handshake" may be required before a read can proceed, requiring a write, first. But maybe that SSL_MODE_AUTO_RETRY alread takes care of this.
Notice how the first call to SSL_read() returns some arbitrary value lower than 16kb.
I missed that, and haven't seen that on Windows, i.e. I always got exactly 16 KB
-- except for the first one, but win32_http_request()
is reading that one with
a smaller buffer!
So things work fine from my part, albeit somewhat mysterious.
Thanks for all your efforts!
(24) By Stephan Beal (stephan) on 2022-01-26 12:29:00 in reply to 20 [link] [source]
But I was tempted to load some SQLite website in my old IE to check whether Althttp(s)d works the same ;-)
The althttpd TLS impl is almost copy/paste from fossil's, but the requirement for looping was discovered while using althttpd to serve fossil, and then back-ported to fossil. althttpd's read() impl currently uses its own counterpart of fossil_fatal(), but it only targets Unix systems so there are no Windows compatibility requirements.
The sqlite www servers have been using that althttpd impl for a week or two now.
But maybe that SSL_MODE_AUTO_RETRY alread takes care of this.
That seems to be the case:
https://www.openssl.org/docs/man1.1.1/man3/SSL_CTX_set_mode.html
Says: "In a blocking environment, applications are not always prepared to deal with the functions returning intermediate reports such as retry requests, and setting the SSL_MODE_AUTO_RETRY flag will cause the functions to only return after successfully processing an application data record or a failure."
So things work fine from my part,
Just so i'm clear: that means that the trunk implementation works for you on Win10 and XP?
albeit somewhat mysterious.
Agreed entirely.
(25) By Florian Balmer (florian.balmer) on 2022-01-26 13:04:45 in reply to 24 [link] [source]
Yes, current trunk works on Windows 10 and XP, with both OpenSSL 1.1.1m and 3.0.1.
(26) By Florian Balmer (florian.balmer) on 2022-01-27 07:17:40 in reply to 20 [link] [source]
I missed that, and haven't seen that on Windows, i.e. I always got exactly 16 KB -- except for the first one, but
win32_http_request()
is reading that one with a smaller buffer!
So my observation that the first read was always smaller than 4000 bytes also proven to be wrong ... Good catch!
(27) By Florian Balmer (florian.balmer) on 2022-01-28 05:54:59 in reply to 26 [link] [source]
Now loading my Ubuntu VM, and will report back if it also works there.
(28) By Florian Balmer (florian.balmer) on 2022-01-28 06:20:46 in reply to 27 [link] [source]
Well, yeah, there were no code changes for Ubuntu ...
But I can browse, clone, sync, large wikiedit and fetch diff context on Windows XP and Windows 10. So this may work, at least until @mgagnon gets to try it ;-)
(30) By Martin Gagnon (mgagnon) on 2022-01-28 12:36:21 in reply to 28 [link] [source]
Good point actually..
Originally, when I got hit by the win32_http_request
off by 1 bug (which not only in SSL), I was testing your SSL_read() patch using MinGW compiler to see if it was behaving the same way.
It's time to resume this test.
(31) By Martin Gagnon (mgagnon) on 2022-01-28 12:56:53 in reply to 30 [link] [source]
Just tried your branch with MinGW (64bit) and it works just fine.
(compiled against OpenSSL 1.1.1k)
Does it worth to try with OpenSSL 3 ?
(33) By Florian Balmer (florian.balmer) on 2022-01-28 13:09:19 in reply to 31 [link] [source]
Thanks!
I'm confident it works with OpenSSL 3 (I've tested it) -- but I've already missed problems, here...
(34) By Martin Gagnon (mgagnon) on 2022-01-28 14:02:31 in reply to 33 [link] [source]
Tested it anyway with OpenSSL3, it works no problem.
(29) By Stephan Beal (stephan) on 2022-01-28 12:10:52 in reply to 27 [link] [source]
Maybe this time!
Funnily enough: while looking at this problem a few days ago, both Richard and myself independently came upon the idea of a very similar patch. We ended up not trying it, but if it improves the TLS situation on Windows then i'm all for it.
It works for me me on Mint Linux (Ubuntu derivative).
My only recommendation would be to swap the semantics of the new flag. Right now the meanings of "yes" and "no" are effectively swapped: "yes, don't loop" or "no, do loop." It would arguably be less confusing if the noLoop flag was renamed to doLoop so that 1 means "yes, loop" and 0 means "no, don't loop."
(32) By Florian Balmer (florian.balmer) on 2022-01-28 13:03:22 in reply to 29 [link] [source]
The patch is not better, just a bit more explicit than the #fdef
and the "we don't know why" comment...
It's o.k. to swap the meaning of the flag, I set the special case to 1, but that doesn't matter.
I can get at it in 1 or 2 days, if no bug reports come in ... please take over the branch anybody if it's all fine and you're able to proceed more quickly, if you like!
Thanks for the help & testing!
(35) By Stephan Beal (stephan) on 2022-01-28 14:53:16 in reply to 32 [link] [source]
please take over the branch anybody if it's all fine and you're able to proceed more quickly, if you like!
i swapped the semantics of the new parameter and touched up some docs, including a link back to this discussion. It "works for me," but needs a final test from one or more Windows users. i'll merge it once we have confirmation from you and/or Martin that it still works for you.
(36) By Martin Gagnon (mgagnon) on 2022-01-28 17:16:26 in reply to 35 [link] [source]
Tested, works just like previous version.
(37) By Florian Balmer (florian.balmer) on 2022-01-29 09:30:49 in reply to 36 [link] [source]
Thanks @stephan and @mgagnon for tweaking and testing. Restoring the link to the forum thread was also on my mental TODO list, I agree it's better to keep the information accessible.
BTW, for quick development tests with another version of OpenSSL on Windows, it's possible to build just the static libraries and bypass anything else (i.e. also CLI apps and lengthy tests that can't be disabled anymore by configuration options) by typing:
[n]make build_generated build_libs_nodep
This drops OpenSSL 3 build time from 13 min to 3 min on my machine. Almost as fast as the SQLIte3 amdalagamdamlamation.
(21) By Florian Balmer (florian.balmer) on 2022-01-26 11:55:56 in reply to 19 [link] [source]
i had not considered that the Windows impl does not/cannot fork ...
Off-topic:
The NT OS supports forking, but the Microsoft people consider it ineffective, so Cygwin was probably the only system that used it.
https://www.microsoft.com/en-us/research/uploads/prod/2019/04/fork-hotos19.pdf
(22) By Warren Young (wyoung) on 2022-01-26 12:03:19 in reply to 21 [link] [source]
Cygwin’s fork
isn’t implemented how you think it is. (Scroll down to Process Creation.”)
(23) By Florian Balmer (florian.balmer) on 2022-01-26 12:13:47 in reply to 22 [link] [source]
Ah, yes, indeed, and the linked paper even points this out.
The book "Windows NT/2000 Native API Reference" by Gary Nebbett has a fork()
example, so this is the only one I've seen, and probably mostly theoretical ...