"server did not reply", but success?

(1) By Jan Danielsson (jan) on 2024-04-12 19:44:47 [source]

Since somewhat recently (we noticed it today), fossil sync is outputting server did not reply on the clients (on both Windows and macOS). However, the sync appears to be successful (I made a dummy edit, committed, and sync'd -- and the commit appeared on the server).

Are there any recent changes that could explain this?

Because the message started to appear on both macOS and Windows clients at the same time, I assume it's a server issue -- but the server hasn't been upgraded in quite a bit (that is to say, system updates are applied, but fossil is running from fossil-stable that is symlinked to fossil-2.22, which has been untouched for months).

(2) By Richard Hipp (drh) on 2024-04-12 19:50:15 in reply to 1 [link] [source]

I'm not aware of any such changes. But that does not mean they didn't happen.

Your server version is 2.22, but what is your client version. Has it been upgraded? What happens when you try to sync with Fossil's self-hosting server? Does that work for you?

(3) By Jan Danielsson (jan) on 2024-04-15 17:42:45 in reply to 2 [link] [source]

So this has turned into a far bigger head-scratcher than I could ever have imagined.

In our setup we have an apache server that is acting as a proxy for our fossil repositories. We have two ways to access the repos; one domain is publicly accessible (Let's Encrypt) and the other requires a client certificate (self-signed CA, and CN mapped to REMOTE_USER).

In my original post, I wrote that while we get the message, everything seems to work fine. Well, that was slightly premature: I do seem to be able to push changes to the server, but when I run fossil sync in order to get the latest changes I do not seem to get them.

Most of our systems use the latest fossil version (either via package manager, or they have scripts that pull down the latest sources and builds and installs them), but I tried to downgrade the client and get the same error. I tried to downgrade the server and got the same result.

Some random observations:

Running against a local fossil server instance, hosting on 127.0.0.1 does not seem to yield the error.

We're unable to clone -- it yields:

server did not reply
Clone done, wire bytes sent: 264  received: 1000  remote: <removed>
server returned an error - clone aborted

For already checked out clones, it seems to be possible push - but not pull - changes.

The same problem happens for both the public (using username & password) and self-signed (using client certificate) server.

There are no problem accessing the webui for any of the repos (including the one where client certificates are required).

The repos are being hosted on an ubunutu server, and the only things that have changed on this server is running the system update script, which runs apt update && apt upgrade and rebuilds latest fossil. (Though note that the apache uses /usr/local/bin/fossil-stable, which is symlink'ed to /usr/local/bin/fossil-2.23 (or 2.22 at the time of the original post).

I don't understand why, but it looks like a system update (apt update && apt upgrade) did something to our apache installation (and indeed, /usr/sbin/apache2 is dated "Apr 10", and I doubt it was restarted at that point) which is causing it to interfere with fossil -- but in a way which seemingly does not affect the webui.

I'm truly stumped at this point. I basically set all of this up many years ago, and it has just been trudging along over the years -- I've never had to troubleshoot this kind of stuff (well, except for when a CRL update failed once..), so I'm not sure what tools are available to me to inspect what is going on.

Anyone hosting fossil repos behind an apache2 server that knows what I could poke at in order to troubleshoot this?

The server did not reply message seems to originate from http.c, in a block that has the condition:

  if( iLength<0 ){
    /* We got nothing back from the server.

.. which is a little surprising.

(4) By Jan Danielsson (jan) on 2024-04-15 18:02:30 in reply to 3 [link] [source]

Anyone well versed in the finer points of HTTP/CGI that can see if these changes may affect the interaction between apache and fossil?

apache2 package changelog

Actual diff

(5) By Andy Bradford (andybradford) on 2024-04-15 18:47:55 in reply to 3 [link] [source]

> We're unable to clone -- it yields:

Can you try cloning with the  --httptrace option added to your arguments
and then share  the resulting files somewhere? They  should produce some
files that begin with http-request-n.txt and http-reply-n.txt. Feel free
to scrub them for any sensitive information if necessary.

Andy

(6) By Jan Danielsson (jan) on 2024-04-15 21:51:11 in reply to 5 [link] [source]

Sure!

http-request-1.txt contains:

POST /some/path HTTP/1.0
Host: some.host
User-Agent: Fossil/2.24 (2024-04-12 15:24:15 [6a571f88cc])
Content-Type: application/x-fossil-debug
Content-Length: 97

pragma client-version 22400 20240412 152415
clone 3 1
# A91442A0A9F39D304D9AC2CA40DFBF560B24E449

And http-reply-1.txt contains:

HTTP/1.1 200 OK
Date: Mon, 15 Apr 2024 19:44:21 GMT
Server: Apache/2.4.52 (Ubuntu)
Cache-control: no-cache
X-Frame-Options: SAMEORIGIN
Connection: close
Content-Type: application/x-fossil-uncompressed

pragma server-version 22300 20231101 185647
<list of cfile entries and their binary content omitted>

To my untrained eye, this looks pretty reasonable.

So I have a really dumb question. The reply contains no Content-Length and the patch mentioned in another post contains:

++        /* xCGI has its own body framing mechanism which we don't
++         * match against any provided Content-Length, so let the
++         * core determine C-L vs T-E based on what's actually sent.
++         */
++        if (!apr_table_get(r->subprocess_env, AP_TRUST_CGILIKE_CL_ENVVAR))
++            apr_table_unset(r->headers_out, "Content-Length");

So if I'm reading that right, several modules (among them mod_cgi) started removing Content-Length, because it's allegedly not needed. But fossil's http.c module does try to extract it. In fact; it's what's supposed to set the iLength -- but because it doesn't it prints out the server did not reply.

Does this patch look like it could be the source of the problem?

I have yet to figure out what an AP_TRUST_CGILIKE_CL_ENVVAR is..

(7.1) By Jan Danielsson (jan) on 2024-04-15 23:06:46 edited from 7.0 in reply to 6 [link] [source]

When I do a fossil clone against a local repo served using fossil server, then the reply does contain Content-Length.

Is anyone else here hosting fossil repos behind an apache (with mod_cgi) instance? Could you check if your clone reply contains the Content-Length field?

Also, don't upgrade your apache if it is currently working. 😬

(8) By Stephan Beal (stephan) on 2024-04-15 22:18:58 in reply to 6 [link] [source]

So if I'm reading that right, several modules (among them mod_cgi) started removing Content-Length,...

If indeed they're doing that, it seems to be in violation of the CGI RFC, which states:

   The server MUST set this meta-variable if and only if the request is
   accompanied by a message-body entity.  The CONTENT_LENGTH value must
   reflect the length of the message-body after the server has removed
   any transfer-codings or content-codings.

A cursory glance at the mod_cgi docs reveals no way to force it to re-add that.

(10) By Jan Danielsson (jan) on 2024-04-15 22:45:48 in reply to 8 [link] [source]

Yeah, but the spec also says:

The server is not required to create meta-variables for all the header fields that it receives. In particular, it SHOULD remove any header fields carrying authentication information, such as 'Authorization'; or that are available to the script in other variables, such as 'Content-Length' and 'Content-Type'. The server MAY remove header fields that relate solely to client-side communication issues, such as 'Connection'.

I'm not ashamed to say I know too little about HTTP/CGI to properly parse this. From a superficial level it looks to me like they are saying (in the code comment and the CGI RFC) that Content-Length can be scrubbed by the server because the length is conveyed by other means?

You know what, I shouldn't even try to speculate about this, all of this is wooooshing me pretty spectacularly. :)

My question is: Assuming the apache update is correct -- what do we need to do with fossil to make it play nice with this New And Exciting™️ behavior?

(11.1) By Stephan Beal (stephan) on 2024-04-16 20:46:40 edited from 11.0 in reply to 10 [link] [source]

My question is: Assuming the apache update is correct

For a given value of "correct." That change will break many CGIs.

what do we need to do with fossil to make it play nice with this New And Exciting™️ behavior?

The only way to get the content length, aside from the CONTENT_LENGTH, is to read the whole input, re-allocating the input buffer to fit as more input is received.

No, CONTENT_LENGTH is not strictly necessary, in that apps can just keep reading until there's nothing left to read, but they have to be coded to do so, whereas with CONTENT_LENGTH they can allocate their input buffer a single time and read exactly that amount of stuff into it. Not true - Richard's experiments (see below) show that this approach hangs indefinitely.

(12) By Jan Danielsson (jan) on 2024-04-15 23:30:07 in reply to 11.0 [link] [source]

For a given value of "correct." That change will break many CGIs.

Yeah, in one of the links I posted someone mentioned that some code that has been working for 10 years suddenly stopped working with the update. And I figure there are quite a few who have applied the update, but have not restarted apache yet, who'll have an unpleasant surprise when they do.

Though it should be noted that the updates are security updates (with snazzy CVE's attached). I haven't read the reports, so I don't know what the actual issues are.

The only way to get the content length, aside from the CONTENT_LENGTH, is to read the whole input, re-allocating the input buffer to fit as more input is received.

Ah, ok. When the CGI RFC says:

[..] or that are available to the script in other variables, such as 'Content-Length'

I thought that meant that the explicit content length is passed through some other variable (not that it needs to be deduced by waiting for eof).

Reading more about this (not the RFC, but random blogs/posts), it looks like people strongly recommend passing Content-Length when using CGI, so I really do not understand what the RFC was trying to accomplish with that paragraph.

Anywho -- perhaps we should make a note about ap_trust_cgilike_cl somewhere in the wiki somewhere. IIRC there's a page about setting up fossil behind apache?

(13) By Stephan Beal (stephan) on 2024-04-15 23:54:38 in reply to 12 [link] [source]

perhaps we should make a note about ap_trust_cgilike_cl somewhere in the wiki somewhere. IIRC there's a page about setting up fossil behind apache?

i did not find an apache-specific doc but did add a note to the generic CGI setup docs: src:/doc/trunk/www/server/any/cgi.md (scroll to the bottom). Please let us know if that needs any adjustment based on your experience.

(18) By Richard Hipp (drh) on 2024-04-16 18:07:32 in reply to 11.0 [link] [source]

CONTENT_LENGTH is not strictly necessary, in that apps can just keep reading until there's nothing left to read...

I tried that. But I can't get it to work. Fossil uses the cgi_fread() function to read the content off of the wire. But if you try to read more bytes that are available, that function simply blocks, waiting on more. But even if we made that function non-blocking, how would Fossil know that it has reached end-of-input? Maybe the next chunk of content has been delayed on the internet and Fossil just needs to wait a little longer? How can it ever know?

The only thing I can think of to do about this is that if the CGI interpreter ever sees a REQUEST_METHOD value of "POST" and no CONTENT_LENGTH, it should reply immediately with an error message that clearly says "CONTENT_LENGTH is a required CGI parameter for POST requests" - and include the value of SERVER_SOFTWARE in the error message somehow so that the unwitting victim knows that Apache is to blame.

(20) By Stephan Beal (stephan) on 2024-04-16 19:39:41 in reply to 18 [link] [source]

I tried that. But ...

Well, that's unfortunate. No other ideas for working around a missing CONTENT_LENGTH come to mind beyond the nuclear option:

The only thing I can think of to do about this is that if the CGI interpreter ever sees a REQUEST_METHOD value of "POST" and no CONTENT_LENGTH, it should reply immediately with an error message that clearly says "CONTENT_LENGTH is a required CGI parameter for POST requests"

That's the best option yet, IMO.

(14) By graham on 2024-04-16 00:28:18 in reply to 10 [link] [source]

Personally, I suspect:

In particular, it SHOULD remove any header fields carrying authentication information, such as 'Authorization'; or that are available to the script in other variables, such as 'Content-Length' and 'Content-Type'.

is saying that header fields that duplicate information in variables such as content-length/-type should be removed, not that content-length/-type themselves should be removed. So, for example, X-Redundant-Content-Length should go. But I think it's very poorly worded.

(9) By Jan Danielsson (jan) on 2024-04-15 22:37:34 in reply to 1 [link] [source]

After some trail-following I found that AP_TRUST_CGILIKE_CL_ENVVAR is defined to ap_trust_cgilike_cl. This wasn't super helpful, because I had no idea what a ap_trust_cgilike_cl is.

On this mailing list someone asks whether ap_trust_cgilike_cl should be documented. Following the lead from the link is that post, it looks like the ap_trust_cgilike_cl being checked for by in the patch can be set using SetEnv in the apache configuration.

I tried adding SetEnv ap_trust_cgilike_cl "yes" to the repos' Directory section and restarted apache -- and now the message is gone, I can clone again and everything is fine.

"Always stick random environment variables with the word 'trust' in them in your cgi environment. Especially do this if you have no idea what the variable affects."

-- Mark Train

(15) By Brian Follas (62BRAINS) on 2024-04-16 17:30:53 in reply to 9 [link] [source]

On April 9th (Happy Birthday drh!) all of our fossil clients started receiving the same "server did not reply ... clone aborted" message when initiating a clone.

We have several repositories running with Apache using mod_sec and I was able to observe mod_sec log entries citing a Comodo security violation on rule 210710 every time a clone was attempted. This rule has a reputation for false positives, and disabling the rule stopped the mod_sec log entries. Unfortunately, fossil clients continued to abort cloning with the same message.

There were no intentional updates to the environment at the account level (that I was aware of), though this coincided with a cPanel update on that server.

I am hoping to understand what the ap_trust_cgilike_cl "yes" change actually does before implementing it, though I'm not even sure where to make that change. This particular server is using Easy Apache 4 whose modules were -- I suspect -- updated with the cPanel update.

It sounds like the change was made in an .htaccess file in the repos directory. Is that the case, or is this going to be something found in the httpd.conf?

Given the abrupt nature of this issue in my experience, I suspect there will be many others looking for a solution. Thank you for raising the issue here and for everyone's input on this thread.

(16) By Stephan Beal (stephan) on 2024-04-16 17:39:34 in reply to 15 [link] [source]

It sounds like the change was made in an .htaccess file in the repos directory. Is that the case, or is this going to be something found in the httpd.conf?

It can hypothetically be made in either, depending on what's easier for the installation in question. For a shared hoster, it would probably need to go into the .htaccess of any affected accounts. For a global installation it could go into the main config.

Given the abrupt nature of this issue in my experience, I suspect there will be many others looking for a solution.

It is my sincere hope that it affects countless CGI apps and that the Apache folks will revert this change, making it an opt-in instead of automatic breakage. The change is well-intended but is going to have significant fallout across the web (only a tiny fraction of it from fossil users).

(17) By Brian Follas (62BRAINS) on 2024-04-16 18:03:20 in reply to 16 [link] [source]

Thank you, Stephan, for clarifying! I'm looking through the docs and info from linked resources here that I overlooked. This will certainly be something interesting to follow. The only clue I had was the cPanel update on the server which was approximately related. The server restart that came some time after it, and for an unrelated reason, was the trigger. 7 days of mind boggle and the answer shows up. Love this forum!

(19) By Jan Danielsson (jan) on 2024-04-16 18:41:53 in reply to 15 [link] [source]

I am hoping to understand what the ap_trust_cgilike_cl "yes" change actually does before implementing it, [...]

(Minor point, but the code doesn't actually check the value of ap_trust_cgilike_cl, it only checks if it is present. You can set it to "absolutely_not" to confuse future you or anyone else looking over the configuration).

I grepped the latest apache tarball for AP_TRUST_CGILIKE_CL_ENVVAR, and it is defined only once -- to "ap_trust_cgilike_cl". And the only places AP_TRUST_CGILIKE_CL_ENVVAR is used is to determine whether to filter out the Content-Length.

$ rg AP_TRUST_CGILIKE_CL_ENVVAR
include/util_script.h
228:#define AP_TRUST_CGILIKE_CL_ENVVAR "ap_trust_cgilike_cl"

modules/proxy/mod_proxy_scgi.c
397:    if (!apr_table_get(r->subprocess_env, AP_TRUST_CGILIKE_CL_ENVVAR))

modules/proxy/ajp_header.c
678:    if (!apr_table_get(r->subprocess_env, AP_TRUST_CGILIKE_CL_ENVVAR))

modules/proxy/mod_proxy_fcgi.c
787:                            if (!apr_table_get(r->subprocess_env, AP_TRUST_CGILIKE_CL_ENVVAR))

modules/generators/mod_cgid.c
1626:        if (!apr_table_get(r->subprocess_env, AP_TRUST_CGILIKE_CL_ENVVAR))

modules/generators/mod_cgi.c
977:        if (!apr_table_get(r->subprocess_env, AP_TRUST_CGILIKE_CL_ENVVAR))

modules/aaa/mod_authnz_fcgi.c
578:                        if (!apr_table_get(r->subprocess_env, AP_TRUST_CGILIKE_CL_ENVVAR))
$ gvim `find . -type f -exec grep -l AP_TRUST_CGILIKE_CL_ENVVAR {} \;`

So the current state of affairs is that ap_trust_cgilike_cl doesn't have a bunch of other weird side-effects. However, there's no documentation/specification stating this will hold true going forward. We should be clear with users who ask about this that the variable is undocumented and we do not know what it will do in the future. But from our perspective, with the latest apache tarball, ap_trust_cgilike_cl simply means "make fossil work again".

We should also be open about that these changes were introduced in order to fix a HTTP Response Splitting vulnerability. (If I understand the attack vector correctly, it needs other, malicious, modules to be installed?).

(21.1) By Stephan Beal (stephan) on 2024-04-17 03:04:29 edited from 21.0 in reply to 19 [link] [source]

We should also be open about that these changes were introduced in order to fix a HTTP Response Splitting vulnerability.

Follow-up, primarily as a reminder to self: https://httpd.apache.org/security/vulnerabilities_24.html lists that vulnerability as being fixed in 2.4.59. Unfortunately, my Linux Mint systems have 2.4.52, which does not repro the problem.

We're looking for either a way to work around it or at least to explicitly make users away of the problem when it crops up.

Stay tuned...

(22) By Stephan Beal (stephan) on 2024-04-17 03:35:27 in reply to 21.1 [link] [source]

Unfortunately, my Linux Mint systems have 2.4.52, which does not repro the problem.

Bewilderingly, neither does 2.4.59, so i'm at a loss for how to reproduce it.

This leaves me wondering whether there's more to it than simply the Apache version, e.g. ownership of the CGI scripts.

2.4.59, hosting a fossil CGI, reports this from fossil's /test_env page:

CONTENT_LENGTH = 36
CONTENT_TYPE = application/x-www-form-urlencoded
...
PATH_INFO = /test_env
...
SERVER_SOFTWARE = Apache/2.4.59 (Unix)
post-test-button = POST Test
showall = 0

Any hints on how to repro the lack of CONTENT_LENGTH are welcomed!

(23) By Jan Danielsson (jan) on 2024-04-17 12:12:13 in reply to 22 [link] [source]

I reverted the SetEnv ap_trust_cgilike_cl "yes" fix, updated to trunk fossil (This is fossil version 2.24 [9c40ddbcd1] 2024-04-16 13:50:03 UTC), tried to run fossil sync against the repo and got the server did not reply message (i.e. at this point the problem is reactivated).

When I go to the repo's /test_env page and press the POST Test button the output does contain CONTENT_LENGTH = 36.

I'm not sure what this is supposed to demonstrate though, because what apache is filtering out is the Content-Length HTTP header field, and it is this field that fossil's http.c is looking for.

The problem isn't the absence of CONTENT_LENGTH (CGI) on the server, the problem is the absence of Content-Length (HTTP) on the client.

I will reiterate that I should not speak about these things, because I do not know enough about them, but with that disclaimer: What I suspect is happening is that the CONTENT_LENGTH is always being added correctly, apache's cgi module is translating it to a Content-Length field, then the new code in apache is removing the Content-Length field, which makes the fossil client croak because it expects an explicit content length. Setting the ap_trust_cgilike_cl variable simply makes apache's cgi module no longer remove the Content-Length HTTP header field.

(24) By Richard Hipp (drh) on 2024-04-17 12:21:08 in reply to 23 [link] [source]

Please update your Fossil client and server to the latest on the content-length-errors branch. Retry your experiment. Let us know if that clears the problem.

If the problem persists, pleas e run "fossil clone" with the new --xverbose option and post output.

Thanks.

(25.1) By Jan Danielsson (jan) on 2024-04-17 12:53:29 edited from 25.0 in reply to 24 [link] [source]

I cheated and only updated the client, but this branch does indeed solve the problem: It is possible to disable the server-side ap_trust_cgilike_cl hack and cloning/sync'ing still works fine.

Edit: Also updated server and successfully redid the test.

(26) By Richard Hipp (drh) on 2024-04-17 13:04:02 in reply to 25.1 [link] [source]

The fix has now landed on trunk. check-in a8e33fb161f45b65.

(27) By bohwaz on 2024-04-17 20:59:57 in reply to 26 [link] [source]

Thank you all, I stumbled on this issue after an Apache security update on Debian tonight, was beginning to look at Fossil source code and coming to this forum to post about the issue just to see that it's already resolved :)

Thank you!

(29) By Torsten Berg (torstenberg) on 2024-04-19 12:50:39 in reply to 26 [link] [source]

That one also hit me today, obviously my hoster IONOS has updated Apache. Thanks a bunch for the quick reaction and fix, this really saves my day!

And how wonderful that we always have the trunk as a compiled version for download. This makes resolving the issue quick and easy for the users working with the repository!

(28) By Brian Follas (62BRAINS) on 2024-04-19 12:06:26 in reply to 25.1 [link] [source]

Anecdotally, I updated the server using the latest snapshot -- completely forgetting the client -- and still got the "server did not reply error". I saw your message when I got to work and updated the client and can also confirm that it worked great!

Thank you to everyone for working to resolve this.

(30) By schelte on 2024-04-22 16:06:55 in reply to 9 [link] [source]

Completely oblivious to this problem, I decided today to try using fossil as a CGI script.

I saw the instruction on https://fossil-scm.org/home/doc/trunk/www/server/any/cgi.md to add 'SetEnv ap_trust_cgilike_cl "yes"' for Apache 2.4.59, but since my Apache version was reported to be 2.4.41, I didn't think I needed it.

However, I ran into the exact same issues that are described here. Once I added the SetEnv line to my .htaccess file, cloning started working.

(31) By Richard Hipp (drh) on 2024-04-22 16:14:51 in reply to 30 [link] [source]

Please run this test:

Undo the "SetEnv ap_trust_cgilike_ci yes" line in the Apache setup.
Download and use the latest pre-release snapshot of Fossil and use that as your client.
Confirm for us that the new version of Fossil that I am planning to release within the next 48 hours works around the issue with Apache.

Your independent confirmation that the problem has been resolved will be helpful.

(32) By Jan Danielsson (jan) on 2024-04-24 06:25:32 in reply to 30 [link] [source]

Yeah, this is a side-effect of distros that guarantee stable versions of packages but also back-port patches for vulnerabilities.

The wiki comment applies to unpatched versions of apache.

As @drh suggested, you should use the updated fossil instead of enabling ap_trust_cgilike_cl.