/WX causes errors when building under windows with other code pages

(1) By ZJUGKC on 2022-03-22 05:08:35 [link] [source]

Now I use windows server 2019 with code page 936 (Simplified Chinese, China, Singapore) and MSVC (buildmsvc.bat,nmake) to build fossil. The compile options /W2 /WX in Makefile.msc causes some errors, i.e., .\report_.c contains characters that cannot be displayed in code page 936, you should save file as UNICODE format (e.g., UTF-8 format with BOM).

I modify /W2 /WX to /W2 /WX /utf-8 in Makefile.msc, the building process raises the errors in line 20759 and 20772 of file ..\extsrc\shell.c (sqlite) which contain characters cannot be displayed in current code page (0).

I think /WX should be removed and /utf-8 should be added. But /execution-charset:utf8, /source-charset:utf8, /utf-8 are supported since Visual Studio 2015 update 2/3 (MSVC_VERSION=1900).

Waht is the better solution to fix this problem?

(2) By Stephan Beal (stephan) on 2022-03-22 09:51:23 in reply to 1 [link] [source]

you should save file as UNICODE format (e.g., UTF-8 format with BOM).

FWIW, the Unicode Standard recommends against using a BOM for UTF-8. The only tools i'm aware of which violate that recommendation are Microsoft's. A BOM makes little sense in UTF-8 because UTF-8 has a fixed byte order. All source files in fossil are ASCII or UTF-8, with the possible exception of some non-source files added specifically for testing encoding.

Waht is the better solution to fix this problem?

Adding the /utf-8 flag to tell those tools to explicitly treat the files as UTF-8. i can't personally test that change (no Windows) so won't change it myself, but someone who uses Windows will be along shortly and do so (or can explain why it's not as good of an idea as i think it is).

(3) By Larry Brasfield (larrybr) on 2022-03-22 12:05:53 in reply to 2 [link] [source]

I can (and do) confirm that adding the /utf-8 option to CFLAGS in Makefile.msc (and its progenitor) does not break the build and produces what seems to be a normally working fossil.exe . (I did not run a test suite.) Accordingly, I just committed this change.

I think this is a good idea because it makes explicit something (the source encoding and what is to go into "string" literals) which is otherwise left to defaulting mechanisms. It is gcc's default also, and gcc does not guess about encoding by using the user's current "code page", so this change serves to enforce more consistency among platforms. (I could argue that gcc builds should also be more explicit about this, but that is not today's problem.)

On a side-issue: A UTF-8 BOM serves to notify consumers that the file is encoded in UTF-8. Granted, the acronym for "byte order mark" is not quite properly used, but the mark is good for more than indicating endianess.

(5) By Scott Robison (sdr) on 2022-03-22 17:56:23 in reply to 2 [link] [source]

FWIW, the Unicode Standard recommends against using a BOM for UTF-8. The only tools i'm aware of which violate that recommendation are Microsoft's. A BOM makes little sense in UTF-8 because UTF-8 has a fixed byte order. All source files in fossil are ASCII or UTF-8, with the possible exception of some non-source files added specifically for testing encoding.

No recommendation is not the same as a negative recommendation, according to my reading of the linked information. As an example, I might recommend against certain courses of action, recommend for others, and have no recommendation in yet other cases.

I understand the idea that UTF-8 doesn't have a byte order, yet it does: it has a variable byte length per code point stored in big endian order.

Now, I'm not advocating that everyone should modify their source code to include a BOM, but it is far from a useless feature. I don't blame tools for failing to detect that use case when they are not explicitly unicode aware, per se (ie, they just transmit a stream of bytes and trust some other tool to do the right thing). But if a tool already checks for Unicode formatted text that might include a BOM, it would be foolish not to allow for the UTF-8 encoded BOM case.

(6) By Warren Young (wyoung) on 2022-03-22 18:25:05 in reply to 5 [link] [source]

No recommendation is not the same as a negative recommendation

Point 3, bottom of page.

(7) By Scott Robison (sdr) on 2022-03-23 00:15:07 in reply to 6 [link] [source]

And that does not negate my assessment.

"Some byte oriented protocols expect ASCII characters at the beginning of a file."

Their negative recommendation is explicitly tied to "byte oriented protocols that expect ASCII characters". A program that does not expect to handle BOM should obviously not be given a BOM ... that is part of the "protocol of processing data" even if it doesn't match what we typically think of as a protocol (like HTTP).

My only point is that the UTF-8 BOM recommendation isn't as strong as some believe, and that it does have utility. I am not tilting at a windmill trying to get the world to start prepending BOM to all UTF-8 data streams any more than I think they should be mandated for UTF-16 or UTF-32 data streams. Just stating the negative recommendation is not universal.

Really, what they are saying is "do not feed BOM byte sequences in UTF-8 format to programs that do not expect to process UTF-8 BOM byte sequences." That's a pretty fair recommendation, but it doesn't negate all usage. Or so I think.

I also think it is a good idea as much as possible for software that processes UTF-8 encoded text (not an otherwise unidentified stream of bytes) to skip the BOM. Whether people like the fact that some UTF-8 files have BOMs or not, the fact is they do, and that paragraph doesn't explicitly forbid it in the general case.

(13) By ravbc on 2022-03-23 09:33:02 in reply to 7 [link] [source]

UTF-8 exists in only one byte-ordering. So adding BOM does not add any new information. But adding BOM to UTF-8 encoded files creates problems, at least in some situations. So what's the value of BOM in this case?

(16) By Scott Robison (sdr) on 2022-03-23 17:38:16 in reply to 13 [link] [source]

Because UTF-8 is but one possible encoding among many: UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE, . A signature is useful in some workflows to help identify the data unambiguously.

Again ... I'm not advocating for all files to include a BOM. I'm merely stating that it does sometimes have utility, that of course it should not be used where it is not expected (such as in an ASCII protocol), and that the commonly cited reason that it is not recommended isn't as broad as some interpret the recommendation.

Sorry, I really didn't think this would be so controversial!

(4) By Larry Brasfield (larrybr) on 2022-03-22 12:12:39 in reply to 1 [link] [source]

I think /WX should be removed

I think not. And I'm fairly certain that some people who spent hours debugging a subtle fossil problem which /WX would have entirely avoided would agree that /W2 level warnings are much better fixed than allowed to affect what is built.

The problem with just relying on people so evaluate warnings is that, with the Fossil build, so much build noise is emitted that warnings fly by (on any semi-modern computer), and are unlikely to be seen. That /WX option was added specifically to ensure that warnings which should not be ignored cannot be ignored.

(8) By ZJUGKC on 2022-03-23 03:48:06 in reply to 4 [link] [source]

Thank you for adding /utf-8 to Makefile.msc.

Now I can remove /WX in my building script dynamically to avoid errors raised by MSVC compiler in extsrc\shell.c.

(9) By Larry Brasfield (larrybr) on 2022-03-23 04:26:28 in reply to 8 [link] [source]

What warning(s) are you seeing when compiling extsrc\shell.c with MSVC tools? Are you using non-default build options? And if so, what are they?

I ask because SQLite developers prefer shell.c to compile cleanly with the more commonly used compilers. The /W2 warning level is lax enough that its complaints should be taken seriously, and are generally addressed by code adjustments.

Thanks.

(10) By ZJUGKC on 2022-03-23 04:56:04 in reply to 9 [source]

Source code of line 20759 and 20772 in extsrc\shell.c is :

20759  if( bBOM ) fprintf(p->out,"\357\273\277");

20772  if( bBOM ) fprintf(p->out,"\357\273\277");

These three characters in a string cause warnings which said these characters cannot be displayed in current code page(0). /WX converts these warnings to errors. So I remove /WX in my building script and all works OK.

(11) By Larry Brasfield (larrybr) on 2022-03-23 05:19:11 in reply to 10 [link] [source]

What happens if you add /uft-8 to the CL invocation flags instead of removing /WX?

Are you building with the source from this checkin or later? Or are you still writing about what happens with some earlier version of the source?

I am unable to replicate your reported warnings with CL.exe v19 when using code page 936, even with /Wall, so I cannot conveniently tell whether that checkin cures the problem you reported. What version of the MSVC tools are you using?

Thanks for any insights.

(12.1) By ZJUGKC on 2022-03-23 05:58:08 edited from 12.0 in reply to 11 [link] [source]

Now I use the latest checkin [1e70f826] (2022-3-22 15:53). /utf-8 has been added in this checkin. I use Visual Studio 17 2022 build tools. These warnings and errors are raised with /W2 /WX /utf-8. If I remove /WX, These errors disappear.

These warnings are raised on each older checkin without /utf-8. But it becomes errors since checkin 57f16ce8.

(14) By Richard Hipp (drh) on 2022-03-23 10:10:00 in reply to 10 [link] [source]

Please try again with check-in 92fd091703a28c07 and let us know if that clears your problem.

(15) By ZJUGKC on 2022-03-23 17:06:31 in reply to 14 [link] [source]

Good job. Now no errors are raised when using MSVC with compiling options /W2 /WX /utf-8 under my code page. Some warnings are raised like data type conversion from XX to XX would lose precision in source files of fossil (not sqlite), but that is not a serious problem.

Thank you very much, all problems in this thread are solved.