Markdown formatter bug.
(1.1) By Richard Hipp (drh) on 2021-08-06 12:29:22 edited from 1.0 [link] [source]
This bug was discovered over on the Pikchr forum. If you include
an explicit <table>
in the Markdown and put Pikchr diagrams in cells of
the table, and if you include additional text after the </table>
, then
the Pikchr is not rendered. If you omit the trailing text, the Pikchr is
rendered.
This post contains the <table>
with two pikchr diagrams in cells without
the trailing text:
arrow box "1st" "Pikchr" "diagram" arrow→ /pikchrshow |
arrow oval "2nd" "Pikchr" "diagram" arrow→ /pikchrshow |
(2) By Richard Hipp (drh) on 2021-08-06 12:27:36 in reply to 1.0 [source]
Here is the same <table>
with text after the </table>
:
arrow box "1st" "Pikchr" "diagram" arrow→ /pikchrshow |
arrow oval "2nd" "Pikchr" "diagram" arrow→ /pikchrshow |
Here is the additional text.
(3.1) By Stephan Beal (stephan) on 2021-08-06 13:31:19 edited from 3.0 in reply to 1.1 [link] [source]
Testing the same with fossil-wiki.
arrow box "1st" "Pikchr" "diagram" arrow→ /pikchrshow |
arrow oval "2nd" "Pikchr" "diagram" arrow→ /pikchrshow |
extra text
(4.2) By Stephan Beal (stephan) on 2021-08-06 14:44:35 edited from 4.1 in reply to 1.1 [link] [source]
Reduced repro in md format.
arrow box "1st" "Pikchr" "diagram" arrow→ /pikchrshow |
arrow oval "2nd" "Pikchr" "diagram" arrow→ /pikchrshow |
If the line with the table closing tag contains any non-whitespace after that tag, the problem goes away, regardless of whether or how much whitespace surrounds that non-whitespace.
(5) By Stephan Beal (stephan) on 2021-08-06 16:30:04 in reply to 2 [link] [source]
Here is the additional text.
Though i've been unable to fix it, here's a curious detail: adding any non-whitespace anywhere on the same line after the closing table tag stops the bug. Whether that non-whitespace is surrounded by whitespace is irrelevant. Whether that's a red herring or not is currently anyone's guess, but it's curious. Based solely on that, it would seem that markdown.c:htmlblock_end()
, at least in some contexts, is related or adjacent to this bug.
(6) By Richard Hipp (drh) on 2021-08-06 16:45:52 in reply to 5 [link] [source]
To prove this point,
here is a copy of the penultimate post
with the added non-whitespace after </table>
.
arrow box "1st" "Pikchr" "diagram" arrow→ /pikchrshow |
arrow oval "2nd" "Pikchr" "diagram" arrow→ /pikchrshow |
Here is the additional text.
(7) By Richard Hipp (drh) on 2021-08-06 23:29:59 in reply to 1.1 [link] [source]
Problem Analysis
The way the markdown interpreter works is that if an HTML block tag span begins after a blank line and is followed immediately by a blank line, then that entire HTML block tag span is copied into the output verbatim. None of the content is subject to Markdown formatting. The list of HTML block tags that followed this rule use to be large, including:
<p>
<table>
<ins>
<del>
- ... and many others
The fact that <table>
was on this list was the source of the problem.
If the </table>
was followed by a blank line, then the interior text was
not markdown-interpreted. If extra text appeared after the </table>
then
int content was markdown-interpreted.
The fix
The change (check-in cdbf0bf179989a2d) was to dramatically reduce the number of HTML tags that behave this way. The list is now just these two:
<pre>
<script>
(8) By Stephan Beal (stephan) on 2021-08-08 12:14:55 in reply to 7 [link] [source]
The way the markdown interpreter works is that if an HTML block tag span begins after a blank line and is followed immediately by a blank line, then that entire HTML block tag span is copied into the output verbatim.
i can't shake the nagging feeling that that might, on occasion, be useful, so long as the user knows about it and how to work around it when it's not useful. If extra non-whitespace is necessary after a closing tag in order to force markdown parsing, a
could be used to make it invisible.
(9) By Richard Hipp (drh) on 2021-08-08 17:09:57 in reply to 8 [link] [source]
What if we add <html>
to the set of tags that work this way. That way, if
you have a chunk of text that you want to be pure HTML without any Markdown
interpretation, you simply enclose it inside <html>...</html>
preceded and
followed by a blank line?
(10) By Stephan Beal (stephan) on 2021-08-08 18:14:36 in reply to 9 [link] [source]
What if we add...
That would certainly be much more intuitive.
... preceded and followed by a blank line?
Would the blank lines be necessary? We'd be recycling a tag which otherwise "cannot," outside of verbatim blocks, appear in markdown docs, so accepting it inlined at any point in the text seems(?) okay.
(11) By Brian Tiffin (btiffin) on 2021-08-08 19:59:41 in reply to 9 [link] [source]
Personal opinion.
When I first read this, I went to the Fossil wiki format page to look into the verbatim and nowiki tags. I'd prefer a different tag than html.
Isn't there a rule somewhere about starting CGI output with html avoiding (or including?) header and footer processing? Asking, because I don't remember, but spider sense is tingling.
I'd prefer something like noparse, or asis, or an even more magic word that covers the various semantics of nowiki, verbatim and don't touch any of this, even the angle brackets, until you find a close tag of the magic word.
Have good.
(12) By Stephan Beal (stephan) on 2021-08-08 20:08:18 in reply to 11 [link] [source]
Isn't there a rule somewhere about starting CGI output with html avoiding (or including?) header and footer processing? Asking, because I don't remember, but spider sense is tingling.
That doesn't apply to markdown files, though. Yes, we'd have a bit of mixed semantics with a tag called "html" but i've been unable to come up with a case where that would bite us. But... that's not to say it's impossible. A completely custom tag would probably cause less confusion.
(13) By Richard Hipp (drh) on 2021-08-08 22:38:28 in reply to 12 [link] [source]
Ideas:
<purehtml>
<nomarkdown>
<htmlonly>
Or, just omit the <html>
and its corresponding </html>
but keep all
the (non-markdown formatted) text in between.
(14) By Larry Brasfield (larrybr) on 2021-08-08 23:47:56 in reply to 13 [link] [source]
(Piping in reminds me of the alleged one-to-one mapping between GI tracts and opinions.)
As a signal to the markdown processor, <html> seems perfect. Better even than <html_for_sure>, <nothing_but_ordinary_html>, or other elaborations. If the input to the markdown processor was purported to be ready for interpretation as a response to a web page request, maybe some adherence to standards for that would be smart. But it is not such input. Simple is easier to remember.
(15) By Stephan Beal (stephan) on 2021-08-09 00:01:18 in reply to 14 [link] [source]
Simple is easier to remember
FWIW, Larry has me convinced there. Assuming we don't have any semantic collisions (i don't believe we do), "html" sounds perfect.
(16) By Richard Hipp (drh) on 2021-08-09 01:16:17 in reply to 15 [link] [source]
Changes are on trunk. Try it out. Report any problems.
(17) By Brian Tiffin (btiffin) on 2021-08-09 01:56:52 in reply to 15 [link] [source]
Agree. Thanks to the hint about separating the special-ness of the <html> tag from Markdown and CGI, Stephan, allowed Larry's wisdom to sink in faster.
Allowing raw to the browser with <html> blocks is a sane, easy thing to remember.
Allowing exception control of header/footer with <html> triggers in CGI is a sane, easy thing to remember, even though I'd still have to go look it up for reminder-ing on specifics.
And the internet gets to cheer as someone makes a good point, shifts other opinions and the design to the better. For the win. An internet win.
And new fossil
powers. Thanks, Richard.