Fossil Forum

RFC: adding auto-generating ID attributes in markdown-to-html heading elements
Login

RFC: adding auto-generating ID attributes in markdown-to-html heading elements

RFC: adding auto-generating ID attributes in markdown-to-html heading elements

(1) By Stephan Beal (stephan) on 2019-12-13 13:20:06 [source]

i've just added a new feature to the markdown parser which is a bit debatable, and therefore not on trunk:

Each heading line now adds an automatically-generated ID attribute to the <H#> element, e.g.:

# My Header
==>
<h1 id="myheader">My Header</h1>

The reason for this is to facilitate on-the-fly creation of intra-document links and the creation of tables of contents which link to their relevant sectionsmisref. The algorithm for creating the ID is simply to use the lower-case form of all ASCII alphanumeric characters in the header. The algorithm needs to be simple enough for humans to do "in their head" while typing, so that they can create links without having to first process the doc and look at the HTML to see what ID was generated. Also, the algorithm should be resistant to minor changes like spacing and changing of case in the input text. That is, "My Header" and "My - header" should(?) produce the same ID. (Maybe the dash and underscore should be added to the list of allowed characters?)

The trivial branch is here:

https://fossil-scm.org/fossil/timeline?r=markdown-header-ids

If there are objections or suggestions for a more suitable ID-generation algorithm, i'm all ears.

Potential maybe-kinda-possibly-issues:

  1. This applies to all markdown-to-HTML, so also tickets and forum posts.
  2. HTML does not strictly allow multiple elements to have the same ID value, but internally keeping track of those, and finding a human-reproducible algorithm to produce collision-free IDs for two sufficiently-similar headings, would require a... ah... disproportional amount of effort. The browser won't break on duplicate IDs, but links to them will of course only point to one or the other of the duplicates.

misref = Tables of Contents. i don't yet know if the internal markdown structure would easily support this, but i'd eventually like to add a #pragma toc feature which injects an auto-generated table of contents from all headers.


  1. ^ a b Misreference

(2) By Warren Young (wyoung) on 2019-12-13 13:26:57 in reply to 1 [link] [source]

Another great-yet-small feature!

I haven't tried this feature yet, but I have been wanting such a feature for a long time. It'll save me from having to write all those manual <a name="foo"></a> tags in my Markdown source files, which is not only a pain, it uglifies the Markdown source for those reading it as plain text.

I especially like the fact that you took the time to make it use fragment identifiers on the <hX> tag, not add <a name> tags immediately before or after the header. I use the latter hack only because I don't have access to the former in the Markdown source.

Thank you, Stephan!

On your point #2, you can cheat the multiple ID problem by adding name="bar" for the second link target. Fragment identifiers in URLs will find those, too, and unlike id, they're not expected to be unique in HTML. I often do this in manually-written HTML when I need an alias for the "real" ID.

(3) By Stephan Beal (stephan) on 2019-12-13 13:34:32 in reply to 2 [link] [source]

On your point #2, you can cheat the multiple ID problem by adding name="bar" for the second link target. Fragment identifiers in URLs will find those, too

Strangely enough...

HTML5 has obsoleted the <A NAME=...> pattern for some reason, suggesting using IDs instead. Curiously, name is only obsoleted for <A> tags, but not others. According to Mozilla, the NAME attribute is not legal on <A> tags or heading tags. They claim it's only intended for:

<button>, <form>, <fieldset>, <iframe>, <input>,
<keygen>, <object>, <output>, <select>, <textarea>,
<map>, <meta>, <param>

(4) By Warren Young (wyoung) on 2019-12-13 14:14:50 in reply to 3 [link] [source]

For aliases, the non-kosher use of name in h1 etc. doesn't bother me greatly. Link creators should prefer the id value, but as long as browsers support jumping to name, I say feel free to make use of it.

(5) By anonymous on 2019-12-14 01:02:13 in reply to 1 [link] [source]

i don't yet know if the internal markdown structure would easily support this, but i'd eventually like to add a #pragma toc feature

I think it would just be a matter of adding the needed extra array and other variables to collect header info during the block structure pass then insert the TOC during the in-line pass. (Though, if Fossil's Markdown parser is single pass, some provision for resolving forward references, like TOC, would be needed.)

Of course, #pragma collides with Fossil's header processing and your requested #hashtags feature. (Though I'd expect those collisions to only occur when discussing the #pragma feature, so maybe disallowing #pragma as #hashtag won't cause more than a few smal problems.)

(6) By Stephan Beal (stephan) on 2019-12-14 01:37:27 in reply to 5 [link] [source]

then insert the TOC during the in-line pass.

A cursory glance reveals comments which say:

  /* first pass: looking for references, copying everything else */
 ...
 /* second pass: actual rendering */

Other than that i can't say much about it. There's no high-level DOM-ish structure involved, just tons of Blob objects (byte arrays/buffers).

Of course, #pragma collides with Fossil's header processing and your requested #hashtags feature.

It's my hope that we can all eventually agree that header lines require a space after the # but, to be honest, i'm in no particular hurry to dive in to the markdown processor code :/. Since markdown allows some measure of HTML-like markup, the TOC could potentially be injected using a tag like the one the old GoogleCode format used: <wiki:toc max_depth="number"></wiki:toc>

(7) By Stephan Beal (stephan) on 2019-12-15 14:38:08 in reply to 1 [link] [source]

Last call:

Each heading line now adds an automatically-generated ID attribute to the <H#> element

If there are no objections to this addition i'll merge it in the next day or so.

(8) By Florian Balmer (florian.balmer) on 2019-12-15 14:38:39 in reply to 7 [link] [source]

My comments, as requested ;-)

  • I don't think this feature is as small and trivial to implement as it may seem at first sight, since any future changes may potentially break a lot of hyperlinks.

  • The auto-generated IDs should really be unique, since duplicate IDs are invalid in HTML, and misleading and confusing to users. Duplicate IDs could be dropped, as a hint to the markdown author to come up with a more distinct heading, or a sequential number could be appended.

  • I think it's not necessary to drop non-ASCII characters from the auto-generated IDs, and this seems to work even without URL-encoding the IDs in the "id" attribute, or in the "href" attribute of links to the heading. (Think of Russian or Chinese users, for example.) Also, after removing leading and trailing spaces, replacing (one or more) embedded spaces (or tabs) could be replaced with a single dash to improve readability (i.e. #this-is-a-hash instead of #thisisahash).

  • The auto-generated IDs should have their own "namespace" (their own common prefix) to avoid collisions with Fossil's internal IDs, as markdown documents from wiki pages linked to branches or check-ins may be displayed inside timeline views, for example. Since the feature is somewhat "obtrusive", I'd vote for it to be optional, if the auto-generated IDs are not isolated in their own namespace. (The best solution seems to be some markdown meta-statement to explicitly trigger the ID auto-generation, but this may be impossible to implement while preserving markdown compatibility).

  • As a markdown author, I'd like to "preview" the auto-generated IDs, so I can copy them to clipboard, and don't need to guess them, each time. For example they could be displayed (small and colored) next to the headings on the Preview page when saving wiki documents or forum posts.

  • As an alternative to the previous point, the auto-generated IDs could be made visible on the final document, for example through small clickable # signs displayed before or after the heading, or similar, so readers can also see them and use them for their own bookmarks.

(9) By Stephan Beal (stephan) on 2019-12-15 15:20:19 in reply to 8 [link] [source]

Good evening, Florian!

I don't think this feature is as small and trivial to implement as it may seem at first sight, since any future changes may potentially break a lot of hyperlinks.

That's the user's problem. The feature is intended to simplify the most basic cases, not to out-guess the user or completely take over management of anchors. Those who need specific, immutable anchors have to add them with <A> tags, like they've been doing so far.

The auto-generated IDs should really be unique, since duplicate IDs are invalid in HTML, and misleading and confusing to users. Duplicate IDs could be dropped, as a hint to the markdown author to come up with a more distinct heading, or a sequential number could be appended.

Three points:

  1. It seems unlikely that many users have the exact same header text in more than one header on the same page. Not at all impossible, but unlikely.

  2. It's effectively impossible to create an algorithm which can avoid duplicates and be performed by the average user on the fly in their head. We can't expect users to go through a document and count the number of collisions each time they want to type #ananchorname, so a sequential number isn't, IMO, a real usability improvement.

  3. Catching duplicates requires keeping track of every header encountered so far, making the patch far less trivial. We also don't know if there are dupes until we encounter one, so the first anchor would be called #foo, not #foo1, but the second one would be #foo2, which seems inconsistent to me.

But this does point out a potential bug in my patch: the generated headers should have a suffix which makes them unique even if the user adds their own anchor tags with similar IDs. i'll change the algo to prefix each one with "header-" or "md-" or something similar (no, i won't make the prefix user-configurable ;)).

I think it's not necessary to drop non-ASCII characters from the auto-generated IDs

The reasons it skips non-ASCII are:

  1. We read the header byte-by-byte, instead of character-by-character, and it would be bad to output partial characters to the ID.

  2. We have no code in fossil to determine if arbitrary non-ASCII characters are alphanumeric, and it would probably be a bad idea to allow strange Unicode punctuation in the ID. e.g., Unicode defines several different types of spaces.

As a markdown author, I'd like to "preview" the auto-generated IDs, so I can copy them to clipboard, and don't need to guess them, each time. For example they could be displayed (small and colored) next to the headings on the Preview page when saving wiki documents or forum posts.

To do that here would require either implementing the IDs in JavaScript (which is way out of scope for this patch) or that the user submit the page so that the server can generate the IDs (which is something i specifically want to avoid).

As an alternative to the previous point, the auto-generated IDs could be made visible on the final document, for example through small clickable # signs displayed before or after the heading, or similar, so readers can also see them and use them for their own bookmarks.

Avoiding JavaScript, and similar complexities, is one of the goals of this patch. It would, of course, be feasible to extend this support to include such a feature, but that's far beyond the scope of what i'm implementing because it's far beyond the scope of what feature i want right now in my own docs (where i'm porting nearly 200 pages of gdocs into markdown). If anyone wants to extend its functionality, they have my blessing to do so :).

(24) By anonymous on 2019-12-16 22:52:17 in reply to 9 [link] [source]

Three points: 1 It seems unlikely that many users have the exact same header text in more than one header on the same page. Not at all impossible, but unlikely.

What about:

 # Option 1
   ## Benefits
   ## Drawbacks
   ## Discussion

 # Option 2
   ## Benefits
   ## Drawbacks
   ## Discusion

 # Option 3
   ## Benefits
   ## Drawbacks
   ## Discussion

So three main headers with the same set of subheaders. I use this form often in documentation.

Now I can do:

  # Option 3
      intro text describing option 3
    ## Option 3 - Benefits
       text describing benefits of option 3
etc. But that is kind of a pain if the main sections aren't as short as "Option X". Also it makes for a repetitive table of contents.

(25) By Stephan Beal (stephan) on 2019-12-16 23:18:10 in reply to 24 [link] [source]

So three main headers with the same set of subheaders. I use this form often in documentation.

i admittedly often feel tempted to as well, but i always (pedantically) use the formulation you show at the bottom for the very reason that it's unambiguous, when i see the header, exactly which "Benefits" section i'm looking at. That's obviously a matter of personal pedantic taste, though - i suspect that more people prefer to use the first formulation you show.

In any case, i've dropped work on this feature for the time being based on the feedback about the pandoc implementation, as such an implementation would be much more useful/solid (but that isn't a code rabbit hole i'm currently willing to crawl down myself).

(26) By Warren Young (wyetr) on 2019-12-16 23:47:51 in reply to 24 [link] [source]

I've got a doc I'm working on right now that uses the first style for its headers. It's a series of product evals, where each top level header is one product, but under each one is the same set of sub-heads: Characteristics, Good Points, Bad Points, etc.

You could recall the nesting level into the IDs, but that adds complexity to the implementation: id="option-3-benefits"

(27) By Stephan Beal (stephan) on 2019-12-17 00:03:01 in reply to 26 [link] [source]

You could recall the nesting level into the IDs, but that adds complexity to the implementation: id="option-3-benefits"

It also breaks links if the doc is restructured, which is exactly the broken feature of GDocs which compelled me to finally migrate my docs to markdown :/.

(10) By Warren Young (wyoung) on 2019-12-16 04:10:20 in reply to 8 [link] [source]

duplicate IDs are invalid in HTML

Sure, but do they actually break browsers? I'd expect a reference to a duplicate ID to always refer to the first or last such ID on the page, depending on how the browser parses it.

And when that happens, the fix is easy: remove or change the conflicting ID.

One of the charms of this feature is that it didn't require long IDs to create a disambiguation. Consequently, I'm not very happy with this "header-" prefix. I liked it better without.

Overall, I want this feature to let me strip all of my explicit named anchors out of my Markdown documents. Then there will be no ambiguous IDs, because as Stephan says, there's basically zero chance I'll reuse the same header within one document.

On that point, keep in mind that all of this applies only within a single document. IDs don't have to be unique across the whole repo.

I'm not even certain IDs have to be entirely unique within a doc, because CSS selectors let you scope the ID search. Surely h1#foo is different from h2#foo?

own "namespace"

Aside from the user's own namespace, which I've argued away above, what other auto-generated IDs are there? What is this feature competing with?

(11) By Stephan Beal (stephan) on 2019-12-16 04:17:43 in reply to 10 [link] [source]

One of the charms of this feature is that it didn't require long IDs to create a disambiguation. Consequently, I'm not very happy with this "header-" prefix. I liked it better without.

i admittedly have mixed feelings about it, but am, on the whole ambivalent. Convince me to remove it and i will (or remove it yourself and i won't bother to replace it ;)).

(12) By Warren Young (wyoung) on 2019-12-16 04:20:10 in reply to 11 [link] [source]

If your mind cannot be eased on the ID uniqueness question, how about a 3-way repo-level setting: off, unadorned, and prefixed?

(13) By Stephan Beal (stephan) on 2019-12-16 04:38:58 in reply to 12 [link] [source]

i'd rather it be removed than add a config option for it - that's far more complexity than i care to invoke. i'm not at all averse to "header-" being removed. It was bugging me a bit 8 hours ago, but now i'm completely fine with either keeping or dropping the prefix.

Related/sidebar: this might be an interesting case for introducing a #pragma markdown function, but adding such a thing is far more complexity than i care to invoke for the header auto-IDs:

#pragma header-ids header-
#pragma header-ids on
#pragma header-ids off

Maybe something to consider for later.

(14) By Florian Balmer (florian.balmer) on 2019-12-16 07:48:43 in reply to 10 [link] [source]

I don't think duplicate IDs would break anything -- unless they collide with Fossil's internal IDs for documents embedded in timeline views, for example (hence the requirement for a separate "namespace").

Given that Markdown is such a widely-used and general-purpose format, I think the current non-uniqueness- and non-Unicode-aware solution is a bit too simple and too specific to stephan's current use case.

Consider a very simple example with two headings:

# Sample ①
# Sample ②

This adds unnecessary and useless garbage to the final HTML document (i.e. the duplicate ID "heading-sample").

(15) By Stephan Beal (stephan) on 2019-12-16 13:29:36 in reply to 14 [link] [source]

Given that Markdown is such a widely-used and general-purpose format, I think the current non-uniqueness- and non-Unicode-aware solution is a bit too simple and too specific to stephan's current use case.

Consider a very simple example with two headings:

# Sample ① # Sample ②

i agree completely that it "would be nice" to support non-ASCII, but we lack the UTF-level infrastructure to do that properly. Rather than add UTF-related infrastructure for determining whether ② is a legal/sensible character for an anchor, the only "correct" solution right now is to always answer that question with "no". Unicode is a messy topic, and this tiny feature is truly not worth the trouble of involving full Unicode handling.

This feature is a convenience for those who want to make use of it, not a requirement for anyone writing Markdown in fossil. There's nothing stopping a user from manually adding an anchor:

<a id="sample②"></a>

And manually linking to it:

See [Sample ②](#sample②)

That's the approach which "hard core" authors/editors are taking right now, so nothing changes for them.

This adds unnecessary and useless garbage to the final HTML document (i.e. the duplicate ID "heading-sample").

Probably 90-95% of the CSS loaded for any given page is "useless garbage," in the sense that it's very probably not used on that page (check the CSS for this forum post - it includes diff-view CSS, timeline-specific CSS, etc.). Adding 1 more ID per heading element, only on pages which could use them, doesn't seem like a high cost to me.

As Warren said, a duplicate ID is harmless - the browser will jump to one or the other anchor (but which one is undefined), which is an indication to the author that they have a duplicate, and they have multiple ways of resolving that (change a header or manually add an anchor with a name of their choosing).

(16) By Stephan Beal (stephan) on 2019-12-16 13:38:38 in reply to 14 [link] [source]

HTML5 allows nearly any character in an element ID:

https://html.spec.whatwg.org/multipage/dom.html#the-id-attribute

Even this garbage is legal:

<div id="div>p+p:last-child"></div>

(!!!)

However, creating links to such anchors requires escaping certain characters. If we allow non-alphanum/underscore/dash characters in the auto-generated IDs, it would be up to the user to learn/remember which characters need to be escaped in their links to those IDs, which would make the usability of the feature a disaster. Few users would bother to remember which characters they need to escape, but any user can remember, when creating links, that "only alpha-numeric ASCII characters are legal" [for purposes of linking to the auto-generated anchor IDs].

(18) By Warren Young (wyoung) on 2019-12-16 17:46:56 in reply to 14 [link] [source]

I assume Fossil has some kind of in-memory map/dict/hash data structure within it already. What if, while generating IDs, you just remember which ones you've already generated, and if there's already one like the one you propose to insert into the output HTML, you add a serial number and bump the count in the map to match?

Perl pseudocode:

my %ids;    # map of header ID strings to count of uses, starting at 1

while (not done) {
     if (is_header($element)) {
         my $id = normalize_id($element->header->name);
         if (exists $ids{$id}) {
             $id = $id . '-' . ++$ids{$id};
         }
         else {
             $ids{$id} = 1;
         }
         insert_header_into_doc($element->header, $id);
     }
     elsif ...
}

Thus Florian's example would produce:

<h1 id="sample">...
<h1 id="sample-2">...

Not as elegant as somehow magically doing the "right thing" and producing "sample-1" for the first, but there's got to be some limit here.

(19) By Stephan Beal (stephan) on 2019-12-16 18:30:07 in reply to 18 [link] [source]

That data structure is called sqlite ;). There is a Bag class for tracking integers (RIDs) but not, AFAIK, strings/void pointers. Using the db for that would be an extreme level of overkill.

(20) By Warren Young (wyetr) on 2019-12-16 19:00:32 in reply to 19 [link] [source]

There's Th_HashFind() and friends...

(17) By Stephan Beal (stephan) on 2019-12-16 13:52:10 in reply to 10 [link] [source]

Consequently, I'm not very happy with this "header-" prefix. I liked it better without.

After having slept on it, i agree, and have backed out that change (my first-ever time using merge --backout - what a handy feature!). The prefix is over-engineering. KISS.

(21) By Joel Dueck (joeld) on 2019-12-16 19:52:01 in reply to 1 [link] [source]

I like how Pandoc implements this feature. I believe Github's Markdown follows the same algorithm. It would be worth matching this as closely as possible.

In particular the only thing I really have an opinion on is the step where spaces and newlines are converted to hyphens. I think this is a good idea.

(22) By Warren Young (wyetr) on 2019-12-16 21:19:30 in reply to 21 [link] [source]

It seems sensible. Their points with my commentary:

  1. Remove all formatting, links, etc. — This can wait for someone to run into it.

  2. Remove all footnotes. — Ditto.

  3. Remove all non-alphanumeric characters, except underscores, hyphens, and periods. — Sensible. To the extent that the current algorithm differs, we should standardize on this.

  4. Replace all spaces and newlines with hyphens. — Out of order: this needs to be the prior step, else it's a no-op. Otherwise, it's sensible.

  5. Convert all alphabetic characters to lowercase. — Sensible. It might be better done as part of step 3 where it already checks for alphabetic characters, but that's an implementation detail.

  6. Remove everything up to the first letter — Excellent addition. Fossil's own docs have things like "2.8 Test Before Commit". This should end up as id="test-before-commit"

  7. If nothing is left after this, use the identifier section. — Mmm...I'd want a serial number after that. section-1, section-2, etc.

(23) By Stephan Beal (stephan) on 2019-12-16 21:53:57 in reply to 22 [link] [source]

It seems sensible. Their points with my commentary:

i agree that all of that is sensible, but my itch for this feature isn't nearly itchy enough to inspire me to implement it :/. i'll put this branch on hold and wait until either that itch reaches critical mass or someone with a greater itch than mine implements it. It wouldn't make sense to use the current implementation now and then swap it with a different algo later, breaking any links people have based on it.

Remove all formatting, links, etc. — This can wait for someone to run into it.

That is a is a point i hadn't considered and would have soon tripped over - i commonly use backticks on symbol names in headers.

(28) By Florian Balmer (florian.balmer) on 2019-12-17 14:27:12 in reply to 21 [link] [source]

Some more reasoning, structured with headings, for the sake of the topic :-)

Unique IDs

The Pandoc-way of appending -1, -2, etc. to IDs to keep them unique seems reasonable. However, in the Fossil Forum, multiple independently-parsed Markdown fragments are usually displayed on the same page, and even if the unique ID numbering were performed across all fragments, the order of fragments might change (new replies to higher-level posts, chronological view), resulting in unstable IDs.

Therefore, and due to potential conflicts of IDs with Fossil's internal IDs in timeline views for branch-linked wiki pages, I think such a feature should only be enabled for "true" wiki pages and embedded /doc pages, and not for branch-linked wiki pages, nor for forum posts.

Length Limit

A missing point from the Pandoc algorithm may be to set some length limit (256 characters?), in case headings are "misused" to print whole paragraphs enlarged and bold.

UTF-8 Processing

My first thought to use SELECT lower(%Q) and similar to process UTF-8 strings does not seem to work unless SQLite is built with the ICU extension.

The Fossil source code file src/unicode.c has the following handy functions, probably mostly used to deal with Unicode in regular expressions:

  • unicode_isalnum()
  • unicode_remove_diacritic()
  • unicode_is_diacritic()
  • unicode_fold()

Javascript

With a client-side Javascript solution, it may be much easier to get the "plain text" of a heading, with all embedded Markdown and HTML markup stripped. However, I'm not sure if it's possible to hook-in early enough in the DOM loading process to make this work (i.e. have the browser really jump to the dynamically generated anchors specified in URLs), but probably yes.

Javascript comes with Unicode support, and the same single solution could be used with both Markdown and Fossil Wiki documents.

A disadvantage with a Javascript solution may be that anchors won't work in "no-script" environments (as is currently the case with the Fossil Forum, i.e. the current post is not scrolled into view if Javascript is disabled/unsupported).

(29) By Stephan Beal (stephan) on 2019-12-17 17:57:16 in reply to 28 [link] [source]

However, in the Fossil Forum, multiple independently-parsed Markdown fragments are usually displayed on the same page,

i would like to see a parsing option which allows us to disable such generation in certain contexts. Specifically, adding such links in forums posts seems like a bad idea idea to me, and not useful for the general case. In pseudocode, something like:

fossil_markdown_blob( inputBlob, outputBlob, /*flags*/ MD_NO_HEADING_IDS );

I'm not sure if it's possible to hook-in early enough in the DOM loading process to make this work (i.e. have the browser really jump to the dynamically generated anchors specified in URLs

Element.scrollIntoView() can be used to scroll at any time. So the JS would just need to be able to formulate a CSS selector so that it could find the right element, then call scrollIntoView() on it.

(30) By Florian Balmer (florian.balmer) on 2020-01-09 15:28:26 in reply to 29 [link] [source]

Here's a refreshingly simple plain-Javascript ToC script, with a handy helper function to find all heading elements (properly sorted, and optionally only inside a specified container element), that looks like a good inspiration:

In Fossil, a custom Javascript module containing the functionality could be loaded on demand (i.e. only for "true" wiki pages, but not for forum posts, nor for branch-linked wiki pages). The linked script relies on existing ID attributes, but once there's a sorted (for continuous numbering) array of heading elements, auto-generating the missing-only IDs (for compatibility with existing "Fossil markup" pages, where heading elements may already have IDs manually added by the author) seems not too complicated (i.e. following the Pandoc algorithm with the tags-stripped element.innerText available on IE6+, and Javascript's Unicode-aware regex engine). Also, the part to display the ToC with clickable links could be omitted, or made optional, if all that's needed for now is the auto-generated IDs.

(33) By anonymous on 2020-01-16 17:50:01 in reply to 29 [link] [source]

However, in the Fossil Forum, multiple independently-parsed Markdown fragments are usually displayed on the same page,

i would like to see a parsing option which allows us to disable such generation in certain contexts. Specifically, adding such links in forums posts seems like a bad idea idea to me

This is also being discussed on CommonMark:

Another issue with automatic IDs is that they may clash with other IDs on the page. Imagine two posts in a forum topic having the same header text. Now, suppose the first post is deleted. The order of the IDs would change and any links to the second post would break. As a solution, the parser could accept an optional namespace parameter that would be added to the start of the ID. The ID of the header “The Philosophy of CommonMark” would become #discourse-topic-115-post-40-the-philosophy-of-commonmark, for example.

Assuming replies to Fossil forum post don't get renumbered, perhaps generate IDs like "#r28-unique_ids-1".

This "reply prefix" could also be used with manually specified IDs, too.

(35) By anonymous on 2020-01-22 18:58:27 in reply to 33 [link] [source]

Assuming replies to Fossil forum post don't get renumbered,

Unfortunately, editing a post does renumber it. In:

https://fossil-scm.org/forum/forumpost/3492ee83cb

wyetr's post was originally #2, but after editing, became #6.

I think this behavior is confusing.

While I understand that calculating the sequence ID on-the-fly is more efficient than storing it, I did notice that 2 was skipped when the thread was displayed, and that the reply to the original #2 still referred to it as 2. This suggests that the way Fossil numbers the replies might result in stable sequence IDs.

In some ways, a forum post should be treated like a ticket: Certain meta data, like the original post ID, post date and sequence ID, should "override" the corresponding values in "edit artifacts". (The date of the edit artifact could be displayed as "Edited date".)

Unfortunately, I can't, right now, think of a way to do this with out adding a card to the forum post artifact. Though, I think the K card could be re-used because "ticket ID" and "post ID" are similar in purpose.

(36) By Stephan Beal (stephan) on 2020-01-22 19:02:56 in reply to 35 [link] [source]

Keep in mind that this forum software uses a distributed data model - the same one fossil uses for its SCM features. This means that stable sequential IDs are an impossibility (that's why DVCSs use hash codes, instead of version numbers, to refer to versions).

(39) By anonymous on 2020-01-22 20:23:08 in reply to 36 [link] [source]

Keep in mind that this forum software uses a distributed data model

I know. (see my reply to wyetr).

Unfortunately, being pioneer on a new frontier is often like opening a can of worms. Hopefully, it is not a can of wyrms.

(37) By Warren Young (wyetr) on 2020-01-22 19:05:15 in reply to 35 [link] [source]

wyetr's post was originally #2, but after editing, became #6.

Post #2 is still there, it's just being hidden by post #6.

Fossil forum posts use the same indelible artifact blockchain as most other content Fossil stores. My post #2 didn't go away any more than does fossil amend destroy data.

How that affects your actual point here, I'm not certain. Maybe it's just irrelevant pedantry. Still, it's best to think clearly about these things.

(38) By anonymous on 2020-01-22 20:19:22 in reply to 37 [link] [source]

Post #2 is still there, it's just being hidden by post #6.

I know that.

I'm just not sure why it was still counted. One possibility is that it was counted because another post was in reply to it.

I haven't had a compelling reason to setup a forum. Though, curiosity might nudge me enough to setup one to experiment with.

Anyway, I was following up on the viability of the sequence numbers in generating anchors for links.

Obviously, the hash ID would be the "safest", though even that has problems. One being that using the hash is less user friendly than the sequence number.

Absent a "post ID" (like a ticket ID), an anchor generated from the hash ID would have to be generated when the post is submitted. But that would alter the hash. Otherwise, the hash part of the anchor would change with every edit. (Of course, there's risk the non-hash part could change, but that's an established risk that's easier to live with.[1])

I'm pretty sure Fossil's forum is one of the first - if not the first - distributed forum. It does bring new issues to light.

While some might argue that any generated anchors should be restricted to a thread's original post, I've seen many cases where an individual reply was (legitimately) epical-ly long.


[1] While manually specified anchor IDs avoid most of the problems with generated IDs, there are reasons that make prepending a prefix (related to the article/post) to the manually specified ID a good practice - especially in a forum.

(31) By anonymous on 2020-01-16 00:51:14 in reply to 28 [link] [source]

The Pandoc-way of appending -1, -2, etc. to IDs to keep them unique seems reasonable.

I agree.

However, in the Fossil Forum, multiple independently-parsed Markdown fragments are usually displayed on the same page ... resulting in unstable IDs.

Note that Pandoc supports a notation for specifying IDs:

    # header {#ID}

This notation also supports specifying classes and other attributes.

Probably should limit Fossil's handler to IDs and maybe classes.

(And yes, I agree that auto ID generation should be suppressed for forum posts and branch-linked pages.)

(32) By anonymous on 2020-01-16 01:34:01 in reply to 31 [link] [source]

Forgot to mention that CommonMark is considering adopting this as an extension. (Currently, they are still working on completing v1.0 of the core specification. Meanwhile, discussions on various extensions are taking place in their forums.)

(34) By anonymous on 2020-01-16 19:18:49 in reply to 31 [link] [source]

Also, Markdown Extra supports this notation for specifying IDs, so it seems that CommonMark will eventually adopt this, too.

(40) By Stephan Beal (stephan) on 2020-02-17 02:06:52 in reply to 28 [link] [source]

Back to the topic of generating anchors to headers in markdown...

Since Offray posted about Markdeep last night, i've been toying with it and they take a novel approach to generating the anchor names which works even when the section names are repeated.

For example:

# API: Fetching JSON Content
## Data Type 1
## Data Type 2
# API: Saving JSON Content
## Data Type 1
## Data Type 2
...

The anchor names it generates are a normalized form of each header, basically like suggested in the top post, but they go one step further and build a path of names, where each level of heading introduces a new path element, just like a directory tree. So the anchors for the above might look like:

<a id="apifetchingjsoncontent"></a>
<a id="apifetchingjsoncontent/datatype1"></a>
<a id="apifetchingjsoncontent/datatype2"></a>
<a id="apisavingjsoncontent"></a>
<a id="apisavingjsoncontent/datatype1"></a>
<a id="apisavingjsoncontent/datatype2"></a>

So, even though both sections have "Data Type N" sub-sections, the generated anchor names are distinct for each one.

Heading levels 3+ add additional path elements to the anchors.

(41) By D. Bohdan (dbohdan) on 2020-09-26 10:15:04 in reply to 40 [link] [source]

I think this is a good approach. In my client-side ToC script for Fossil I independently arrived at something similar. I'll describe my thinking in some detail in case it helps.

I started looking for a better way to generate fragment ids when I realized that just avoiding id collision the way GitHub did was not enough. If you have a Markdown document structured like

# Top
## Red
### Foo
## Green
### Foo

GitHub, for example, will give the two "Foos" the id foo and foo-1 respectively. While this id generation scheme does prevent collisions, it is brittle in an important way. If someone links to your-document#foo-1 with "Top" 🡒 "Green" 🡒 "Foo" in mind, and you swap around the section "Red" and "Green", their link will instead lead to "Top" 🡒 "Red" 🡒 "Foo". One can reasonably want to reorder sections in a document and also to have sections with subsections carrying the same title. A good id scheme should not break links when these two things occur together.

In my script I address this problem by thinking of headings as a tree. Each heading gets an id based on its "path": the headings with a lower <h[1-6]> number that, in a tree, you would traverse to get to it. To generate the id for a heading, I "slugify" (see below) its text and the text of its parents in the tree, deriving strings that can't contain --, then join these strings with -- parents first. If any level of heading is missing (for example, you have an <h3> followed directly by an <h6>), I treat the missing headings as present but containing an empty string. I use one <h1> per page, so I have an option to not include the top-level heading in ids. You can find the full algorithm in the function tocGen.createToC.

The term "to slugify" comes from "URL slug". It seems the most common term for this particular type of string normalization. The following is one way to generate a URL slug from arbitrary text in JavaScript.

var slugify = function(s) {
    return s.toLowerCase().match(/\w+/g).join('-');
};

And here is another in Tcl.

proc slugify text {
    string trim [regsub -all {[^[:alnum:]]+} [string tolower $text] -] -
}

To me api-fetching-json-content--data-type-1 is more readable than apifetchingjsoncontent/datatype1 while still being human-writable, but that's pretty bikesheddy.