Fossil Forum to show a thread's title as page title in browser

(1) By anonymous on 2020-01-11 17:56:25 [link]

Currently, opening any Forum thread here displays the same page title in the browser "Fossil Forum: Forum".

This makes it somewhat difficult to make sense out of history of visited pages, short of looking over the actual URLs.

Is this something configurable or is a current limitation of the Forum feature?

(2) By Stephan Beal (stephan) on 2020-01-11 18:51:30 in reply to 1 [link]

It's not currently configurable. A cursory glance reveals that the page titles are effectively hard-coded at the moment. The good news is that it seems that changing the title to include the post subject won't be a problem for contexts like editing and replying, as we have the post object loaded at that point. Those aren't the really useful cases, though, as they don't get bookmarked or browsed back to.

Unfortunately, though, changing that for the `/forumpost` (a.k.a. `/forumthread`) page, which is the page from which posts are read, would require more intensive changes because the page header (which includes the title) gets processed long before any posts are loaded and the header must be rendered before any posts are output. The problem is that the post we're interested in may be the 27th post loaded/output on that page, and we don't have the name of the post until that point, but the header has to be output before the first post on the page is rendered.

i.e. this can certainly be improved with a bit of refactoring but there's not a quick/trivial fix for it.

(3) By Martijn Coppoolse (vor0nwe) on 2020-01-15 20:57:32 in reply to 2 [link]

A fairly trivial workaround could be to insert some javascript in the Forum's page footer:

```html
<script nonce="$nonce">
document.title = document.querySelector('.content h1').textContent + ' - ' + document.title;
</script>
```

It wouldn't help for SEO purposes, and it wouldn't work for browsers with Javascript disabled, but it could be helpful in the browser history, title bar and task bar button.

Of course, when writing a post, or composing a reply, it would just put "Replying To: - Fossil Forum: Reply - Personal", unless the above script is limited to the `/forumpost` page. But I doubt that will be seen as a real problem.

(4) By Stephan Beal (stephan) on 2020-01-15 21:06:41 in reply to 3 [link]

That's exactly what i was thinking, as that could be done without any refactoring. We could also simply inject the JS at the point where we know which thread we're loading, something like this pseudocode:

```
cgi_print( "<script nonce...>document.title="%h"</script>", thread->title );
```

(Or whatever the modifier HTML-encode modifier is, if it's not `%h`.)

i do like the selector approach, though.

(5) By Martijn Coppoolse (vor0nwe) on 2020-01-17 17:21:55 in reply to 4 [link]

Is that HTML-encoded text guaranteed to be safe for use in Javascript, too? Or is there a separate JS-encoder?

(6) By Stephan Beal (stephan) on 2020-01-17 17:55:30 in reply to 5 updated by 6.1 [link]

A better/cleaner way to handle that would be to pass the title through sqlite3's JSON conversion, which will encode it in a JSON-compatible string and handle all necessary escaping:

```
# f sqlite3
...
sqlite> select json_quote('This is the page title');
'"This is the page title"'
sqlite> select json_quote('This is the <script>evil stuff</script>');
'"This is the <script>evil stuff</script>"'
```

Note that `<script>evil stuff</script>` is not escaped but doesn't have to be because it's enclosed in a JSON/JS string and will never get evaluated as a script tag.

The outer single-quotes there are a side-effect of the sqlite shell, and would not be present in the C-side results, where we could do something like (untested):

```
Stmt q;
db_prepare(&q, "select json_quote($title)");
db_bind_text(&q, "$title", thread->zTitle);
db_step(&q);
cgi_print(
  "<script nonce=\"%s\">document.title=%s</script>",
  theNonce/*safe-for-%s*/,
  db_column_text(&q, 0)/*safe-for-%s*/
);
db_finalize(&q);
```

(Sidebar: injecting a script wouldn't work with the default CSP unless the attacker could magically guess the nonce. Even then, it would work only once (when the nonce matches).)

The `<script>...</script>` part of that could optionally be in the `prepare()` call:

```
db_prepare(&q,
  "select '<script nonce=\"$nonce\">document.title=' || $title || '</script>'"
);
```

That looks like it would be a trivial change to make and is, IMO, harmless. The question is really one of style and precedent: do we want to inject JS to handle features like this? i'm not at all against it, but suspect that Warren and/or Richard may be more reserved on this point.

(6.1) By Stephan Beal (stephan) on 2020-01-17 17:56:52 edited from 6.0 in reply to 5 updated by 6.2 [link]

A better/cleaner way to handle that would be to pass the title through sqlite3's JSON conversion, which will encode it in a JSON-compatible string and handle all necessary escaping:

```
# f sqlite3
...
sqlite> select json_quote('This is the page title');
'"This is the page title"'
sqlite> select json_quote('This is the <script>evil stuff</script>');
'"This is the <script>evil stuff</script>"'
```

Note that `<script>evil stuff</script>` is not escaped but doesn't have to be because it's enclosed in a JSON/JS string and will never get evaluated as a script tag.

The outer single-quotes there are a side-effect of the sqlite shell, and would not be present in the C-side results, where we could do something like (untested):

```
Stmt q;
db_prepare(&q, "select json_quote($title)");
db_bind_text(&q, "$title", thread->zTitle);
db_step(&q);
cgi_print(
  "<script nonce=\"%s\">document.title=%s</script>",
  theNonce/*safe-for-%s*/,
  db_column_text(&q, 0)/*safe-for-%s*/
);
db_finalize(&q);
```

(Sidebar: injecting a script wouldn't work with the default CSP unless the attacker could magically guess the nonce. Even then, it would work only once (when the nonce matches).)

The `<script>...</script>` part of that could optionally be in the `prepare()` call:

```
db_prepare(&q,
  "select '<script nonce=\"$nonce\">document.title=' || json_quote($title) || '</script>'"
);
```

That looks like it would be a trivial change to make and is, IMO, harmless. The question is really one of style and precedent: do we want to inject JS to handle features like this? i'm not at all against it, but suspect that Warren and/or Richard may be more reserved on this point.

Edit: missing `json_quote()` part on the final `db_prepare()` example.

(6.2) By Stephan Beal (stephan) on 2020-01-17 18:00:17 edited from 6.1 in reply to 5 [link]

A better/cleaner way to handle that would be to pass the title through sqlite3's JSON conversion, which will encode it in a JSON-compatible string and handle all necessary escaping:

```
# f sqlite3
...
sqlite> select json_quote('This is the page title');
'"This is the page title"'
sqlite> select json_quote('This is the <script>evil stuff</script>');
'"This is the <script>evil stuff</script>"'
```

Note that `<script>evil stuff</script>` is not escaped but doesn't have to be because it's enclosed in a JSON/JS string and will never get evaluated as a script tag.

The outer single-quotes there are a side-effect of the sqlite shell, and would not be present in the C-side results, where we could do something like (untested):

```
Stmt q;
db_prepare(&q, "select json_quote($title)");
db_bind_text(&q, "$title", thread->zTitle);
db_step(&q);
cgi_print(
  "<script nonce=\"%s\">document.title=%s</script>",
  theNonce/*safe-for-%s*/,
  db_column_text(&q, 0)/*safe-for-%s*/
);
db_finalize(&q);
```

(Sidebar: injecting a script wouldn't work with the default CSP unless the attacker could magically guess the nonce. Even then, it would work only once (when the nonce matches).)

The `<script>...</script>` part of that could optionally be in the `prepare()` call:

```
db_prepare(&q,
  "select '<script nonce=' || json_quote($nonce) || '>document.title=' || json_quote($title) || '</script>'"
);
```


That looks like it would be a trivial change to make and is, IMO, harmless. The question is really one of style and precedent: do we want to inject JS to handle features like this? i'm not at all against it, but suspect that Warren and/or Richard may be more reserved on this point.

Edit: missing `json_quote()` part on the final `db_prepare()` example.

Edit again (apologies): corrected structure of the `$nonce` part of that example.

(7) By Warren Young (wyetr) on 2020-01-17 18:30:55 in reply to 6.2 [link]

> do we want to inject JS to handle features like this? i'm not at all against it, but suspect that Warren and/or Richard may be more reserved on this point.

You've guessed right.

MHO is that the JS solutions are fine as local workarounds until Fossil gets this ability itself, but work on that should do it entirely at the SQL + template layers within Fossil, server-side. This is not an area where we benefit in any useful way from client-side JS other than expedience.

(8) By Stephan Beal (stephan) on 2020-01-17 18:44:15 in reply to 7 [link]

The problem here is that this cannot be solved using the current C code structure. To solve it in C we would have to load the thread before rendering the header, which requires refactoring of the thread-rendering sections (and their callers) and just seems "off" to me (but only marginally so).

The downside to expediting this in JS, though, is that it seems unlikely we'd ever find an incentive to go back and refactor those C parts solely for the sake of updating the page title for bookmarkability.

(9) By anonymous on 2020-01-17 20:19:43 in reply to 8 [link]

> The downside to expediting this in JS, though, is that it seems unlikely we'd ever find an incentive to go back and refactor those C parts

On the third hand, Fossil already uses JS for the graph in the time line and for web-crawler defense, so there is precedence. Given how small a piece of JS this is - a single assignment statement, maybe it is a reasonable solution.

(Of course, DRH will decide that.)

(10) By Warren Young (wyetr) on 2020-01-17 20:26:39 in reply to 9 [link]

> Fossil already uses JS for the graph...

...and [many other things][1], but all of those current uses of JS in Fossil are either:

1. things that can only be done client-side (e.g. SxS diff scrolling, wiki editor); or

2. would cause an expensive HTTP round-trip to do server-side (e.g. file tree view [un]folding)

Neither applies here. The server has all the info it needs to provide useful `<title>` content. Someone's just got to dig in and make it happen.

[1]: https://fossil-scm.org/fossil/doc/trunk/www/javascript.md

(11) By anonymous on 2020-02-28 22:09:48 in reply to 1 [link]

Added support for including forum thread's title in its page title; see branch [forumthread-title](https://fossil-scm.org/fossil/timeline?r=forumthread-title).

Please try it to see if this covers the expected thread views.

(12) By Florian Balmer (florian.balmer) on 2020-02-29 08:23:20 in reply to 11 updated by 12.1 [link]

Nitpick: truncating a string to a number of `char`'s may produce invalid (incomplete) code point sequences in UTF-8.

(12.1) By Florian Balmer (florian.balmer) on 2020-02-29 10:29:29 edited from 12.0 in reply to 11 [link]

Nitpick: truncating a string to a number of `char`'s may produce invalid (incomplete) code point sequences in UTF-8.

Edit: Why not just  let the web browser handle the truncation?

(13) By anonymous on 2020-02-29 19:53:18 in reply to 12.1 [link]

Thanks for taking a look at the code. Added the UTF-8 truncation logic.

> Edit: Why not just let the web browser handle the truncation?

Not sure browsers care for the length of the text in the `<title>` tag. But, say, if one to create a bookmark (which is based on the title's text), an excessively long text may be not very useful, especially when converted into a url-shortcut file. 

Also, cosmetically, the title is also printed next to `"<ProjectName> /"` heading in some of the default skins, so truncating the long title just seems less cluttered.

Said this, we can as well just use the thread title text verbatim, the goal is just to have semantic bookmarks.

Any consensus on this?

(14) By anonymous on 2020-03-01 00:33:47 in reply to 12.1 [link]

> Nitpick: truncating a string to a number of char's may produce invalid (incomplete) code point

As best I can determine, most browsers handle this reasonably. The other anon poster makes good points about long titles.

Alternately, a purist might choose to convert UTF-8 to "wide characters", truncate the string, then convert back to UTF-8.

Most C compilers have proper support for 32 bit wchar strings, though some still only support 16 bit wide characters.

In compilers with the proper wchar type, the stdlib function mbstowcs() will convert UTF-8 or UTF-16 to a wchar string:

```
    uint8_t uf8s_in[] = "some string with UTF-8 characters";
    uint8_t uf8s_out[204] = { 0U };
    wchar_t wcs[51] = { 0U };

    /* Note: make sure LOCALE is LC_CTYPE */
    mbstowcs(wcs, uf8s_in, 50);  /* convert at most 50 characters */
    wcstombs(uf8s_out, wcs, 50); /* convert back to UTF-8 */
```

For better portability, routines to decode/encode UTF-8 are not hard to write.

(15) By Florian Balmer (florian.balmer) on 2020-03-01 14:22:46 in reply to 13 [link]

> Added the UTF-8 truncation logic.

The forum thread title can be considered to be valid UTF-8, because the input element to enter the title is on a web page encoded in UTF-8, and web browsers can be expected to submit valid UTF-8.

So checking whether the first truncated byte, or any of the last 3 (non-truncated) bytes, is a lead byte would be enough to find the last complete UTF-8 sequence.

> Not sure browsers care for the length of the text in the `<title>` tag.

Besides taking care to keep multi-byte code sequences intact (as your code does now), web browsers will also take care not to split (non-spacing) diacritics from their base characters, and take into account the effective display width, depending on the screen resolution, font size, etc. when shortening the `<title>` text to fit window or tab titles, or favorites lists. This is way superior to the rather short and random truncation after 50 chars (or, bytes).

I keep a list of bookmarks to forum posts, with the forum thread titles manually set as the link text. Like this, none of the links seems to have an excessive length.

My vote is to avoid truncating the forum thread title used in the page title. If a length limit is necessary, the `<input>` element for the forum thread title could have a `maxlength` attribute to enforce it in a way that users can anticipate it and adapt their wording, for example.

(16) By Florian Balmer (florian.balmer) on 2020-03-01 14:23:59 in reply to 14 [link]

Your solution has the same problem.

In UTF-16, code points from the Basic Multilingual Plane (BMP) are encoded as one 16-bit code unit, but code points from the Supplementary Planes are encoded as two 16-bit code units, called a "surrogate pair".

So the same as UTF-8, UTF-16 is a variable-length encoding, and truncating after a fixed number of code units may likewise produce incomplete sequences.

Also note that the effort of iterating over the whole string, perform the calculations to convert from UTF-8 to UTF-16, copy the result into another buffer, and then do the same again the other way round, is not necessary to find code point boundaries.

UTF-8 is "self-synchronizing", which means in a UTF-8 stream, any particular code unit (byte) is either a lead byte or a trail byte, regardless of the context, so looking at ±3 bytes from a random index is enough context to find the next or previous code point boundary.

For other encodings, such as the DBCS family, this is different: many bytes can represent a lead byte or a trail byte of a two-byte code unit at the same time. That's why iterating and searching is context-dependent, and must start at the beginning of a string, or from a known code point boundary. Simply looking back at the previous byte may not be sufficient, as the previous byte may again be eligible to be used as both a lead byte and a trail byte.

For example, in code page 936 (or Windows-936, the legacy Windows character encoding for simplified Chinese), searching for U+FF03 FULLWIDTH NUMBER SIGN ＃ is not trivial. Windows-936 represents this code point by the two-byte code unit `0xA3A3`, that is byte `0xA3` is valid as either a lead byte or a trail byte. So the (mid-stream) matching two-byte code unit `0xA3A3` can be interpreted as lead byte `0xA3` followed by trail byte `0xA3`, making a true search hit. Or, it's possible that the first `0xA3` byte is the trail byte of a preceding 2-byte sequence, while the second `0xA3` byte is the lead byte of a subsequent 2-byte sequence, making a false positive hit.

(17) By anonymous on 2020-03-01 15:29:46 in reply to 16 [link]

Since the intended input in the page title, decoding will start at the beginning of the character sequence.

The mbstowcs() routine will "consume" the input string in whole characters until either the end of the string, invalid input or the specified number of decoded characters have been copied to the output buffer.

(18) By Florian Balmer (florian.balmer) on 2020-03-01 16:01:09 in reply to 17 [link]

Ok, I see. Still, not the most economical solution, and due to the involved buffer copies likely even slower than repeatedly calling `invalid_utf8()` for text that was already known to be valid UTF-8 before manipulation ...

I've submitted related changes to modify the comment formatter to avoid output of incomplete UTF-8 sequences, and to avoid line breaks inside UTF-8 sequences:

* <https://www.fossil-scm.org/index.html/timeline?r=comment-formatter-utf8>

Similar logic could be used to truncate a (valid UTF-8) string without producing incomplete sequences. (But my preferred approach in this case would be not to truncate the title at all, see post above.)

(19) By anonymous on 2020-03-02 00:23:25 in reply to 15 [link]

I looked through the current Fossil codebase in case we already have a utility function to handle UTF-8 truncation properly/efficiently. The closest that I could find was the `invalid_utf8()`, so I adapted it for the task. I understand from your reply that proper UTF-8 truncation involves more than what the validation function does.

Would you suggest a robust logic (if such is possible in UTF-8 case at all) to handle the truncation, so this could be put into a utility function in case this could be of use in other parts of the Fossil code? That's assuming there's a utility in truncating the titles in question.

As for the thread title length, so far the most populous (that I know of) Fossil-handled forum is this one. Running a query, yields [the longest title so far is at 120 chars](https://fossil-scm.org/forum/forumpost/982772ccd):

```
select max(length(substr(event.comment,instr(event.comment,':')+2))) from forumpost join event on event.objid=forumpost.fpid;

```

, coincidentally it's a on a subject of multi-byte chars :) Most of the post titles are below 80 char-long. Over 200 (out of 836) are longer than 50 chars.

(20) By Warren Young (wyoung) on 2020-03-02 17:46:49 in reply to 19 [link]

I made [a moving average of the frequency data](https://imgur.com/a/0nNRoEK) to give a better sense of title lengths in aggregate.

The anonymously-sourced query above gives the count of forum posts with the given title length, but if you use it to produce a graph like mine, it produces a bias. The post with the 120-character title had 17 replies, so it counts as 18 if you just throw a `count(*)` and `group by` into the query above.

Therefore, I have modified it thus to produce the data backing this graph:

```sql
select n, count(*)
from (
    select
        length(substr(event.comment, instr(event.comment, ':') + 2)) as n
    from
        forumpost
        join event on event.objid = forumpost.fpid
    group by n, froot
    order by n
) group by n;
```

(21) By Florian Balmer (florian.balmer) on 2020-03-02 18:55:33 in reply to 19 [link]

I'm happy to make  a suggestion, taking advantage of the "self-synchronizing" nature mentioned in an earlier post. My next "coding window" is Tuesday or Thursday.

(22) By Joel Dueck (joeld) on 2020-03-02 19:00:28 in reply to 18 [link]

I also do not see the point in truncating the content of the `<title>` tag. Every browser I have used will truncate it for you when creating/displaying a bookmark of the page.

(23) By Florian Balmer (florian.balmer) on 2020-03-03 15:24:40 in reply to 21 [link]

Something like the following can be used to find the length to pass to `blob_truncate()` without producing incomplete UTF-8 sequences:

```
/*
** For a given index in a UTF-8 string, return the nearest index that is the
** start of a new code point. The returned index is equal or lower than the
** given index. The end of the string (the null-terminator) is considered a
** valid start index. The given index is returned unchanged if the string
** contains invalid UTF-8 (i.e. overlong runs of trail bytes).
** This function is useful to find code point boundaries for truncation, for
** example, so that no incomplete UTF-8 sequences are left at the end of the
** truncated string.
** This function does not attempt to keep logical and/or visual constructs
** spanning across multiple code points intact, that is no attempts are made
** keep combining characters together with their base characters, or to keep
** more complex grapheme clusters intact.
*/
#define IsUTF8TrailByte(c) ( (c&0xc0)==0x80 )
int utf8_nearest_codepoint(const char *zString, int maxByteIndex){
  int i,n;
  for( n=0, i=maxByteIndex; n<4 && i>=0; n++, i-- ){
    if( !IsUTF8TrailByte(zString[i]) ) return i;
  }
  return maxByteIndex;
}
```

For an alternative (and slightly more adequate) approach to truncate after N code points, instead of N bytes, something similar to the [`strlen_utf8()` function][0] could be used.

Truncating strings without visual context may be suitable for CLI programs. For web page output, CSS techniques usually create much better results, as the web browser can take the effective visual appearance, screen resolution, font size, etc. into account, for example:

```
<p style="white-space: nowrap; overflow: hidden; text-overflow: ellipsis;"> ...
```

[0]: https://www.fossil-scm.org/index.html/artifact/e9fe5da3fa?ln=149-179

(24) By anonymous on 2020-03-03 17:23:59 in reply to 23 [link]

Thank you for the functional code and hints on how to apply it to UTF8 truncation. We should use this instead of the current way of truncating the title.

The way I see this there're three places sensitive to title length:

1. Browser's window title
2. Browser's Bookmark list view
3. File system ability to accommodate a URL-link file

Since the current implementation in now folded into trunk, we can observe the effect of the truncation or non-truncation on this very page itself (just resize the window width). In Fossil's original theme, the page title appears in the page heading and when is long it wraps the heading line in somewhat ugly way. I understand from your post that this could be handled in CSS for the given theme.

Browsers indeed can take care of the long titles in window title and bookmark list view. I did some limited testing with Firefox under Ubuntu; a lengthy page title is taking up as much space in the browser's windows title as is available, the rest is visually truncated without losing the accents. Same with bookmark views.

However, creating URL-link files from the browser's address line fails silently once the length of the page title exceeds internal system limit on file name length. Under Ubuntu the limit is set to 255-char, past that you'd get "filename too long" error... when on command line (`touch`); in Firefox browser it just fails without any feedback to user.

So said this, I also see it reasonable not to truncate in the Fossil code, but have some limit on Forum's input forms for the title. Also, control the length in CSS, if desired for a given skin.

(25) By Florian Balmer (florian.balmer) on 2020-03-03 18:41:52 in reply to 24 [link]

I see the dilemma with URL-link files. But besides checking the length limit of the target filesystem, the browser may also need to replace disallowed characters (slashes) when deriving filenames from web page titles, and Fossil can't cover all thse cases ...

(26) By Stephan Beal (stephan) on 2020-03-03 18:58:43 in reply to 25 [link]

Fossil has, somewhere, a list of characters it does not allow in filenames. Presumably that list would be sufficient.

(That said, i don't understand why truncating names or titles should be fossil's responsibility. Automated truncation, will, at some point, do something like truncate the word "assistance" to its first three letters, and that's bound to upset someone. i have indeed seen that happen with news article title truncation.)

(27) By anonymous on 2020-03-03 19:37:37 in reply to 15 [link]

> If a length limit is necessary, the <input> element for the forum thread title could have a maxlength attribute to enforce it in a way that users can anticipate

While this is a good idea, it overlooks something very important: Programs should always validate inputs from outside sources. The web browser is an outside source.

Just because an `<input>` element in an HTML form has a maxlength attribute, doesn't mean we can safely skip length limiting[1] when accepting input from the web browser.

Also, input checks and limits should be independent of the input source/mechanism. Whether a given input comes from an HTML form, a JSON API or the command line interface, it should be put through the same input validation.

By same validation, I don't mean duplicating code in each input mechanism. I mean that each mechanism should call the same validation functions. Ideally, each validation function also delivers its output, the validated, length limited (and any other input transformation[2]) to Fossil's "business logic", only returning Pass/Fail to the input mechanism that called it.

[1] Note that length limiting is not just the length in characters, but also in total bytes.

[2] Obviously, input mechanism specific transformations should be done in the input mechanism. Any transform done in the validation routine should be mechanism independent.

(28) By anonymous on 2020-03-03 19:41:17 in reply to 26 [link]

> i don't understand why truncating names or titles should be fossil's responsibility

See [reply 30](https://fossil-scm.org/forum/forumpost/2697e28917)

(29) By Florian Balmer (florian.balmer) on 2020-03-04 17:21:03 in reply to 27 [link]

I guess you could forge a HTTP request to create a forum post containing invalid UTF-8, but all you'd get is a web page with broken text, and no SQL admin superpowers.

Likewise, users could take an input length limit as a guidance to rephrase lengthy titles, while Fossil could easily handle longer titles. (In fact this would be required for compatiblity with existing forums.)

The fixed 50-bytes truncation is unnecessary: it hides information that would easily fit, and leaves a lot of unused space on my monitor. A dynamic `text-overflow` truncation  would show more information.

(30) By anonymous on 2020-03-04 21:37:37 in reply to 29 [link]

> I guess you could forge a HTTP request

Even if there is no possibility of privilege escalation, letting unvalidated, unlimited inputs into Fossil can still degrade Fossil's usability and cause problems for other tools[1].

For example, you are overlooking the very real possibility of someone having a Continuous Integration, "CI", server directly post to the forum. Already, some Fossil users have interfaced Fossil to a CI server, such as [Jenkins](https://en.wikipedia.org/wiki/Jenkins_(software)).

Of course, such integrations would be better done using an API like Fossil's JSON API, however Fossil's JSON API doesn't include the forum.

In any case, even when using a theoretical JSON API for forum posts, Fossil would still need to length limit and otherwise validate input fields.

> The fixed 50-bytes truncation

I assume you mean 50 characters?

A longer limit in Fossil would be reasonable. As you pointed out, the web browser can deal with purely display issues.

[1] Making Fossil friendlier to integration with other tools is a valid consideration and benefits Fossil and its community. Part of that is validating and limiting Fossil's inputs.

(31) By Florian Balmer (florian.balmer) on 2020-03-05 14:02:53 in reply to 30 [link]

> I assume you mean 50 characters?

That depends on the definition of "character".

According to a [Unicode-aware definition][0], a "character" in UTF-8 [takes up 1 to 4 bytes][1], corresponding to 1 to 4 elements in a `char[]` array.

So when truncating a `char[]` array at a random index N, the equivalent of a random byte position N, the resulting string may contain anything from `N>>2` to `N` "characters".

### Example 1:

String with 26 Latin-alphabet (1-byte) "characters":
```
  "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
```

Truncation after 20 bytes results in a string with 20 Latin-alphabet "characters":
```
  "ABCDEFGHIJKLMNOPQRST"
```

### Example 2:

String with 26 U+1F4BE FLOPPY DISK (4-byte) "characters":
```
  "💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾💾"
```
Truncation after 20 bytes results in a string with only 5 U+1F4BE FLOPPY DISK "characters":
```
  "💾💾💾💾💾"
```

This is what your algorithm to truncate the title currently does, and that's why I referred to it as "fixed 50-bytes truncation".

[0]: http://www.unicode.org/glossary/#character
[1]: http://unicode.org/faq/utf_bom.html#gen6

(32) By anonymous on 2020-03-05 14:38:22 in reply to 31 [link]

> This is what your algorithm to truncate the title currently does, and that's why I referred to it as "fixed 50-bytes truncation".

No. mbstowcs() decodes the UFT-8 sequences first, then counts decoded characters as it copies them to the destination buffer, which is an array of uint32_t. So 50 UC32 characters in 200 bytes.

See the description at http://man7.org/linux/man-pages/man3/mbstowcs.3.html

(33) By Florian Balmer (florian.balmer) on 2020-03-05 15:51:47 in reply to 32

But `mbstowcs()` is not currently used in the algorithm to truncate the forum title:

<https://www.fossil-scm.org/index.html/info/59f126d90b>

And still, the (two) conversion steps (plus two buffer copies) are not required to find the boundary of code point N. For example, you could just walk the UTF-8 `char[]` array and count N lead bytes, to find the position of code point N.

Or, something like the following minimal variation of the `strlen_utf8()` function could be used to find the byte index corresponding to a given code point index:

```
/*
** Find the byte index corresponding to the given code point index in a UTF-8
** string. If the string contains fewer than the given number of code points,
** the index of the end of the string (the null-terminator) is returned.
** Incomplete, ill-formed and overlong sequences are counted as one sequence.
** The invalid lead bytes 0xC0 to 0xC1 and 0xF5 to 0xF7 are allowed to initiate
** (ill-formed) 2- and 4-byte sequences, respectively, the other invalid lead
** bytes 0xF8 to 0xFF are treated as invalid 1-byte sequences (as lone trail
** bytes).
*/
int utf8_codepoint_index(const char *zString, int iCodePoint){
  int i;       /* Counted bytes. */
  int lenUTF8; /* Counted UTF-8 sequences. */
  if( zString==0 ) return 0;
  for(i=0, lenUTF8=0; zString[i]!=0 && lenUTF8<iCodePoint; i++, lenUTF8++){
    char c = zString[i];
    int cchUTF8=1; /* Code units consumed. */
    int maxUTF8=1; /* Expected sequence length. */
    if( (c&0xe0)==0xc0 )maxUTF8=2;          /* UTF-8 lead byte 110vvvvv */
    else if( (c&0xf0)==0xe0 )maxUTF8=3;     /* UTF-8 lead byte 1110vvvv */
    else if( (c&0xf8)==0xf0 )maxUTF8=4;     /* UTF-8 lead byte 11110vvv */
    while( cchUTF8<maxUTF8 &&
            (zString[i+1]&0xc0)==0x80 ){    /* UTF-8 trail byte 10vvvvvv */
      cchUTF8++;
      i++;
    }
  }
  return i;
}
```

(34) By anonymous on 2020-03-06 17:55:46 in reply to 33 [link]

Thank you for the improved function to handle the UTF-8 truncation. I could see that it adds a few more UTF-8 chars in the resulting string, which previously would be prematurely chopped off. I updated the truncation handling in [d076853d10a2f2f7](https://fossil-scm.org/fossil/info/d076853d10a2f2f7).

Additionally, I limited the max length of the new thread's title to 125 chars. This should accommodate the older threads; the longest one so far was at 120 chars.

At this point, based on this thread's discussion I'm also in favor of keeping the title __un-truncated__. However, one part remains to be tweaked -- the visual appearance of a long title in the Forum/Fossil [default theme header](https://fossil-scm.org/fossil/artifact?udc=1&ln=2&name=951f84d12b2d92e6).

My own CSS efforts did not yield a nicely ellipted sub-title in the default skins's header. For the fix to work as expected, there must be an explict `max-width` style attribute set on it (`80ch` should suffice), but `ch` units may not be universally supported by browsers. Not sure if this could be set in other units or otherwise.

(35) By Joel Dueck (joeld) on 2020-03-06 18:22:42 in reply to 34 [link]

> `ch` units may not be universally supported by browsers.

[According to “Can I Use…”][1], `ch` units have been supported by all browsers, even Internet Explorer, since 2013. (But I didn't even know about it until today!)

True to form, IE calculates the width of 1ch slightly differently than the other browsers. But I think that difference wouldn’t matter very much for this use case.

[1]: https://caniuse.com/#feat=ch-unit

(36) By Florian Balmer (florian.balmer) on 2020-03-07 11:19:09 in reply to 34 [link]

Isn't truncating after "max-width = 80 ch units" the same as truncating after 80 code points, i.e. a 120-char title would still be truncated even if the screen were wide enough to display all of it?

(37) By anonymous on 2020-03-07 18:55:43 in reply to 34 [link]

Here's a mock-up of the Fossil's default theme that I used in my CSS attempts to tweak the elision of the text in the header section of the page (that's where the title text is displayed). NOTE: you may need to resize the browser window to trigger the text overflow.

```
<!DOCTYPE html>
<html>
<head>
<title>
The title text
</title>
<style>
.title {
    float:left;
    white-space: nowrap;
    overflow: hidden;
    text-overflow: ellipsis;
}
.title h1 {
    display:inline;
}
.title h1:after {
    content: " / ";
}
.status {
    float:right;
}
.mainmenu {
    clear:both;
    overflow-x: auto;
    overflow-y: hidden;
    white-space: nowrap;
}
</style>
</head>
<body>
</body>
<div class="header">
  <div class="title"><h1>ProjectName</h1>
The very long long long text to elide 
  </div>
  <div class="status">Login</div>
</div>
<div class="mainmenu"></div>
<div class="content">
<h2>This is main content.</h2>
</div>
</h2>
</html>
```

As one can see, in this form the "long" text is not ellipsed despite the CSS directives for `.title` class.

The reason for that is the `'float: left;'`. If we disable it, then the ellipsis does work, yet the "Login" text gets shifted to the next line.

(38) By anonymous on 2020-03-07 22:22:50 in reply to 37 [link]

I guess, I got this to reasonably align, by applying "max-width:" attribute to both `.title` (80%) and `.status` (20%) classes.

The relevant CSS changes are:

```
.title {
    max-width:80%;
....
}

.status {
    max-width:20%;
....
}
```

The result is the title-based text gets ellipsed to allow for fitting together with the "Login" text on the same visual line. A side-effect is that the ProjectName will get ellipsed too should the viewport shrink too much.

Is this an acceptable solution?

(39) By Florian Balmer (florian.balmer) on 2020-03-08 13:32:28 in reply to 38 [link]

I'm really sorry for your time after I mentioned the CSS `text-overflow` property. I'm so careful to avoid CSS that I forgot its nature: in theory there's a solution that sounds good and looks simple for anything, but in practice it won't work.

I don't know whether setting the widths for `.title` and `.status` to 80% and 20% works well with any browser on any device, especially with small mobile screens.

I don't know how to proceed. On narrow screens, the ±50 "chars" title wraps smoothly and looks okay. On wide screens, some space that could be used to show more of the title seems wasted, but maybe just let that happen? Or, we could hire a CSS wizard (probably this is already a subspecialization of web design?).

(40) By anonymous on 2020-03-09 17:47:10 in reply to 37 [link]

Here's an updated variant of mockup header styling (for the default Fossil skin), which implements the following objectives with minimal side-effects:

1. Display the page's title in the page header
2. On wide viewports, display as much of a long title as fits unellipsed
3. Ellipse a long title when does not fit
4. On narrow viewports, display as much of ellipsed title as fits
5. Keep the ProjectName unellipsed
6. In transitions, preserve much of the original visual flow 


```
<!DOCTYPE html>
<html>
<head>
<title>
Title
</title>
<style>
.title {
    float: left;
    max-width: 75%;
}
.title h1 {
    display:inline;
    width: 20%;
}
.title h1:after {
    content: " / ";
}
.titleText {
    display: inline-block;
    max-width: 100%;
    white-space: nowrap;
    overflow: hidden;
    text-overflow: ellipsis;  
}
.status {
    display:inline-block;
    max-width: 20%;
    float:right;
}
.mainmenu {
    clear:both;
    overflow-x: auto;
    overflow-y: hidden;
    white-space: nowrap;
}
</style>
</head>
<body>
</body>
<div class="header">
  <div class="title"><h1>ProjectName</h1>
<span class="titleText">Very long long long long long long long long long long long long text to elide</span>
  </div>
  <div class="status">username - Logout</div>
</div>
<div class="mainmenu"></div>
<div class="content">
<h2>This is main content.</h2>
</div>
</h2>
</html>

```

Mostly this builds upon the `max-width:%` approach, so may need more compatibility testing across browsers (how cannot one love CSS...).  An additional `span` was added for the title. When the title does not fit, the title block stacks underneath the ProjectName block and is ellipsed if needed.

I tested this on Firefox. In mobile view (narrow) the thread title ends up stacked and ellipsed. It's compact, but not much of utility from the title. In desktop view, the title reads more usefully.

This prompts an obvious and indeed simplest alternative ... to just go on with **unellipsed title** altogether. On mobile viewports, the side-effect will be the long title wrapping and filling most of the first screen. If we can accept this, it would just make sense to simply go on with no changes to the skin.

(41) By Florian Balmer (florian.balmer) on 2020-03-10 15:12:40 in reply to 40 [link]

The idea to use an inline-block to have lengthy `$title`'s go below the inline-h1 looks interesting. I've also made a few more experiments, but haven't found a way to keep the h1 and the following `$title` on the same line, with truncation (ellipsis) as needed.

Not sure if everybody will like the break inside the status line for narrow screens:

```
username -
Logout
```

I don't feel competent (CSS-, Fossil skin-, and other-wise) to make a decision, here. If more changes than just to the "global" CSS (identical for all skins) are required (i.e. the new `<span>` around `$title`), this looks like a "major compatibility break"? Then maybe each skin needs to be updated and tested individually, and users may need to integrate the changes into their existing skin customizations to get the new layout, including the Fossil website itself.

(42) By anonymous on 2020-03-11 00:15:38 in reply to 41 [link]

> If more changes than just to the "global" CSS (identical for all skins) are required (i.e. the new <span> around $title), this looks like a "major compatibility break"?

In general, the long titles in header have been there in Wiki part of Fossil since ever. That is the skin presentation choices made to accommodate Wiki should be equally applicable to Forum thread view (which is more recent as such). BTW, Wiki page title is limited at 100 chars.

As for the specific changes to the Fossil's default skin -- the affected files (CSS.txt and header.txt) are not directly shared with other skins in this case. So if changed in the default skin, the other included skins should not be affected.

> Not sure if everybody will like the break inside the status line for narrow screens

Good point, this may be of some preferences. The unellipsed alternative would just push the whole status block unwrapped, but trailing the end of the [long] title which would wrap as many lines as needed on a narrow screen. With the ellipsed solution, the status block would at least maintain its visual placement.

My choice at this point would be to go on and disable the thread's title active truncation logic altogether. This will bring the Forum thread's view to the same visual style as the Wiki. Then, if it's still desired, we can alter the default skin to ellipse the overflow using either of the approaches. Makes sense?

Fossil Forum to show a thread’s title as page title in browser