Fossil Forum

Diacritics and advanced search
Login

Diacritics and advanced search

Diacritics and advanced search

(1) By mixcocam on 2023-08-18 10:45:37 [source]

Explicitly dealing with diacritics

It seems that currently the search function is not quite predictable when it comes to diacritics. My personal suggestion and what is followed by most open search functions is to revert accented characters to their non-accented counterparts (eg. é to e).

Advanced search

I might be missing something but there doesn't seem to be a way to conduct advanced search (wildcards, exclusions, OR, exact matches etc.). Is this done on purpose? Is there a way to conduct these types of searches otherwise?

(2) By Stephan Beal (stephan) on 2023-08-18 10:58:00 in reply to 1 [link] [source]

Is there a way to conduct these types of searches otherwise?

The search features are limited to what sqlite's FTS5 can do. Its list of search options is summarized in section 3 of that page.

As to the diacritics i'm not well-versed enough in FTS5 to know whether it's capable, without additional code, of treating them as anything other than themselves.

(3) By Warren Young (wyoung) on 2023-08-18 11:38:48 in reply to 2 [link] [source]

It looks like the OP wants the option of selecting the “unicode61” tokenizer.

(4) By Stephan Beal (stephan) on 2023-08-18 12:03:11 in reply to 3 [link] [source]

It looks like the OP wants the option of selecting the “unicode61” tokenizer.

Indeed, from the fts5 docs:

By default, diacritics are removed from all Latin script characters. This means, for example, that "A", "a", "À", "à", "Â" and "â" are all considered to be equivalent.

While starting to add that option to the search config i came across this snippet in the docs:

By default, the porter tokenizer operates as a wrapper around the default tokenizer (unicode61).

i.e. selecting the porter option "should" already provide that feature.

@mixcocam: please try enabling the porter stemmer option and see if that resolves it for you:

$ fossil fts-config tokenizer porter

or, from the UI, see the drop-down list at the bottom of the /srchsetup page.

(5) By Stephan Beal (stephan) on 2023-08-18 12:19:52 in reply to 4 [link] [source]

@mixcocam: please try enabling the porter stemmer option and see if that resolves it for you:

After Warren pointed out to me that that option performs word stemming which is not valid for non-English languages, an option was added for the unicode61 tokenizer without the word-stemmer. If you can, please try the "unicode61" tokenizer in the latest trunk and tell us if that resolves the problem for you:

$ fossil fts-config unicode61

Likewise, the /srchsetup page has a new option for this.

(7) By mixcocam on 2023-09-01 14:27:34 in reply to 4 [link] [source]

This worked like a charm. Thank you @stephan.

(6) By Warren Young (wyoung) on 2023-08-18 13:27:47 in reply to 1 [link] [source]

there doesn't seem to be a way to conduct advanced search

Some months back, drh added a filter on the search pattern that strips out everything but spaces and alphanumerics. The relevant checkin comment doesn't explain why this is a good idea.

Until that is backed out or modified, the current search_match() implementation will never see the * character, the only wildcard character it appears to support.1


  1. ^ FTS5 can use LIKE or GLOB, but it appears to be using MATCH instead, with this custom UDF match() implementation.