Fossil Forum

About Chinese search
Login

About Chinese search

(1) By anonymous on 2022-06-04 13:44:21 [link] [source]

The current search function does not support Chinese.


When I tried to use search to find Chinese characters, I failed.

So, I set up the index of FTS, and tried to search for Chinese characters again, but still failed.

Then, I went to the document and looked up the reason with Google: fossil's search is based on sqlite's FTS, while sqlite's FTS does not support Chinese search …

Then, if I want to add a Chinese search function to fossil, I need to change fossil's source code …

Too much trouble, I had to give up fossil's search.

(2) By anonymous on 2022-06-07 05:22:39 in reply to 1 [link] [source]

Guys, guess what I found out?

I found out that I can use sqlite3's "trigram" splitter to support Chinese search!


But the "trigram" splitter only supports fts5, not fts4.

So I tried to change the search.c fts4 to fts5. 1.

  1. change fts4 to fts5. 1.
  2. fts4 has docid, while fts5 has no docid, need to use rowid instead of docid. this is easy.
  3. change porter splitter to trigram splitter, this is mainly because I don't know how to add a new parameter yet, change old parameter first to simplify the process.
  4. fts4 has matchinfo function, while fts5 does not have matchinfo function, I can not solve this problem. So I temporarily replaced the matchinfo function where it was used.

After I did the above steps, I rebuilt the index in the admin page, and then searched with Chinese characters in the search page. Wow, I got Chinese results! Great!


In the source code of search.c, lines 22 to 24 say that you can switch fts4 and fts5 in the admin page, but I didn't find this option. Is it possible that these comments are obsolete?

If it is out of date, I think it will mislead people who want to turn on fts5.


I finally see hope for fossil to implement Chinese character search!

Translated with www.DeepL.com/Translator (free version)

(3) By Stephan Beal (stephan) on 2022-06-07 06:20:51 in reply to 2 [link] [source]

fts4 has matchinfo function, while fts5 does not have matchinfo function, I can not solve this problem. So I temporarily replaced the matchinfo function where it was used.

What did you replace it with? If the transition is smooth, perhaps we should update fossil to fts5.

In the source code of search.c, lines 22 to 24 say that you can switch fts4 and fts5 in the admin page, but I didn't find this option. Is it possible that these comments are obsolete?

That part seems to be incorrect. The db scheme later in that same file is hard-coded to fts4 and that comment is the only mention of fts5. That will be patched momentarily.

(4) By anonymous on 2022-06-07 07:19:12 in reply to 3 [link] [source]

What did you replace it with?

--- " rank(matchinfo(ftsidx,'pcsx')),"
+++ " ftsidx.rowid,"

"matchinfo" is mainly used for sorting here. I'll leave it unsorted for the time being.

I know this is not a good implementation.

I'm looking at the documentation of "sqlite" to find a perfect alternative to "matchinfo".

(5) By anonymous on 2022-06-08 05:58:00 in reply to 1 [link] [source]

I found a way to migrate fossil from fts4 to fts5.


There are the following steps to migrate fossil from fts4 to fts5:

  1. Change the name of fts4 to fts5.
  2. fts4 has docid, while fts5 has no docid. Rowid is needed to replace docid in fts5.
  3. fts4 has matchinfo function, while fts5 has no matchinfo function. The matchinfo function needs to be implemented in fts5.
  4. fts4 has a full-featured snippet function, but the function of the snippet function in fts5 is weakened.

It can be seen from the above steps that the difficulties of fossil's migration from fts4 to fts5 mainly focus on the matchinfo function and snippet function. Next, I want to explain my solution.

matchinfo function

To realize the matchinfo function by yourself, you need to look at the matchinfo source code of fts4, and then realize it in fts5. It's too much trouble. I don't recommend it.

I searched the timeline of sqlite for matchinfo, and found that dan had already implemented the matchinfo function in the ext/fts5/fts5_test_mi.c file of sqlite.

Only need to move the matchinfo function in fts5_test_mi.c to fossil to realize the matchinfo function.

snippet function

Fortunately, fossil only uses the basic functions of snippet.

Therefore, fossil is relatively simple in the migration of snippet function.

I have read the documents of fts4 and fts5, and the parameters of both sides are basically the same, but the order is different.

Put the fifth parameter of the snippet function of fts4 in the second parameter position of the snippet function of fts5.


That's all.

(6) By Stephan Beal (stephan) on 2022-06-08 06:38:36 in reply to 5 [link] [source]

I found a way to migrate fossil from fts4 to fts5

Thank you for following up on this. Yesterday i asked Richard about such a migration and there are no apparent technical hurdles to migrating. The current bottleneck is the limited number of hours per day to work on our many TODOs (across multiple projects, not just fossil). i'm bookmarking your response and hope to be able to attempt this migration in the foreseeable future.

Would you mind posting a patch of your work? Though we can't use anonymous patches as-is, it would be a great starting point and would help us get an idea of the amount of effort required. If the patch is huge, you can email it to me if you prefer. Contact info is at https://wanderinghorse.net/home/stephan.

(7) By anonymous on 2022-06-08 07:25:32 in reply to 6 [link] [source]

Index: Fossil/Fossil_src_2022-06-07/src/search.c
==================================================================
--- Fossil/Fossil_src_2022-06-07/src/search.c
+++ Fossil/Fossil_src_2022-06-07/src/search.c
@@ -937,14 +937,14 @@
     " SELECT ftsdocs.label,"
     "        ftsdocs.url,"
     "        rank(matchinfo(ftsidx,'pcsx')),"
     "        ftsdocs.type || ftsdocs.rid,"
     "        datetime(ftsdocs.mtime),"
-    "        snippet(ftsidx,'<mark>','</mark>',' ... ',-1,35)"
+    "        snippet(ftsidx,-1,'<mark>','</mark>',' ... ',35)"
     "   FROM ftsidx CROSS JOIN ftsdocs"
     "  WHERE ftsidx MATCH %Q"
-    "    AND ftsdocs.rowid=ftsidx.docid",
+    "    AND ftsdocs.rowid=ftsidx.rowid",
     zPat
   );
   fossil_free(zPat);
   if( srchFlags!=SRCH_ALL ){
     const char *zSep = " AND (";
@@ -1510,11 +1510,11 @@
 /* The schema for the full-text index
 */
 static const char zFtsSchema[] =
 @ -- One entry for each possible search result
 @ CREATE TABLE IF NOT EXISTS repository.ftsdocs(
-@   rowid INTEGER PRIMARY KEY, -- Maps to the ftsidx.docid
+@   rowid INTEGER PRIMARY KEY, -- Maps to the ftsidx.rowid
 @   type CHAR(1),              -- Type of document
 @   rid INTEGER,               -- BLOB.RID or TAG.TAGID for the document
 @   name TEXT,                 -- Additional document description
 @   idxed BOOLEAN,             -- True if currently in the index
 @   label TEXT,                -- Label to print on search results
@@ -1528,11 +1528,11 @@
 @ CREATE VIEW IF NOT EXISTS repository.ftscontent AS
 @   SELECT rowid, type, rid, name, idxed, label, url, mtime,
 @          title(type,rid,name) AS 'title', body(type,rid,name) AS 'body'
 @     FROM ftsdocs;
 @ CREATE VIRTUAL TABLE IF NOT EXISTS repository.ftsidx
-@   USING fts4(content="ftscontent", title, body%s);
+@   USING fts5(content="ftscontent", title, body%s);
 ;
 static const char zFtsDrop[] =
 @ DROP TABLE IF EXISTS repository.ftsidx;
 @ DROP VIEW IF EXISTS repository.ftscontent;
 @ DROP TABLE IF EXISTS repository.ftsdocs;
@@ -1608,11 +1608,11 @@
     char zType[2];
     zType[0] = cType;
     zType[1] = 0;
     search_sql_setup(g.db);
     db_multi_exec(
-       "DELETE FROM ftsidx WHERE docid IN"
+       "DELETE FROM ftsidx WHERE rowid IN"
        "    (SELECT rowid FROM ftsdocs WHERE type=%Q AND rid=%d AND idxed)",
        zType, rid
     );
     db_multi_exec(
        "REPLACE INTO ftsdocs(type,rid,name,idxed)"
@@ -1619,11 +1619,11 @@
        " VALUES(%Q,%d,%Q,0)",
        zType, rid, zName
     );
     if( cType=='w' || cType=='e' ){
       db_multi_exec(
-        "DELETE FROM ftsidx WHERE docid IN"
+        "DELETE FROM ftsidx WHERE rowid IN"
         "    (SELECT rowid FROM ftsdocs WHERE type='%c' AND name=%Q AND idxed)",
         cType, zName
       );
       db_multi_exec(
         "DELETE FROM ftsdocs WHERE type='%c' AND name=%Q AND rid!=%d",
@@ -1659,11 +1659,11 @@
     "   WHERE foci.checkinID=%d AND blob.uuid=foci.uuid"
     "     AND %z",
     ckid, glob_expr("foci.filename", db_get("doc-glob",""))
   );
   db_multi_exec(
-    "DELETE FROM ftsidx WHERE docid IN"
+    "DELETE FROM ftsidx WHERE rowid IN"
     "  (SELECT rowid FROM ftsdocs WHERE type='d'"
     "      AND rid NOT IN (SELECT rid FROM current_docs))"
   );
   db_multi_exec(
     "DELETE FROM ftsdocs WHERE type='d'"
@@ -1678,11 +1678,11 @@
     "         %.17g"
     " FROM current_docs",
     zDocBr, rTime
   );
   db_multi_exec(
-    "INSERT INTO ftsidx(docid,title,body)"
+    "INSERT INTO ftsidx(rowid,title,body)"
     "  SELECT rowid, label, bx FROM ftsdocs WHERE type='d' AND NOT idxed"
   );
   db_multi_exec(
     "UPDATE ftsdocs SET"
     "  idxed=1,"
@@ -1695,11 +1695,11 @@
 /*
 ** Deal with all of the unindexed 'c' terms in FTSDOCS
 */
 static void search_update_checkin_index(void){
   db_multi_exec(
-    "INSERT INTO ftsidx(docid,title,body)"
+    "INSERT INTO ftsidx(rowid,title,body)"
     " SELECT rowid, '', body('c',rid,NULL) FROM ftsdocs"
     "  WHERE type='c' AND NOT idxed;"
   );
   db_multi_exec(
     "UPDATE ftsdocs SET idxed=1, name=NULL,"
@@ -1718,11 +1718,11 @@
 /*
 ** Deal with all of the unindexed 't' terms in FTSDOCS
 */
 static void search_update_ticket_index(void){
   db_multi_exec(
-    "INSERT INTO ftsidx(docid,title,body)"
+    "INSERT INTO ftsidx(rowid,title,body)"
     " SELECT rowid, title('t',rid,NULL), body('t',rid,NULL) FROM ftsdocs"
     "  WHERE type='t' AND NOT idxed;"
   );
   if( db_changes()==0 ) return;
   db_multi_exec(
@@ -1741,11 +1741,11 @@
 /*
 ** Deal with all of the unindexed 'w' terms in FTSDOCS
 */
 static void search_update_wiki_index(void){
   db_multi_exec(
-    "INSERT INTO ftsidx(docid,title,body)"
+    "INSERT INTO ftsidx(rowid,title,body)"
     " SELECT rowid, title('w',rid,NULL),body('w',rid,NULL) FROM ftsdocs"
     "  WHERE type='w' AND NOT idxed;"
   );
   if( db_changes()==0 ) return;
   db_multi_exec(
@@ -1763,11 +1763,11 @@
 /*
 ** Deal with all of the unindexed 'f' terms in FTSDOCS
 */
 static void search_update_forum_index(void){
   db_multi_exec(
-    "INSERT INTO ftsidx(docid,title,body)"
+    "INSERT INTO ftsidx(rowid,title,body)"
     " SELECT rowid, title('f',rid,NULL),body('f',rid,NULL) FROM ftsdocs"
     "  WHERE type='f' AND NOT idxed;"
   );
   if( db_changes()==0 ) return;
   db_multi_exec(
@@ -1786,11 +1786,11 @@
 /*
 ** Deal with all of the unindexed 'e' terms in FTSDOCS
 */
 static void search_update_technote_index(void){
   db_multi_exec(
-    "INSERT INTO ftsidx(docid,title,body)"
+    "INSERT INTO ftsidx(rowid,title,body)"
     " SELECT rowid, title('e',rid,NULL),body('e',rid,NULL) FROM ftsdocs"
     "  WHERE type='e' AND NOT idxed;"
   );
   if( db_changes()==0 ) return;
   db_multi_exec(
@@ -2002,24 +2002,24 @@
       const char *zUrl = db_column_text(&q,4);
       const char *zDocId = db_column_text(&q,0);
       char *zName;
       char *z;
       @ <table border=0>
-      @ <tr><td align='right'>docid:<td>&nbsp;&nbsp;<td>%d(id)
+      @ <tr><td align='right'>rowid:<td>&nbsp;&nbsp;<td>%d(id)
       @ <tr><td align='right'>id:<td><td>%s(zDocId)
       @ <tr><td align='right'>name:<td><td>%h(db_column_text(&q,1))
       @ <tr><td align='right'>idxed:<td><td>%d(db_column_int(&q,2))
       @ <tr><td align='right'>label:<td><td>%h(db_column_text(&q,3))
       @ <tr><td align='right'>url:<td><td>
       @ <a href='%R%s(zUrl)'>%h(zUrl)</a>
       @ <tr><td align='right'>mtime:<td><td>%s(db_column_text(&q,5))
-      z = db_text(0, "SELECT title FROM ftsidx WHERE docid=%d",id);
+      z = db_text(0, "SELECT title FROM ftsidx WHERE rowid=%d",id);
       if( z && z[0] ){
         @ <tr><td align="right">title:<td><td>%h(z)
         fossil_free(z);
       }
-      z = db_text(0, "SELECT body FROM ftsidx WHERE docid=%d",id);
+      z = db_text(0, "SELECT body FROM ftsidx WHERE rowid=%d",id);
       if( z && z[0] ){
         @ <tr><td align="right" valign="top">body:<td><td>%h(z)
         fossil_free(z);
       }
       @ </table>


Index: Fossil/Fossil_src_2022-06-07/extsrc/sqlite3.c
==================================================================
--- Fossil/Fossil_src_2022-06-07/extsrc/sqlite3.c
+++ Fossil/Fossil_src_2022-06-07/extsrc/sqlite3.c
@@ -236463,10 +236463,387 @@
   for(i=0; i<sizeof(azName)/sizeof(azName[0]); i++){
     if( sqlite3_stricmp(zName, azName[i])==0 ) return 1;
   }
   return 0;
 }
+
+#ifdef SQLITE_ENABLE_FTS5
+
+typedef struct Fts5MatchinfoCtx Fts5MatchinfoCtx;
+
+#ifndef SQLITE_AMALGAMATION
+typedef unsigned int u32;
+#endif
+
+struct Fts5MatchinfoCtx {
+  int nCol;                       /* Number of cols in FTS5 table */
+  int nPhrase;                    /* Number of phrases in FTS5 query */
+  char *zArg;                     /* nul-term'd copy of 2nd arg */
+  int nRet;                       /* Number of elements in aRet[] */
+  u32 *aRet;                      /* Array of 32-bit unsigned ints to return */
+};
+
+
+
+/*
+** Return a pointer to the fts5_api pointer for database connection db.
+** If an error occurs, return NULL and leave an error in the database 
+** handle (accessible using sqlite3_errcode()/errmsg()).
+*/
+static int fts5_api_from_db(sqlite3 *db, fts5_api **ppApi){
+  sqlite3_stmt *pStmt = 0;
+  int rc;
+
+  *ppApi = 0;
+  rc = sqlite3_prepare(db, "SELECT fts5(?1)", -1, &pStmt, 0);
+  if( rc==SQLITE_OK ){
+    sqlite3_bind_pointer(pStmt, 1, (void*)ppApi, "fts5_api_ptr", 0);
+    (void)sqlite3_step(pStmt);
+    rc = sqlite3_finalize(pStmt);
+  }
+
+  return rc;
+}
+
+
+/*
+** Argument f should be a flag accepted by matchinfo() (a valid character
+** in the string passed as the second argument). If it is not, -1 is 
+** returned. Otherwise, if f is a valid matchinfo flag, the value returned
+** is the number of 32-bit integers added to the output array if the
+** table has nCol columns and the query nPhrase phrases.
+*/
+static int fts5MatchinfoFlagsize(int nCol, int nPhrase, char f){
+  int ret = -1;
+  switch( f ){
+    case 'p': ret = 1; break;
+    case 'c': ret = 1; break;
+    case 'x': ret = 3 * nCol * nPhrase; break;
+    case 'y': ret = nCol * nPhrase; break;
+    case 'b': ret = ((nCol + 31) / 32) * nPhrase; break;
+    case 'n': ret = 1; break;
+    case 'a': ret = nCol; break;
+    case 'l': ret = nCol; break;
+    case 's': ret = nCol; break;
+  }
+  return ret;
+}
+
+static int fts5MatchinfoIter(
+  const Fts5ExtensionApi *pApi,   /* API offered by current FTS version */
+  Fts5Context *pFts,              /* First arg to pass to pApi functions */
+  Fts5MatchinfoCtx *p,
+  int(*x)(const Fts5ExtensionApi*,Fts5Context*,Fts5MatchinfoCtx*,char,u32*)
+){
+  int i;
+  int n = 0;
+  int rc = SQLITE_OK;
+  char f;
+  for(i=0; (f = p->zArg[i]); i++){
+    rc = x(pApi, pFts, p, f, &p->aRet[n]);
+    if( rc!=SQLITE_OK ) break;
+    n += fts5MatchinfoFlagsize(p->nCol, p->nPhrase, f);
+  }
+  return rc;
+}
+
+static int fts5MatchinfoXCb(
+  const Fts5ExtensionApi *pApi,
+  Fts5Context *pFts,
+  void *pUserData
+){
+  Fts5PhraseIter iter;
+  int iCol, iOff;
+  u32 *aOut = (u32*)pUserData;
+  int iPrev = -1;
+
+  for(pApi->xPhraseFirst(pFts, 0, &iter, &iCol, &iOff); 
+      iCol>=0; 
+      pApi->xPhraseNext(pFts, &iter, &iCol, &iOff)
+  ){
+    aOut[iCol*3+1]++;
+    if( iCol!=iPrev ) aOut[iCol*3 + 2]++;
+    iPrev = iCol;
+  }
+
+  return SQLITE_OK;
+}
+
+static int fts5MatchinfoGlobalCb(
+  const Fts5ExtensionApi *pApi,
+  Fts5Context *pFts,
+  Fts5MatchinfoCtx *p,
+  char f,
+  u32 *aOut
+){
+  int rc = SQLITE_OK;
+  switch( f ){
+    case 'p':
+      aOut[0] = p->nPhrase; 
+      break;
+
+    case 'c':
+      aOut[0] = p->nCol; 
+      break;
+
+    case 'x': {
+      int i;
+      for(i=0; i<p->nPhrase && rc==SQLITE_OK; i++){
+        void *pPtr = (void*)&aOut[i * p->nCol * 3];
+        rc = pApi->xQueryPhrase(pFts, i, pPtr, fts5MatchinfoXCb);
+      }
+      break;
+    }
+
+    case 'n': {
+      sqlite3_int64 nRow;
+      rc = pApi->xRowCount(pFts, &nRow);
+      aOut[0] = (u32)nRow;
+      break;
+    }
+
+    case 'a': {
+      sqlite3_int64 nRow = 0;
+      rc = pApi->xRowCount(pFts, &nRow);
+      if( nRow==0 ){
+        memset(aOut, 0, sizeof(u32) * p->nCol);
+      }else{
+        int i;
+        for(i=0; rc==SQLITE_OK && i<p->nCol; i++){
+          sqlite3_int64 nToken;
+          rc = pApi->xColumnTotalSize(pFts, i, &nToken);
+          if( rc==SQLITE_OK){
+            aOut[i] = (u32)((2*nToken + nRow) / (2*nRow));
+          }
+        }
+      }
+      break;
+    }
+
+  }
+  return rc;
+}
+
+static int fts5MatchinfoLocalCb(
+  const Fts5ExtensionApi *pApi,
+  Fts5Context *pFts,
+  Fts5MatchinfoCtx *p,
+  char f,
+  u32 *aOut
+){
+  int i;
+  int rc = SQLITE_OK;
+
+  switch( f ){
+    case 'b': {
+      int iPhrase;
+      int nInt = ((p->nCol + 31) / 32) * p->nPhrase;
+      for(i=0; i<nInt; i++) aOut[i] = 0;
+
+      for(iPhrase=0; iPhrase<p->nPhrase; iPhrase++){
+        Fts5PhraseIter iter;
+        int iCol;
+        for(pApi->xPhraseFirstColumn(pFts, iPhrase, &iter, &iCol);
+            iCol>=0; 
+            pApi->xPhraseNextColumn(pFts, &iter, &iCol)
+        ){
+          aOut[iPhrase * ((p->nCol+31)/32) + iCol/32] |= ((u32)1 << iCol%32);
+        }
+      }
+
+      break;
+    }
+
+    case 'x':
+    case 'y': {
+      int nMul = (f=='x' ? 3 : 1);
+      int iPhrase;
+
+      for(i=0; i<(p->nCol*p->nPhrase); i++) aOut[i*nMul] = 0;
+
+      for(iPhrase=0; iPhrase<p->nPhrase; iPhrase++){
+        Fts5PhraseIter iter;
+        int iOff, iCol;
+        for(pApi->xPhraseFirst(pFts, iPhrase, &iter, &iCol, &iOff); 
+            iOff>=0; 
+            pApi->xPhraseNext(pFts, &iter, &iCol, &iOff)
+        ){
+          aOut[nMul * (iCol + iPhrase * p->nCol)]++;
+        }
+      }
+
+      break;
+    }
+
+    case 'l': {
+      for(i=0; rc==SQLITE_OK && i<p->nCol; i++){
+        int nToken;
+        rc = pApi->xColumnSize(pFts, i, &nToken);
+        aOut[i] = (u32)nToken;
+      }
+      break;
+    }
+
+    case 's': {
+      int nInst;
+
+      memset(aOut, 0, sizeof(u32) * p->nCol);
+
+      rc = pApi->xInstCount(pFts, &nInst);
+      for(i=0; rc==SQLITE_OK && i<nInst; i++){
+        int iPhrase, iOff, iCol = 0;
+        int iNextPhrase;
+        int iNextOff;
+        u32 nSeq = 1;
+        int j;
+
+        rc = pApi->xInst(pFts, i, &iPhrase, &iCol, &iOff);
+        iNextPhrase = iPhrase+1;
+        iNextOff = iOff+pApi->xPhraseSize(pFts, 0);
+        for(j=i+1; rc==SQLITE_OK && j<nInst; j++){
+          int ip, ic, io;
+          rc = pApi->xInst(pFts, j, &ip, &ic, &io);
+          if( ic!=iCol || io>iNextOff ) break;
+          if( ip==iNextPhrase && io==iNextOff ){
+            nSeq++;
+            iNextPhrase = ip+1;
+            iNextOff = io + pApi->xPhraseSize(pFts, ip);
+          }
+        }
+
+        if( nSeq>aOut[iCol] ) aOut[iCol] = nSeq;
+      }
+
+      break;
+    }
+  }
+  return rc;
+}
+ 
+static Fts5MatchinfoCtx *fts5MatchinfoNew(
+  const Fts5ExtensionApi *pApi,   /* API offered by current FTS version */
+  Fts5Context *pFts,              /* First arg to pass to pApi functions */
+  sqlite3_context *pCtx,          /* Context for returning error message */
+  const char *zArg                /* Matchinfo flag string */
+){
+  Fts5MatchinfoCtx *p;
+  int nCol;
+  int nPhrase;
+  int i;
+  int nInt;
+  sqlite3_int64 nByte;
+  int rc;
+
+  nCol = pApi->xColumnCount(pFts);
+  nPhrase = pApi->xPhraseCount(pFts);
+
+  nInt = 0;
+  for(i=0; zArg[i]; i++){
+    int n = fts5MatchinfoFlagsize(nCol, nPhrase, zArg[i]);
+    if( n<0 ){
+      char *zErr = sqlite3_mprintf("unrecognized matchinfo flag: %c", zArg[i]);
+      sqlite3_result_error(pCtx, zErr, -1);
+      sqlite3_free(zErr);
+      return 0;
+    }
+    nInt += n;
+  }
+
+  nByte = sizeof(Fts5MatchinfoCtx)          /* The struct itself */
+         + sizeof(u32) * nInt               /* The p->aRet[] array */
+         + (i+1);                           /* The p->zArg string */
+  p = (Fts5MatchinfoCtx*)sqlite3_malloc64(nByte);
+  if( p==0 ){
+    sqlite3_result_error_nomem(pCtx);
+    return 0;
+  }
+  memset(p, 0, nByte);
+
+  p->nCol = nCol;
+  p->nPhrase = nPhrase;
+  p->aRet = (u32*)&p[1];
+  p->nRet = nInt;
+  p->zArg = (char*)&p->aRet[nInt];
+  memcpy(p->zArg, zArg, i);
+
+  rc = fts5MatchinfoIter(pApi, pFts, p, fts5MatchinfoGlobalCb);
+  if( rc!=SQLITE_OK ){
+    sqlite3_result_error_code(pCtx, rc);
+    sqlite3_free(p);
+    p = 0;
+  }
+
+  return p;
+}
+
+static void fts5MatchinfoFunc(
+  const Fts5ExtensionApi *pApi,   /* API offered by current FTS version */
+  Fts5Context *pFts,              /* First arg to pass to pApi functions */
+  sqlite3_context *pCtx,          /* Context for returning result/error */
+  int nVal,                       /* Number of values in apVal[] array */
+  sqlite3_value **apVal           /* Array of trailing arguments */
+){
+  const char *zArg;
+  Fts5MatchinfoCtx *p;
+  int rc = SQLITE_OK;
+
+  if( nVal>0 ){
+    zArg = (const char*)sqlite3_value_text(apVal[0]);
+  }else{
+    zArg = "pcx";
+  }
+
+  p = (Fts5MatchinfoCtx*)pApi->xGetAuxdata(pFts, 0);
+  if( p==0 || sqlite3_stricmp(zArg, p->zArg) ){
+    p = fts5MatchinfoNew(pApi, pFts, pCtx, zArg);
+    if( p==0 ){
+      rc = SQLITE_NOMEM;
+    }else{
+      rc = pApi->xSetAuxdata(pFts, p, sqlite3_free);
+    }
+  }
+
+  if( rc==SQLITE_OK ){
+    rc = fts5MatchinfoIter(pApi, pFts, p, fts5MatchinfoLocalCb);
+  }
+  if( rc!=SQLITE_OK ){
+    sqlite3_result_error_code(pCtx, rc);
+  }else{
+    /* No errors has occured, so return a copy of the array of integers. */
+    int nByte = p->nRet * sizeof(u32);
+    sqlite3_result_blob(pCtx, (void*)p->aRet, nByte, SQLITE_TRANSIENT);
+  }
+}
+
+int sqlite3Fts5TestRegisterMatchinfo(sqlite3 *db){
+  int rc;                         /* Return code */
+  fts5_api *pApi;                 /* FTS5 API functions */
+
+  /* Extract the FTS5 API pointer from the database handle. The 
+  ** fts5_api_from_db() function above is copied verbatim from the 
+  ** FTS5 documentation. Refer there for details. */
+  rc = fts5_api_from_db(db, &pApi);
+  if( rc!=SQLITE_OK ) return rc;
+
+  /* If fts5_api_from_db() returns NULL, then either FTS5 is not registered
+  ** with this database handle, or an error (OOM perhaps?) has occurred.
+  **
+  ** Also check that the fts5_api object is version 2 or newer.  
+  */ 
+  if( pApi==0 || pApi->iVersion<2 ){
+    return SQLITE_ERROR;
+  }
+
+  /* Register the implementation of matchinfo() */
+  rc = pApi->xCreateFunction(pApi, "matchinfo", 0, fts5MatchinfoFunc, 0);
+
+  return rc;
+}
+
+#endif /* SQLITE_ENABLE_FTS5 */
+
+
 
 static int fts5Init(sqlite3 *db){
   static const sqlite3_module fts5Mod = {
     /* iVersion      */ 3,
     /* xCreate       */ fts5CreateMethod,
@@ -236527,16 +236904,16 @@
   }
 
   /* If SQLITE_FTS5_ENABLE_TEST_MI is defined, assume that the file
   ** fts5_test_mi.c is compiled and linked into the executable. And call
   ** its entry point to enable the matchinfo() demo.  */
-#ifdef SQLITE_FTS5_ENABLE_TEST_MI
+
   if( rc==SQLITE_OK ){
-    extern int sqlite3Fts5TestRegisterMatchinfo(sqlite3*);
+
     rc = sqlite3Fts5TestRegisterMatchinfo(db);
   }
-#endif
+
 
   return rc;
 }
 
 /*

I want to emphasize here that the implementation of the matchinfo function is carried from the ext/fts5/fts5_test_mi.c file of sqlite.

(8) By Stephan Beal (stephan) on 2022-06-08 08:08:15 in reply to 7 [link] [source]

Index: Fossil/Fossil_src_2022-06-07/src/search.c ... I want to emphasize here that the implementation of the matchinfo function is carried from the ext/fts5/fts5_test_mi.c file of sqlite.

Thank you very much for that! Fossil's FTS bits are still black-box to me, so your patch will provide a great guide to what needs to be updated. i can't guaranty when that will happen but your patch makes it far, far more likely that it will happen "sometime soon."

(9) By Stephan Beal (stephan) on 2023-01-24 02:59:04 in reply to 7 [link] [source]

Index: Fossil/Fossil_src_2022-06-07/src/search.c

This is long overdue, but an initial version of the FTS5 upgrade is at:

src:/timeline?r=search-fts5

it's not yet heavily tested but seems to work.

(10) By Richard Hipp (drh) on 2023-01-24 13:48:22 in reply to 9 [link] [source]

The only issue I have with the search-fts5 branch is that when you go to update your Fossil binary on a server, you have to be sure to click the "Rebuild The Full-Text Index" button, or else nothing will work right.

I wonder is there anything we can do to make that index rebuild automatic?

(11) By Richard Hipp (drh) on 2023-01-24 13:55:17 in reply to 10 [link] [source]

And, if you set up the FTS5 index, then revert your Fossil binary to one that uses FTS4, you get an SQL error on any search request...

(12) By Richard Hipp (drh) on 2023-01-24 13:58:22 in reply to 10 [link] [source]

Is the solution here that we just tell people when upgrading to a newer Fossil (or going back to an older version) that you must run

fossil all fts-config reindex

to get search working again?

(13) By Richard Hipp (drh) on 2023-01-24 14:08:20 in reply to 10 [link] [source]

I think I have a better solution for this. Anybody watching can talk to me about it on chat. I'll update this post once I get a check-in ready.

(14) By Stephan Beal (stephan) on 2023-01-24 21:29:36 in reply to 13 [link] [source]

I think I have a better solution for this.

Update: Richard patched the FTS changes to be able to gracefully handle the transition from FTS4 and FTS5 without requiring explicit actions from users. That version is now running on this site and has been merged into trunk.

(15) By anonymous on 2023-04-17 07:46:43 in reply to 14 [source]

I'm sorry that I was sick earlier, so I'm just now seeing your responses.

It's great to see that you guys have finally completed the migration of Fossil from FTS4 to FTS5. This is so cool!

However you guys seem to have missed one thing.

The reason I brought up Fossil's FTS5 was because I couldn't do a Chinese search in Fossil.

Now that Fossil has migrated to FTS5, there is only Porter Stemmer and no Trigram Splitter.

When Fossil has Trigram Splitter, it will be possible to do Chinese search.

Luckily, SQLite has Trigram Splitter built-in, and Fossil only needs to add a button for this purpose, so that Chinese search can be easily implemented. This is a relatively easy thing to do.

Could you please add this button? Thank you!

Translated with www.DeepL.com/Translator (free version)

(16) By Stephan Beal (stephan) on 2023-04-17 18:36:26 in reply to 15 [link] [source]

Luckily, SQLite has Trigram Splitter built-in, and Fossil only needs to add a button for this purpose, so that Chinese search can be easily implemented. This is a relatively easy thing to do.

There's an initial version of this support in the fts5-trigram branch. It is currently missing the option to set this tokenizer in the web UI but it can be configured from the command-line:

$ fossil fts-config tokenizer trigram -R REPOSITORY_FILE

if you are able to, please try that and see if searching in Chinese then works for you. It's unlikely that any of us will be able to adequately determine whether or not Chinese searching really works.

The trigram config option for the web UI is pending and should be done sometime in the next couple of days.

(17) By anonymous on 2023-04-18 02:15:05 in reply to 16 [link] [source]

This is so cool!

I downloaded the source code for the fts5-trigram branch and compiled it.

Then I ran the command "fossil fts-config tokenizer trigram -R REPOSITORY_FILE", and something magical happened—Fossil created an FTS5 index with trigram as the tokenizer.

After that, I tried Chinese search! It's amazing! The Chinese search succeeded! Thank you for your hard work!

(18) By Stephan Beal (stephan) on 2023-04-18 02:20:38 in reply to 17 [link] [source]

After that, I tried Chinese search! It's amazing! The Chinese search succeeded!

Fantastic, thank you for the confirmation. The missing UI piece will be added in the next day or two.

(19) By Stephan Beal (stephan) on 2023-04-18 17:50:52 in reply to 16 [link] [source]

The trigram config option for the web UI is pending and should be done sometime in the next couple of days.

The /srchsetup page now has a selection list of 3 different tokenizer options: none, Porter stemmer, and trigram. Those changes have been merged to trunk. Please report any problems with it here.

(20) By anonymous on 2023-06-04 04:04:24 in reply to 19 [link] [source]

Hi, stephan

Thanks for this, I tested w64-2.22 with "trigram" and met following error when searching Chinese character/word

Database error: fts5: syntax error near "": {INSERT INTO x(label,url,score,id,date,snip) SELECT ftsdocs.label, ftsdocs.url, rank(matchinfo(ftsidx,'pcsx')), ftsdocs.type || ftsdocs.rid, datetime(ftsdocs.mtime), snippet(ftsidx,-1,'<mark>','</mark>',' ... ',35) FROM ftsidx CROSS JOIN ftsdocs WHERE ftsidx MATCH ' ' AND ftsdocs.rowid=ftsidx.rowid AND (ftsdocs.type='w')}

(21) By Stephan Beal (stephan) on 2023-06-04 09:27:16 in reply to 20 [link] [source]

I tested w64-2.22 with "trigram" and met following error when searching Chinese character/word

Every time you search, regardless of the input, or for a specific input? Nothing in the search/indexing code has changed since the last update, so i'm at a loss to explain this.

Could you post a link to your repository and some Chinese text which should be found but isn't?

(22) By anonymous on 2023-06-04 14:50:49 in reply to 21 [link] [source]

you can download it from https://file.io/DB6fJC8qw1RE

It's very simple, just init then enable search in admin, add two wiki pages

  • test hello world:test hello worldtest hello worldtest hello world

  • test :中文测试

Searched under wiki by English "hello" successfully, but failed by Chinese "中文" with following error

Database error: fts5: syntax error near "": {INSERT INTO x(label,url,score,id,date,snip) SELECT ftsdocs.label, ftsdocs.url, rank(matchinfo(ftsidx,'pcsx')), ftsdocs.type || ftsdocs.rid, datetime(ftsdocs.mtime), snippet(ftsidx,-1,'<mark>','</mark>',' ... ',35) FROM ftsidx CROSS JOIN ftsdocs WHERE ftsidx MATCH ' ' AND ftsdocs.rowid=ftsidx.rowid AND (ftsdocs.type='w')}

(23) By Stephan Beal (stephan) on 2023-06-04 20:12:16 in reply to 22 [link] [source]

It's very simple, just init then enable search in admin, add two wiki pages

Thank you for the reproduce case. i will look into this the next time i have access to AC electricity (sometime in the next couple of days).

(24.2) By Stephan Beal (stephan) on 2023-06-05 10:33:25 edited from 24.1 in reply to 22 [link] [source]

Searched under wiki by English "hello" successfully, but failed by Chinese "中文" with following error

Follow-up:

The error message is an inadvertent side effect of src:196dfedf7fc467cd, which indirectly makes the FTS query invalid. i will need to discuss with Richard the implications of backing out that change.

However... when i locally back out that change then Chinese search "works" again, but not as i would intuitively expect it to.

Searching for an individual glyph does not return any results, but searching for all of them at once does. For example, with this wiki page content:

中文测试

Searching for any subset of those glyphs returns no results, but searching for the whole text works.

That is not a fossil-specific quirk, as the sqlite3 CLI app behaves the same way:

sqlite> create virtual table f using fts5 (txt,tokenize=trigram);
sqlite> insert into f(txt) values('中文测试');
sqlite> select * from f where txt match '中';
sqlite> select * from f where txt match '中文测试';
中文测试

The results are the same when using fossil's built-in copy of sqlite.

In other words, it does not look like searching for Chinese will be useful as-is. We'll need to explore other options for adding proper search support for Chinese. Unfortunately, my FTS5 skills are too low to suggest how we might go about that.

Edit: FTS tables which use the trigram tokenizer can (i just learned) use GLOB and LIKE patterns, and they then produce results where the above test finds none:

select * from f where txt glob '*中*';

but (A) whether that's an appropriate change for other tokenizers is not clear and (B) how to accurately detect whether or not the trigram tokenizer is actually in effect (i.e. by asking FTS instead of relying on the config-level setting (though i guess we can safely rely on that setting for this)), so that we can switch to this syntax for that tokenizer, is not clear.

(25) By anonymous on 2023-06-06 04:30:24 in reply to 24.2 [link] [source]

Thanks for the follow up!

Sorry for my wrong test case of 2 characters, seems trigram requires at least 3 unicode characters.

So if this 2.22 broken error which made FTS query(trigram + Chinese) invalid gets fixed in next release, we can search "中文测" and get result. although it's a limit but works.

I'm also not sure "like/blob" is appropriate here, further more to support for better Chinese search, maybe need better tokenizer such like simple, so we can select it from admin of FTS Tokenizer. I will make a try when get some time later.