Fossil Forum

About Chinese search
Login

About Chinese search

About Chinese search

(1) By anonymous on 2022-06-04 13:44:21 [link] [source]

The current search function does not support Chinese.


When I tried to use search to find Chinese characters, I failed.

So, I set up the index of FTS, and tried to search for Chinese characters again, but still failed.

Then, I went to the document and looked up the reason with Google: fossil's search is based on sqlite's FTS, while sqlite's FTS does not support Chinese search …

Then, if I want to add a Chinese search function to fossil, I need to change fossil's source code …

Too much trouble, I had to give up fossil's search.

(2) By anonymous on 2022-06-07 05:22:39 in reply to 1 [link] [source]

Guys, guess what I found out?

I found out that I can use sqlite3's "trigram" splitter to support Chinese search!


But the "trigram" splitter only supports fts5, not fts4.

So I tried to change the search.c fts4 to fts5. 1.

  1. change fts4 to fts5. 1.
  2. fts4 has docid, while fts5 has no docid, need to use rowid instead of docid. this is easy.
  3. change porter splitter to trigram splitter, this is mainly because I don't know how to add a new parameter yet, change old parameter first to simplify the process.
  4. fts4 has matchinfo function, while fts5 does not have matchinfo function, I can not solve this problem. So I temporarily replaced the matchinfo function where it was used.

After I did the above steps, I rebuilt the index in the admin page, and then searched with Chinese characters in the search page. Wow, I got Chinese results! Great!


In the source code of search.c, lines 22 to 24 say that you can switch fts4 and fts5 in the admin page, but I didn't find this option. Is it possible that these comments are obsolete?

If it is out of date, I think it will mislead people who want to turn on fts5.


I finally see hope for fossil to implement Chinese character search!

Translated with www.DeepL.com/Translator (free version)

(3) By Stephan Beal (stephan) on 2022-06-07 06:20:51 in reply to 2 [link] [source]

fts4 has matchinfo function, while fts5 does not have matchinfo function, I can not solve this problem. So I temporarily replaced the matchinfo function where it was used.

What did you replace it with? If the transition is smooth, perhaps we should update fossil to fts5.

In the source code of search.c, lines 22 to 24 say that you can switch fts4 and fts5 in the admin page, but I didn't find this option. Is it possible that these comments are obsolete?

That part seems to be incorrect. The db scheme later in that same file is hard-coded to fts4 and that comment is the only mention of fts5. That will be patched momentarily.

(4) By anonymous on 2022-06-07 07:19:12 in reply to 3 [link] [source]

What did you replace it with?

--- " rank(matchinfo(ftsidx,'pcsx')),"
+++ " ftsidx.rowid,"

"matchinfo" is mainly used for sorting here. I'll leave it unsorted for the time being.

I know this is not a good implementation.

I'm looking at the documentation of "sqlite" to find a perfect alternative to "matchinfo".

(5) By anonymous on 2022-06-08 05:58:00 in reply to 1 [link] [source]

I found a way to migrate fossil from fts4 to fts5.


There are the following steps to migrate fossil from fts4 to fts5:

  1. Change the name of fts4 to fts5.
  2. fts4 has docid, while fts5 has no docid. Rowid is needed to replace docid in fts5.
  3. fts4 has matchinfo function, while fts5 has no matchinfo function. The matchinfo function needs to be implemented in fts5.
  4. fts4 has a full-featured snippet function, but the function of the snippet function in fts5 is weakened.

It can be seen from the above steps that the difficulties of fossil's migration from fts4 to fts5 mainly focus on the matchinfo function and snippet function. Next, I want to explain my solution.

matchinfo function

To realize the matchinfo function by yourself, you need to look at the matchinfo source code of fts4, and then realize it in fts5. It's too much trouble. I don't recommend it.

I searched the timeline of sqlite for matchinfo, and found that dan had already implemented the matchinfo function in the ext/fts5/fts5_test_mi.c file of sqlite.

Only need to move the matchinfo function in fts5_test_mi.c to fossil to realize the matchinfo function.

snippet function

Fortunately, fossil only uses the basic functions of snippet.

Therefore, fossil is relatively simple in the migration of snippet function.

I have read the documents of fts4 and fts5, and the parameters of both sides are basically the same, but the order is different.

Put the fifth parameter of the snippet function of fts4 in the second parameter position of the snippet function of fts5.


That's all.

(6) By Stephan Beal (stephan) on 2022-06-08 06:38:36 in reply to 5 [link] [source]

I found a way to migrate fossil from fts4 to fts5

Thank you for following up on this. Yesterday i asked Richard about such a migration and there are no apparent technical hurdles to migrating. The current bottleneck is the limited number of hours per day to work on our many TODOs (across multiple projects, not just fossil). i'm bookmarking your response and hope to be able to attempt this migration in the foreseeable future.

Would you mind posting a patch of your work? Though we can't use anonymous patches as-is, it would be a great starting point and would help us get an idea of the amount of effort required. If the patch is huge, you can email it to me if you prefer. Contact info is at https://wanderinghorse.net/home/stephan.

(7) By anonymous on 2022-06-08 07:25:32 in reply to 6 [link] [source]

Index: Fossil/Fossil_src_2022-06-07/src/search.c
==================================================================
--- Fossil/Fossil_src_2022-06-07/src/search.c
+++ Fossil/Fossil_src_2022-06-07/src/search.c
@@ -937,14 +937,14 @@
     " SELECT ftsdocs.label,"
     "        ftsdocs.url,"
     "        rank(matchinfo(ftsidx,'pcsx')),"
     "        ftsdocs.type || ftsdocs.rid,"
     "        datetime(ftsdocs.mtime),"
-    "        snippet(ftsidx,'<mark>','</mark>',' ... ',-1,35)"
+    "        snippet(ftsidx,-1,'<mark>','</mark>',' ... ',35)"
     "   FROM ftsidx CROSS JOIN ftsdocs"
     "  WHERE ftsidx MATCH %Q"
-    "    AND ftsdocs.rowid=ftsidx.docid",
+    "    AND ftsdocs.rowid=ftsidx.rowid",
     zPat
   );
   fossil_free(zPat);
   if( srchFlags!=SRCH_ALL ){
     const char *zSep = " AND (";
@@ -1510,11 +1510,11 @@
 /* The schema for the full-text index
 */
 static const char zFtsSchema[] =
 @ -- One entry for each possible search result
 @ CREATE TABLE IF NOT EXISTS repository.ftsdocs(
-@   rowid INTEGER PRIMARY KEY, -- Maps to the ftsidx.docid
+@   rowid INTEGER PRIMARY KEY, -- Maps to the ftsidx.rowid
 @   type CHAR(1),              -- Type of document
 @   rid INTEGER,               -- BLOB.RID or TAG.TAGID for the document
 @   name TEXT,                 -- Additional document description
 @   idxed BOOLEAN,             -- True if currently in the index
 @   label TEXT,                -- Label to print on search results
@@ -1528,11 +1528,11 @@
 @ CREATE VIEW IF NOT EXISTS repository.ftscontent AS
 @   SELECT rowid, type, rid, name, idxed, label, url, mtime,
 @          title(type,rid,name) AS 'title', body(type,rid,name) AS 'body'
 @     FROM ftsdocs;
 @ CREATE VIRTUAL TABLE IF NOT EXISTS repository.ftsidx
-@   USING fts4(content="ftscontent", title, body%s);
+@   USING fts5(content="ftscontent", title, body%s);
 ;
 static const char zFtsDrop[] =
 @ DROP TABLE IF EXISTS repository.ftsidx;
 @ DROP VIEW IF EXISTS repository.ftscontent;
 @ DROP TABLE IF EXISTS repository.ftsdocs;
@@ -1608,11 +1608,11 @@
     char zType[2];
     zType[0] = cType;
     zType[1] = 0;
     search_sql_setup(g.db);
     db_multi_exec(
-       "DELETE FROM ftsidx WHERE docid IN"
+       "DELETE FROM ftsidx WHERE rowid IN"
        "    (SELECT rowid FROM ftsdocs WHERE type=%Q AND rid=%d AND idxed)",
        zType, rid
     );
     db_multi_exec(
        "REPLACE INTO ftsdocs(type,rid,name,idxed)"
@@ -1619,11 +1619,11 @@
        " VALUES(%Q,%d,%Q,0)",
        zType, rid, zName
     );
     if( cType=='w' || cType=='e' ){
       db_multi_exec(
-        "DELETE FROM ftsidx WHERE docid IN"
+        "DELETE FROM ftsidx WHERE rowid IN"
         "    (SELECT rowid FROM ftsdocs WHERE type='%c' AND name=%Q AND idxed)",
         cType, zName
       );
       db_multi_exec(
         "DELETE FROM ftsdocs WHERE type='%c' AND name=%Q AND rid!=%d",
@@ -1659,11 +1659,11 @@
     "   WHERE foci.checkinID=%d AND blob.uuid=foci.uuid"
     "     AND %z",
     ckid, glob_expr("foci.filename", db_get("doc-glob",""))
   );
   db_multi_exec(
-    "DELETE FROM ftsidx WHERE docid IN"
+    "DELETE FROM ftsidx WHERE rowid IN"
     "  (SELECT rowid FROM ftsdocs WHERE type='d'"
     "      AND rid NOT IN (SELECT rid FROM current_docs))"
   );
   db_multi_exec(
     "DELETE FROM ftsdocs WHERE type='d'"
@@ -1678,11 +1678,11 @@
     "         %.17g"
     " FROM current_docs",
     zDocBr, rTime
   );
   db_multi_exec(
-    "INSERT INTO ftsidx(docid,title,body)"
+    "INSERT INTO ftsidx(rowid,title,body)"
     "  SELECT rowid, label, bx FROM ftsdocs WHERE type='d' AND NOT idxed"
   );
   db_multi_exec(
     "UPDATE ftsdocs SET"
     "  idxed=1,"
@@ -1695,11 +1695,11 @@
 /*
 ** Deal with all of the unindexed 'c' terms in FTSDOCS
 */
 static void search_update_checkin_index(void){
   db_multi_exec(
-    "INSERT INTO ftsidx(docid,title,body)"
+    "INSERT INTO ftsidx(rowid,title,body)"
     " SELECT rowid, '', body('c',rid,NULL) FROM ftsdocs"
     "  WHERE type='c' AND NOT idxed;"
   );
   db_multi_exec(
     "UPDATE ftsdocs SET idxed=1, name=NULL,"
@@ -1718,11 +1718,11 @@
 /*
 ** Deal with all of the unindexed 't' terms in FTSDOCS
 */
 static void search_update_ticket_index(void){
   db_multi_exec(
-    "INSERT INTO ftsidx(docid,title,body)"
+    "INSERT INTO ftsidx(rowid,title,body)"
     " SELECT rowid, title('t',rid,NULL), body('t',rid,NULL) FROM ftsdocs"
     "  WHERE type='t' AND NOT idxed;"
   );
   if( db_changes()==0 ) return;
   db_multi_exec(
@@ -1741,11 +1741,11 @@
 /*
 ** Deal with all of the unindexed 'w' terms in FTSDOCS
 */
 static void search_update_wiki_index(void){
   db_multi_exec(
-    "INSERT INTO ftsidx(docid,title,body)"
+    "INSERT INTO ftsidx(rowid,title,body)"
     " SELECT rowid, title('w',rid,NULL),body('w',rid,NULL) FROM ftsdocs"
     "  WHERE type='w' AND NOT idxed;"
   );
   if( db_changes()==0 ) return;
   db_multi_exec(
@@ -1763,11 +1763,11 @@
 /*
 ** Deal with all of the unindexed 'f' terms in FTSDOCS
 */
 static void search_update_forum_index(void){
   db_multi_exec(
-    "INSERT INTO ftsidx(docid,title,body)"
+    "INSERT INTO ftsidx(rowid,title,body)"
     " SELECT rowid, title('f',rid,NULL),body('f',rid,NULL) FROM ftsdocs"
     "  WHERE type='f' AND NOT idxed;"
   );
   if( db_changes()==0 ) return;
   db_multi_exec(
@@ -1786,11 +1786,11 @@
 /*
 ** Deal with all of the unindexed 'e' terms in FTSDOCS
 */
 static void search_update_technote_index(void){
   db_multi_exec(
-    "INSERT INTO ftsidx(docid,title,body)"
+    "INSERT INTO ftsidx(rowid,title,body)"
     " SELECT rowid, title('e',rid,NULL),body('e',rid,NULL) FROM ftsdocs"
     "  WHERE type='e' AND NOT idxed;"
   );
   if( db_changes()==0 ) return;
   db_multi_exec(
@@ -2002,24 +2002,24 @@
       const char *zUrl = db_column_text(&q,4);
       const char *zDocId = db_column_text(&q,0);
       char *zName;
       char *z;
       @ <table border=0>
-      @ <tr><td align='right'>docid:<td>&nbsp;&nbsp;<td>%d(id)
+      @ <tr><td align='right'>rowid:<td>&nbsp;&nbsp;<td>%d(id)
       @ <tr><td align='right'>id:<td><td>%s(zDocId)
       @ <tr><td align='right'>name:<td><td>%h(db_column_text(&q,1))
       @ <tr><td align='right'>idxed:<td><td>%d(db_column_int(&q,2))
       @ <tr><td align='right'>label:<td><td>%h(db_column_text(&q,3))
       @ <tr><td align='right'>url:<td><td>
       @ <a href='%R%s(zUrl)'>%h(zUrl)</a>
       @ <tr><td align='right'>mtime:<td><td>%s(db_column_text(&q,5))
-      z = db_text(0, "SELECT title FROM ftsidx WHERE docid=%d",id);
+      z = db_text(0, "SELECT title FROM ftsidx WHERE rowid=%d",id);
       if( z && z[0] ){
         @ <tr><td align="right">title:<td><td>%h(z)
         fossil_free(z);
       }
-      z = db_text(0, "SELECT body FROM ftsidx WHERE docid=%d",id);
+      z = db_text(0, "SELECT body FROM ftsidx WHERE rowid=%d",id);
       if( z && z[0] ){
         @ <tr><td align="right" valign="top">body:<td><td>%h(z)
         fossil_free(z);
       }
       @ </table>


Index: Fossil/Fossil_src_2022-06-07/extsrc/sqlite3.c
==================================================================
--- Fossil/Fossil_src_2022-06-07/extsrc/sqlite3.c
+++ Fossil/Fossil_src_2022-06-07/extsrc/sqlite3.c
@@ -236463,10 +236463,387 @@
   for(i=0; i<sizeof(azName)/sizeof(azName[0]); i++){
     if( sqlite3_stricmp(zName, azName[i])==0 ) return 1;
   }
   return 0;
 }
+
+#ifdef SQLITE_ENABLE_FTS5
+
+typedef struct Fts5MatchinfoCtx Fts5MatchinfoCtx;
+
+#ifndef SQLITE_AMALGAMATION
+typedef unsigned int u32;
+#endif
+
+struct Fts5MatchinfoCtx {
+  int nCol;                       /* Number of cols in FTS5 table */
+  int nPhrase;                    /* Number of phrases in FTS5 query */
+  char *zArg;                     /* nul-term'd copy of 2nd arg */
+  int nRet;                       /* Number of elements in aRet[] */
+  u32 *aRet;                      /* Array of 32-bit unsigned ints to return */
+};
+
+
+
+/*
+** Return a pointer to the fts5_api pointer for database connection db.
+** If an error occurs, return NULL and leave an error in the database 
+** handle (accessible using sqlite3_errcode()/errmsg()).
+*/
+static int fts5_api_from_db(sqlite3 *db, fts5_api **ppApi){
+  sqlite3_stmt *pStmt = 0;
+  int rc;
+
+  *ppApi = 0;
+  rc = sqlite3_prepare(db, "SELECT fts5(?1)", -1, &pStmt, 0);
+  if( rc==SQLITE_OK ){
+    sqlite3_bind_pointer(pStmt, 1, (void*)ppApi, "fts5_api_ptr", 0);
+    (void)sqlite3_step(pStmt);
+    rc = sqlite3_finalize(pStmt);
+  }
+
+  return rc;
+}
+
+
+/*
+** Argument f should be a flag accepted by matchinfo() (a valid character
+** in the string passed as the second argument). If it is not, -1 is 
+** returned. Otherwise, if f is a valid matchinfo flag, the value returned
+** is the number of 32-bit integers added to the output array if the
+** table has nCol columns and the query nPhrase phrases.
+*/
+static int fts5MatchinfoFlagsize(int nCol, int nPhrase, char f){
+  int ret = -1;
+  switch( f ){
+    case 'p': ret = 1; break;
+    case 'c': ret = 1; break;
+    case 'x': ret = 3 * nCol * nPhrase; break;
+    case 'y': ret = nCol * nPhrase; break;
+    case 'b': ret = ((nCol + 31) / 32) * nPhrase; break;
+    case 'n': ret = 1; break;
+    case 'a': ret = nCol; break;
+    case 'l': ret = nCol; break;
+    case 's': ret = nCol; break;
+  }
+  return ret;
+}
+
+static int fts5MatchinfoIter(
+  const Fts5ExtensionApi *pApi,   /* API offered by current FTS version */
+  Fts5Context *pFts,              /* First arg to pass to pApi functions */
+  Fts5MatchinfoCtx *p,
+  int(*x)(const Fts5ExtensionApi*,Fts5Context*,Fts5MatchinfoCtx*,char,u32*)
+){
+  int i;
+  int n = 0;
+  int rc = SQLITE_OK;
+  char f;
+  for(i=0; (f = p->zArg[i]); i++){
+    rc = x(pApi, pFts, p, f, &p->aRet[n]);
+    if( rc!=SQLITE_OK ) break;
+    n += fts5MatchinfoFlagsize(p->nCol, p->nPhrase, f);
+  }
+  return rc;
+}
+
+static int fts5MatchinfoXCb(
+  const Fts5ExtensionApi *pApi,
+  Fts5Context *pFts,
+  void *pUserData
+){
+  Fts5PhraseIter iter;
+  int iCol, iOff;
+  u32 *aOut = (u32*)pUserData;
+  int iPrev = -1;
+
+  for(pApi->xPhraseFirst(pFts, 0, &iter, &iCol, &iOff); 
+      iCol>=0; 
+      pApi->xPhraseNext(pFts, &iter, &iCol, &iOff)
+  ){
+    aOut[iCol*3+1]++;
+    if( iCol!=iPrev ) aOut[iCol*3 + 2]++;
+    iPrev = iCol;
+  }
+
+  return SQLITE_OK;
+}
+
+static int fts5MatchinfoGlobalCb(
+  const Fts5ExtensionApi *pApi,
+  Fts5Context *pFts,
+  Fts5MatchinfoCtx *p,
+  char f,
+  u32 *aOut
+){
+  int rc = SQLITE_OK;
+  switch( f ){
+    case 'p':
+      aOut[0] = p->nPhrase; 
+      break;
+
+    case 'c':
+      aOut[0] = p->nCol; 
+      break;
+
+    case 'x': {
+      int i;
+      for(i=0; i<p->nPhrase && rc==SQLITE_OK; i++){
+        void *pPtr = (void*)&aOut[i * p->nCol * 3];
+        rc = pApi->xQueryPhrase(pFts, i, pPtr, fts5MatchinfoXCb);
+      }
+      break;
+    }
+
+    case 'n': {
+      sqlite3_int64 nRow;
+      rc = pApi->xRowCount(pFts, &nRow);
+      aOut[0] = (u32)nRow;
+      break;
+    }
+
+    case 'a': {
+      sqlite3_int64 nRow = 0;
+      rc = pApi->xRowCount(pFts, &nRow);
+      if( nRow==0 ){
+        memset(aOut, 0, sizeof(u32) * p->nCol);
+      }else{
+        int i;
+        for(i=0; rc==SQLITE_OK && i<p->nCol; i++){
+          sqlite3_int64 nToken;
+          rc = pApi->xColumnTotalSize(pFts, i, &nToken);
+          if( rc==SQLITE_OK){
+            aOut[i] = (u32)((2*nToken + nRow) / (2*nRow));
+          }
+        }
+      }
+      break;
+    }
+
+  }
+  return rc;
+}
+
+static int fts5MatchinfoLocalCb(
+  const Fts5ExtensionApi *pApi,
+  Fts5Context *pFts,
+  Fts5MatchinfoCtx *p,
+  char f,
+  u32 *aOut
+){
+  int i;
+  int rc = SQLITE_OK;
+
+  switch( f ){
+    case 'b': {
+      int iPhrase;
+      int nInt = ((p->nCol + 31) / 32) * p->nPhrase;
+      for(i=0; i<nInt; i++) aOut[i] = 0;
+
+      for(iPhrase=0; iPhrase<p->nPhrase; iPhrase++){
+        Fts5PhraseIter iter;
+        int iCol;
+        for(pApi->xPhraseFirstColumn(pFts, iPhrase, &iter, &iCol);
+            iCol>=0; 
+            pApi->xPhraseNextColumn(pFts, &iter, &iCol)
+        ){
+          aOut[iPhrase * ((p->nCol+31)/32) + iCol/32] |= ((u32)1 << iCol%32);
+        }
+      }
+
+      break;
+    }
+
+    case 'x':
+    case 'y': {
+      int nMul = (f=='x' ? 3 : 1);
+      int iPhrase;
+
+      for(i=0; i<(p->nCol*p->nPhrase); i++) aOut[i*nMul] = 0;
+
+      for(iPhrase=0; iPhrase<p->nPhrase; iPhrase++){
+        Fts5PhraseIter iter;
+        int iOff, iCol;
+        for(pApi->xPhraseFirst(pFts, iPhrase, &iter, &iCol, &iOff); 
+            iOff>=0; 
+            pApi->xPhraseNext(pFts, &iter, &iCol, &iOff)
+        ){
+          aOut[nMul * (iCol + iPhrase * p->nCol)]++;
+        }
+      }
+
+      break;
+    }
+
+    case 'l': {
+      for(i=0; rc==SQLITE_OK && i<p->nCol; i++){
+        int nToken;
+        rc = pApi->xColumnSize(pFts, i, &nToken);
+        aOut[i] = (u32)nToken;
+      }
+      break;
+    }
+
+    case 's': {
+      int nInst;
+
+      memset(aOut, 0, sizeof(u32) * p->nCol);
+
+      rc = pApi->xInstCount(pFts, &nInst);
+      for(i=0; rc==SQLITE_OK && i<nInst; i++){
+        int iPhrase, iOff, iCol = 0;
+        int iNextPhrase;
+        int iNextOff;
+        u32 nSeq = 1;
+        int j;
+
+        rc = pApi->xInst(pFts, i, &iPhrase, &iCol, &iOff);
+        iNextPhrase = iPhrase+1;
+        iNextOff = iOff+pApi->xPhraseSize(pFts, 0);
+        for(j=i+1; rc==SQLITE_OK && j<nInst; j++){
+          int ip, ic, io;
+          rc = pApi->xInst(pFts, j, &ip, &ic, &io);
+          if( ic!=iCol || io>iNextOff ) break;
+          if( ip==iNextPhrase && io==iNextOff ){
+            nSeq++;
+            iNextPhrase = ip+1;
+            iNextOff = io + pApi->xPhraseSize(pFts, ip);
+          }
+        }
+
+        if( nSeq>aOut[iCol] ) aOut[iCol] = nSeq;
+      }
+
+      break;
+    }
+  }
+  return rc;
+}
+ 
+static Fts5MatchinfoCtx *fts5MatchinfoNew(
+  const Fts5ExtensionApi *pApi,   /* API offered by current FTS version */
+  Fts5Context *pFts,              /* First arg to pass to pApi functions */
+  sqlite3_context *pCtx,          /* Context for returning error message */
+  const char *zArg                /* Matchinfo flag string */
+){
+  Fts5MatchinfoCtx *p;
+  int nCol;
+  int nPhrase;
+  int i;
+  int nInt;
+  sqlite3_int64 nByte;
+  int rc;
+
+  nCol = pApi->xColumnCount(pFts);
+  nPhrase = pApi->xPhraseCount(pFts);
+
+  nInt = 0;
+  for(i=0; zArg[i]; i++){
+    int n = fts5MatchinfoFlagsize(nCol, nPhrase, zArg[i]);
+    if( n<0 ){
+      char *zErr = sqlite3_mprintf("unrecognized matchinfo flag: %c", zArg[i]);
+      sqlite3_result_error(pCtx, zErr, -1);
+      sqlite3_free(zErr);
+      return 0;
+    }
+    nInt += n;
+  }
+
+  nByte = sizeof(Fts5MatchinfoCtx)          /* The struct itself */
+         + sizeof(u32) * nInt               /* The p->aRet[] array */
+         + (i+1);                           /* The p->zArg string */
+  p = (Fts5MatchinfoCtx*)sqlite3_malloc64(nByte);
+  if( p==0 ){
+    sqlite3_result_error_nomem(pCtx);
+    return 0;
+  }
+  memset(p, 0, nByte);
+
+  p->nCol = nCol;
+  p->nPhrase = nPhrase;
+  p->aRet = (u32*)&p[1];
+  p->nRet = nInt;
+  p->zArg = (char*)&p->aRet[nInt];
+  memcpy(p->zArg, zArg, i);
+
+  rc = fts5MatchinfoIter(pApi, pFts, p, fts5MatchinfoGlobalCb);
+  if( rc!=SQLITE_OK ){
+    sqlite3_result_error_code(pCtx, rc);
+    sqlite3_free(p);
+    p = 0;
+  }
+
+  return p;
+}
+
+static void fts5MatchinfoFunc(
+  const Fts5ExtensionApi *pApi,   /* API offered by current FTS version */
+  Fts5Context *pFts,              /* First arg to pass to pApi functions */
+  sqlite3_context *pCtx,          /* Context for returning result/error */
+  int nVal,                       /* Number of values in apVal[] array */
+  sqlite3_value **apVal           /* Array of trailing arguments */
+){
+  const char *zArg;
+  Fts5MatchinfoCtx *p;
+  int rc = SQLITE_OK;
+
+  if( nVal>0 ){
+    zArg = (const char*)sqlite3_value_text(apVal[0]);
+  }else{
+    zArg = "pcx";
+  }
+
+  p = (Fts5MatchinfoCtx*)pApi->xGetAuxdata(pFts, 0);
+  if( p==0 || sqlite3_stricmp(zArg, p->zArg) ){
+    p = fts5MatchinfoNew(pApi, pFts, pCtx, zArg);
+    if( p==0 ){
+      rc = SQLITE_NOMEM;
+    }else{
+      rc = pApi->xSetAuxdata(pFts, p, sqlite3_free);
+    }
+  }
+
+  if( rc==SQLITE_OK ){
+    rc = fts5MatchinfoIter(pApi, pFts, p, fts5MatchinfoLocalCb);
+  }
+  if( rc!=SQLITE_OK ){
+    sqlite3_result_error_code(pCtx, rc);
+  }else{
+    /* No errors has occured, so return a copy of the array of integers. */
+    int nByte = p->nRet * sizeof(u32);
+    sqlite3_result_blob(pCtx, (void*)p->aRet, nByte, SQLITE_TRANSIENT);
+  }
+}
+
+int sqlite3Fts5TestRegisterMatchinfo(sqlite3 *db){
+  int rc;                         /* Return code */
+  fts5_api *pApi;                 /* FTS5 API functions */
+
+  /* Extract the FTS5 API pointer from the database handle. The 
+  ** fts5_api_from_db() function above is copied verbatim from the 
+  ** FTS5 documentation. Refer there for details. */
+  rc = fts5_api_from_db(db, &pApi);
+  if( rc!=SQLITE_OK ) return rc;
+
+  /* If fts5_api_from_db() returns NULL, then either FTS5 is not registered
+  ** with this database handle, or an error (OOM perhaps?) has occurred.
+  **
+  ** Also check that the fts5_api object is version 2 or newer.  
+  */ 
+  if( pApi==0 || pApi->iVersion<2 ){
+    return SQLITE_ERROR;
+  }
+
+  /* Register the implementation of matchinfo() */
+  rc = pApi->xCreateFunction(pApi, "matchinfo", 0, fts5MatchinfoFunc, 0);
+
+  return rc;
+}
+
+#endif /* SQLITE_ENABLE_FTS5 */
+
+
 
 static int fts5Init(sqlite3 *db){
   static const sqlite3_module fts5Mod = {
     /* iVersion      */ 3,
     /* xCreate       */ fts5CreateMethod,
@@ -236527,16 +236904,16 @@
   }
 
   /* If SQLITE_FTS5_ENABLE_TEST_MI is defined, assume that the file
   ** fts5_test_mi.c is compiled and linked into the executable. And call
   ** its entry point to enable the matchinfo() demo.  */
-#ifdef SQLITE_FTS5_ENABLE_TEST_MI
+
   if( rc==SQLITE_OK ){
-    extern int sqlite3Fts5TestRegisterMatchinfo(sqlite3*);
+
     rc = sqlite3Fts5TestRegisterMatchinfo(db);
   }
-#endif
+
 
   return rc;
 }
 
 /*

I want to emphasize here that the implementation of the matchinfo function is carried from the ext/fts5/fts5_test_mi.c file of sqlite.

(8) By Stephan Beal (stephan) on 2022-06-08 08:08:15 in reply to 7 [link] [source]

Index: Fossil/Fossil_src_2022-06-07/src/search.c ... I want to emphasize here that the implementation of the matchinfo function is carried from the ext/fts5/fts5_test_mi.c file of sqlite.

Thank you very much for that! Fossil's FTS bits are still black-box to me, so your patch will provide a great guide to what needs to be updated. i can't guaranty when that will happen but your patch makes it far, far more likely that it will happen "sometime soon."

(9) By Stephan Beal (stephan) on 2023-01-24 02:59:04 in reply to 7 [link] [source]

Index: Fossil/Fossil_src_2022-06-07/src/search.c

This is long overdue, but an initial version of the FTS5 upgrade is at:

src:/timeline?r=search-fts5

it's not yet heavily tested but seems to work.

(10) By Richard Hipp (drh) on 2023-01-24 13:48:22 in reply to 9 [link] [source]

The only issue I have with the search-fts5 branch is that when you go to update your Fossil binary on a server, you have to be sure to click the "Rebuild The Full-Text Index" button, or else nothing will work right.

I wonder is there anything we can do to make that index rebuild automatic?

(11) By Richard Hipp (drh) on 2023-01-24 13:55:17 in reply to 10 [link] [source]

And, if you set up the FTS5 index, then revert your Fossil binary to one that uses FTS4, you get an SQL error on any search request...

(12) By Richard Hipp (drh) on 2023-01-24 13:58:22 in reply to 10 [link] [source]

Is the solution here that we just tell people when upgrading to a newer Fossil (or going back to an older version) that you must run

fossil all fts-config reindex

to get search working again?

(13) By Richard Hipp (drh) on 2023-01-24 14:08:20 in reply to 10 [link] [source]

I think I have a better solution for this. Anybody watching can talk to me about it on chat. I'll update this post once I get a check-in ready.

(14) By Stephan Beal (stephan) on 2023-01-24 21:29:36 in reply to 13 [link] [source]

I think I have a better solution for this.

Update: Richard patched the FTS changes to be able to gracefully handle the transition from FTS4 and FTS5 without requiring explicit actions from users. That version is now running on this site and has been merged into trunk.

(15) By anonymous on 2023-04-17 07:46:43 in reply to 14 [source]

I'm sorry that I was sick earlier, so I'm just now seeing your responses.

It's great to see that you guys have finally completed the migration of Fossil from FTS4 to FTS5. This is so cool!

However you guys seem to have missed one thing.

The reason I brought up Fossil's FTS5 was because I couldn't do a Chinese search in Fossil.

Now that Fossil has migrated to FTS5, there is only Porter Stemmer and no Trigram Splitter.

When Fossil has Trigram Splitter, it will be possible to do Chinese search.

Luckily, SQLite has Trigram Splitter built-in, and Fossil only needs to add a button for this purpose, so that Chinese search can be easily implemented. This is a relatively easy thing to do.

Could you please add this button? Thank you!

Translated with www.DeepL.com/Translator (free version)

(16) By Stephan Beal (stephan) on 2023-04-17 18:36:26 in reply to 15 [link] [source]

Luckily, SQLite has Trigram Splitter built-in, and Fossil only needs to add a button for this purpose, so that Chinese search can be easily implemented. This is a relatively easy thing to do.

There's an initial version of this support in the fts5-trigram branch. It is currently missing the option to set this tokenizer in the web UI but it can be configured from the command-line:

$ fossil fts-config tokenizer trigram -R REPOSITORY_FILE

if you are able to, please try that and see if searching in Chinese then works for you. It's unlikely that any of us will be able to adequately determine whether or not Chinese searching really works.

The trigram config option for the web UI is pending and should be done sometime in the next couple of days.

(17) By anonymous on 2023-04-18 02:15:05 in reply to 16 [link] [source]

This is so cool!

I downloaded the source code for the fts5-trigram branch and compiled it.

Then I ran the command "fossil fts-config tokenizer trigram -R REPOSITORY_FILE", and something magical happened—Fossil created an FTS5 index with trigram as the tokenizer.

After that, I tried Chinese search! It's amazing! The Chinese search succeeded! Thank you for your hard work!

(18) By Stephan Beal (stephan) on 2023-04-18 02:20:38 in reply to 17 [link] [source]

After that, I tried Chinese search! It's amazing! The Chinese search succeeded!

Fantastic, thank you for the confirmation. The missing UI piece will be added in the next day or two.

(19) By Stephan Beal (stephan) on 2023-04-18 17:50:52 in reply to 16 [link] [source]

The trigram config option for the web UI is pending and should be done sometime in the next couple of days.

The /srchsetup page now has a selection list of 3 different tokenizer options: none, Porter stemmer, and trigram. Those changes have been merged to trunk. Please report any problems with it here.

(20) By anonymous on 2023-06-04 04:04:24 in reply to 19 [link] [source]

Hi, stephan

Thanks for this, I tested w64-2.22 with "trigram" and met following error when searching Chinese character/word

Database error: fts5: syntax error near "": {INSERT INTO x(label,url,score,id,date,snip) SELECT ftsdocs.label, ftsdocs.url, rank(matchinfo(ftsidx,'pcsx')), ftsdocs.type || ftsdocs.rid, datetime(ftsdocs.mtime), snippet(ftsidx,-1,'<mark>','</mark>',' ... ',35) FROM ftsidx CROSS JOIN ftsdocs WHERE ftsidx MATCH ' ' AND ftsdocs.rowid=ftsidx.rowid AND (ftsdocs.type='w')}

(21) By Stephan Beal (stephan) on 2023-06-04 09:27:16 in reply to 20 [link] [source]

I tested w64-2.22 with "trigram" and met following error when searching Chinese character/word

Every time you search, regardless of the input, or for a specific input? Nothing in the search/indexing code has changed since the last update, so i'm at a loss to explain this.

Could you post a link to your repository and some Chinese text which should be found but isn't?

(22) By anonymous on 2023-06-04 14:50:49 in reply to 21 [link] [source]

you can download it from https://file.io/DB6fJC8qw1RE

It's very simple, just init then enable search in admin, add two wiki pages

  • test hello world:test hello worldtest hello worldtest hello world

  • test :中文测试

Searched under wiki by English "hello" successfully, but failed by Chinese "中文" with following error

Database error: fts5: syntax error near "": {INSERT INTO x(label,url,score,id,date,snip) SELECT ftsdocs.label, ftsdocs.url, rank(matchinfo(ftsidx,'pcsx')), ftsdocs.type || ftsdocs.rid, datetime(ftsdocs.mtime), snippet(ftsidx,-1,'<mark>','</mark>',' ... ',35) FROM ftsidx CROSS JOIN ftsdocs WHERE ftsidx MATCH ' ' AND ftsdocs.rowid=ftsidx.rowid AND (ftsdocs.type='w')}

(23) By Stephan Beal (stephan) on 2023-06-04 20:12:16 in reply to 22 [link] [source]

It's very simple, just init then enable search in admin, add two wiki pages

Thank you for the reproduce case. i will look into this the next time i have access to AC electricity (sometime in the next couple of days).

(24.2) By Stephan Beal (stephan) on 2023-06-05 10:33:25 edited from 24.1 in reply to 22 [link] [source]

Searched under wiki by English "hello" successfully, but failed by Chinese "中文" with following error

Follow-up:

The error message is an inadvertent side effect of src:196dfedf7fc467cd, which indirectly makes the FTS query invalid. i will need to discuss with Richard the implications of backing out that change.

However... when i locally back out that change then Chinese search "works" again, but not as i would intuitively expect it to.

Searching for an individual glyph does not return any results, but searching for all of them at once does. For example, with this wiki page content:

中文测试

Searching for any subset of those glyphs returns no results, but searching for the whole text works.

That is not a fossil-specific quirk, as the sqlite3 CLI app behaves the same way:

sqlite> create virtual table f using fts5 (txt,tokenize=trigram);
sqlite> insert into f(txt) values('中文测试');
sqlite> select * from f where txt match '中';
sqlite> select * from f where txt match '中文测试';
中文测试

The results are the same when using fossil's built-in copy of sqlite.

In other words, it does not look like searching for Chinese will be useful as-is. We'll need to explore other options for adding proper search support for Chinese. Unfortunately, my FTS5 skills are too low to suggest how we might go about that.

Edit: FTS tables which use the trigram tokenizer can (i just learned) use GLOB and LIKE patterns, and they then produce results where the above test finds none:

select * from f where txt glob '*中*';

but (A) whether that's an appropriate change for other tokenizers is not clear and (B) how to accurately detect whether or not the trigram tokenizer is actually in effect (i.e. by asking FTS instead of relying on the config-level setting (though i guess we can safely rely on that setting for this)), so that we can switch to this syntax for that tokenizer, is not clear.

(25) By anonymous on 2023-06-06 04:30:24 in reply to 24.2 [link] [source]

Thanks for the follow up!

Sorry for my wrong test case of 2 characters, seems trigram requires at least 3 unicode characters.

So if this 2.22 broken error which made FTS query(trigram + Chinese) invalid gets fixed in next release, we can search "中文测" and get result. although it's a limit but works.

I'm also not sure "like/blob" is appropriate here, further more to support for better Chinese search, maybe need better tokenizer such like simple, so we can select it from admin of FTS Tokenizer. I will make a try when get some time later.

(26) By anonymous on 2023-06-14 00:56:24 in reply to 25 [link] [source]

你是中国人吧?那我直接用中文回复了。

人家是三元分词,至少需要三个字符,你搜索的字符串比三个字符还少,怎么可能会成功?

还有,在三元分词里,用三个以上的阿斯克码字符去搜索也是不成功的。

这不是 bug,只是你不清楚的三元分词的特性。


Are you Chinese? Then I'll reply directly in Chinese.

It is a three-part word, which requires at least three characters. Your search string has fewer than three characters, so how could it possibly be successful?

Also, in three-part word segmentation, searching with more than three ASCII characters will also not be successful.

This is not a bug, but a feature of three-part word segmentation that you are not aware of.

(27) By anonymous on 2023-06-14 01:22:59 in reply to 25 [link] [source]

之所以使用 trigram 分词器,是因为 sqlite 自带有这个分词器,可以利用这个分词器轻易的添加有关于中文搜索的支持。

虽然它有一些小小限制,但总体来说是可用的。

而你所说的 simple 分词器,在 fossil 里需要进行额外的工作才能正式引入,这样子代价就太大了。

我更加倾向于在现有的 trigram 分词器上进行小修小补,而不是大张旗鼓的引入 sqlite 不自带的分词器。

因为 fossil 的使用用户主要是英文用户,所以 fossil 的开发者没有必要在支持中文搜索上花费大量的时间。

我仅仅使用 fossil 来记录我的日记, trigram 工作得非常好。我非常满意。


The reason for using the trigram tokenizer is that SQLite comes with this tokenizer built-in, making it easy to add support for Chinese search.

Although it has some minor limitations, overall it is usable.

As for the simple tokenizer you mentioned, it requires extra work to be officially introduced into Fossil, which is too costly.

I am more inclined to make small fixes to the existing trigram tokenizer rather than introducing a non-default tokenizer in SQLite.

Since the main user base of Fossil is English-speaking, the developers do not need to spend a lot of time supporting Chinese search.

Personally, I only use Fossil to record my diary, and trigram works very well. I am very satisfied.

(28) By Eugene (eugene) on 2023-07-09 23:44:35 in reply to 24.2 [link] [source]

Stephan,

Any chance this issue going to be fixed in the next release? This affects not only Chinese, but also Russian (and I assume any language which uses non-english characters). Or is the best approach would be to check out fossil source, back out that particular change and recompile?

(29) By Warren Young (wyoung) on 2023-07-10 05:54:50 in reply to 28 [link] [source]

The best approach is for someone who writes regularly one of these languages to apply their skills to the problem and tell us what is needed to make it work. It's not surprising to run into inadvertent errors when the expectation to make it work is laid on one who speaks and writes only in languages that use the Roman alphabet.1

Why rely on the chance of him nailing these features without knowing the languages involved?


  1. ^ If Stephan speaks and writes more than English and German, I don't know about it.

(30) By Stephan Beal (stephan) on 2023-07-10 06:02:26 in reply to 29 [link] [source]

The best approach is for someone who writes regularly one of these languages to apply their skills to the problem and tell us what is needed to make it work.

Very well put :). At this point my solution attempts are just trial and error, as i don't fully understand the problem or potential solutions, and my development priorities are currently elsewhere. We need someone who understands both of those better to look into it.

If Stephan speaks and writes more than English and German, I don't know about it.

Nor do i!

(31) By Eugene (eugene) on 2023-07-10 13:38:19 in reply to 29 [link] [source]

Warren,

Fair enough. However, in my case, I lack expertise in any FTS-related stuff, fossil internals and C programming in general, I only know that before 2.22 the FTS search was working fine on my repo and now it's broken.

I'll add one more patch (reverting changes Stephan mentioned earlier) to my locally checked out fossil source repo as I recompile latest releases anyway and use custom-built binaries.

(32) By Larry Brasfield (larrybr) on 2023-07-10 14:25:12 in reply to 31 [link] [source]

Fair enough. However, in my case, I lack expertise in any FTS-related stuff, fossil internals and C programming in general ...

As I read Warren's comment, (with a pretty good idea of what he meant to say), he is not suggesting that FTS expertise, Fossil internals familiarity, or C programming skills are needed. (Stephan is already adept in those areas.)

What is needed, which you are better poised to provide, is knowledge of how the FTS subsystem should work from a user's perspective. We can see that you have some expectations that were recently disappointed. What is harder to see is what, exactly, the correct behavior should be. Even if that happens to be how a previous version behaved, it would be helpful if the oddities of Chinese and Russian character set or character conglomeration conventions, insofar as they interact with search, were spelled out. It is a lack of such a clear requirement that most likely led to the recent change which you deem "broken".

(33) By Stephan Beal (stephan) on 2023-07-10 14:33:55 in reply to 32 [link] [source]

(Stephan is already adept in those areas.)

Except for FTS, which is still largely voodoo to me.

It would be useful to know whether Russian search was broken by the change from FTS4 to FTS5 or by the changes made for adding the trigram tokenizer. My suspicion is the former, as the latter adds features without removing any.

(34.1) By Eugene (eugene) on 2023-07-12 22:58:26 edited from 34.0 in reply to 32 [link] [source]

Guys,

I apologize for not making it clear in my first post. Here's the full story:

  • I have a few years old fossil repo which I use for my personal blog/doc/wiki stuff.
  • There's a lot of English/Russian stuff in the repo, I enabled FTS a long time ago as I frequently use AND OR NEAR NOT when searching.
  • Over those years I upgraded fossil several times, never had any issues with FTS.
  • I've recently upgraded to 2.22 and immediately noticed that the search is "broken" (see below).

Here's a script to reproduce the issue (it's a Windows batch stuff, but nothing that can't be easily converted to bash or whatever):


rem Set Fossil version
set FV=21

curl -LO https://fossil-scm.org/home/uv/fossil-w64-2.%FV%.zip
unzip fossil-w64-2.%FV%.zip
.\fossil init test.fossil
.\fossil open test.fossil --workdir test

cd test
..\fossil fts-config enable w
if v%FV%==v22 (..\fossil fts-config tokenizer porter) else (..\fossil fts-config stemmer on)
..\fossil fts-config index on

curl "https://en.wikipedia.org/wiki/Fossil_(software)?action=raw" | ..\fossil wiki create fossil-en
curl "https://ru.wikipedia.org/wiki/Fossil?action=raw" | ..\fossil wiki create fossil-ru

..\fossil ui
rem Now do a search from UI

With 2.21 everything works as expected with Russian words (for example "файл"):

fossil-2-21-fts

With 2.22 the search barfs:

fossil-2-22-fts

English search works fine in both cases. As far as I can tell, both 2.21 and 2.22 use FTS5.

(35) By anonymous on 2023-07-12 18:39:25 in reply to 34.0 [link] [source]

Oooh. That's a much better bug report. Thank you.

(36) By Stephan Beal (stephan) on 2023-07-13 00:33:45 in reply to 34.1 [link] [source]

With 2.22 the search barfs:

That's actually a side-effect of an unrelated change which i meant to ping Richard about a month or so ago and lost track of:

src:196dfedf7fc467cd removes non-ASCII from the search string, replacing it with spaces, resulting in the error your screenshot shows.

Can you locally remove that change (just use the left side of the above diff) and see if that restores your Russian search?

In the mean time i'll see whether we can back out that change without introducing any new issues.

(37) By John Rouillard (rouilj) on 2023-07-13 00:57:23 in reply to 36 [link] [source]

Wouldn't that patch break any foreign language that has its own alphabet? Japanese, Chinese, Cyrillic, Turkish etc...

(38.1) By Stephan Beal (stephan) on 2023-07-13 15:58:41 edited from 38.0 in reply to 37 [link] [source]

Wouldn't that patch break any foreign language that has its own alphabet?

Indeed, it does. i've pinged Richard to ask about the ramifications of reverting it. He never checks in unconsidered code, and his reason for making that change might trump the search breakage. That's still To Be Determined and i'll post any updates on that to this thread (edit: unless Daniel beats me to it ;) - see responses below).

(39) By Daniel Dumitriu (danield) on 2023-07-13 04:42:23 in reply to 37 [link] [source]

Turkish uses the Latin alphabet since 1928 ;)

(40) By Daniel Dumitriu (danield) on 2023-07-13 12:15:07 in reply to 36 [link] [source]

Please try again with the latest trunk and report if the problem is solved and if the FTS search in general works as before/as wanted.

(41) By Eugene (eugene) on 2023-07-13 15:29:16 in reply to 40 [link] [source]

Tried the latest trunk today on Windows - search works perfectly fine with Cyrillic characters, thanks!