Fossil

Check-in [16fde3ff]
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Add output buffering to the (non-legacy) comment printing algorithm, to reduce calls to fossil_print(). The resulting performance improvement can be up to factor 10, with a perceptible difference even for short comments (measured and tested on Windows with MSVC builds, and on Ubuntu with GCC builds). (For comparison: for the legacy comment printing algorithm, the extra UTF-8 checks added by this branch impair performance by 0.12-1.8%, depending on whether the input contains predominantly multi-byte vs. ASCII-only sequences.)
Downloads: Tarball | ZIP archive
Timelines: family | ancestors | descendants | both | comment-formatter-utf8
Files: files | file ages | folders
SHA1: 16fde3ff666cf0733102f7a061756c718597a299
User & Date: florian 2018-11-15 12:43:00.000
Context
2018-11-15
15:16
Fix a problem with initial indent introduced by the previous check-in, so that all regression tests from test/comment.test now succeed. Also eliminate three more calls to fossil_print(). Regarding performance, the legacy comment printing algorithm is outnumbered by factor 2-3, with these changes. ... (check-in: b029ed22 user: florian tags: comment-formatter-utf8)
12:43
Add output buffering to the (non-legacy) comment printing algorithm, to reduce calls to fossil_print(). The resulting performance improvement can be up to factor 10, with a perceptible difference even for short comments (measured and tested on Windows with MSVC builds, and on Ubuntu with GCC builds). (For comparison: for the legacy comment printing algorithm, the extra UTF-8 checks added by this branch impair performance by 0.12-1.8%, depending on whether the input contains predominantly multi-byte vs. ASCII-only sequences.) ... (check-in: 16fde3ff user: florian tags: comment-formatter-utf8)
2018-10-17
14:16
Modify the comment formatter to avoid output of incomplete UTF-8 sequences, and to avoid line breaks inside UTF-8 sequences. See https://fossil-scm.org/forum/forumpost/1247e4a3c4 for detailed information and tests. ... (check-in: 1bbca2c3 user: florian tags: comment-formatter-utf8)
Changes
Unified Diff Ignore Whitespace Patch
Changes to src/comformat.c.
178
179
180
181
182
183
184

185
186
187
188
189
190
191







192
193
194
195
196
197






198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
  int trimSpace,         /* [in] Non-zero to trim leading/trailing spaces. */
  int wordBreak,         /* [in] Non-zero to try breaking on word boundaries. */
  int origBreak,         /* [in] Non-zero to break before original comment. */
  int *pLineCnt,         /* [in/out] Pointer to the total line count. */
  const char **pzLine    /* [out] Pointer to the end of the logical line. */
){
  int index = 0, charCnt = 0, lineCnt = 0, maxChars;

  if( !zLine ) return;
  if( lineChars<=0 ) return;
  comment_print_indent(zLine, indent, trimCrLf, trimSpace, &index);
  maxChars = lineChars;
  for(;;){
    int useChars = 1;
    char c = zLine[index];







    if( c==0 ){
      break;
    }else{
      if( origBreak && index>0 ){
        const char *zCurrent = &zLine[index];
        if( comment_check_orig(zOrigText, zCurrent, &charCnt, &lineCnt) ){






          comment_print_indent(zCurrent, origIndent, trimCrLf, trimSpace,
                               &index);
          maxChars = lineChars;
        }
      }
      index++;
    }
    if( c=='\n' ){
      lineCnt++;
      charCnt = 0;
      useChars = 0;
    }else if( c=='\t' ){
      int nextIndex = comment_next_space(zLine, index);
      if( nextIndex<=0 || (nextIndex-index)>maxChars ){
        break;
      }
      charCnt++;
      useChars = COMMENT_TAB_WIDTH;
      if( maxChars<useChars ){
        fossil_print(" ");
        break;
      }
    }else if( wordBreak && fossil_isspace(c) ){
      int nextIndex = comment_next_space(zLine, index);
      if( nextIndex<=0 || (nextIndex-index)>maxChars ){
        break;
      }







>







>
>
>
>
>
>
>






>
>
>
>
>
>



















|







178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
  int trimSpace,         /* [in] Non-zero to trim leading/trailing spaces. */
  int wordBreak,         /* [in] Non-zero to try breaking on word boundaries. */
  int origBreak,         /* [in] Non-zero to break before original comment. */
  int *pLineCnt,         /* [in/out] Pointer to the total line count. */
  const char **pzLine    /* [out] Pointer to the end of the logical line. */
){
  int index = 0, charCnt = 0, lineCnt = 0, maxChars;
  char zBuf[400]; int iBuf=0; /* Output buffer and counter. */
  if( !zLine ) return;
  if( lineChars<=0 ) return;
  comment_print_indent(zLine, indent, trimCrLf, trimSpace, &index);
  maxChars = lineChars;
  for(;;){
    int useChars = 1;
    char c = zLine[index];
    /* Flush the output buffer if there's no space left for at least one more
    ** (potentially 4-byte) UTF-8 sequence and a terminating NULL. */
    if ( iBuf>sizeof(zBuf)-5 ){
      zBuf[iBuf]=0;
      iBuf=0;
      fossil_print("%s", zBuf);
    }
    if( c==0 ){
      break;
    }else{
      if( origBreak && index>0 ){
        const char *zCurrent = &zLine[index];
        if( comment_check_orig(zOrigText, zCurrent, &charCnt, &lineCnt) ){
          /* Flush the output buffer before printing the indentation. */
          if ( iBuf>0 ){
            zBuf[iBuf]=0;
            iBuf=0;
            fossil_print("%s", zBuf);
          }
          comment_print_indent(zCurrent, origIndent, trimCrLf, trimSpace,
                               &index);
          maxChars = lineChars;
        }
      }
      index++;
    }
    if( c=='\n' ){
      lineCnt++;
      charCnt = 0;
      useChars = 0;
    }else if( c=='\t' ){
      int nextIndex = comment_next_space(zLine, index);
      if( nextIndex<=0 || (nextIndex-index)>maxChars ){
        break;
      }
      charCnt++;
      useChars = COMMENT_TAB_WIDTH;
      if( maxChars<useChars ){
        zBuf[iBuf++] = ' ';
        break;
      }
    }else if( wordBreak && fossil_isspace(c) ){
      int nextIndex = comment_next_space(zLine, index);
      if( nextIndex<=0 || (nextIndex-index)>maxChars ){
        break;
      }
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247

248
249
250
251
252
253
254
255
256
257
258
259
260
261






262
263
264
265
266
267
268
    ** inside UTF-8 sequences. Incomplete, ill-formed and overlong sequences are
    ** kept together. The invalid lead bytes 0xC0 to 0xC1 and 0xF5 to 0xF7 are
    ** allowed to initiate (ill-formed) 2- and 4-byte sequences, respectively,
    ** the other invalid lead bytes 0xF8 to 0xFF are treated as invalid 1-byte
    ** sequences (as lone trail bytes).
    */
    if( (c&0xc0)==0xc0 && zLine[index]!=0 ){  /* Any UTF-8 lead byte 11xxxxxx */
      char zUTF8[5]; /* Buffer to hold a UTF-8 sequence. */
      int cchUTF8=1; /* Code units consumed. */
      int maxUTF8=1; /* Expected sequence length. */
      zUTF8[0]=c;
      if( (c&0xe0)==0xc0 )maxUTF8=2;          /* UTF-8 lead byte 110vvvvv */
      else if( (c&0xf0)==0xe0 )maxUTF8=3;     /* UTF-8 lead byte 1110vvvv */
      else if( (c&0xf8)==0xf0 )maxUTF8=4;     /* UTF-8 lead byte 11110vvv */
      while( cchUTF8<maxUTF8 &&
              (zLine[index]&0xc0)==0x80 ){    /* UTF-8 trail byte 10vvvvvv */

        zUTF8[cchUTF8++] = zLine[index++];
      }
      zUTF8[cchUTF8]=0;
      fossil_print("%s", zUTF8);
    }
    else
      fossil_print("%c", c);
    if( (c&0x80)==0 || (zLine[index+1]&0xc0)!=0xc0 ) maxChars -= useChars;
    if( maxChars<=0 ) break;
    if( c=='\n' ) break;
  }
  if( charCnt>0 ){
    fossil_print("\n");
    lineCnt++;






  }
  if( pLineCnt ){
    *pLineCnt += lineCnt;
  }
  if( pzLine ){
    *pzLine = zLine + index;
  }







<


|





>
|

<
<


|





|

>
>
>
>
>
>







246
247
248
249
250
251
252

253
254
255
256
257
258
259
260
261
262
263


264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
    ** inside UTF-8 sequences. Incomplete, ill-formed and overlong sequences are
    ** kept together. The invalid lead bytes 0xC0 to 0xC1 and 0xF5 to 0xF7 are
    ** allowed to initiate (ill-formed) 2- and 4-byte sequences, respectively,
    ** the other invalid lead bytes 0xF8 to 0xFF are treated as invalid 1-byte
    ** sequences (as lone trail bytes).
    */
    if( (c&0xc0)==0xc0 && zLine[index]!=0 ){  /* Any UTF-8 lead byte 11xxxxxx */

      int cchUTF8=1; /* Code units consumed. */
      int maxUTF8=1; /* Expected sequence length. */
      zBuf[iBuf++]=c;
      if( (c&0xe0)==0xc0 )maxUTF8=2;          /* UTF-8 lead byte 110vvvvv */
      else if( (c&0xf0)==0xe0 )maxUTF8=3;     /* UTF-8 lead byte 1110vvvv */
      else if( (c&0xf8)==0xf0 )maxUTF8=4;     /* UTF-8 lead byte 11110vvv */
      while( cchUTF8<maxUTF8 &&
              (zLine[index]&0xc0)==0x80 ){    /* UTF-8 trail byte 10vvvvvv */
        cchUTF8++;
        zBuf[iBuf++] = zLine[index++];
      }


    }
    else
      zBuf[iBuf++] = c;
    if( (c&0x80)==0 || (zLine[index+1]&0xc0)!=0xc0 ) maxChars -= useChars;
    if( maxChars<=0 ) break;
    if( c=='\n' ) break;
  }
  if( charCnt>0 ){
    zBuf[iBuf++] = '\n';
    lineCnt++;
  }
  /* Flush the remaining output buffer. */
  if ( iBuf>0 ) {
    zBuf[iBuf]=0;
    iBuf=0;
    fossil_print("%s", zBuf);
  }
  if( pLineCnt ){
    *pLineCnt += lineCnt;
  }
  if( pzLine ){
    *pzLine = zLine + index;
  }