Fossil

Check-in [2e1775e2]
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Moved the documentation for Fossil's grep implementation out of src/regexp.c into a new document with greatly-expanded content, www/grep.md, which is now referenced both from the source code and in the output for "fossil help grep".
Downloads: Tarball | ZIP archive
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA3-256: 2e1775e23af45c4b108ef52b8ef7ed31d6d351dea33e54d73238ed44cf38fdbf
User & Date: wyoung 2018-10-03 20:09:59.591
Context
2018-10-03
20:11
URL fix ... (check-in: 186748ff user: wyoung tags: trunk)
20:09
Moved the documentation for Fossil's grep implementation out of src/regexp.c into a new document with greatly-expanded content, www/grep.md, which is now referenced both from the source code and in the output for "fossil help grep". ... (check-in: 2e1775e2 user: wyoung tags: trunk)
20:09
Added an entry for `fossil grep` to the www/changes.wiki file in the 2.7 section. Somehow this addition wasn't noted prior to the release. ... (check-in: 9e1faa02 user: wyoung tags: trunk)
Changes
Unified Diff Ignore Whitespace Patch
Changes to src/regexp.c.
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
**   http://www.hwaci.com/drh/
**
*******************************************************************************
**
** This file was adapted from the test_regexp.c file in SQLite3.  That
** file is in the public domain.
**
** The code in this file implements a compact but reasonably
** efficient regular-expression matcher for posix extended regular
** expressions against UTF8 text.  The following syntax is supported:
**
**     X*      zero or more occurrences of X
**     X+      one or more occurrences of X
**     X?      zero or one occurrences of X
**     X{p,q}  between p and q occurrences of X
**     (X)     match X
**     X|Y     X or Y
**     ^X      X occurring at the beginning of the string
**     X$      X occurring at the end of the string
**     .       Match any single character
**     \c      Character c where c is one of \{}()[]|*+?.
**     \c      C-language escapes for c in afnrtv.  ex: \t or \n
**     \uXXXX  Where XXXX is exactly 4 hex digits, unicode value XXXX
**     \xXX    Where XX is exactly 2 hex digits, unicode value XX
**     [abc]   Any single character from the set abc
**     [^abc]  Any single character not in the set abc
**     [a-z]   Any single character in the range a-z
**     [^a-z]  Any single character not in the range a-z
**     \b      Word boundary
**     \w      Word character.  [A-Za-z0-9_]
**     \W      Non-word character
**     \d      Digit
**     \D      Non-digit
**     \s      Whitespace character
**     \S      Non-whitespace character
**
** A nondeterministic finite automaton (NFA) is used for matching, so the
** performance is bounded by O(N*M) where N is the size of the regular
** expression and M is the size of the input string.  The matcher never
** exhibits exponential behavior.  Note that the X{p,q} operator expands
** to p copies of X following by q-p copies of X? and that the size of the
** regular expression in the O(N*M) performance bound is computed after
** this expansion.
*/
#include "config.h"
#include "regexp.h"

/* The end-of-input character */
#define RE_EOF            0    /* End of input */








|
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<







14
15
16
17
18
19
20
21



































22
23
24
25
26
27
28
**   http://www.hwaci.com/drh/
**
*******************************************************************************
**
** This file was adapted from the test_regexp.c file in SQLite3.  That
** file is in the public domain.
**
** See ../www/grep.md for details of the algorithm and RE dialect.



































*/
#include "config.h"
#include "regexp.h"

/* The end-of-input character */
#define RE_EOF            0    /* End of input */

825
826
827
828
829
830
831

832

833
834
835
836
837
838
839
840
841
842
843
844
}

/*
** COMMAND: grep
**
** Usage: %fossil grep [OPTIONS] PATTERN FILENAME
**

** Run grep over all historic version of FILENAME

**
** Options:
**
**     -i|--ignore-case         Ignore case
**     -l|--files-with-matches  Print only filenames that match
**     -v|--verbose             Show each file as it is analyzed
*/
void re_grep_cmd(void){
  u32 flags = 0;
  int bVerbose = 0;
  ReCompiled *pRe;
  const char *zErr;







>
|
>




|







790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
}

/*
** COMMAND: grep
**
** Usage: %fossil grep [OPTIONS] PATTERN FILENAME
**
** Attempt to match the given POSIX extended regular expression PATTERN
** over all historic versions of FILENAME.  For details of the supported
** RE dialect, see https://fossil-scm.org/doc/trunk/www/grep.md.
**
** Options:
**
**     -i|--ignore-case         Ignore case
**     -l|--files-with-matches  List only checkin ID for versions that match
**     -v|--verbose             Show each file as it is analyzed
*/
void re_grep_cmd(void){
  u32 flags = 0;
  int bVerbose = 0;
  ReCompiled *pRe;
  const char *zErr;
Added www/grep.md.






































































































































































































































>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
# Fossil's Internal 'grep' Command

As of Fossil 2.7, there is a `grep` command which acts something like
POSIX's `grep -E` over all historical versions of a single file name.
This document explains the commonalities and divergences between POSIX
`grep` and Fossil `grep`.


## Options

Fossil `grep` supports only a small subset of the options specified for
POSIX `grep`:

| Option | Meaning
|--------|-------------------------------------------------------------
| `-i`   | ignore case in matches
| `-l`   | list a checkin ID prefix for matching historical versions of the file
| `-v`   | print each checkin ID considered, regardless of whether it matches

No equivalent of other POSIX `grep` options currently exist. Patches to
remove those limitations will be thoughtfully considered.


## Regular Expression Dialect

Fossil contains a built-in regular expression engine implementing a
subset of the [POSIX extended regular expression][ere] dialect:

[ere]: https://en.wikipedia.org/wiki/Regular_expression#POSIX_extended

| Atom    | Meaning
|---------|-------------------------------------------------------------
| `X*`    | zero or more occurrences of X
| `X+`    | one or more occurrences of X
| `X?`    | zero or one occurrences of X
| `X{p,q}`| between p and q occurrences of X, inclusive
| `(X)`   | match X
| <tt>X\|Y</tt>| X or Y
| `^X`    | X occurring at the beginning of a line
| `X$`    | X occurring at the end of a line
| `.`     | Match any single character
| `\c`    | Character `c` where `c` is one of <tt>{}()[]\|\*+?.</tt>
| `\c`    | C-language escapes for `c` in `afnrtv`.  ex: `\t` or `\n`
| `\uXXXX`| Where XXXX is exactly 4 hex digits, Unicode value XXXX
| `\xXX`  | Where XX is exactly 2 hex digits, Unicode value XX
| `[abc]` | Any single character from the set `abc`
| `[^abc]`| Any single character not in the set `abc`
| `[a-z]` | Any single character in the range `a-z`
| `[^a-z]`| Any single character not in the range `a-z`
| `\b`    | Word boundary
| `\w`    | Word character: `[A-Za-z0-9_]`
| `\W`    | Non-word character
| `\d`    | Digit
| `\D`    | Non-digit
| `\s`    | Whitespace character
| `\S`    | Non-whitespace character

There are several restrictions in Fossil `grep` relative to a fully
POSIX compatible regular expression engine. Among them are:

*   There is currently no support for POSIX character classes such as
    `[:lower:]`.

*   Fossil does not currently attempt to take your operating system's
    locale settings into account when doing this match.  Fossil also
    currently has no way to mark a given file as having a particular
    encoding.

    Instead, Fossil `grep` assumes the input files are in UTF-8 format,
    so it will not work correctly if the files in your repository are in
    an encoding that is not backwards-compatible with ASCII, such as
    UTF-16.  Partially compatible encodings such as ISO 8859 should work
    with Fossil `grep` as long as you stick to their ASCII-compatible
    subset.

The Fossil `grep` language is not a strict subset of POSIX extended
regular expressions.  Some of the features documented above are
well-understood extensions to it, such as the "word" features `\b`, `\w`
and `\W`.

Fossil `grep` is based on the Unicode engine from [SQLite's FTS5
feature][fts5].  This means it does do things like Unicode-aware case
folding. Therefore, it is usually a user error to attempt to substitute
`[a-z]` for a lack of the POSIX character class `[:lower:]` if you are
grepping over pretty much any human written language other than English.
Use `fossil grep -i` instead, which does Unicode case folding.

[fts5]: https://www.sqlite.org/fts5.html


## Algorithm Details

Fossil `grep` uses a [nondeterministic finite automaton][nfa] for
matching, so the performance is bounded by ***O(nm)*** where ***n*** is
the size of the regular expression and ***m*** is the size of the input
string.

[nfa]: https://en.wikipedia.org/wiki/Nondeterministic_finite_automaton

In order to avoid [ReDoS attacks][redos], the Fossil regular expression
matcher was purposely written to avoid [implementation choices][rei]
which have the potential to require exponential evaluation time. This
constrains the possible feature set we can support in the Fossil `grep`
dialect. For instance, we are unlikely to ever add support for
[backtracking][bt].

[redos]: https://en.wikipedia.org/wiki/ReDoS
[rei]:   https://en.wikipedia.org/wiki/Regular_expression#Implementations_and_running_times
[bt]:    https://en.wikipedia.org/wiki/Backtracking

The `X{p,q}` operator expands to `p` copies of `X` followed by `q-p`
copies of `X?` before RE evaluation. The ***O(nm)*** performance bound
above remains true for this case, but realize that it applies to the RE
*after* this expansion, not to the form as given by the user.  In other
words, as `q-p` increases, so does the RE evaluation time.