Many hyperlinks are disabled.
Use anonymous login
to enable hyperlinks.
Overview
Comment: | Moved the documentation for Fossil's grep implementation out of src/regexp.c into a new document with greatly-expanded content, www/grep.md, which is now referenced both from the source code and in the output for "fossil help grep". |
---|---|
Downloads: | Tarball | ZIP archive |
Timelines: | family | ancestors | descendants | both | trunk |
Files: | files | file ages | folders |
SHA3-256: |
2e1775e23af45c4b108ef52b8ef7ed31 |
User & Date: | wyoung 2018-10-03 20:09:59.591 |
Context
2018-10-03
| ||
20:11 | URL fix ... (check-in: 186748ff user: wyoung tags: trunk) | |
20:09 | Moved the documentation for Fossil's grep implementation out of src/regexp.c into a new document with greatly-expanded content, www/grep.md, which is now referenced both from the source code and in the output for "fossil help grep". ... (check-in: 2e1775e2 user: wyoung tags: trunk) | |
20:09 | Added an entry for `fossil grep` to the www/changes.wiki file in the 2.7 section. Somehow this addition wasn't noted prior to the release. ... (check-in: 9e1faa02 user: wyoung tags: trunk) | |
Changes
Changes to src/regexp.c.
︙ | ︙ | |||
14 15 16 17 18 19 20 | ** http://www.hwaci.com/drh/ ** ******************************************************************************* ** ** This file was adapted from the test_regexp.c file in SQLite3. That ** file is in the public domain. ** | | < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < < | 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | ** http://www.hwaci.com/drh/ ** ******************************************************************************* ** ** This file was adapted from the test_regexp.c file in SQLite3. That ** file is in the public domain. ** ** See ../www/grep.md for details of the algorithm and RE dialect. */ #include "config.h" #include "regexp.h" /* The end-of-input character */ #define RE_EOF 0 /* End of input */ |
︙ | ︙ | |||
825 826 827 828 829 830 831 | } /* ** COMMAND: grep ** ** Usage: %fossil grep [OPTIONS] PATTERN FILENAME ** | > | > | | 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 | } /* ** COMMAND: grep ** ** Usage: %fossil grep [OPTIONS] PATTERN FILENAME ** ** Attempt to match the given POSIX extended regular expression PATTERN ** over all historic versions of FILENAME. For details of the supported ** RE dialect, see https://fossil-scm.org/doc/trunk/www/grep.md. ** ** Options: ** ** -i|--ignore-case Ignore case ** -l|--files-with-matches List only checkin ID for versions that match ** -v|--verbose Show each file as it is analyzed */ void re_grep_cmd(void){ u32 flags = 0; int bVerbose = 0; ReCompiled *pRe; const char *zErr; |
︙ | ︙ |
Added www/grep.md.
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 | # Fossil's Internal 'grep' Command As of Fossil 2.7, there is a `grep` command which acts something like POSIX's `grep -E` over all historical versions of a single file name. This document explains the commonalities and divergences between POSIX `grep` and Fossil `grep`. ## Options Fossil `grep` supports only a small subset of the options specified for POSIX `grep`: | Option | Meaning |--------|------------------------------------------------------------- | `-i` | ignore case in matches | `-l` | list a checkin ID prefix for matching historical versions of the file | `-v` | print each checkin ID considered, regardless of whether it matches No equivalent of other POSIX `grep` options currently exist. Patches to remove those limitations will be thoughtfully considered. ## Regular Expression Dialect Fossil contains a built-in regular expression engine implementing a subset of the [POSIX extended regular expression][ere] dialect: [ere]: https://en.wikipedia.org/wiki/Regular_expression#POSIX_extended | Atom | Meaning |---------|------------------------------------------------------------- | `X*` | zero or more occurrences of X | `X+` | one or more occurrences of X | `X?` | zero or one occurrences of X | `X{p,q}`| between p and q occurrences of X, inclusive | `(X)` | match X | <tt>X\|Y</tt>| X or Y | `^X` | X occurring at the beginning of a line | `X$` | X occurring at the end of a line | `.` | Match any single character | `\c` | Character `c` where `c` is one of <tt>{}()[]\|\*+?.</tt> | `\c` | C-language escapes for `c` in `afnrtv`. ex: `\t` or `\n` | `\uXXXX`| Where XXXX is exactly 4 hex digits, Unicode value XXXX | `\xXX` | Where XX is exactly 2 hex digits, Unicode value XX | `[abc]` | Any single character from the set `abc` | `[^abc]`| Any single character not in the set `abc` | `[a-z]` | Any single character in the range `a-z` | `[^a-z]`| Any single character not in the range `a-z` | `\b` | Word boundary | `\w` | Word character: `[A-Za-z0-9_]` | `\W` | Non-word character | `\d` | Digit | `\D` | Non-digit | `\s` | Whitespace character | `\S` | Non-whitespace character There are several restrictions in Fossil `grep` relative to a fully POSIX compatible regular expression engine. Among them are: * There is currently no support for POSIX character classes such as `[:lower:]`. * Fossil does not currently attempt to take your operating system's locale settings into account when doing this match. Fossil also currently has no way to mark a given file as having a particular encoding. Instead, Fossil `grep` assumes the input files are in UTF-8 format, so it will not work correctly if the files in your repository are in an encoding that is not backwards-compatible with ASCII, such as UTF-16. Partially compatible encodings such as ISO 8859 should work with Fossil `grep` as long as you stick to their ASCII-compatible subset. The Fossil `grep` language is not a strict subset of POSIX extended regular expressions. Some of the features documented above are well-understood extensions to it, such as the "word" features `\b`, `\w` and `\W`. Fossil `grep` is based on the Unicode engine from [SQLite's FTS5 feature][fts5]. This means it does do things like Unicode-aware case folding. Therefore, it is usually a user error to attempt to substitute `[a-z]` for a lack of the POSIX character class `[:lower:]` if you are grepping over pretty much any human written language other than English. Use `fossil grep -i` instead, which does Unicode case folding. [fts5]: https://www.sqlite.org/fts5.html ## Algorithm Details Fossil `grep` uses a [nondeterministic finite automaton][nfa] for matching, so the performance is bounded by ***O(nm)*** where ***n*** is the size of the regular expression and ***m*** is the size of the input string. [nfa]: https://en.wikipedia.org/wiki/Nondeterministic_finite_automaton In order to avoid [ReDoS attacks][redos], the Fossil regular expression matcher was purposely written to avoid [implementation choices][rei] which have the potential to require exponential evaluation time. This constrains the possible feature set we can support in the Fossil `grep` dialect. For instance, we are unlikely to ever add support for [backtracking][bt]. [redos]: https://en.wikipedia.org/wiki/ReDoS [rei]: https://en.wikipedia.org/wiki/Regular_expression#Implementations_and_running_times [bt]: https://en.wikipedia.org/wiki/Backtracking The `X{p,q}` operator expands to `p` copies of `X` followed by `q-p` copies of `X?` before RE evaluation. The ***O(nm)*** performance bound above remains true for this case, but realize that it applies to the RE *after* this expansion, not to the form as given by the user. In other words, as `q-p` increases, so does the RE evaluation time. |