Fossil

Check-in [c9082b29]
Login

Check-in [c9082b29]

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Update the defense-against-robots documentation to align with current behavior.
Downloads: Tarball | ZIP archive | SQL archive
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA3-256: c9082b29711617c0ca50634dd8acc7bfb58255d26b88b3a1d62e3fee5c8e7305
User & Date: drh 2022-02-12 20:53:11
Context
2022-02-13
19:16
Back out check-in [5bb921dd0893a548] which was wrong - the REQUEST_URI CGI parameter should include the query string. Improve the CGI variable documentation in comments. Improve robustness to malformed CGI variables. ... (check-in: e514eeea user: drh tags: trunk)
00:26
Back out [5bb921dd0893a548]. It turns out that REQUEST_URI should have the query string appended. Make other changes to cgi.c to bring it into "compliance". "Compliance" is in quotes because rfc3875 does not define REQUEST_URI. That variable is really just by conveniention. But Apache and Nginx both append the query string, so we should too. ... (check-in: fd1c9b09 user: drh tags: cgi-compliance)
2022-02-12
20:53
Update the defense-against-robots documentation to align with current behavior. ... (check-in: c9082b29 user: drh tags: trunk)
20:30
Enhancement to robot defense. The auto-hyperlink setting can now be 2 (UserAgent only) in which case the UserAgent string is consulted and hyperlinks are generated if and only if the UserAgent looks human. Javascript does not come into play. When auto-hyperlink is 1, the traditional Javascript changes to href= in anchor tags are still used. ... (check-in: df337eb6 user: drh tags: trunk)
Changes
Hide Diffs Unified Diffs Ignore Whitespace Patch

Changes to www/antibot.wiki.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75

76
77
78
79
80
81


82
83

84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130



131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
<title>Defense Against Spiders</title>

The website presented by a Fossil server has many hyperlinks.
Even a modest project can have millions of pages in its
tree, and many of those pages (for example diffs and annotations
and ZIP archives of older check-ins) can be expensive to compute.
If a spider or bot tries to walk a website implemented by
Fossil, it can present a crippling bandwidth and CPU load.

The website presented by a Fossil server is intended to be used
interactively by humans, not walked by spiders.  This article
describes the techniques used by Fossil to try to welcome human
users while keeping out spiders.

<h2>The Hyperlink User Capability</h2>

Every Fossil web session has a "user".  For random passers-by on the internet
(and for spiders) that user is "nobody".  The "anonymous" user is also
available for humans who do not wish to identify themselves.  The difference
is that "anonymous" requires a login (using a password supplied via
a CAPTCHA) whereas "nobody" does not require a login.
The site administrator can also create logins with
passwords for specific individuals.

Users without the <b>[./caps/ref.html#h | Hyperlink]</b> capability
do not see most Fossil-generated hyperlinks. This is
a simple defense against spiders, since [./caps/#ucat | the "nobody"
user category] does not have this capability by default.
Users must log in (perhaps as
"anonymous") before they can see any of the hyperlinks.  A spider
that cannot log into your Fossil repository will be unable to walk
its historical check-ins, create diffs between versions, pull zip
archives, etc. by visiting links, because they aren't there.

A text message appears at the top of each page in this situation to
invite humans to log in as anonymous in order to activate hyperlinks.

Because this required login step is annoying to some,
Fossil provides other techniques for blocking spiders which
are less cumbersome to humans.

<h2>Automatic Hyperlinks Based on UserAgent</h2>

Fossil has the ability to selectively enable hyperlinks for users
that lack the <b>Hyperlink</b> capability based on their UserAgent string in the
HTTP request header and on the browsers ability to run Javascript.

The UserAgent string is a text identifier that is included in the header
of most HTTP requests that identifies the specific maker and version of
the browser (or spider) that generated the request.  Typical UserAgent
strings look like this:

<ul>
<li> Mozilla/5.0 (Windows NT 6.1; rv:19.0) Gecko/20100101 Firefox/19.0
<li> Mozilla/4.0 (compatible; MSIE 8.0; Windows_NT 5.1; Trident/4.0)
<li> Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
<li> Wget/1.12 (openbsd4.9)
</ul>

The first two UserAgent strings above identify Firefox 19 and
Internet Explorer 8.0, both running on Windows NT.  The third
example is the spider used by Google to index the internet.
The fourth example is the "wget" utility running on OpenBSD.
Thus the first two UserAgent strings above identify the requester
as human whereas the second two identify the requester as a spider.
Note that the UserAgent string is completely under the control
of the requester and so a malicious spider can forge a UserAgent
string that makes it look like a human.  But most spiders truly
seem to desire to "play nicely" on the internet and are quite open
about the fact that they are a spider.  And so the UserAgent string
provides a good first-guess about whether or not a request originates
from a human or a spider.

In Fossil, under the Admin/Access menu, there is a setting entitled
"<b>Enable hyperlinks for "nobody" based on User-Agent and Javascript</b>".

If this setting is enabled, and if the UserAgent string looks like a
human and not a spider, then Fossil will enable hyperlinks even if
the <b>Hyperlink</b> capability is omitted from the user permissions.  This setting
gives humans easy access to the hyperlinks while preventing spiders
from walking the millions of pages on a typical Fossil site.



But the hyperlinks are not enabled directly with the setting above.
Instead, the HTML code that is generated contains anchor tags ("&lt;a&gt;")

without "href=" attributes.  Then, JavaScript code is added to the
end of the page that goes back and fills in the "href=" attributes of
the anchor tags with the hyperlink targets, thus enabling the hyperlinks.
This extra step of using JavaScript to enable the hyperlink targets
is a security measure against spiders that forge a human-looking
UserAgent string.  Most spiders do not bother to run JavaScript and
so to the spider the empty anchor tag will be useless.  But all modern
web browsers implement JavaScript, so hyperlinks will show up
normally for human users.

<h2>Further Defenses</h2>

Recently (as of this writing, in the spring of 2013) the Fossil server
on the SQLite website ([http://www.sqlite.org/src/]) has been hit repeatedly
by Chinese spiders that use forged UserAgent strings to make them look
like normal web browsers and which interpret JavaScript.  We do not
believe these attacks to be nefarious since SQLite is public domain
and the attackers could obtain all information they ever wanted to
know about SQLite simply by cloning the repository.  Instead, we
believe these "attacks" are coming from "script kiddies".  But regardless
of whether or not malice is involved, these attacks do present
an unnecessary load on the server which reduces the responsiveness of
the SQLite website for well-behaved and socially responsible users.
For this reason, additional defenses against
spiders have been put in place.

On the Admin/Access page of Fossil, just below the
"<b>Enable hyperlinks for "nobody" based on User-Agent and Javascript</b>"
setting, there are now two additional sub-settings that can be optionally
enabled to control hyperlinks.

The first sub-setting waits to run the
JavaScript that sets the "href=" attributes on anchor tags until after
at least one "mouseover" event has been detected on the &lt;body&gt;
element of the page.  The thinking here is that spiders will not be
simulating mouse motion and so no mouseover events will ever occur and
hence the hyperlinks will never become enabled for spiders.

The second new sub-setting is a delay (in milliseconds) before setting
the "href=" attributes on anchor tags.  The default value for this
delay is 10 milliseconds.  The idea here is that a spider will try to
render the page immediately, and will not wait for delayed scripts
to be run, thus will never enable the hyperlinks.

These two sub-settings can be used separately or together.  If used together,
then the delay timer does not start until after the first mouse movement
is detected.




See also [./loadmgmt.md|Managing Server Load] for a description
of how expensive pages can be disabled when the server is under heavy
load.

<h2>The Ongoing Struggle</h2>

Fossil currently does a very good job of providing easy access to humans
while keeping out troublesome robots and spiders.  However, spiders and
bots continue to grow more sophisticated, requiring ever more advanced
defenses.  This "arms race" is unlikely to ever end.  The developers of
Fossil will continue to try improve the spider defenses of Fossil so
check back from time to time for the latest releases and updates.

Readers of this page who have suggestions on how to improve the spider
defenses in Fossil are invited to submit your ideas to the Fossil Users
forum:
[https://fossil-scm.org/forum].
|

<
|
|
|
|
|

|
|

|




|








|


|


|




|
|










|











|


|

|
|
|
|

|

|
|
>
|
|
|
|
|

>
>
|

>
|
|
|

|
|
|







|









|

|
|



<
<
<
<
<
<
<
|

|
|
|

|
|
|
>
>
>








|
|

|


|



1
2

3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117







118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
<title>Defense Against Robots</title>


A typical Fossil website can have billions of pages, and many of
those pages (for example diffs and annotations and tarballs) can
be expensive to compute.
If a robot walks a Fossil-generated website,
it can present a crippling bandwidth and CPU load.

A Fossil website is intended to be used
interactively by humans, not walked by robots.  This article
describes the techniques used by Fossil to try to welcome human
users while keeping out robots.

<h2>The Hyperlink User Capability</h2>

Every Fossil web session has a "user".  For random passers-by on the internet
(and for robots) that user is "nobody".  The "anonymous" user is also
available for humans who do not wish to identify themselves.  The difference
is that "anonymous" requires a login (using a password supplied via
a CAPTCHA) whereas "nobody" does not require a login.
The site administrator can also create logins with
passwords for specific individuals.

Users without the <b>[./caps/ref.html#h | Hyperlink]</b> capability
do not see most Fossil-generated hyperlinks. This is
a simple defense against robots, since [./caps/#ucat | the "nobody"
user category] does not have this capability by default.
Users must log in (perhaps as
"anonymous") before they can see any of the hyperlinks.  A robot
that cannot log into your Fossil repository will be unable to walk
its historical check-ins, create diffs between versions, pull zip
archives, etc. by visiting links, because there are no links.

A text message appears at the top of each page in this situation to
invite humans to log in as anonymous in order to activate hyperlinks.

But requiring a login, even an anonymous login, can be annoying.
Fossil provides other techniques for blocking robots which
are less cumbersome to humans.

<h2>Automatic Hyperlinks Based on UserAgent</h2>

Fossil has the ability to selectively enable hyperlinks for users
that lack the <b>Hyperlink</b> capability based on their UserAgent string in the
HTTP request header and on the browsers ability to run Javascript.

The UserAgent string is a text identifier that is included in the header
of most HTTP requests that identifies the specific maker and version of
the browser (or robot) that generated the request.  Typical UserAgent
strings look like this:

<ul>
<li> Mozilla/5.0 (Windows NT 6.1; rv:19.0) Gecko/20100101 Firefox/19.0
<li> Mozilla/4.0 (compatible; MSIE 8.0; Windows_NT 5.1; Trident/4.0)
<li> Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
<li> Wget/1.12 (openbsd4.9)
</ul>

The first two UserAgent strings above identify Firefox 19 and
Internet Explorer 8.0, both running on Windows NT.  The third
example is the robot used by Google to index the internet.
The fourth example is the "wget" utility running on OpenBSD.
Thus the first two UserAgent strings above identify the requester
as human whereas the second two identify the requester as a robot.
Note that the UserAgent string is completely under the control
of the requester and so a malicious robot can forge a UserAgent
string that makes it look like a human.  But most robots want
to "play nicely" on the internet and are quite open
about the fact that they are a robot.  And so the UserAgent string
provides a good first-guess about whether or not a request originates
from a human or a robot.

In Fossil, under the Admin/Robot-Defense menu, there is a setting entitled
"<b>Enable hyperlinks based on User-Agent and/or Javascript</b>".
If this setting is set to "UserAgent only" or "UserAgent and Javascript",
and if the UserAgent string looks like a human and not a robot, then
Fossil will enable hyperlinks even if the <b>Hyperlink</b> capability
is omitted from the user permissions.  This settingn gives humans easy
access to the hyperlinks while preventing robots
from walking the billions of pages on a typical Fossil site.

If the setting is "UserAgent only", then the hyperlinks are simply
enabled and that is all.  But if the setting is "UserAgent and Javascript",
then the hyperlinks are not enabled directly .
Instead, the HTML code that is generated contains anchor tags ("&lt;a&gt;")
with "href=" attributes that point to [/honeypot] rather than the correct
link.  JavaScript code is added to the end of the page that goes back and
fills in the correct "href=" attributes of
the anchor tags with the true hyperlink targets, thus enabling the hyperlinks.
This extra step of using JavaScript to enable the hyperlink targets
is a security measure against robots that forge a human-looking
UserAgent string.  Most robots do not bother to run JavaScript and
so to the robot the empty anchor tag will be useless.  But all modern
web browsers implement JavaScript, so hyperlinks will show up
normally for human users.

<h2>Further Defenses</h2>

Recently (as of this writing, in the spring of 2013) the Fossil server
on the SQLite website ([http://www.sqlite.org/src/]) has been hit repeatedly
by Chinese robots that use forged UserAgent strings to make them look
like normal web browsers and which interpret JavaScript.  We do not
believe these attacks to be nefarious since SQLite is public domain
and the attackers could obtain all information they ever wanted to
know about SQLite simply by cloning the repository.  Instead, we
believe these "attacks" are coming from "script kiddies".  But regardless
of whether or not malice is involved, these attacks do present
an unnecessary load on the server which reduces the responsiveness of
the SQLite website for well-behaved and socially responsible users.
For this reason, additional defenses against
robots have been put in place.

On the Admin/Robot-Defense page of Fossil, just below the
"<b>Enable hyperlinks using User-Agent and/or Javascript</b>"
setting, there are now two additional sub-settings that can be optionally
enabled to control hyperlinks.








The first new sub-setting is a delay (in milliseconds) before setting
the "href=" attributes on anchor tags.  The default value for this
delay is 10 milliseconds.  The idea here is that a robots will try to
interpret the links on the page immediately, and will not wait for delayed
scripts to be run, and thus will never enable the true links.

The second sub-setting waits to run the
JavaScript that sets the "href=" attributes on anchor tags until after
at least one "mousedown" or "mousemove" event has been detected on the
&lt;body&gt; element of the page.  The thinking here is that robots will not be
simulating mouse motion and so no mouse events will ever occur and
hence the hyperlinks will never become enabled for robots.

See also [./loadmgmt.md|Managing Server Load] for a description
of how expensive pages can be disabled when the server is under heavy
load.

<h2>The Ongoing Struggle</h2>

Fossil currently does a very good job of providing easy access to humans
while keeping out troublesome robots.  However, robots
continue to grow more sophisticated, requiring ever more advanced
defenses.  This "arms race" is unlikely to ever end.  The developers of
Fossil will continue to try improve the robot defenses of Fossil so
check back from time to time for the latest releases and updates.

Readers of this page who have suggestions on how to improve the robot
defenses in Fossil are invited to submit your ideas to the Fossil Users
forum:
[https://fossil-scm.org/forum].

Changes to www/mkindex.tcl.

12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

set doclist {
  aboutcgi.wiki {How CGI Works In Fossil}
  aboutdownload.wiki {How The Download Page Works}
  adding_code.wiki {Adding New Features To Fossil}
  adding_code.wiki {Hacking Fossil}
  alerts.md {Email Alerts And Notifications}
  antibot.wiki {Defense against Spiders and Bots}
  backoffice.md {The "Backoffice" mechanism of Fossil}
  backup.md {Backing Up a Remote Fossil Repository}
  blame.wiki {The Annotate/Blame Algorithm Of Fossil}
  blockchain.md {Is Fossil A Blockchain?}
  branching.wiki {Branching, Forking, Merging, and Tagging}
  bugtheory.wiki {Bug Tracking In Fossil}
  build.wiki {Compiling and Installing Fossil}







|







12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

set doclist {
  aboutcgi.wiki {How CGI Works In Fossil}
  aboutdownload.wiki {How The Download Page Works}
  adding_code.wiki {Adding New Features To Fossil}
  adding_code.wiki {Hacking Fossil}
  alerts.md {Email Alerts And Notifications}
  antibot.wiki {Defense against Spiders and Robots}
  backoffice.md {The "Backoffice" mechanism of Fossil}
  backup.md {Backing Up a Remote Fossil Repository}
  blame.wiki {The Annotate/Blame Algorithm Of Fossil}
  blockchain.md {Is Fossil A Blockchain?}
  branching.wiki {Branching, Forking, Merging, and Tagging}
  bugtheory.wiki {Bug Tracking In Fossil}
  build.wiki {Compiling and Installing Fossil}