MediaWiki REL1_33
README
Go to the documentation of this file.
1MediaWiki extension: SpamBlacklist
2----------------------------------
3
4SpamBlacklist is a simple edit filter extension. When someone tries to save the
5page, it checks the text against a potentially very large list of "bad"
6hostnames. If there is a match, it displays an error message to the user and
7refuses to save the page.
8
9To enable it, first download a copy of the SpamBlacklist directory and put it
10into your extensions directory. Then put the following at the end of your
11LocalSettings.php:
12
13require_once( "$IP/extensions/SpamBlacklist/SpamBlacklist.php" );
14
15The list of bad URLs can be drawn from multiple sources. These sources are
16configured with the $wgSpamBlacklistFiles global variable. This global variable
17can be set in LocalSettings.php, AFTER including SpamBlacklist.php.
18
19$wgSpamBlacklistFiles is an array, each value containing either a URL, a filename
20or a database location. Specifying a database location allows you to draw the
21blacklist from a page on your wiki. The format of the database location
22specifier is "DB: <db name> <title>".
23
24Example:
25
26require_once( "$IP/extensions/SpamBlacklist/SpamBlacklist.php" );
27$wgSpamBlacklistFiles = array(
28 "$IP/extensions/SpamBlacklist/wikimedia_blacklist", // Wikimedia's list
29
30// database title
31 "DB: wikidb My_spam_blacklist",
32);
33
34The local pages [[MediaWiki:Spam-blacklist]] and [[MediaWiki:Spam-whitelist]]
35will always be used, whatever additional files are listed.
36
37Compatibility
38-----------
39
40This extension is primarily maintained to run on the latest release version
41of MediaWiki (1.22.x as of this writing) and development versions, however
42the current version should work up to 1.21.
43
44If you are using an older version of MediaWiki, you can checkout an
45older release branch, for example MediaWiki 1.20 would use REL1_20.
46
47For even older versions, you may be able to dig older versions out of the
48Git repository which work, but if using Wikimedia's blacklist file
49you will likely have problems with failure due to the large size of the
50blacklist not being handled by old versions of the code.
51
52
53File format
54-----------
55
56In simple terms:
57 * Everything from a "#" character to the end of the line is a comment
58 * Every non-blank line is a regex fragment which will only match inside URLs
59
60Internally, a regex is formed which looks like this:
61
62 !http://[a-z0-9\-.]*(line 1|line 2|line 3|....)!Si
63
64A few notes about this format. It's not necessary to add www to the start of
65hostnames, the regex is designed to match any subdomain. Don't add patterns
66to your file which may run off the end of the URL, e.g. anything containing
67".*". Unlike in some similar systems, the line-end metacharacter "$" will not
68assert the end of the hostname, it'll assert the end of the page.
69
70Performance
71-----------
72
73This extension uses a small "loader" file, to avoid loading all the code on
74every page view. This means that page view performance will not be affected
75even if you are not running a PHP bytecode cache such as Turck MMCache. Note
76that a bytecode cache is strongly recommended for any MediaWiki installation.
77
78The regex match itself generally adds an insignificant overhead to page saves,
79on the order of 100ms in our experience. However loading the spam file from disk
80or the database, and constructing the regex, may take a significant amount of
81time depending on your hardware. If you find that enabling this extension slows
82down saves excessively, try installing MemCached or another supported data
83caching solution. The SpamBlacklist extension will cache the constructed regex
84if such a system is present.
85
86Caching behavior
87----------------
88
89Blacklist files loaded from remote web sites are cached locally, in the cache
90subsystem used for MediaWiki's localization. (This usually means the objectcache
91table on a default install.)
92
93By default, the list is cached for 15 minutes (if successfully fetched) or
9410 minutes (if the network fetch failed), after which point it will be fetched
95again when next requested. This should be a decent balance between avoiding
96too-frequent fetches if your site is frequently used and staying up to date.
97
98Fully-processed blacklist data may be cached in memcached or another shared
99memory cache if it's been configured in MediaWiki.
100
101
102Stability
103---------
104
105This extension has not been widely tested outside Wikimedia. Although it has
106been in production on Wikimedia websites since December 2004, it should be
107considered experimental. Its design is simple, with little input validation, so
108unexpected behavior due to incorrect regular expression input or non-standard
109configuration is entirely possible.
110
111Obtaining or making blacklists
112------------------------------
113
114The primary source for a MediaWiki-compatible blacklist file is the Wikimedia
115spam blacklist on meta:
116
117 http://meta.wikimedia.org/wiki/Spam_blacklist
118
119In the default configuration, the extension loads this list from our site
120once every 10-15 minutes.
121
122The Wikimedia spam blacklist can only be edited by trusted administrators.
123Wikimedia hosts large, diverse wikis with many thousands of external links,
124hence the Wikimedia blacklist is comparatively conservative in the links it
125blocks. You may want to add your own keyword blocks or even ccTLD blocks.
126You may suggest modifications to the Wikimedia blacklist at:
127
128 http://meta.wikimedia.org/wiki/Talk:Spam_blacklist
129
130To make maintenance of local lists easier, you may wish to add a DB: source to
131$wgSpamBlacklistFiles and hence create a blacklist on your wiki. If you do this,
132it is strongly recommended that you protect the page from general editing.
133Besides the obvious danger that someone may add a regex that matches everything,
134please note that an attacker with the ability to input arbitrary regular
135expressions may be able to generate segfaults in the PCRE library.
136
137Whitelisting
138------------
139
140You may sometimes find that a site listed in a centrally-maintained blacklist
141contains something you nonetheless want to link to.
142
143A local whitelist can be maintained by creating a [[MediaWiki:Spam-whitelist]]
144page and listing hostnames in it, using the same format as the blacklists.
145URLs matching the whitelist will be ignored locally.
146
147Logging
148-------
149
150To aid with tracking which domains are being spammed, this extension has
151multiple logging features. By default, hits are included in the standard
152debug log (controlled by $wgDebugLogFile). You can grep for 'SpamBlacklistHit',
153which includes the IP of the user and the URL they tried to submit. This
154file is only availible for people with server access and includes private info.
155
156You can also enable logging to [[Special:Log]] by setting $wgLogSpamBlacklistHits to
157true. This will include the account which tripped the blacklist, the page title the
158edit was attempted on, and the specific URL. By default this log is only viewable
159to wiki administrators, and you can grant other groups access by giving them the
160"spamblacklistlog" permission.
161
162Copyright
163---------
164This extension and this documentation was written by Tim Starling (with later
165contributions by others) and is available under GPLv2 or any later version.