After returning from vacation I have found my blog rampaged by spammers, which have successfully been able to bypass the CAPTCHA on the site, to post more than 400 comments, a notable amount, considering there are only 186 legitimate comments on this site at the time of writing. Since I was quite frustrated by this, I decided to study the matter a little more and since this took me quite a while, I figured I could document the results of my research and progress here.

First, prerequisites

Just to make things clear - I intend for this blog to allow anonymous users to post content without having to register. Providing an email address is nice, but is not necessary either. This is related to my strong views on freedom of speech and while I reserve the right to censor speech on this site, I have never done so, so far, and if I do, I will have a damn good reason.

Blocking people outright should be considered as censorship and avoided at all costs.

I am also hesitant in using online services that may track my users, but more on that below.

The bottomline: my current policy

Right now, I am using the Riddler CAPTCHA, the honeypot module and the Spam module, but I am still tweaking this, and I am also considering the Bad behaviour and/or the Block Anonymous Links modules. We'll see!

See below for details of those modules.

The traditionnal approach: CAPTCHA

For those of you not familiar with what a CAPTCHA is, Wikipedia has a pretty good description. Drupal supports this through the CATPCHA contrib module, which has existed for more than 7 years now and is one of the most popular tools on Drupal.org for that kind of thing. The problem is: it can be broken, and it's hard and annoying for users to prove themselves. It's a very fine balance between a challenge that is too difficult and one that is too easy. Basically, you're punishing your faithful users, which refrains participation.

Basic configuration

Still, the CAPTCHA module is a good approach. With little configuration and tweaking, you can start keeping most spam out and still allow anonymous comments. You will at least want to enable the CAPTCHA on user registration and comment forms. You can then give permission to registered users to bypass the CAPTCHA forms (since they passed it to register). If illegitimate users succeed in creating accounts anyways, you can always delete their accounts, with a module like Ban & Unpublish, which I haven't tried (since most of my spam is from anonymous users).

As for the actual CAPTHA, you will probably want to start with an image captcha, since this is the one bundled with the module, and requires than trivial thinking from the bots (hint: computers are pretty good at solving arithmetic problems, so the Math CAPTCHA is easy to break).

Also, the CAPTCHA after module can be useful to present a CAPTCHA to the user only after X submissions, since spammers often send massive loads of comments in one shot, which is uncommon for regular users. This may alleviate the burden of CAPTCHA for user users. I haven't tried this system yet however, as it means spam can still get through.

But maybe the above will not work for you and the image captcha will be broken. You can then try to make the challenge harder by adding noise and so on, but this will only make the life of your users more miserable. The are alternatives! There's a whole myriad of CAPTCHA contrib modules (sorted by number of installs), of which I will outline a few approaches here.

More CAPTCHAs!

A common module people install is the CAPTCHA pack module, which adds a few weapons to your arsenal:

more configurable math challenges
text-based captchas (pick the wrong word, spelling checks)
CSS captchas
ASCII art
randomizer (picks any of the configured CAPTCHA types randomly)

In my experience, most of those are broken by the bots that visit my site, but they may be useful to you.

Trick question challenges

The most interesting ones I could find where the ones like the riddler, which allows you to define question/responses that are specific to your site, and for which automated spam bot will probably not have the answer. Finding those golden Q&A is the key here, and may prove difficult. The riddler module depends on the bigger CAPTCHA module, so you may want to look at the trick question module, which is independent.

There is also the Egglue.com online service, which provides Q&A that make (mostly) sense to humans, but are hard to answer for computers. Unfortunately, that service is in english only (unless you haven't been paying attention, this is a bilingual space) and it makes this site dependent on a centralized service, which I dislike. I have ran it for a few days, and it seems effective to block spammers, but I find it is not intuitive enough for non-native speakers, so I have switched to Riddler since. I have opened a feature request regarding the translation bit, but I am worried this will never be fixed as the projects on drupal.org and upstream haven't seen any update in two years.

Online services

There is a whole slew of online services that help you protecting your site. The most popular for Drupal are Mollom and reCAPTCHA. The former relies on Mollom.com, a startup from Dries, the founder and leader of the Drupal project, and which works actually very well. The latter relies on Google's reCAPTCHA service (yes, they own it now), which I suspect helps them finish their book scanning project. Both of those services suffer from the problem that the users are forced to connect to a third party server, which then can track them, which is especially a problem for Google, considering how much their business model is that YOU are the product.

Nevertheless, both products work fairly well, and Mollom has the added advantage that it presents the users with a CAPTCHA only if Mollom determines (through various heuristics) that your message may be spam. In other words, most users often don't get a CAPTCHA. This comes at a price though - your user content may be sent to Mollom.com for analysis!

There are other online services, a very popular one (Akismet) which is now implemented by the Antispam module. I've already mentionned egglue (which doesn't send your data to the central server though) and I'll also pimp Spambot, which sports an online database of known bots, and doesn't require you sending your users' content to the central server, although your users can probably be fingerprinted there. I have also found the following services in the list, but haven't tried any: textcaptcha.com, keycaptcha.com (monetizing captcha, who would have thought...), confident CAPTCHA (images selection) and Vidoop CAPTCHA.

Other non-CAPTCHA approaches

So all the above ends up showing the users a CAPTCHA. As we mentionned, there are problems with this, both in terms of accessibility (blind users can't see Image CAPTCHAs for example), user annoyance (don't punish your users) and anyways, they can be broken. So I have looked at other solutions here.

The good old spam module

One very important module is the Spam module. About as old as the CAPTCHA module (but less popular), it takes a different approach and allows for plugins to parse the content and evaluate its "spamicity". It features Bayesian filter (which you can train so that it learns from its mistake), an URL filter (both limiting the number of URLs but also allowing you to block certain URLs outright), the SURBL online service (basically a blacklist of URLs), a node age filter (that marks older posts are more spammy) and a duplicate filter (that marks duplicate content as spam).

Of those, the Duplicate and Node age filters are the most important, and will remove a significant amount of spam from your site, while keeping user disruption to a minimum. The other filters are also important and I enabled them on this site (including the URL blacklist filter). You will probably want to read the spam module handbook to better understand how to configure the module, especially the filter gain, which is important for the Bayesian filter.

In general, the spam module will help you getting rid of a significant amount of spam on your site, and may be the only thing you need. The really good thing with the module is that it allows spam to be queued for review instead of simply being denied, which keep your site accessible to everyone.

Hidden fields

Another trick that is used in the antispam modules is to add hidden fields to the form, that bots are going to happily fill up, while regular users will skip them. This allows your site to easily detect bots and block them right there. The downside of that approach is that bots learn of those and can skip the field. Often this forces your users to use javascript or cookies to submit content, which you may want to avoid. Here are the modules I have found (but not tried yet):

Spamicide and Hidden CAPTCHA: very similar, both well maintained and popular, but the latter depends on the CAPTCHA module while the former is standalone
honeypot: also enforces a delay before posting, but has a all-or-nothing approach (can't target only comment forms - see this silly issue)
un.captcha.lous: also enforces javascript
BOTCHA: mostly hidden fields, but quite customizable
GOTCHA: old module (5.x), unsupported

Other approaches

Let's just finish by throwing here a couple of other good approaches. The Wikipedia article on spam in blogs details a good list of potential solutions, which we could summarize as this:

rate-limiting
keyword blocking - use the spam module for this
rel=nofollow - controversial, may not keep spam from being posted, no easy solution in Drupal for comments
CAPTCHA
block links altogether - use Block anonymous links module for this
redirect page for links - "you will be redirect to blablabla", may not keep spam from being posted, no easy solution in Drupal for comments
distributed approaches - see the "Online services above"
RSS feed monitoring - monitor your site's activity through RSS to detect spam runs and required moderation, see the Admin RSS module for this
Response tokens - built into drupal
AJAX - force comments to be submitted or validated through Javascript, see JS Validate forms or un.captcha.lous

In this I will insist on the block anonymous links module, which can be very useful if you are mostly interested in discussions and feedback, not URLs, which spammers live for.

I will also mention the hashcash module which force spammers to solve a cryptographic problem before being allowed to post, which will add a small delay before comments are posted but may keep all bots at bay (not tested).

Finally, there is the Not so fast module which allows validation of anonymous user's emails, but unfortunately will let comments be posted even if the anonymous user is not validated. This is mostly useful if you do constant surveillance on your site.

Oh, and there is the Bad behaviour module also looks very interesting, and depends on the Bad behaviour library. It will detect spam bots using a variety of fingerprinting heuristics. No CAPTCHA, no content filtering - just checking if the user is a bot. Not tested.

General tips on implementing policies

Try one thing at a time. Let things sit for a while and see if the new measure is broken through, and maybe try to see why (harder). Then try another thing and wait again.

If you pull all the guns and it works, you'll have too many heavy measures in place and you won't know what's really necessary, and you'll bother your users. Do things one at a time.

RSS

mollom

i like that mollom uses data from all of its customers to identify spambots when they visit your site; this is clearly an easier task than flagging spammers based solely on data from any one site. you can also configure it to only show a captcha for suspicious IPS, which is minimally instrusive (and means less data is sent to their servers). also, if you later flag a comment as spam its heuristics will be updated.

that said, even if you trust dries buytaert and his company, there's something to be said about not relying on the cloud for everything: http://www.theatlantic.com/technology/archive/2011/09/the-clouds-my-mom-cleaned-my-room-problem/245648/

Comment by mvc — 2011-09-29 09:34

400 spam comments is not a

400 spam comments is not a notable amount per 186 legitimate comments, I have that many per 2 legitimate comments

Just my point of view: do not use antispam protection based on centralized external blacklisting. It is too easy to manipulate by blackhat professionals and spammers armed by bots.

Comment by Conspywrightor — 2011-10-01 09:37

Seems effective! So far I

Seems effective! So far I haven't received new spam, although one comment was blocked as it triggered the bayesian spam filter (or some piece of the spam module). I have since then reduced the weight of the filter and it is quite successful at catching spam that go through the honeypot and riddler modules.

Comment by anarcat — 2011-10-19 22:50

Cross reference

How about something that cross references sites comments to weed out the spam, as the spam will most likely be similar wherever it is spammed. If site x y z receive comment pointing towards a point, it might be spam. Then some parsing of pages could find key words from pre analyzed spam.

Comment by Anonymous — 2011-11-02 18:14