This sounds like it's coming from someone who hasn't had any real experience with large scale spam problems.
We operate a forum with 250k members and ~800k posts per month, a new registration every minute and we get so many spam bots even with captcha (mechanical turk etc) and without captcha it's unworkable. Captcha is a necessary evil, but it does help.
This seems to be coming from someone dealing with a site where spam wouldn't be that much of a problem, who would sign up to animoto to spam? Very silly post.
They are proposing alternative methods that don't put the burden on the user in the form of explicit action. By using honeypot fields that would only be filled out by a robot, and timestamp analysis which effectively detects automatic form submission, they can weed out the bots without asking their users to do anything.
Honeypot fields and their ilk are easy to bypass with a focused attack. For smaller sites, that's fine - who's going to make the effort to target you? Keep out the opportunistic bots rattling your contact form, and life's good.
For juicier targets, something more sophisticated is necessary. Captchas are one answer.
Nothing, they're a good solution for this company. My point was the conclusions were based on this company, not everyone who suffers spam, the article has since been updated with:
> For some reason this article has hit the front page of Hacker News and is getting quite a lot of traffic. I should mention that yes, I acknowledge CAPTCHAs are of course sometimes unavoidable. That doesn’t mean, however, that we should ever feel good about using them, nor should we fool ourselves that users don’t mind them.
Which was my point. When spam is a serious issue then captchas are unavoidable 99% of the time.
If you're site is running on a popular forum software, robot-only inputs and timestamp analysis would eliminate most of your problems.
Spammers are probably not targeting your website in particular, rather the software your forum is run on. If you add atypical anti-spam measures you'll separate yourself from others using the same platform, defeating the typical phpbb or vbulletin bot which probably accounts for most of your spam.
The thing is though, for anything small scale a simple "Type Human" in a box is 100% effective for the random spam bots. For anything like what you are experiencing you are being targeted so even with the best CAPTCHA around spam is still going to be a problem.
I honestly believe that CAPTCHA's are one of the most evil things on the internet and that there are many valid and better ways to avoid spam.
BTW im not just some random guy with an axe to grind over this, I wrote http://www.wausita.com/captcha/ as an example of how trivial 90% of the CAPTCHA's on the web are trivial to decode.
A lot of targeted attacks use humans to decipher captchas. Hell, a lot of programs used by internet marketing will display a captcha to decipher every 2 seconds in order to post in a forum/website. Invisible fields are in my opinion a much better solution, but who cares what I think, did you try other solutions before calling them silly?
You have a valid point. However this is pointing out ignorance of developers (I being one of them) that most of the time captcha is unnecessary and annoying to UX.
So yes, who would spam animoto? But you know what, now their spam filter is good enough, their user registration has been increased. Better more "foolproof" techniques will always be needed, and directed attacks are hard to prevent, but getting to a good enough point is great as well.
And the problem you face may be a smaller percentage of sites than animoto, who do need captcha especially if you are the target of a directed attack.
Since several commenters have been asking for an explanation of honeypots and timestamps, here's a link[1] I happened to run across just recently and a quick explanation.
- Honeypots: Add a field to your form that is styled to be invisible to normal human users, such as being located off the screen, sized to 1 pixel, or placed behind/under images on the page. Bots examine a page through HTML rather than through eyesight and will not distinguish these fields. Reject submissions which have entered text in the honeypot fields.
- Timestamps: Some spambots operate by 'playback' - a human fills the form out correctly once, then copy-and-pastes the form output into a script that replaces the comment text/etc. with desired spam links. Place a hidden field in your form that contains a timestamp (possibly hashed or combined with other form output). Reject submissions which contain a timestamp far in the past, indicating a bot which is 'playing back' an old submission.
The idea with defeating spam is not to be 100% accurate with unbeatable security, since no matter your system, a bot tailored to your site can defeat it. However, putting several simple techniques together can defeat general-purpose bots that shotgun spam across many sites. This reduces spam to levels that are manageable by hand.
That kind of text will still confuse the majority of Web users. It's the kind of thing my mom or wife would show to me and say, "So am I supposed to put my state in? Why do they have it there if I'm not supposed to fill it in?"
roel_v makes a valid point - the majority of web users will not see this text. The whole point is that it's hidden and will only be 'seen' by people using screen readers. The text could easily be changed to "To reduce spam we have included this extra field. If you are a human, please leave it blank."
It was a valid question - the GP was talking about blind users, and then some guy who didn't even read the discussion properly comes waltzing in with a non sequitur about his mom.
Another form of timestamp analysis is to detect submissions that happen too quickly. A spammer's signup script is likely to fill out the form and submit it nearly instantly. Of course a spammer could beat this by waiting a small randomized amount of time, but that makes spam signups more expensive and might also deter them.
Many automated form fillers for normal people, such as LastPass or even FireFox's form fillers will fill out the submission forms and submit them quickly as well. Perhaps not as quickly as an automated script, but worth looking out for.
A very rudimentary defense at best - any spammer or scraper worth their salt is randomizing their timing. Better yet, have timings derived from real users. For a dedicated attack (or even a category-specific attack like forum signups) timing would solve little.
This is a good method... especially if you look at it over the course of more than one page. If you have a multi-page signup funnel, you can watch the time it takes someone to get from the first form to the last.
I have traditionally used honeypot fields in this manner. Recently, however, I have noticed some false positives because of autofill features in browsers (especially Chrome). To work around this, I would add to the above that it may be useful to remove the field with the submit event on the form and then test for it's presence on the backend. Alternatively, just use the timestamp approach.
Much appreciated especially for the link provided. Quite helpful. When thinking about it, for a user registration (I don't have comments) this seems pretty reasonable. Combined with a few fun tricks on the input fields this should be sufficient enough, especially due to other mechanisms in my site. One less captcha on the web is only better :)
This is why I'm terrible at statistics. To me, it looks almost like wave function collapse.
The results are good only if you do a set number of observations (say 500) instead of waiting for a significant result (say it happens at 623). But what if you had decided to run 623 tests at the beginning?
No problem with that. But compare these two experiments:
for i in range 623:
data.add_result()
s = calculate_significance(data)
if s > 0.95:
publish()
for i in range 623:
data.add_result()
s = calculate_significance(data)
if s > 0.95:
publish()
break
The second one gives you many more chances to succeed, which must result in your confidence in the answer going down.
I run a Travel Blog host, and get several hundred spam attempts per day, accounting for more than 90% of the posts on the site. Still, I refuse to put CAPTCHAs in between my users and what they want to accomplish. It's just a terrible use experience.
Instead, I use a combination of human detection scripts, bayesian filtering, and moderation. Combined, this keeps the site pretty much 100% spam free from the perspective of our end users, and more importantly, Googlebot.
I have been using pseudo-timestamps and honeypot fields for a while now and it has worked pretty well for me. I get a bit of spam every now and then but it is usually someone manually copy-pasting. I could safely block those too but it is infrequent enough that I don't need to bother.
Wait until your site gets popular. It'll get fun sooner or later.
The latest thing I'm seeing on my site is a robot that automates real web browsers, jumps between ip addresses, scrapes real user content off the site, then posts it back using some form of Markov generator to make the content look unique. It'll do that on new accounts for weeks before trying to insert any links.
It's amazing the lengths spammers will go to to get their content onto your site. In this case, the crawler is clearly written specifically for my site, even though it's only PR4 and nofollows all its links. It's no wonder 99% of the content on big sites like Blogger is spam.
what i've done recently is give a text field a worthless class, then use javascript to change it's class to a display:none. Sure bots may execute the javascript however it causes them to run much slower so I'd think the majority of them aren't.
There's no single way to block spam. I just showed two of the most basic methods I used. There are lots more that can be setup if spam starts to become a problem without resorting to captchas.
You just gave me an interesting (but somewhat unrelated) idea.
You can "sign" a timestamp by appending a hash of that timestamp with some sercre value. This way, whenever the user submits your form, you can reliably determine when it was requested without storing anything on the server.
The second half of the article reveals that (in a specific case) removing the CAPTCHA improved conversion from 48% to 64%. I didn't much like the rest of the article, but this is interesting.
What they failed to mention is what percent of that boost came from autofillers/spammers.
They say they successfully used timestamp/honeypots to keep out spammers; if so, how many spammers did they keep out? If it was tons, then say so, that's useful information. If it wasn't very many, then they didn't need the CAPTCHA in the first place.
I'd be interested in knowing whether the application itself is designed to be immune to autofilled accounts. Assuming people use it to create slideshows they can then share with their family/friends and not socially/crowdsourced a la flickr, a bunch of bots with garbage accounts no one has to look at wouldn't actually harm anyone else's experience of the site.
I've got to say that from a developers perspective its worth trying whenever possible to not put CAPTCHA in a form if at all possible for the benefit of your customers. No one enjoys filling out a CAPTCHA. I'd say trying honeypot fields and timestamps, hashed value matching, etc that are all invisible to the end user.
I think not being a lazy developer in order to allow your customers to not make as much effort is a good thing. Only at a point where other methods dont work should you then employ CAPTCHA.
This just encourages spambots to upgrade their technology. You could upgrade spambots quite easily by just running them inside a headless browser with full javascript support, like phantomjs.
There is very little distinction between writing a phantomjs unit test and writing a spambot.
Spammers make money off of volume. The more expensive it is to deploy each individual spam, the less money spammers make. Processing and running javascript, retrieving and inlining external CSS, and rendering HTML all take time. Writing custom bots also takes time.
I think the goal is to make spammers lose the arms race simply because the payoff has become too small. If we can do that without CAPTCHAs, so much the better.
You can apply the same to CAPTCHA as well, its not hard to automate CAPTCHA input either. But the reality of the situation is that the vast majority of spam bots are simple and for every additional check you put into your form you increase the effectiveness by a magnitude.
@mayank is right, when you get popularity you are EXPOSED, and every developer could be able to spam your système without that much effort.
Captcha are Effective even if they are EVIL, anyhow i think that one day we'll find a viable solution again's spams.
You can't do the same with captcha because that would require a degree of brute forcing. And my point was that existing spambots could trivially be upgraded to handle hidden form values and keystroke timers and other automated javascript validation.
1. you run your existing spambot software through phantomjs.
2. your unmodified bot fills in all the visible fields without changing a single line of code, and the webkit backend transparently computes your hashes and other automated javascript "human" tests.
3. again, your existing "stupid" spambot code submits your form, and your site is now overrun by spam.
With Captcha, you get an image and a unique ID that is validated at the server. Sure, you could run it through mechanical turk, but I'm guessing that a few CPU cycles to load a webkit backend is still vastly cheaper than farming work out to MechTurk.
My point is that you wouldn't even have to change your spambot software to defeat these "new" validations, and they can be trivially overcome, as opposed to MechTurk+reCaptcha. Add to that the benefits of targeting sites that are relatively spam-free, and you have a real incentive for spammers to simply plug-in phantomjs instead of using WWW::Mechanize or what have you.
The point is that all these measure are 'trivial' to break, and so are captcha's. Except with captcha's you impose a burden on your user, and with other techniques you can offload that burden to the developer. I'm not sure what the 'existing' part in 'existing spambot' has to do with it - the time it would take to add farmed captcha solving is marginal (you don't even have to mech turk it - most captcha's are broken with OCR software readily available on the underground market anyway).
captcha = sign of clueless or lazy, or both, developer. I don't put up with it anymore - I have yet to meet a single registration that I actually need that uses a captcha. I'm not the only one, either.
Sometimes I think we have it wrong. Instead of trying to determine if someone IS a spammer, why not try figure out if they're definitely NOT a spammer.
So start with a pessistic view they they are, and that they need to be shown a CAPTCHA. Then do some analysis to try figure out if they're legit, e.g. time spent on page, mouse/keyboard interaction, geo-location, referrer etc.
If they're all good, don't show them the CAPTCHA (perhaps just rely on honeypot inputs), otherwise show them a CAPCTHA as a next step after posting content (and apologise in case it's a false positive).
I'd just like to point out that the way they did their A/B testing might be flawed, you can't run the test until you get a certain confidence, you have to decide beforehand how long you'll run it. They seem to have run it until they got 99% confidence, which is probably the wrong way to go about it.
Here's an idea: Force registrants to submit a computationally expensive token along with their registration form. Perhaps it's computed with javascript. Users usually spend more than 15 seconds on the form anyways, and spammers will hate to peg their hardware like that.
Add 100ms-of-2011-avg-cpu computation and tie it to the submit button (avoiding any complications interleaving with user activity). So that deals with first-order dumbbots and makes life a little harder for Javscript-executing (but still volume-based) folks. Marry to a bayesian system to handle the third-order mechanical turk-style miscreants.
The article states that Animoto use "honeypot fields and timestamp analysis" instead of CAPTCHAs, which they claim has been effective to date. What do you think of this?
I use honeypot fields myself and they stop a ton of spam submissions. I'm sure timestamp analysis can be very effective too. I'm totally a fan. But are there bots smart enough to defeat it? You bet!
Some of my forms also have a CAPTCHA. I think it's got to be case-by-case. Do you have something desirable to bad guys (like the signup for a new Yahoo account, or a high-ranking blog about pharmaceuticals)? Do you have tools in place to deal with spam submissions effectively when they do occur? Will a bunch of bots signing up for accounts degrade service for legitimate visitors?
For example, the Contact our Sales team form definitely does not have a CAPTCHA. The sales team will gladly sort though a pile of junk if it means one more inbound lead. But the Post a Comment form would be an absolute disaster without a strong CAPTCHA. A surprising amount of junk gets through anyway, in fact. (As far as I can tell, it's actual humans in developing countries copy/pasting into comments by hands. Blocking referrers from Google that have the phrase "post a comment below" made a dent)
Think he probably means spammers are searching for the phrase 'post a comment below' on Google looking for forms they can spam. You'll see this search term in the HTTP referrer header.
Edit: obviously you could just avoid using this phrase on your site instead.
If timestamp analysis is effective now, it won't be forever. It would be trivially easy to program an autofiller to leave pseudo-random pauses between filling individual fields. If this becomes a much more common technique, the spammers will adapt.
We were getting Spam bots on our forum which uses the same registration info as our game. We used Captcha for a bit, but also noticed a big decrease in conversion rate, so then we tweaked the forum software a bit to require that you have gained at least 1 level in the game before you can post to the forum and now no captcha and no Spam.
I was surprised to find that when I pressed control-f and typed "duh" that zero results were found in the comments.
However flawed the experiment might've been, it's obvious that if you add barriers (e.g., CAPTCHAs) before some end goal and detract from user experience then you decrease your conversion rate.
Try mollom (http://mollom.com/). It uses text analysis for the most part and only uses CAPTCHA if its not sure. Even though I don't have a huge site it blocks a lot for me.
CloudFlare (http://cloudflare.com) also works great, since it does a quick Project Honeypot check on any suspicious visitors (along with a bunch of other good stuff).
CAPTCHAs were designed for identifying computers and humans apart. Initially, they were simple tests, which required users to identify certain words. However, computer vision is growing leaps and bounds. So these test have become so complicated that even humans find it difficult to comprehend CAPTCHAs. CAPTCHAs have gone from simple tests to extremely complicated ones over last 10 years but design has never changed. We need overhaul of CAPTCHA design. They need to be both usable and secure.
P.S. I'm working on the project to make CAPTCHAs more usable. We will have some updates soon. :)
The study he did isn't broadly valid because he only tested using a captcha system that is quite abysmal, and for which the results were not surprising.
If he wants to increase conversion rates, he should get rid of the irrelevant fields such as date of birth, zip code, country, gender, and check-to-agree to legal contract.
Ha, checking the actual site, "sign up" leads to "pricing" and not a sign up page. So much for their grave concern about losing sign ups at each stage.
On the other hand, his link to an article about including Honeypot fields is good advice and valuable. Timestamp analysis is not so great since it requires javascript and cookies. The more stuff you require the more users drop off. The problem with captchas is bad captchas that are impossible for humans to decode. Sometimes the reason these are used is because simpler captchas are implemented in a faulty manner that allows spammers to decode them without even having to do OCR. So the site developer upgrades to more complex captchas rather than fix the underlying problem that is breaking the captcha security.
I think it depends on the kind of CAPTCHA, how many people will give up. Some captchas are literally easier to read for a machine than for a human. For example, some use simple rotated text in unreadable grey on grey. Humans can hardly read it, but an algorithm doesn't care about the contrast at all. Very stupid. A captcha should be as easy to read by humans as possible.
Before arguing that "CAPTCHA's are a necessary evil", it pays to know the life time value of a user/customer for your site. It's likely that the cost of dealing with the spam would be lower than the amount of revenue lost from your CAPTCHA-impaired conversion rate.
If it's so hard to tell the true humans from the machines (CAPTCHA) shouldn't it be a lot easier to tell true machines from humans and human/machine combinations? (Human/machine combination, like a person in a debugger with some reverse engineering tools.)
Couldn't this be used to increase the security of computer systems? What if one could extend this to be able to tell particular machines from humans, human/machine combos, and counterfeit machines. I suspect one can do this. I have been working on this problem for the past 3 months, and I'm about to implement it and publish it on the App Store.
I suspect one technique would be to add extra fields to the HTML form that are hidden when the page is viewed in an actual browser. Any submissions with values specified for these fields would likely come from a bot, since a normal user would not have been able to enter anything.
I'm guessing the timestamp, in its simplest form would just submit the time when the page was loaded as a hidden variable in the form, and compare it to the time the form was submitted.
If it's less than something reasonable for a person (say, 20 seconds or something), then it was clearly auto-filled.
With a little help from javascript, you could even expand this to the individual fields.
As someone mentioned in the comments above, it would be pretty trivial for spammers to adapt to this if they thought it was common, with a few random pauses. Perhaps they already have...
Bots tend to fill in every input field they encounter. So you could add an empty hidden input field to your form and check if the field has been populated.
Another way is to look how long it took to open the page which contains the form and the form got submitted by injecting a timestamp. Bots are way faster than humans.
The Project Honeypot website can help you with setting up a honeypot as well as blocking spammers other users have already detected: http://www.projecthoneypot.org/
I could be mistaken, but I think Project Honeypot is trying to address a different problem - harvested email addresses.
I believe the Honeypot concept that has been discussed on here is referring to creation of a honeypot field on a web form, tempting the bot to fill it in. Many bots will blindly try to submit something into each field, just to make sure that they get all the required fields on their form submission.
By adding a honeypot field, and adding text that instructs humans to leave it blank, a very high percentage of bot submissions will be detected, with few false positives.
Furthermore, you can hide the field from humans, with CSS tricks, as others mentioned. Make it 1 pixel. Make it hidden. etc.
In addition to including specially tagged spam trap addresses, some honey pots also include special HTML forms. Comment spammers are identified by watching what information is posted to these forms."
You're absolutely right that fake fields like that are a good way to catch bots, though, and that making your site unique is a great way to avoid being targeted by mass attacks that go after, say, all MediaWiki sites. Of course that doesn't help when you're big enough to be worth attacking specifically, but it makes things a little harder for the spammers.
I'd love to see Google tackle this by identifying this spam and immediately penalizing the links they are spamming.
I assume the spam is there in the first place to increase search engine rankings; so why not update the Google ranking algorithms (for example) to identify this spam and immediately give the targeted site (but not the site with the spam on it!) a terribly low rating?
Then, hopefully, the incentive to spam in the first place is gone.
The bottom line is that people don't like CAPTCHA, and it cannot leave a good impression to irritate potential customers/users within the first five minutes of a visit. Most people don't really understand what they're used for, and they get frustrated when they cannot read them and/or get rejected. I have have definitely been taking steps to limit my use of them or dispense with them altogether.
2. Identify bad blocks of IPs. If it's a datacenter, someone is probably running spamming software on a dedicated server or VPS. Maybe get your hands on some of those open proxy lists that are floating around.
3. Use your data to prune bad accounts, throttle or block creation of new ones, etc.
A word of warning: between forged IPs, compromised systems, and formerly hostile IP space given to new owners, an IP blacklist will eventually hit legitimate customers. I speak from experience on this since I had the same bright idea.
You are right about the blacklist, however, it's very unlikely you will have legitimate users coming from datacenter IPs. I've used this trick to prune hundreds/thousands of bad accounts in a couple of forums. You need to be careful with it, but I think it's a worthwhile method.
We've found that required email confirmations can drop conversion rates by 60%. Capatchas I wouldn't worry about unless you have serious spam problems. Seems better to detect and push capatchas on unhuman like engagement.
When the client loads the page, the server sends a hash of the timestamp and asks for the client to store it. When the client submits the form, it also sends the stored hash.
This exploits the fact that bots don't usually run javascript or load all resources on a page.
I understand the honeypot technique, which is quite cool. However what is this timestamp analysis stuff? anyone has a link to a decent explanation or care to say it in a few words?
They then removed the CAPTCHA, and it boosted the conversion rate up to 64%. In conversion rate lingo, that’s an uplift of 33.3%!
Pretty sure that 33% was bots, lol.
And they do train the bots to avoid negative-fields and timestamp analysis - all they have to do is look for type=hidden or display:none/visibility:hidden on the CSS
I use simple math instead of word captchas, seems easier on people.
Pro tip: You can usually get away with entering invalid similar characters on recaptcha when the word is really blurry. Substitute 'ri' for 'n', for example.
I like to do this as a game, to see what I can get away with, adds some fun to the drudgery of typing in a captcha.
The really blurry word is the word they're trying to OCR; generally it doesn't matter what you type as it'll accept it provided the other word is entered correctly.
Of course your 'game' is hurting reCaptcha's goal of digitizing books.
Compare: "security labels in clothing are a way of announcing to the world that you've got a theft problem, that you don't know how to deal with it, and that you've decided to offload the frustration of the problem onto your user-base. Security labels suck, because you can't properly try some pieces of clothing on with those labels in them, which means sales go down."
Such complaining doesn't accomplish a thing, unless you tell them about an effective alternative. If you don't change anything about the trade-off they have knowingly made, nothing will change. To have any chance of convincing anyone, you at least need to explain the alternatives. Everyone that reads this post just shrugs their shoulders and ignores you, because their captchas effectively solve a problem they and their clients would suffer from without those captchas.
In this case, if you open with
Using a CAPTCHA is a way of announcing to the world that
you’ve got a spam problem, that you don’t know how to deal
with it, and that you’ve decided to offload the
frustration of the problem onto your user-base.
then I think it is very dissatisfying[1] to follow up later with
They replaced the CAPTCHA with honeypot fields and
timestamp analysis, which has apparently proven to be very
effective at preventing spam while being completely
invisible to the end user.
which indicates that you have no idea about alternatives for fighting spam, apart from some measures that have 'apparently' helped in one particular case. It's not better than someone in a bar complaining about stupid government rules, without any idea or suggestion for how to improve things.
[1] it said 'hypocritical' here. That is not the correct word for it.
That he offered up the word "apparently", even with strong evidence of proof shows that he's being an objective reporter and a good scientist. I'm disheartened that this would earn somebody ridicule here.
The plural of 'anecdote' is not data. Simply reporting an anecdote makes you neither a scientist nor a journalist, no matter how strongly the anecdote supports your feelings on some matter. In the end, this is about his feelings on captchas. He hasn't made the case that a better trade-off between fighting spam and a higher conversion is possible; he has only suggested something based on an anecdote. As others immediately questioned: what happened to spam levels? 'Apparently' is not good enough when dealing with that serious problem.
You're right that it's a gaping hole in the article, but it's not hypocritical. All web projects involve people from different disciplines working together. My background, for example, is in Psychology and User Research. People in my role can relay user needs, goals and expectations to you, but we can't tell you what development approach to use to solve it.
Your analogy does not stand true: security labels in real-world stores don't cause a percentage of customers to give up their purchases in frustration.
Having said that, in a forum like HN, most readers would expect both a statement of the problem and some proposed solutions. Frankly, when I posted this article, I didn't expect it to get onto the No. 1 spot on the front page. It must be a slow news day.
You're right, 'hypocritical' is not the correct word here, as you are not displaying behavior for which you criticise others.
As for the security labels: a few days ago I wanted to fit a belt with an awkward security label that prevented a proper fit. It was an additional bump that wasn't overcome and may have been the only thing preventing me from buying it.
We operate a forum with 250k members and ~800k posts per month, a new registration every minute and we get so many spam bots even with captcha (mechanical turk etc) and without captcha it's unworkable. Captcha is a necessary evil, but it does help.
This seems to be coming from someone dealing with a site where spam wouldn't be that much of a problem, who would sign up to animoto to spam? Very silly post.