As a website owner, you probably love the idea of thousands of people coming to your site every day. But what if most of those requests were automated, coming from bots and scrapers that wanted to harvest and collect your content and information for their own purposes?
Clearly, not all website traffic is desirable. These types of requests can add a lot of extra load to your website’s servers.
As a general rule of thumb, content that’s presented publicly on a website can also be accessed by web scrapers.
A determined web scraper can copy any human browsing behavior to make their requests “blend in” with legitimate traffic to your website.
So if you’re putting valuable content online, web scraping could pose a very real threat to your business. Having written a book on web scraping and spent a lot of time scraping sites myself, I’ve seen a lot of the most common anti-scraping techniques — seeing what works and what doesn’t.
Ultimately, most of these tactics can be thwarted by a determined enough scraper, but these are some simple things you can do to frustrate the scrapers and make your site far less of a target.
Require a login for access
The de facto protocol of the web — HTTP — is designed to be stateless. A client sends an anonymous request and the server simply returns a response and then forgets everything about that connection. Cookies were invented to help bring “state” to the web, meaning that a server can recognize that a series of requests all came from the same client. One of the most common ways cookies are used is to implement logins. On a public website, a scraper doesn’t need to identify itself or send the same cookies on each request.
But if that page is protected by a login, then the web scraper has to send some identifying information along with each request (the session cookie) in order to view the login-protected content. This makes it easy to tell when a lot of requests are all coming from the same client, and allows you to tie the scraping behavior to a specific user in your system. You can then revoke their credentials or block their account to shut off their access.
Change your site’s markup regularly
Scrapers work by finding patterns in your site’s markup, and using those patterns to help their scripts find the right data in the HTML page. For example, when you view a Google search results page, this HTML markup appears for each result:
I Don't Need No Stinking API: Web Scraping For Fun and Profit
Sometimes you need to pull data from a service that doesn't have an API. Not to fear! Here's how (and why) you should consider web scraping.
Every result is in an
<li> element with a class
g, the title of the page is in an
<h3> element with class
r and the link to the page is in an
<a> element inside the
<h3>. These sorts of patterns are easy to discover and provide a great way for a web scraping script to find the exact information it needs.
Every time your site's markup changes, there's a good chance that it will break any of the web scrapers that have been built for your site. If your site's markup changes frequently or is thoroughly inconsistent (making it hard to find patterns in the first place), then you might be able to frustrate the scraper enough that they give up.
This doesn't mean you need a full-blown website redesign to break a scraper, simply changing the
id in your HTML (and the corresponding CSS files) should be enough to break a lot of web scrapers that rely on those CSS hooks. Note that you might also end up driving your web designers insane as well, as they also rely on the elements' structure and attributes to apply styles to web pages.
Embed information in images, instead of plain text
Most scrapers expect to be simply pulling text out of an HTML file. If the content on your website is inside an image, movie, pdf, etc. then you’ve just added another very huge step for a scraper — parsing text from a complex media object. Facebook did this for awhile with user’s email addresses.
When you went to someone’s profile and looked at their basic information, everything was presented as basic text on the page — except for the email address, which was presented as an image. This was to specifically designed to make it much harder for someone to scrape a bunch of email addresses from Facebook profiles, although some determined enough scrapers used OCR to parse the email address out of the image.
Embedding content in an image or other media objects will probably make your site a bit slower for the average user. Not to mention all the extra time it takes whenever you need to update the content on your website.
Additionally, putting text content inside an image or other media object, your site becomes far less accessible for blind or otherwise disabled users. This is a bit of an extreme measure, and it does have drawbacks.
Captchas are those annoying blocks of garbled texts you have to squint at whenever you sign up for things. The whole idea of a captcha is to be a test that’s very easy for humans to pass, but very difficult for computers. Reading garbled text out of an image is one of those, and there are many other types of captchas as well.
While humans tend to find the problems easy, they also tend to find them extremely annoying.
They’ve been shown to lower conversion rates and generally piss off users, so you want to be careful with how you use them. Maybe only show a captcha if a particular client has made several suspicious requests, like Twitter does if you guess the wrong password too many times
Use honey pots
Honey pots are pages that a human visitor would never visit, but a robot that’s clicking every link on a page might accidentally stumble across. These are pretty good for catching web spiders that pull every link from a page and then visit every single one of those pages. They’re not necessarily scraping your content, but a poorly behaved crawler can cripple your site by visiting many pages at once and overwhelming it was traffic.
If the goal is to block undesirable access, you might consider adding a honey pot to your site to sniff out the spiders. Maybe include a link in the footer of your site but use some CSS to hide it or blend it in with the background. Once a particular client stumbles across a honey pot page, you can be relatively sure they’re not a human, and start throttling or blocking all requests from that client.
Don’t post the information on your website
This might seem obvious, but it’s definitely an option if you’re really worried about scrapers stealing your information.
Remember, web scraping is just automated access of the information on your site. If you’re fine sharing content with anyone who visits your site, then maybe you don’t really need to worry about web scrapers. Keep in mind, Google is arguably the largest scraper in the world — crawling billions of website and pulling them into its giant index.
People don’t seem to mind when Google scrapes their content, putting it in a public index that’s searched through by millions of people daily.
If you’re worried about information “falling into the wrong hands” then maybe it shouldn’t be up there in the first place. — Since web scrapers can often take steps to disguise their traffic as regular site visitors, it can be a tricky game of cat-and-mouse to keep them away from your content.
Any steps that you take to limit web scrapers will probably also harm the experience of the average web viewer on your site. It’s a delicate balance to get your content in front of huge human audiences, but keep it away from the hordes of web scrapers and bots that crawl the internet every day.
Hopefully, with some of these tactics in your toolbelt, you’ll be able to strike the right balance for your organization.
I'm a 20-something, full-stack web developer. Author of Marketing for Hackers and The Ultimate Guide to Web Scraping.