Making Your Website Scrape-Proof: How to Thwart Content Thieves
Posted in: Getting Started   -   September 5, 2013

PargetingAs a website owner, you probably love the idea of thousands of people coming to your site every day. But what if most of those requests were automated, coming from bots and scrapers that wanted to harvest and collect your content and information for their own purposes?

Clearly, not all website traffic is desirable. These types of requests can add a lot of extra load to your website’s servers.

According to research published by Distil Networks, almost a third of all traffic to some major content sites is bot traffic.

As a general rule of thumb, content that’s presented publicly on a website can also be accessed by web scrapers.

A determined web scraper can copy any human browsing behavior to make their requests “blend in” with legitimate traffic to your website.

So if you’re putting valuable content online, web scraping could pose a very real threat to your business. Having written a book on web scraping and spent a lot of time scraping sites myself, I’ve seen a lot of the most common anti-scraping techniques — seeing what works and what doesn’t.

Ultimately, most of these tactics can be thwarted by a determined enough scraper, but these are some simple things you can do to frustrate the scrapers and make your site far less of a target.

 

Require a login for access

The de facto protocol of the web — HTTP — is designed to be stateless. A client sends an anonymous request and the server simply returns a response and then forgets everything about that connection. Cookies were invented to help bring “state” to the web, meaning that a server can recognize that a series of requests all came from the same client. One of the most common ways cookies are used is to implement logins. On a public website, a scraper doesn’t need to identify itself or send the same cookies on each request.

But if that page is protected by a login, then the web scraper has to send some identifying information along with each request (the session cookie) in order to view the login-protected content. This makes it easy to tell when a lot of requests are all coming from the same client, and allows you to tie the scraping behavior to a specific user in your system. You can then revoke their credentials or block their account to shut off their access.

 

Change your site’s markup regularly

Scrapers work by finding patterns in your site’s markup, and using those patterns to help their scripts find the right data in the HTML page. For example, when you view a Google search results page, this HTML markup appears for each result:

Every result is in an <li> element with a class g, the title of the page is in an <h3> element with class r and the link to the page is in an <a> element inside the <h3>. These sorts of patterns are easy to discover and provide a great way for a web scraping script to find the exact information it needs.

Every time your site's markup changes, there's a good chance that it will break any of the web scrapers that have been built for your site. If your site's markup changes frequently or is thoroughly inconsistent (making it hard to find patterns in the first place), then you might be able to frustrate the scraper enough that they give up.

This doesn't mean you need a full-blown website redesign to break a scraper, simply changing the class and id in your HTML (and the corresponding CSS files) should be enough to break a lot of web scrapers that rely on those CSS hooks. Note that you might also end up driving your web designers insane as well, as they also rely on the elements' structure and attributes to apply styles to web pages.

Embed information in images, instead of plain text

Most scrapers expect to be simply pulling text out of an HTML file. If the content on your website is inside an image, movie, pdf, etc. then you’ve just added another very huge step for a scraper — parsing text from a complex media object. Facebook did this for awhile with user’s email addresses.

When you went to someone’s profile and looked at their basic information, everything was presented as basic text on the page — except for the email address, which was presented as an image. This was to specifically designed to make it much harder for someone to scrape a bunch of email addresses from Facebook profiles, although some determined enough scrapers used OCR to parse the email address out of the image.

Embedding content in an image or other media objects will probably make your site a bit slower for the average user. Not to mention all the extra time it takes whenever you need to update the content on your website.

Additionally, putting text content inside an image or other media object, your site becomes far less accessible for blind or otherwise disabled users. This is a bit of an extreme measure, and it does have drawbacks.

Present captchas

Captchas are those annoying blocks of garbled texts you have to squint at whenever you sign up for things. The whole idea of a captcha is to be a test that’s very easy for humans to pass, but very difficult for computers. Reading garbled text out of an image is one of those, and there are many other types of captchas as well.

While humans tend to find the problems easy, they also tend to find them extremely annoying.

They’ve been shown to lower conversion rates and generally piss off users, so you want to be careful with how you use them. Maybe only show a captcha if a particular client has made several suspicious requests, like Twitter does if you guess the wrong password too many times
twitter-blocking-bots

Use honey pots

Honey pots are pages that a human visitor would never visit, but a robot that’s clicking every link on a page might accidentally stumble across. These are pretty good for catching web spiders that pull every link from a page and then visit every single one of those pages. They’re not necessarily scraping your content, but a poorly behaved crawler can cripple your site by visiting many pages at once and overwhelming it was traffic.

If the goal is to block undesirable access, you might consider adding a honey pot to your site to sniff out the spiders. Maybe include a link in the footer of your site but use some CSS to hide it or blend it in with the background. Once a particular client stumbles across a honey pot page, you can be relatively sure they’re not a human, and start throttling or blocking all requests from that client.

Don’t post the information on your website

This might seem obvious, but it’s definitely an option if you’re really worried about scrapers stealing your information.
Remember, web scraping is just automated access of the information on your site. If you’re fine sharing content with anyone who visits your site, then maybe you don’t really need to worry about web scrapers. Keep in mind, Google is arguably the largest scraper in the world — crawling billions of website and pulling them into its giant index.
People don’t seem to mind when Google scrapes their content, putting it in a public index that’s searched through by millions of people daily.

If you’re worried about information “falling into the wrong hands” then maybe it shouldn’t be up there in the first place. — Since web scrapers can often take steps to disguise their traffic as regular site visitors, it can be a tricky game of cat-and-mouse to keep them away from your content.

Any steps that you take to limit web scrapers will probably also harm the experience of the average web viewer on your site. It’s a delicate balance to get your content in front of huge human audiences, but keep it away from the hordes of web scrapers and bots that crawl the internet every day.

Hopefully, with some of these tactics in your toolbelt, you’ll be able to strike the right balance for your organization.


I'm a 20-something, full-stack web developer. Author of Marketing for Hackers and The Ultimate Guide to Web Scraping.

Tags: , ,

  • Big Data

    Protecting your data from scraping means you are not confident at your website.
    Most of the websites allows you to scrape or even provide feeds as they become more popular. I’ve red very small bit of your post and I’m sure you are not aware of OCR functions and modern technologies.

    • OCR

      hahaha! “I’ve red very small bit of your post and I’m sure you are not aware of OCR function…”

      try reading the whole post you knob-jocky

  • Perf Tech

    Interesting article. I think using image is the most effective method – a web page is dynamically converted to image (or pdf) on server side and sent to user. OCR can be used but the inaccuracy is a big turn off. Maybe there is even a service that can be provided by CDN.

  • Ranvijay Singh
  • http://www.linkedin.com/in/jonathanbentz Jonathan Bentz

    Stumbled across this article after reading Hartley’s in-depth Quora answer on this same topic. Thank you for bringing this content to the web – I’m definitely going to have to check out your guides… hopefully they have been updated at some point in the last 4 years!

    Obviously, a lot of this content is a few years old, but the fact remains that scraping is still bad news for businesses… to the tune of over $2.5B in lost revenue opportunities.

    For the newest methods of preventing scraping from taking your online rankings, jacking your pricing, increasing your bandwidth fees and limiting your competitive advantage, please check out this link from PerimeterX – https://www.perimeterx.com/threats/scraping/