Robots invasion

May 6 2011

again, and again, and again.. “how can I test this”? same old question.

this week my team had to “instruct” spiders not to follow certain links, in order those not to be processed by search engines indexing process. given robots.txt is a standard de-facto for this, we tried to test this behaviour. we initially only found syntactic check tools, useful while developing, but not enough as a safety-net against regressions. so, we later decided to go a bit further: testing that the actual URLs were not processed by spiders.

having a robots.txt file published under the site root with these rules..

User-Agent: *
Disallow: /shop/cart
Disallow: /shop/one-page-checkout

.. our goal could be easily documented by the following test:

test "should skip disallowed paths" do
  lets_crowl "/shop/cart"
  assert_nothing_was_surfed

  lets_crowl "/shop/one-page-checkout"
  lets_crowl "/shop/contact-us"
  assert_surfed_only '/shop/contact-us'
end

next step was running this test against a local development server (http://localhost:3000 for a Rails webapp). spiking around, I started looking for existing spider engines, written in Ruby so that I could easily integrate into our test codebase. it took me one hour for evaluating tools: the winner was ruby-spider, forked from Spider.

we actually start a new Spider intance, run it against a given URL, check if it was allowed to surf it or not. then we “stop it”: this is done instructing the Spider to only process URLs that match the exact target URL. otherwise, the Spider would continue processing every URL found in the HTML response, recursively, and so on.. extra points would be setting a timeout to the test, or a max depth. anyway, here’s the fixture code:

def setup
  @surfed_urls = []
end

def lets_crowl(target_path)
  Spider.start_at(local_url(target_path)) do |site|
    site.add_url_check do |an_url|
      an_url.match("^#{local_url(target_path)}.*")
    end

    site.on :every do |an_url, response, prior_url|
      @surfed_urls << an_url
    end
  end
end

def assert_surfed_only(path)
  expected_url = "#{local_url(path)}"
  assert_equal @surfed_urls, [ expected_url ]
end

def assert_nothing_was_surfed
  assert @surfed_urls.empty?
end

def local_url(path)
  "http://localhost:3000#{path}"
end

last thing done: running this test against a test server. shame on me, I could not manage to setup a Webrat integration test (server is in-process). so, after verifying our staging server would have really slowed down the test suite, I went for a Selenium test: actually, I only changed one line of code, switching local_url to http://localhost:3001. really, really nice!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: