Screen scraping using a headless selenium

July 31, 2011 at 12:18 pm Leave a comment

Summary: I think that using selenium one can produce simple batch process scrapers more quickly than using other strategies. Using xvfb one can have this run in a headless mode – removing one of the largest irritations of selenium. This approach breaks down for a number of use cases however.

https://github.com/argandgahandapandpa/Headless-selenium-screen-scraping-example

This is an example of generating a screen scraping using the selenium IDE.

The idea is to carry out a prototypical navigation to all of the pages that you are interested in scraping. You can then export this to a python unit test.

This unittest can then be customized by hand to also perform the scraping that you want to do and add parameterization. Unfortunately, this customization process will almost always involve modification of this generated source code by hand since one wishes to intersperse scraping within navigation. This is a bit of a shame since it means that you can’t regenerate your code.

One then adds code to create a headless selenium using Xvfb (see http://www.alittlemadness.com/2008/03/05/running-selenium-headless) which can be used to perform the scraping.

Caveats to this approach:
* Although the machine doing the scraping doesn’t need be running X is requires firefox, X libraries and full gtk library. This is quite a lot of disk space, quite a lot of memory and quite a lot of installation. This is fine for running scripts on your server, but if you say want to distribute your scrapers over a number of EC2 servers this becomes less fun.

* This is slow. The start up is slow. More pages are fetched. If you hand roll a scraper you can often cut out a lot of the data (all the images, all the js files, a bunch of intermediate pages, any flash components). You can probably get a similar effect by using firefox profiles and switching off images and flash.

* This still doesn’t work too well with with flash (though there is a flash extension)

* I’m still concerned about nasty bugs where the browser slips out of sync with the remote control, even though I haven’t seen this happen.

Advantages of this approach:

* No boring re-writing of http queries which you then get slightly wrong, no having to match headers. Of course it would probably

* No having to pick apart javascript to work out what they are doing.

I feel that this approach works fairly well for hacking together scrapers quickly for batch processes (scrapers rather than robots) where they just need to run rather than run well.

——
Library for this.

I’ve created a small library for spawning headless seleniums cleanly. It deals with things like:

* Avoiding port collisions
* Avoiding display collisions
* Ensuring that resources are clean up
* Logging
* Waiting for processes to initialize.

This is available here:

https://github.com/argandgahandapandpa/selscrape

About these ads

Entry filed under: Uncategorized. Tags: , .

Postgres listing all tables ordered by approximate size Converting ordnance survey coordinates into templates within wikipedia markup

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Trackback this post  |  Subscribe to the comments via RSS Feed


July 2011
M T W T F S S
« Jun   Sep »
 123
45678910
11121314151617
18192021222324
25262728293031

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: