Screen scraping using a headless selenium
Summary: I think that using selenium one can produce simple batch process scrapers more quickly than using other strategies. Using xvfb one can have this run in a headless mode – removing one of the largest irritations of selenium. This approach breaks down for a number of use cases however.
This is an example of generating a screen scraping using the selenium IDE.
The idea is to carry out a prototypical navigation to all of the pages that you are interested in scraping. You can then export this to a python unit test.
This unittest can then be customized by hand to also perform the scraping that you want to do and add parameterization. Unfortunately, this customization process will almost always involve modification of this generated source code by hand since one wishes to intersperse scraping within navigation. This is a bit of a shame since it means that you can’t regenerate your code.
One then adds code to create a headless selenium using Xvfb (see http://www.alittlemadness.com/2008/03/05/running-selenium-headless) which can be used to perform the scraping.
Caveats to this approach:
* Although the machine doing the scraping doesn’t need be running X is requires firefox, X libraries and full gtk library. This is quite a lot of disk space, quite a lot of memory and quite a lot of installation. This is fine for running scripts on your server, but if you say want to distribute your scrapers over a number of EC2 servers this becomes less fun.
* This is slow. The start up is slow. More pages are fetched. If you hand roll a scraper you can often cut out a lot of the data (all the images, all the js files, a bunch of intermediate pages, any flash components). You can probably get a similar effect by using firefox profiles and switching off images and flash.
* This still doesn’t work too well with with flash (though there is a flash extension)
* I’m still concerned about nasty bugs where the browser slips out of sync with the remote control, even though I haven’t seen this happen.
Advantages of this approach:
* No boring re-writing of http queries which you then get slightly wrong, no having to match headers. Of course it would probably
I feel that this approach works fairly well for hacking together scrapers quickly for batch processes (scrapers rather than robots) where they just need to run rather than run well.
Library for this.
I’ve created a small library for spawning headless seleniums cleanly. It deals with things like:
* Avoiding port collisions
* Avoiding display collisions
* Ensuring that resources are clean up
* Waiting for processes to initialize.
This is available here: