Archive for July, 2011

Screen scraping using a headless selenium

Summary: I think that using selenium one can produce simple batch process scrapers more quickly than using other strategies. Using xvfb one can have this run in a headless mode – removing one of the largest irritations of selenium. This approach breaks down for a number of use cases however.

https://github.com/argandgahandapandpa/Headless-selenium-screen-scraping-example

This is an example of generating a screen scraping using the selenium IDE.

The idea is to carry out a prototypical navigation to all of the pages that you are interested in scraping. You can then export this to a python unit test.

This unittest can then be customized by hand to also perform the scraping that you want to do and add parameterization. Unfortunately, this customization process will almost always involve modification of this generated source code by hand since one wishes to intersperse scraping within navigation. This is a bit of a shame since it means that you can’t regenerate your code.

One then adds code to create a headless selenium using Xvfb (see http://www.alittlemadness.com/2008/03/05/running-selenium-headless) which can be used to perform the scraping.

Caveats to this approach:
* Although the machine doing the scraping doesn’t need be running X is requires firefox, X libraries and full gtk library. This is quite a lot of disk space, quite a lot of memory and quite a lot of installation. This is fine for running scripts on your server, but if you say want to distribute your scrapers over a number of EC2 servers this becomes less fun.

* This is slow. The start up is slow. More pages are fetched. If you hand roll a scraper you can often cut out a lot of the data (all the images, all the js files, a bunch of intermediate pages, any flash components). You can probably get a similar effect by using firefox profiles and switching off images and flash.

* This still doesn’t work too well with with flash (though there is a flash extension)

* I’m still concerned about nasty bugs where the browser slips out of sync with the remote control, even though I haven’t seen this happen.

Advantages of this approach:

* No boring re-writing of http queries which you then get slightly wrong, no having to match headers. Of course it would probably

* No having to pick apart javascript to work out what they are doing.

I feel that this approach works fairly well for hacking together scrapers quickly for batch processes (scrapers rather than robots) where they just need to run rather than run well.

——
Library for this.

I’ve created a small library for spawning headless seleniums cleanly. It deals with things like:

* Avoiding port collisions
* Avoiding display collisions
* Ensuring that resources are clean up
* Logging
* Waiting for processes to initialize.

This is available here:

https://github.com/argandgahandapandpa/selscrape

Advertisements

July 31, 2011 at 12:18 pm Leave a comment

Postgres listing all tables ordered by approximate size

This was run against postgres 9. Probably works with different versions.

select nspname, relname, pg_relation_size(nspname || '.' || relname) as size from pg_class join pg_namespace on pg_class.relnamespace=pg_namespace.oid where nspname not ilike 'pg%' and nspname <> 'information_schema' and relkind='r' order by size desc;

July 26, 2011 at 5:18 pm Leave a comment

Tracking down where symbols come from in gdb

There should be a better way of doing this – perhaps someone on the internet will tell me what it is.

I wanted to answer the following question today.

What dynamically loaded library does this function come from?

To do this I used the following approach:

* Start the process
* Attach to it in gdb
* Use the x command to find the memory address of the symbol associated with the function (e.g x function name)
* Look this memory address up in /procs/pid/maps

There is probably a slightly shorter way of doing this, this looked hopeful, together with the interpreter command in gdb:

http://davis.lbl.gov/Manuals/GDB/gdb_24.html#SEC469

but didn’t seem to be present on my gdb (probably an old version).

July 25, 2011 at 12:14 am Leave a comment

Send an attachment from the command line

cat file | uuencode attach_name | mailx -s ‘Test’ address@domain

This doesn’t include a mail body only the attachment. But this is good enough for some use cases.

July 22, 2011 at 2:22 pm Leave a comment

Reading mail using python

There may be a better way to go about this… however this works without requiring too much code.

* Use imapclient to fetch messages.
* Use fetch to get the full message using ‘BODY[]’
* Parse this with email.parser.Parser
* Use walk to pull out the message part (emails contain multiparts)

Victory.

July 20, 2011 at 3:36 pm Leave a comment

Source of the name backus-naur-form

Let’s explain some trivia that I feel obligated to know to the internet, as a learning aid.

Backus-Naur Form, was originally named after Backus, an employee at IBM who designed a language for describing parsers.

Naur took this language out of IBM and used it for different purposes – referring to it as backus normal form.

Knuth decided that the use of the word normal was technically incorrect so re-dubbed this backus naur form.

July 16, 2011 at 7:33 pm Leave a comment

Key exchange overview

Warning: This is me using the internet as an entity to explain things to. Any of this could be a lie.
Go read a text book.

=== Scenario ===

Alice and Bob and trying to communicate in a world where all their messages can be intercepted. What can they do.

=== The key exchange problem ===

Idea: If alice and bob have a symmetric encryption algorithm all the need to do is share one secret. Is it possible for alive and bob to create a shared secret by exchanging functions of private secrets that no one else will know.

=== Formalisation ===

The extremely optimistic formalisation looks like this.

A: has a secret a
B: has as secret b

A publishes f(a)
B publishes f(b)

A uses f(b) to calculate g(f(b), a) privately.
B uses f(b) to calculate g(f(a), b) privately.

[ Here we are optimistic because:
A and B use the same functions. Only one message is send from each party. The messages don’t depend upon one another]

=== Stupid approach ===

Have f(a) = ca, g(x, y) = xy. Hurrah everything works by associativity and commutativity! The only problem is that c is common knowledge, so since we can do division moderately easily be lose.

=== Fixing this ===

Exponentiation is a nice operation

(a^b)^c = a^(bc) = (a^c)^b

since taking logarithms modulo some number is hard we can use this, giving us key exchange.

July 15, 2011 at 7:46 pm Leave a comment

Older Posts


July 2011
M T W T F S S
« Jun   Sep »
 123
45678910
11121314151617
18192021222324
25262728293031