Archive for July, 2011

Screen scraping using a headless selenium

Summary: I think that using selenium one can produce simple batch process scrapers more quickly than using other strategies. Using xvfb one can have this run in a headless mode – removing one of the largest irritations of selenium. This approach breaks down for a number of use cases however.

https://github.com/argandgahandapandpa/Headless-selenium-screen-scraping-example

This is an example of generating a screen scraping using the selenium IDE.

The idea is to carry out a prototypical navigation to all of the pages that you are interested in scraping. You can then export this to a python unit test.

This unittest can then be customized by hand to also perform the scraping that you want to do and add parameterization. Unfortunately, this customization process will almost always involve modification of this generated source code by hand since one wishes to intersperse scraping within navigation. This is a bit of a shame since it means that you can’t regenerate your code.

One then adds code to create a headless selenium using Xvfb (see http://www.alittlemadness.com/2008/03/05/running-selenium-headless) which can be used to perform the scraping.

Caveats to this approach:
* Although the machine doing the scraping doesn’t need be running X is requires firefox, X libraries and full gtk library. This is quite a lot of disk space, quite a lot of memory and quite a lot of installation. This is fine for running scripts on your server, but if you say want to distribute your scrapers over a number of EC2 servers this becomes less fun.

* This is slow. The start up is slow. More pages are fetched. If you hand roll a scraper you can often cut out a lot of the data (all the images, all the js files, a bunch of intermediate pages, any flash components). You can probably get a similar effect by using firefox profiles and switching off images and flash.

* This still doesn’t work too well with with flash (though there is a flash extension)

* I’m still concerned about nasty bugs where the browser slips out of sync with the remote control, even though I haven’t seen this happen.

Advantages of this approach:

* No boring re-writing of http queries which you then get slightly wrong, no having to match headers. Of course it would probably

* No having to pick apart javascript to work out what they are doing.

I feel that this approach works fairly well for hacking together scrapers quickly for batch processes (scrapers rather than robots) where they just need to run rather than run well.

——
Library for this.

I’ve created a small library for spawning headless seleniums cleanly. It deals with things like:

* Avoiding port collisions
* Avoiding display collisions
* Ensuring that resources are clean up
* Logging
* Waiting for processes to initialize.

This is available here:

https://github.com/argandgahandapandpa/selscrape

July 31, 2011 at 12:18 pm Leave a comment

Postgres listing all tables ordered by approximate size

This was run against postgres 9. Probably works with different versions.

select nspname, relname, pg_relation_size(nspname || '.' || relname) as size from pg_class join pg_namespace on pg_class.relnamespace=pg_namespace.oid where nspname not ilike 'pg%' and nspname <> 'information_schema' and relkind='r' order by size desc;

July 26, 2011 at 5:18 pm Leave a comment

Tracking down where symbols come from in gdb

There should be a better way of doing this – perhaps someone on the internet will tell me what it is.

I wanted to answer the following question today.

What dynamically loaded library does this function come from?

To do this I used the following approach:

* Start the process
* Attach to it in gdb
* Use the x command to find the memory address of the symbol associated with the function (e.g x function name)
* Look this memory address up in /procs/pid/maps

There is probably a slightly shorter way of doing this, this looked hopeful, together with the interpreter command in gdb:

http://davis.lbl.gov/Manuals/GDB/gdb_24.html#SEC469

but didn’t seem to be present on my gdb (probably an old version).

July 25, 2011 at 12:14 am Leave a comment

Send an attachment from the command line

cat file | uuencode attach_name | mailx -s ‘Test’ address@domain

This doesn’t include a mail body only the attachment. But this is good enough for some use cases.

July 22, 2011 at 2:22 pm Leave a comment

Reading mail using python

There may be a better way to go about this… however this works without requiring too much code.

* Use imapclient to fetch messages.
* Use fetch to get the full message using ‘BODY[]’
* Parse this with email.parser.Parser
* Use walk to pull out the message part (emails contain multiparts)

Victory.

July 20, 2011 at 3:36 pm Leave a comment

Source of the name backus-naur-form

Let’s explain some trivia that I feel obligated to know to the internet, as a learning aid.

Backus-Naur Form, was originally named after Backus, an employee at IBM who designed a language for describing parsers.

Naur took this language out of IBM and used it for different purposes – referring to it as backus normal form.

Knuth decided that the use of the word normal was technically incorrect so re-dubbed this backus naur form.

July 16, 2011 at 7:33 pm Leave a comment

Key exchange overview

Warning: This is me using the internet as an entity to explain things to. Any of this could be a lie.
Go read a text book.

=== Scenario ===

Alice and Bob and trying to communicate in a world where all their messages can be intercepted. What can they do.

=== The key exchange problem ===

Idea: If alice and bob have a symmetric encryption algorithm all the need to do is share one secret. Is it possible for alive and bob to create a shared secret by exchanging functions of private secrets that no one else will know.

=== Formalisation ===

The extremely optimistic formalisation looks like this.

A: has a secret a
B: has as secret b

A publishes f(a)
B publishes f(b)

A uses f(b) to calculate g(f(b), a) privately.
B uses f(b) to calculate g(f(a), b) privately.

[ Here we are optimistic because:
A and B use the same functions. Only one message is send from each party. The messages don’t depend upon one another]

=== Stupid approach ===

Have f(a) = ca, g(x, y) = xy. Hurrah everything works by associativity and commutativity! The only problem is that c is common knowledge, so since we can do division moderately easily be lose.

=== Fixing this ===

Exponentiation is a nice operation

(a^b)^c = a^(bc) = (a^c)^b

since taking logarithms modulo some number is hard we can use this, giving us key exchange.

July 15, 2011 at 7:46 pm Leave a comment

Becoming a certificate authority (CA) in one file

I found this blog post very useful when trying to set up a CA : Becoming a certificate authority.

However extended howtos with cut-and-paste code samples, though useful, kind of suck for some use cases. I’ve converted this into a single file bash script which you should be able to download and run to create a sample CA, and sign a sample certificate.

Bear in mind that you probably want to tweak a few things, but this should give you something that works

#!/bin/bash
# Make a key
rm -rf cert_dir
mkdir cert_dir

# First we need keys to prove that we have signed things
openssl genrsa 1025 > cert_dir/private.pem # private key
openssl rsa -in cert_dir/private.pem -pubout -out cert_dir/public.pem

# Then we need a certificate to tell other people that we can 
# issue certificates

#    Write down what we want to appear in this certificate

cat > cert_dir/ca_config <<EOF
[ req ]
#default_bits           = 1024
#default_keyfile        = privkey.pem
distinguished_name     = req_distinguished_name
#attributes             = req_attributes
x509_extensions        = v3_ca
prompt = no

[ req_distinguished_name ]
countryName                    = UK 
localityName                   = London 
organizationalUnitName         = Certs 
commonName                     = www.certificates4all.com 
#emailAddress                   = test@test 

[ v3_ca ]
subjectKeyIdentifier=hash
authorityKeyIdentifier=keyid:always,issuer:always
basicConstraints = CA:true

[ ca ]
default_ca = CA_Default


[ CA_Default ]
email_in_dn             = no
dir                     = .
new_certs_dir           = ./cert_dir
database                = ./cert_dir/issue
certificate             = ./cert_dir/ca_cert
serial                  = ./cert_dir/serial
private_key             = ./cert_dir/private.pem
name_opt                = ca_default
cert_opt                = ca_default
default_crl_days        = 30
default_days            = 365
default_md              = sha1
preserve                = no
policy                  = policy_match

[ policy_match ]
countryName             = optional
stateOrProvinceName     = optional
organizationName        = optional
organizationalUnitName  = optional
commonName              = supplied
emailAddress            = optional
EOF

#     Turn this configuration into a certificate
echo creating ca cert
openssl req -config cert_dir/ca_config -key cert_dir/private.pem -new -x509 -extensions v3_ca > cert_dir/ca_cert 

# Some configuration files to remember what we have signed
echo 0001 > cert_dir/serial
touch cert_dir/issue # database
touch cert_dir/issue.attr


# We now are a working certificate authority - yay!

# Now to do some sample signing...

echo signing sample cert

# Reuse out CA key as our server key - 
# in real life this would be different

# A site creates request for something to be signed, they
# must sign this so that only they can claim to be this person

#    Writing down details of certification request
cat > cert_dir/cert_config << EOF
[ req ]
#default_bits           = 1024
#default_keyfile        = privkey.pem
distinguished_name     = req_distinguished_name
#attributes             = req_attributes
prompt = no

[ req_distinguished_name ]
countryName                    = MN 
localityName                   = GoogleVile 
organizationalUnitName         = google 
commonName                     = *.google.com 
#emailAddress                   = test@test 
EOF

#    Turn this configuration into a binary request
openssl req -new -config cert_dir/cert_config -key cert_dir/private.pem > cert_dir/sample_site.req

# We then sign this certifcate to say that we believe they are who they say they are
openssl ca -batch -config cert_dir/ca_config -in cert_dir/sample_site.req -out cert_dir/sample_site.cert

July 15, 2011 at 2:06 pm Leave a comment

Minimal working implementation of a socks 5 proxy

SOCKS 5 proxy. Written in python.

Works fine as a firefox proxy.

https://github.com/argandgahandapandpa/minimal_python_socks

July 7, 2011 at 12:46 am Leave a comment


July 2011
M T W T F S S
 123
45678910
11121314151617
18192021222324
25262728293031