Archive for July, 2011
Screen scraping using a headless selenium
Summary: I think that using selenium one can produce simple batch process scrapers more quickly than using other strategies. Using xvfb one can have this run in a headless mode – removing one of the largest irritations of selenium. This approach breaks down for a number of use cases however.
—
https://github.com/argandgahandapandpa/Headless-selenium-screen-scraping-example
This is an example of generating a screen scraping using the selenium IDE.
The idea is to carry out a prototypical navigation to all of the pages that you are interested in scraping. You can then export this to a python unit test.
This unittest can then be customized by hand to also perform the scraping that you want to do and add parameterization. Unfortunately, this customization process will almost always involve modification of this generated source code by hand since one wishes to intersperse scraping within navigation. This is a bit of a shame since it means that you can’t regenerate your code.
One then adds code to create a headless selenium using Xvfb (see http://www.alittlemadness.com/2008/03/05/running-selenium-headless) which can be used to perform the scraping.
Caveats to this approach:
* Although the machine doing the scraping doesn’t need be running X is requires firefox, X libraries and full gtk library. This is quite a lot of disk space, quite a lot of memory and quite a lot of installation. This is fine for running scripts on your server, but if you say want to distribute your scrapers over a number of EC2 servers this becomes less fun.
* This is slow. The start up is slow. More pages are fetched. If you hand roll a scraper you can often cut out a lot of the data (all the images, all the js files, a bunch of intermediate pages, any flash components). You can probably get a similar effect by using firefox profiles and switching off images and flash.
* This still doesn’t work too well with with flash (though there is a flash extension)
* I’m still concerned about nasty bugs where the browser slips out of sync with the remote control, even though I haven’t seen this happen.
Advantages of this approach:
* No boring re-writing of http queries which you then get slightly wrong, no having to match headers. Of course it would probably
* No having to pick apart javascript to work out what they are doing.
I feel that this approach works fairly well for hacking together scrapers quickly for batch processes (scrapers rather than robots) where they just need to run rather than run well.
——
Library for this.
I’ve created a small library for spawning headless seleniums cleanly. It deals with things like:
* Avoiding port collisions
* Avoiding display collisions
* Ensuring that resources are clean up
* Logging
* Waiting for processes to initialize.
This is available here:
Postgres listing all tables ordered by approximate size
This was run against postgres 9. Probably works with different versions.
select nspname, relname, pg_relation_size(nspname || '.' || relname) as size from pg_class join pg_namespace on pg_class.relnamespace=pg_namespace.oid where nspname not ilike 'pg%' and nspname <> 'information_schema' and relkind='r' order by size desc;
Tracking down where symbols come from in gdb
There should be a better way of doing this – perhaps someone on the internet will tell me what it is.
I wanted to answer the following question today.
What dynamically loaded library does this function come from?
To do this I used the following approach:
* Start the process
* Attach to it in gdb
* Use the x command to find the memory address of the symbol associated with the function (e.g x function name)
* Look this memory address up in /procs/pid/maps
There is probably a slightly shorter way of doing this, this looked hopeful, together with the interpreter command in gdb:
http://davis.lbl.gov/Manuals/GDB/gdb_24.html#SEC469
but didn’t seem to be present on my gdb (probably an old version).
Send an attachment from the command line
cat file | uuencode attach_name | mailx -s ‘Test’ address@domain
This doesn’t include a mail body only the attachment. But this is good enough for some use cases.
Reading mail using python
There may be a better way to go about this… however this works without requiring too much code.
* Use imapclient to fetch messages.
* Use fetch to get the full message using ‘BODY[]’
* Parse this with email.parser.Parser
* Use walk to pull out the message part (emails contain multiparts)
Victory.
Source of the name backus-naur-form
Let’s explain some trivia that I feel obligated to know to the internet, as a learning aid.
Backus-Naur Form, was originally named after Backus, an employee at IBM who designed a language for describing parsers.
Naur took this language out of IBM and used it for different purposes – referring to it as backus normal form.
Knuth decided that the use of the word normal was technically incorrect so re-dubbed this backus naur form.
Key exchange overview
Warning: This is me using the internet as an entity to explain things to. Any of this could be a lie.
Go read a text book.
=== Scenario ===
Alice and Bob and trying to communicate in a world where all their messages can be intercepted. What can they do.
=== The key exchange problem ===
Idea: If alice and bob have a symmetric encryption algorithm all the need to do is share one secret. Is it possible for alive and bob to create a shared secret by exchanging functions of private secrets that no one else will know.
=== Formalisation ===
The extremely optimistic formalisation looks like this.
A: has a secret a
B: has as secret b
A publishes f(a)
B publishes f(b)
A uses f(b) to calculate g(f(b), a) privately.
B uses f(b) to calculate g(f(a), b) privately.
[ Here we are optimistic because:
A and B use the same functions. Only one message is send from each party. The messages don’t depend upon one another]
=== Stupid approach ===
Have f(a) = ca, g(x, y) = xy. Hurrah everything works by associativity and commutativity! The only problem is that c is common knowledge, so since we can do division moderately easily be lose.
=== Fixing this ===
Exponentiation is a nice operation
(a^b)^c = a^(bc) = (a^c)^b
since taking logarithms modulo some number is hard we can use this, giving us key exchange.
Becoming a certificate authority (CA) in one file
I found this blog post very useful when trying to set up a CA : Becoming a certificate authority.
However extended howtos with cut-and-paste code samples, though useful, kind of suck for some use cases. I’ve converted this into a single file bash script which you should be able to download and run to create a sample CA, and sign a sample certificate.
Bear in mind that you probably want to tweak a few things, but this should give you something that works
#!/bin/bash # Make a key rm -rf cert_dir mkdir cert_dir # First we need keys to prove that we have signed things openssl genrsa 1025 > cert_dir/private.pem # private key openssl rsa -in cert_dir/private.pem -pubout -out cert_dir/public.pem # Then we need a certificate to tell other people that we can # issue certificates # Write down what we want to appear in this certificate cat > cert_dir/ca_config <<EOF [ req ] #default_bits = 1024 #default_keyfile = privkey.pem distinguished_name = req_distinguished_name #attributes = req_attributes x509_extensions = v3_ca prompt = no [ req_distinguished_name ] countryName = UK localityName = London organizationalUnitName = Certs commonName = www.certificates4all.com #emailAddress = test@test [ v3_ca ] subjectKeyIdentifier=hash authorityKeyIdentifier=keyid:always,issuer:always basicConstraints = CA:true [ ca ] default_ca = CA_Default [ CA_Default ] email_in_dn = no dir = . new_certs_dir = ./cert_dir database = ./cert_dir/issue certificate = ./cert_dir/ca_cert serial = ./cert_dir/serial private_key = ./cert_dir/private.pem name_opt = ca_default cert_opt = ca_default default_crl_days = 30 default_days = 365 default_md = sha1 preserve = no policy = policy_match [ policy_match ] countryName = optional stateOrProvinceName = optional organizationName = optional organizationalUnitName = optional commonName = supplied emailAddress = optional EOF # Turn this configuration into a certificate echo creating ca cert openssl req -config cert_dir/ca_config -key cert_dir/private.pem -new -x509 -extensions v3_ca > cert_dir/ca_cert # Some configuration files to remember what we have signed echo 0001 > cert_dir/serial touch cert_dir/issue # database touch cert_dir/issue.attr # We now are a working certificate authority - yay! # Now to do some sample signing... echo signing sample cert # Reuse out CA key as our server key - # in real life this would be different # A site creates request for something to be signed, they # must sign this so that only they can claim to be this person # Writing down details of certification request cat > cert_dir/cert_config << EOF [ req ] #default_bits = 1024 #default_keyfile = privkey.pem distinguished_name = req_distinguished_name #attributes = req_attributes prompt = no [ req_distinguished_name ] countryName = MN localityName = GoogleVile organizationalUnitName = google commonName = *.google.com #emailAddress = test@test EOF # Turn this configuration into a binary request openssl req -new -config cert_dir/cert_config -key cert_dir/private.pem > cert_dir/sample_site.req # We then sign this certifcate to say that we believe they are who they say they are openssl ca -batch -config cert_dir/ca_config -in cert_dir/sample_site.req -out cert_dir/sample_site.cert
Minimal working implementation of a socks 5 proxy
SOCKS 5 proxy. Written in python.
Works fine as a firefox proxy.