Archive for May, 2011

R cheat sheet

People who write cheat sheets don’t seem to understand the value of plain text. If you like pdfs you might prefer this:

Warning! These are notes to myself made when not trying particularly hard to understand the syntax of R, they may not be optimal

apropos, help, etc

Add a column to a data frame f taking a constant value (think outer join):
merge(frame, data.frame(b=c(1)))

Stack two data frames
rbind(frame1, frame2)


Run R in batch mode
R < fit.r –no-save

Read csv or similar

Linear model fitting: # linear model
lm(a ~ b + c)
lm(column1 ~ column2, frame)

names(frame) – List column names



R versus python.

Higher level (more domain specific), a more ugly programming language. Kitchen sink more immediately accessible but less well packed.
Classic DSL versus general purpose language. Key point: R has a vast kitchen sink .

— Sequences

1:10 = 1,2,3,…10
seq(0, 1, 0.2) = [0, 0.2, … 1.0]

— Postgres —


conn = dbConnect(PostgreSQL(), user=”user”, dbname=”db”, host=”localhost”, port=”port”)
results <- dbGetQuery(conn, query)
# beautiful results

— Functional programming


— Plotting

R plots into a device. A device can be a file or a window – Create a new window
dev.list() – List windows
dev.set(number) – Set the window
plot() – Plot some stuff (clear previous plot)
points – add points to an existing plot.

The plot function in R is a magic do what I mean function. It will
* Plot all 2-d projections if given a data frame
* Plot objects that you build (like clusters)

— Package management


— Old values

.Last.value — last value (_) in python

— Matrix

t – Transpose

May 30, 2011 at 8:05 pm Leave a comment

Rule-based automation of gnucash transaction source detection

I’ve decided to start using gnucash to track my accounts (as an alternative to nothing).

An issue that arises when using any piece of accounting software is the input of data. This is somewhat simplified by companies providing you with electronic records, but one still has the problem of assigning each transaction to a category.

Gnucash has some inbuilt approaches for this. For ofx files it has a bayesian filter than can guess which account a transaction is likely to involve (e.g income, interest, auto expenses) based on your previous assignments. However, I’m afraid that such an approach will be wrong quite a lot. Therefore I’ve hacked up an alternative approach that allows one to define a set of rules and match based on them.

Here is some code:

# This program is distributed under a GPL3 license.
# The author should be "Arg and gah and ap and pa"

"""Open a gnu cash file then find transactions which come from or go to
an unknown source and apply a set of user defined rules to guess this

 * This approach is intrinsically prey to tweaks in gnucash's xml format.
   This is probably unlikely to happen, the program is simple enough that
   it would be easy to repair. A better approach would be to implement this
   in gnucash, but this has too large a fixed cost for me to consider at the
 * There are already inbuilt features of gnucash to do this but 
   these don't do quite what I want
    * A bayesian mapper. I don't feel like using a bayesian mapper here 
      because it is liable to mismap requiring me to check results. I'm 
      sure for some people this works fine..
    * Qif support for account mapping:
       My understand is that in qif you map each source once and 
       it then remembers these sources. However my bank won't send me qif.

def match_account(transaction):
    """Matching function to be tweaked. 

    This takes the details of an unmapped transaction and decides 
    which account the transaction came from or should go to."""

    if transaction.description.startswith('INT EARNED'):
        return 'Interest Income'
    elif transaction.description.startswith('SAVE THE CHILDREN'):
        return 'Salary'
        return None

from contextlib import closing
import datetime
import logging
import gzip
import shutil
import os
import lxml.etree as E
from lxml.etree import tostring

from logging import debug, info

run_time =

    filename='%s-unbalanced-remap.log%s' % (GNUCASH_FILE, run_time),

     'gnc': "",
     'act': "",
     'slot': "",
     'split': "",
     'trn': ""

def xpath(el, xpath):
    return el.xpath(xpath, namespaces=NAMESPACES)

def fetch_account_ids(tree):
    accounts = {}
    for xml_acc in xpath(tree, '//gnc:account'):
        name, = xpath(xml_acc, 'act:name/text()')
        id, = xpath(xml_acc, 'act:id/text()') 
        if name in accounts:
            raise Exception('%r is duplicated' % (name,))
        accounts[name] = id
    return accounts

shutil.copy(GNUCASH_FILE, '%s-backup-%s' % (GNUCASH_FILE, run_time))
with closing( as stream:
    tree = E.XML(

account_ids = fetch_account_ids(tree)
error_id = account_ids['Imbalance-GBP']

class XmlTransaction(object):
    def dump_xml(f):
        def patched(self, *args, **kwargs):
                return f(self, *args, **kwargs)
            except Exception:
                print tostring(self.xml, pretty_print=True)
        return patched
    def __init__(self, xml):
        self.xml = xml

    def is_imbalance(self):
        if len(xpath(self.xml, './/trn:split')) != 2:
            return False
            return len(xpath(self.xml, './/trn:split[split:account/text()'
                ' = "%s"]/split:value/text()' % error_id)) == 1

    def get_amount(self):
        amount, = xpath(self.xml, './/trn:split[split:account/text() = "%s"]/split:value/text()' % error_id)
        return amount

    def get_notes(self):
        return xpath(self.xml, './/slot[slot:key/text()="notes"]/text()')

    def set_account_id(self, id):
        account_xml, = xpath(self.xml, './/trn:split[split:account/text() = "%s"]/split:account' % error_id)
        account_xml.text = id

    def get_description(self):
        description, = list(xpath(self.xml, 'trn:description/text()')) or ['']
        return description

class TransactionDetails(object):
    """Parse those details of an account that relevant to us"""
    def __init__(self, xml):
        self.xml = xml
        info = XmlTransaction(xml)
        self.description = info.get_description()
        self.is_imbalance = info.is_imbalance()
        self.notes = info.get_notes()
        if self.is_imbalance:
            self.amount = info.get_amount()
            self.amount = None

    def __repr__(self):
        return ('' %
            (self.description, self.amount, self.notes))
# Main loop
for xml_transaction in xpath(tree, '//gnc:transaction'):
    details = TransactionDetails(xml_transaction)
    if not details.is_imbalance:
        debug('Ignoring %s. Not imbalance.' % details)
        account_name = match_account(details)
        if account_name is None:
            info("Could not match: %r" % details)
                account_id = account_ids[account_name]
            except KeyError:
                print 'Valid account names: %r ' % (sorted(account_ids.keys()),)
            debug('Remapping transaction %s to %s(%s)' % (details, account_name, account_id))
with closing(, 'w')) as f:
    f.write(tostring(tree, pretty_print=True))

May 29, 2011 at 8:57 pm Leave a comment

Improved python dir

The results of dir are too long and writing comprehensions for filtering is slightly too painful. However there is a solution: replace it!

The following function is an improved version of dir that

* Pretty prints values
* Has an additional ‘filter’ argument that allows you to only search for names containing a particular string or matching a particular regular expression (this is case insensitive by default)

class PrettyList(list):
    def __repr__(self):
        return pprint.pformat(list(self))

UNSET = object()
def magic_dir(object=UNSET, filter='', flags=re.I):
    if object is UNSET:
        results = sys._getframe(1).f_locals.keys()
        results = dir(object)

    if isinstance(filter, str):
        regexp = re.compile(filter, flags)
        regexp = filter

    return PrettyList([x for x in results

I then smuggle this dir into my context replacing the builtin dir by setting PYTHONSTARTUP and doing the patching here. There are some slight issues with getting this into pdb, by it otherwise works wonderfully.

May 29, 2011 at 7:48 pm Leave a comment

Mounting Android Devices at sane mount points

My android phone (or my camera for that matter) doesn’t have a very nice name for it’s volume. However, I am too cowardly to try and change the volume name for an android device with gparted. (At least not until I can find some notes about it on the internet). As far as I can tell there is no way to tell gnome where to mount devices, and though you can just do everything you want at the udev level this removes gnome features from you (short cuts etc – not that I really use these…)

Thus I’ve been forced to use workarounds.

The following script will sit quietly wait for a given device to be mounted and create a sanely named symlink to it.


# Silly script to create nicely named symlinks 
# for mounts in addition to the less nicely named volume links.

# In general you probably just want to change your volume label.
# However my volumne label comes from a phone and I am a coward!

# This script is written to crash out on failure. 
# A supervisor should ensure that it is kept alive (like upstart)

import os
import pyinotify
import select
import sys
import syslog
import traceback

def syslog_exceptions():
    hook = sys.excepthook
    def new_hook(type, value, tb):
        hook(type, value, tb)
        output = traceback.format_exception(type, value, tb)
    new_hook.previous_hook = hook
    sys.excepthook = new_hook

def log(msg):
    print msg

mount, link = sys.argv[1:]

class EventProcessor(pyinotify.ProcessEvent):
    def process_IN_CREATE(self, event):
        path = os.path.join(event.path,
        if path == mount:
            os.symlink(mount, link)
            log("%s created. Linked to it from %s." % (path, link))

    def process_IN_DELETE(self, event):
        path = os.path.join(event.path,
        if path == mount:
            log("%s removed. Unlinking to it")
def wait_for_exit(thread, signal):
    """Wait for thread to exit and then signal"""
    # Hack to leave main thread free to handle signals

wm = pyinotify.WatchManager()
notifier = pyinotify.AsyncNotifier(wm, EventProcessor())
notifier.daemon = True
flags = pyinotify.EventsCodes.ALL_FLAGS
mount_dir = os.path.dirname(mount)
wm.add_watch(mount_dir, flags['IN_CREATE'] | flags['IN_DELETE'])

log("Waiting for %s to be mounted..." % mount)

Together with the following upstart file this will magically create symlinks for you.

description     "Create sane mounts locations for my phone"

start on local-filesystems
stop on runlevel [016]


exec /root/autolinkmount /media/ /media/phone

My goodness this was a waste of time.

PS: For almost all devices other than a phone you probably want to achieve this effect by changing the volume label.

May 21, 2011 at 1:37 am Leave a comment

Code exceptions to syslog in addition to standard out

I’m not sure where the limits of where syslog should be used are, but for some use cases writing to syslog is very useful. In particular, syslog is used as “the place to go if there is any problems with hardware, or the low level running of my system”, and having such a common dumping ground is quite useful.

The following function call will start writing exceptions from a python script to syslog in addition to standard error (or wherever you were logging to before)

# Code log exceptions to syslog in addition to standard out

import sys
import syslog
import traceback

def syslog_exceptions():
    hook = sys.excepthook
    def new_hook(type, value, traceback):
        hook(type, value, traceback)
        output = traceback.format_exception(type, value, traceback)
    new_hook.previous_hook = hook
    sys.excepthook = new_hook

P.S There was a patch about a year ago to add this to the python standard library… looks like nothing happened alas.

May 21, 2011 at 12:44 am Leave a comment

Cloning table schemas


May 20, 2011 at 6:19 pm Leave a comment

Python tracing decorator with optional depth argument

The following code sample will print the lines in a function and the lines in the functions it calls up to a certain depth as they are executed. This is quite useful for debugging, since it prevents one from having to do ‘print bisecting’ or similar. The downside is that this can become terribly verbose, particularly if you are tracing functions with loops.

This code is untested for multiple threads

class Trace(object):
	def __init__(self, max_depth):
		self.max_depth = max_depth
		self.depth = 0

	def __call__(self, frame, reason, arg):
		if reason == 'call':
			self.depth += 1
		elif reason == 'line' and self.max_depth is not None and self.depth <= self.max_depth:
			filename = frame.f_code.co_filename
			line_no = frame.f_lineno
			line = linecache.getline(filename, line_no)
			print '%s:%s:%s' % (filename, line_no, line),
		elif reason == 'return':
			self.depth -= 1
		return self

def trace(max_depth=None):
	_trace = Trace(max_depth)
	def decorator(f):
		def _f(*args, **kwds):
			result = f(*args, **kwds)
			return result
		return _f
	return decorator

May 11, 2011 at 6:00 pm Leave a comment

Older Posts

May 2011
« Apr   Jun »