May | 2011 | Arg and gah and ap and pa

Archive for May, 2011

R cheat sheet

People who write cheat sheets don’t seem to understand the value of plain text. If you like pdfs you might prefer this:

Warning! These are notes to myself made when not trying particularly hard to understand the syntax of R, they may not be optimal

apropos, help, etc

Add a column to a data frame f taking a constant value (think outer join):
merge(frame, data.frame(b=c(1)))

Stack two data frames
rbind(frame1, frame2)

Import:
source(‘x’)

Run R in batch mode
R < fit.r –no-save

Read csv or similar
read.table(file=’name’)

Linear model fitting: # linear model
lm(a ~ b + c)
lm(column1 ~ column2, frame)

names(frame) – List column names

TRUE, FALSE

R versus python.

Higher level (more domain specific), a more ugly programming language. Kitchen sink more immediately accessible but less well packed.
Classic DSL versus general purpose language. Key point: R has a vast kitchen sink .

— Sequences

1:10 = 1,2,3,…10
seq(0, 1, 0.2) = [0, 0.2, … 1.0]

— Postgres —

RPostgreSQL.

library(RPostgreSQL)
conn = dbConnect(PostgreSQL(), user=”user”, dbname=”db”, host=”localhost”, port=”port”)
results <- dbGetQuery(conn, query)
# beautiful results

— Functional programming

Map
Filter
Reduce

— Plotting

R plots into a device. A device can be a file or a window

dev.new() – Create a new window
dev.list() – List windows
dev.set(number) – Set the window
plot() – Plot some stuff (clear previous plot)
points – add points to an existing plot.

The plot function in R is a magic do what I mean function. It will
* Plot all 2-d projections if given a data frame
* Plot objects that you build (like clusters)

— Package management

install.packages(‘Hmisc’)

— Old values

.Last.value — last value (_) in python

— Matrix

t – Transpose

May 30, 2011 at 8:05 pm Leave a comment

Rule-based automation of gnucash transaction source detection

I’ve decided to start using gnucash to track my accounts (as an alternative to nothing).

An issue that arises when using any piece of accounting software is the input of data. This is somewhat simplified by companies providing you with electronic records, but one still has the problem of assigning each transaction to a category.

Gnucash has some inbuilt approaches for this. For ofx files it has a bayesian filter than can guess which account a transaction is likely to involve (e.g income, interest, auto expenses) based on your previous assignments. However, I’m afraid that such an approach will be wrong quite a lot. Therefore I’ve hacked up an alternative approach that allows one to define a set of rules and match based on them.

Here is some code:


# This program is distributed under a GPL3 license.
# The author should be "Arg and gah and ap and pa"

"""Open a gnu cash file then find transactions which come from or go to
an unknown source and apply a set of user defined rules to guess this
source.

Caveats:
 * This approach is intrinsically prey to tweaks in gnucash's xml format.
   This is probably unlikely to happen, the program is simple enough that
   it would be easy to repair. A better approach would be to implement this
   in gnucash, but this has too large a fixed cost for me to consider at the
   moment.
 
 * There are already inbuilt features of gnucash to do this but 
   these don't do quite what I want
    * A bayesian mapper. I don't feel like using a bayesian mapper here 
      because it is liable to mismap requiring me to check results. I'm 
      sure for some people this works fine..
 
    * Qif support for account mapping:
       My understand is that in qif you map each source once and 
       it then remembers these sources. However my bank won't send me qif.
"""

# CONFIGURE BY HAND 
GNUCASH_FILE = 'play'
def match_account(transaction):
    """Matching function to be tweaked. 

    This takes the details of an unmapped transaction and decides 
    which account the transaction came from or should go to."""

    if transaction.description.startswith('INT EARNED'):
        return 'Interest Income'
    elif transaction.description.startswith('SAVE THE CHILDREN'):
        return 'Salary'
    else:
        return None
# END CONFIGURE BY HAND 

from contextlib import closing
import datetime
import logging
import gzip
import shutil
import os
import lxml.etree as E
from lxml.etree import tostring

from logging import debug, info

run_time = datetime.datetime.now().isoformat()

logging.basicConfig(
    filename='%s-unbalanced-remap.log%s' % (GNUCASH_FILE, run_time),
    level=logging.DEBUG)

NAMESPACES = {
     'gnc': "http://www.gnucash.org/XML/gnc",
     'act': "http://www.gnucash.org/XML/act",
     'slot': "http://www.gnucash.org/XML/slot",
     'split': "http://www.gnucash.org/XML/split",
     'trn': "http://www.gnucash.org/XML/trn"
}

def xpath(el, xpath):
    return el.xpath(xpath, namespaces=NAMESPACES)

def fetch_account_ids(tree):
    accounts = {}
    for xml_acc in xpath(tree, '//gnc:account'):
        name, = xpath(xml_acc, 'act:name/text()')
        id, = xpath(xml_acc, 'act:id/text()') 
        if name in accounts:
            raise Exception('%r is duplicated' % (name,))
        accounts[name] = id
    return accounts

shutil.copy(GNUCASH_FILE, '%s-backup-%s' % (GNUCASH_FILE, run_time))
with closing(gzip.open(GNUCASH_FILE)) as stream:
    tree = E.XML(stream.read())

account_ids = fetch_account_ids(tree)
error_id = account_ids['Imbalance-GBP']

class XmlTransaction(object):
    def dump_xml(f):
        def patched(self, *args, **kwargs):
            try:
                return f(self, *args, **kwargs)
            except Exception:
                print tostring(self.xml, pretty_print=True)
                raise
        return patched
    
    def __init__(self, xml):
        self.xml = xml

    @dump_xml
    def is_imbalance(self):
        if len(xpath(self.xml, './/trn:split')) != 2:
            return False
        else:
            return len(xpath(self.xml, './/trn:split[split:account/text()'
                ' = "%s"]/split:value/text()' % error_id)) == 1

    @dump_xml
    def get_amount(self):
        amount, = xpath(self.xml, './/trn:split[split:account/text() = "%s"]/split:value/text()' % error_id)
        return amount

    @dump_xml
    def get_notes(self):
        return xpath(self.xml, './/slot[slot:key/text()="notes"]/text()')

    @dump_xml
    def set_account_id(self, id):
        account_xml, = xpath(self.xml, './/trn:split[split:account/text() = "%s"]/split:account' % error_id)
        account_xml.text = id

    @dump_xml
    def get_description(self):
        description, = list(xpath(self.xml, 'trn:description/text()')) or ['']
        return description

class TransactionDetails(object):
    """Parse those details of an account that relevant to us"""
    def __init__(self, xml):
           
        self.xml = xml
        info = XmlTransaction(xml)
        self.description = info.get_description()
        self.is_imbalance = info.is_imbalance()
        self.notes = info.get_notes()
        if self.is_imbalance:
            self.amount = info.get_amount()
        else:
            self.amount = None

    def __repr__(self):
        return ('' %
            (self.description, self.amount, self.notes))
        
# Main loop
for xml_transaction in xpath(tree, '//gnc:transaction'):
    details = TransactionDetails(xml_transaction)
    if not details.is_imbalance:
        debug('Ignoring %s. Not imbalance.' % details)
        continue
    else:
        account_name = match_account(details)
        if account_name is None:
            info("Could not match: %r" % details)
        else:
            try:
                account_id = account_ids[account_name]
            except KeyError:
                print 'Valid account names: %r ' % (sorted(account_ids.keys()),)
                raise
            XmlTransaction(xml_transaction).set_account_id(account_id)
            debug('Remapping transaction %s to %s(%s)' % (details, account_name, account_id))
        
with closing(gzip.open(GNUCASH_FILE, 'w')) as f:
    f.write('\n')
    f.write(tostring(tree, pretty_print=True))

May 29, 2011 at 8:57 pm Leave a comment

Improved python dir

The results of dir are too long and writing comprehensions for filtering is slightly too painful. However there is a solution: replace it!

The following function is an improved version of dir that

* Pretty prints values
* Has an additional ‘filter’ argument that allows you to only search for names containing a particular string or matching a particular regular expression (this is case insensitive by default)


class PrettyList(list):
    def __repr__(self):
        return pprint.pformat(list(self))

UNSET = object()
def magic_dir(object=UNSET, filter='', flags=re.I):
    if object is UNSET:
        results = sys._getframe(1).f_locals.keys()
    else:
        results = dir(object)

    if isinstance(filter, str):
        regexp = re.compile(filter, flags)
    else:
        regexp = filter

    return PrettyList([x for x in results
        if regexp.search(x)])

I then smuggle this dir into my context replacing the builtin dir by setting PYTHONSTARTUP and doing the patching here. There are some slight issues with getting this into pdb, by it otherwise works wonderfully.

May 29, 2011 at 7:48 pm Leave a comment

Mounting Android Devices at sane mount points

My android phone (or my camera for that matter) doesn’t have a very nice name for it’s volume. However, I am too cowardly to try and change the volume name for an android device with gparted. (At least not until I can find some notes about it on the internet). As far as I can tell there is no way to tell gnome where to mount devices, and though you can just do everything you want at the udev level this removes gnome features from you (short cuts etc – not that I really use these…)

Thus I’ve been forced to use workarounds.

The following script will sit quietly wait for a given device to be mounted and create a sanely named symlink to it.

#!/usr/bin/python

# Silly script to create nicely named symlinks 
# for mounts in addition to the less nicely named volume links.

# In general you probably just want to change your volume label.
# However my volumne label comes from a phone and I am a coward!

# This script is written to crash out on failure. 
# A supervisor should ensure that it is kept alive (like upstart)

import os
import pyinotify
import select
import sys
import syslog
import traceback

def syslog_exceptions():
    hook = sys.excepthook
    def new_hook(type, value, tb):
        hook(type, value, tb)
        output = traceback.format_exception(type, value, tb)
        syslog.syslog(repr(output))
    new_hook.previous_hook = hook
    sys.excepthook = new_hook
syslog.openlog('autolinkmount')
syslog_exceptions()


def log(msg):
    syslog.syslog(msg)
    print msg

mount, link = sys.argv[1:]

class EventProcessor(pyinotify.ProcessEvent):
    def process_IN_CREATE(self, event):
        path = os.path.join(event.path, event.name)
        if path == mount:
            os.symlink(mount, link)
            log("%s created. Linked to it from %s." % (path, link))

    def process_IN_DELETE(self, event):
        path = os.path.join(event.path, event.name)
        if path == mount:
            os.unlink(link)
            log("%s removed. Unlinking to it")
            
def wait_for_exit(thread, signal):
    """Wait for thread to exit and then signal"""
    # Hack to leave main thread free to handle signals
    thread.join()
    signal.set()

wm = pyinotify.WatchManager()
notifier = pyinotify.AsyncNotifier(wm, EventProcessor())
notifier.daemon = True
flags = pyinotify.EventsCodes.ALL_FLAGS
mount_dir = os.path.dirname(mount)
wm.add_watch(mount_dir, flags['IN_CREATE'] | flags['IN_DELETE'])

log("Waiting for %s to be mounted..." % mount)
notifier.loop()

Together with the following upstart file this will magically create symlinks for you.

description     "Create sane mounts locations for my phone"

start on local-filesystems
stop on runlevel [016]

respawn

exec /root/autolinkmount /media/ /media/phone

My goodness this was a waste of time.

PS: For almost all devices other than a phone you probably want to achieve this effect by changing the volume label.

May 21, 2011 at 1:37 am Leave a comment

Code exceptions to syslog in addition to standard out

I’m not sure where the limits of where syslog should be used are, but for some use cases writing to syslog is very useful. In particular, syslog is used as “the place to go if there is any problems with hardware, or the low level running of my system”, and having such a common dumping ground is quite useful.

The following function call will start writing exceptions from a python script to syslog in addition to standard error (or wherever you were logging to before)

# Code log exceptions to syslog in addition to standard out

import sys
import syslog
import traceback

def syslog_exceptions():
    hook = sys.excepthook
    def new_hook(type, value, traceback):
        hook(type, value, traceback)
        output = traceback.format_exception(type, value, traceback)
        syslog.syslog(output)
    new_hook.previous_hook = hook
    sys.excepthook = new_hook

P.S There was a patch about a year ago to add this to the python standard library… looks like nothing happened alas.

May 21, 2011 at 12:44 am Leave a comment

Cloning table schemas

create table x2 ( like x1 INCLUDING DEFAULTS INCLUDING CONSTRAINTS INCLUDING INDEXES );

May 20, 2011 at 6:19 pm Leave a comment

Python tracing decorator with optional depth argument

The following code sample will print the lines in a function and the lines in the functions it calls up to a certain depth as they are executed. This is quite useful for debugging, since it prevents one from having to do ‘print bisecting’ or similar. The downside is that this can become terribly verbose, particularly if you are tracing functions with loops.

This code is untested for multiple threads

class Trace(object):
	def __init__(self, max_depth):
		self.max_depth = max_depth
		self.depth = 0

	def __call__(self, frame, reason, arg):
		if reason == 'call':
			self.depth += 1
		elif reason == 'line' and self.max_depth is not None and self.depth <= self.max_depth:
			filename = frame.f_code.co_filename
			line_no = frame.f_lineno
			line = linecache.getline(filename, line_no)
			print '%s:%s:%s' % (filename, line_no, line),
		elif reason == 'return':
			self.depth -= 1
		return self

def trace(max_depth=None):
	_trace = Trace(max_depth)
	def decorator(f):
		def _f(*args, **kwds):
			sys.settrace(_trace)
			result = f(*args, **kwds)
			sys.settrace(None)
			return result
		return _f
	return decorator

May 11, 2011 at 6:00 pm Leave a comment

Wifi tethering out of the box on gingerbread without root

I spent quite a while being annoyed at needing root for wireless tethering on my android phone until I discovered that google had quietly hidden this in their setting menu under tethering and portable hot spots section of their wireless and network settings. But once I found this everything seemed to work fine. I am however not sure whether this option will be blocked on slightly more evil mobile services than mine.

May 8, 2011 at 8:28 pm Leave a comment

Using spareroom efficiently

I find searching sites like gumtree, craigslist or spareroom irritating, the main problem I get is that I end up wasting time clicking through results that I’ve seen before and generally browsing around. This is partly due to my ineffectualness, but also do with the layout of the site. Things become particularly annoying if you visit the site frequently.

What I really want is a means to consider each listing once and only once – some sites might be quite good at doing this – by giving you e-mail updates and such like – but I couldn’t easily see a way of making this work.

Dirty hacks to the rescue:

I wrote the following scraper which will perform a search for me and print out a list of urls

# Write-once code - do not pass judgment
import urllib
from lxml.etree import HTML
URL = "http://www.spareroom.co.uk/flatshare/search.pl?searchtype=advanced&flatshare_type=offered&location_type=area&search=&miles_from_max=0&showme_rooms=Y&showme_buddyup_properties=Y&min_rent=&max_rent=&per=pw&rooms_for=&no_of_rooms=&available_search=N&day_avail=&mon_avail=&year_avail=&min_age_req=&max_age_req=&min_beds=&max_beds=&keyword=&nmsq_mode=%21nmsq_mode%21&action=search&templateoveride=&x=149&y=13"

def mangle_urls(urls):
    new_matches = list(set([url.split('&', 1)[0] for url in urls]))
    return new_matches

stream = urllib.urlopen(URL)
new_url = stream.url
page = stream.read()
tree = HTML(page)
matches = mangle_urls(tree.xpath('//a[contains("More info", text())]/@href'))
matches_so_far = set(matches)
for offset in range(10, 300, 10):
    url = new_url + 'offset=%d' % offset
    page = urllib.urlopen(url).read()
    tree = HTML(page)
    new_matches = mangle_urls(tree.xpath('//a[contains("More info", text())]/@href'))
    new_matches = set(new_matches) - matches_so_far
    matches_so_far |= set(new_matches)
    matches += list(new_matches)
matches = ['http://www.spareroom.co.uk' + url for url in matches if not url.startswith('http')]
print '\n'.join(matches)

I then took the output saved it to ‘new_rooms’. With vim and some macro magic, I then went through links one at a time, considering them, possibly sending replies, and then moving the links to and ‘already_considered’ file. Using command line magic I was then able to ensure that the next time I got a list of links I could remove the links I’d already considered.

Not quite sure if this quite justified the time invested, but it’s coming pretty close. If this script still works when you read this post it could just be worth your time using the script (since you don’t have to write it.)

Kind regards,
Anon

May 5, 2011 at 1:24 am Leave a comment

Templating in make: parameterised targets

I just discovered the eval and define keywords in make. These can be used to create parameterised targets. These are useful for a few purposes – one being building for different architectures.

Example use case,

define arch_target
build-$(arch): squid
	# make for architecture...
endef

$(foreach arch,i386 x86_64,$(eval $(arch_target)))

May 3, 2011 at 6:27 pm Leave a comment

Older Posts

Arg and gah and ap and pa