Archive for May, 2011
R cheat sheet
People who write cheat sheets don’t seem to understand the value of plain text. If you like pdfs you might prefer this:
Click to access Short-refcard.pdf
Warning! These are notes to myself made when not trying particularly hard to understand the syntax of R, they may not be optimal
apropos, help, etc
Add a column to a data frame f taking a constant value (think outer join):
merge(frame, data.frame(b=c(1)))
Stack two data frames
rbind(frame1, frame2)
Import:
source(‘x’)
Run R in batch mode
R < fit.r –no-save
Read csv or similar
read.table(file=’name’)
Linear model fitting: # linear model
lm(a ~ b + c)
lm(column1 ~ column2, frame)
names(frame) – List column names
TRUE, FALSE
??
R versus python.
Higher level (more domain specific), a more ugly programming language. Kitchen sink more immediately accessible but less well packed.
Classic DSL versus general purpose language. Key point: R has a vast kitchen sink .
— Sequences
1:10 = 1,2,3,…10
seq(0, 1, 0.2) = [0, 0.2, … 1.0]
— Postgres —
RPostgreSQL.
library(RPostgreSQL)
conn = dbConnect(PostgreSQL(), user=”user”, dbname=”db”, host=”localhost”, port=”port”)
results <- dbGetQuery(conn, query)
# beautiful results
— Functional programming
Map
Filter
Reduce
— Plotting
R plots into a device. A device can be a file or a window
dev.new() – Create a new window
dev.list() – List windows
dev.set(number) – Set the window
plot() – Plot some stuff (clear previous plot)
points – add points to an existing plot.
The plot function in R is a magic do what I mean function. It will
* Plot all 2-d projections if given a data frame
* Plot objects that you build (like clusters)
— Package management
install.packages(‘Hmisc’)
— Old values
.Last.value — last value (_) in python
— Matrix
t – Transpose
Rule-based automation of gnucash transaction source detection
I’ve decided to start using gnucash to track my accounts (as an alternative to nothing).
An issue that arises when using any piece of accounting software is the input of data. This is somewhat simplified by companies providing you with electronic records, but one still has the problem of assigning each transaction to a category.
Gnucash has some inbuilt approaches for this. For ofx files it has a bayesian filter than can guess which account a transaction is likely to involve (e.g income, interest, auto expenses) based on your previous assignments. However, I’m afraid that such an approach will be wrong quite a lot. Therefore I’ve hacked up an alternative approach that allows one to define a set of rules and match based on them.
Here is some code:
# This program is distributed under a GPL3 license. # The author should be "Arg and gah and ap and pa" """Open a gnu cash file then find transactions which come from or go to an unknown source and apply a set of user defined rules to guess this source. Caveats: * This approach is intrinsically prey to tweaks in gnucash's xml format. This is probably unlikely to happen, the program is simple enough that it would be easy to repair. A better approach would be to implement this in gnucash, but this has too large a fixed cost for me to consider at the moment. * There are already inbuilt features of gnucash to do this but these don't do quite what I want * A bayesian mapper. I don't feel like using a bayesian mapper here because it is liable to mismap requiring me to check results. I'm sure for some people this works fine.. * Qif support for account mapping: My understand is that in qif you map each source once and it then remembers these sources. However my bank won't send me qif. """ # CONFIGURE BY HAND GNUCASH_FILE = 'play' def match_account(transaction): """Matching function to be tweaked. This takes the details of an unmapped transaction and decides which account the transaction came from or should go to.""" if transaction.description.startswith('INT EARNED'): return 'Interest Income' elif transaction.description.startswith('SAVE THE CHILDREN'): return 'Salary' else: return None # END CONFIGURE BY HAND from contextlib import closing import datetime import logging import gzip import shutil import os import lxml.etree as E from lxml.etree import tostring from logging import debug, info run_time = datetime.datetime.now().isoformat() logging.basicConfig( filename='%s-unbalanced-remap.log%s' % (GNUCASH_FILE, run_time), level=logging.DEBUG) NAMESPACES = { 'gnc': "http://www.gnucash.org/XML/gnc", 'act': "http://www.gnucash.org/XML/act", 'slot': "http://www.gnucash.org/XML/slot", 'split': "http://www.gnucash.org/XML/split", 'trn': "http://www.gnucash.org/XML/trn" } def xpath(el, xpath): return el.xpath(xpath, namespaces=NAMESPACES) def fetch_account_ids(tree): accounts = {} for xml_acc in xpath(tree, '//gnc:account'): name, = xpath(xml_acc, 'act:name/text()') id, = xpath(xml_acc, 'act:id/text()') if name in accounts: raise Exception('%r is duplicated' % (name,)) accounts[name] = id return accounts shutil.copy(GNUCASH_FILE, '%s-backup-%s' % (GNUCASH_FILE, run_time)) with closing(gzip.open(GNUCASH_FILE)) as stream: tree = E.XML(stream.read()) account_ids = fetch_account_ids(tree) error_id = account_ids['Imbalance-GBP'] class XmlTransaction(object): def dump_xml(f): def patched(self, *args, **kwargs): try: return f(self, *args, **kwargs) except Exception: print tostring(self.xml, pretty_print=True) raise return patched def __init__(self, xml): self.xml = xml @dump_xml def is_imbalance(self): if len(xpath(self.xml, './/trn:split')) != 2: return False else: return len(xpath(self.xml, './/trn:split[split:account/text()' ' = "%s"]/split:value/text()' % error_id)) == 1 @dump_xml def get_amount(self): amount, = xpath(self.xml, './/trn:split[split:account/text() = "%s"]/split:value/text()' % error_id) return amount @dump_xml def get_notes(self): return xpath(self.xml, './/slot[slot:key/text()="notes"]/text()') @dump_xml def set_account_id(self, id): account_xml, = xpath(self.xml, './/trn:split[split:account/text() = "%s"]/split:account' % error_id) account_xml.text = id @dump_xml def get_description(self): description, = list(xpath(self.xml, 'trn:description/text()')) or [''] return description class TransactionDetails(object): """Parse those details of an account that relevant to us""" def __init__(self, xml): self.xml = xml info = XmlTransaction(xml) self.description = info.get_description() self.is_imbalance = info.is_imbalance() self.notes = info.get_notes() if self.is_imbalance: self.amount = info.get_amount() else: self.amount = None def __repr__(self): return ('' % (self.description, self.amount, self.notes)) # Main loop for xml_transaction in xpath(tree, '//gnc:transaction'): details = TransactionDetails(xml_transaction) if not details.is_imbalance: debug('Ignoring %s. Not imbalance.' % details) continue else: account_name = match_account(details) if account_name is None: info("Could not match: %r" % details) else: try: account_id = account_ids[account_name] except KeyError: print 'Valid account names: %r ' % (sorted(account_ids.keys()),) raise XmlTransaction(xml_transaction).set_account_id(account_id) debug('Remapping transaction %s to %s(%s)' % (details, account_name, account_id)) with closing(gzip.open(GNUCASH_FILE, 'w')) as f: f.write('\n') f.write(tostring(tree, pretty_print=True))
Improved python dir
The results of dir are too long and writing comprehensions for filtering is slightly too painful. However there is a solution: replace it!
The following function is an improved version of dir that
* Pretty prints values
* Has an additional ‘filter’ argument that allows you to only search for names containing a particular string or matching a particular regular expression (this is case insensitive by default)
class PrettyList(list): def __repr__(self): return pprint.pformat(list(self)) UNSET = object() def magic_dir(object=UNSET, filter='', flags=re.I): if object is UNSET: results = sys._getframe(1).f_locals.keys() else: results = dir(object) if isinstance(filter, str): regexp = re.compile(filter, flags) else: regexp = filter return PrettyList([x for x in results if regexp.search(x)])
I then smuggle this dir into my context replacing the builtin dir by setting PYTHONSTARTUP and doing the patching here. There are some slight issues with getting this into pdb, by it otherwise works wonderfully.
Mounting Android Devices at sane mount points
My android phone (or my camera for that matter) doesn’t have a very nice name for it’s volume. However, I am too cowardly to try and change the volume name for an android device with gparted. (At least not until I can find some notes about it on the internet). As far as I can tell there is no way to tell gnome where to mount devices, and though you can just do everything you want at the udev level this removes gnome features from you (short cuts etc – not that I really use these…)
Thus I’ve been forced to use workarounds.
The following script will sit quietly wait for a given device to be mounted and create a sanely named symlink to it.
#!/usr/bin/python # Silly script to create nicely named symlinks # for mounts in addition to the less nicely named volume links. # In general you probably just want to change your volume label. # However my volumne label comes from a phone and I am a coward! # This script is written to crash out on failure. # A supervisor should ensure that it is kept alive (like upstart) import os import pyinotify import select import sys import syslog import traceback def syslog_exceptions(): hook = sys.excepthook def new_hook(type, value, tb): hook(type, value, tb) output = traceback.format_exception(type, value, tb) syslog.syslog(repr(output)) new_hook.previous_hook = hook sys.excepthook = new_hook syslog.openlog('autolinkmount') syslog_exceptions() def log(msg): syslog.syslog(msg) print msg mount, link = sys.argv[1:] class EventProcessor(pyinotify.ProcessEvent): def process_IN_CREATE(self, event): path = os.path.join(event.path, event.name) if path == mount: os.symlink(mount, link) log("%s created. Linked to it from %s." % (path, link)) def process_IN_DELETE(self, event): path = os.path.join(event.path, event.name) if path == mount: os.unlink(link) log("%s removed. Unlinking to it") def wait_for_exit(thread, signal): """Wait for thread to exit and then signal""" # Hack to leave main thread free to handle signals thread.join() signal.set() wm = pyinotify.WatchManager() notifier = pyinotify.AsyncNotifier(wm, EventProcessor()) notifier.daemon = True flags = pyinotify.EventsCodes.ALL_FLAGS mount_dir = os.path.dirname(mount) wm.add_watch(mount_dir, flags['IN_CREATE'] | flags['IN_DELETE']) log("Waiting for %s to be mounted..." % mount) notifier.loop()
Together with the following upstart file this will magically create symlinks for you.
description "Create sane mounts locations for my phone" start on local-filesystems stop on runlevel [016] respawn exec /root/autolinkmount /media/ /media/phone
My goodness this was a waste of time.
PS: For almost all devices other than a phone you probably want to achieve this effect by changing the volume label.
Code exceptions to syslog in addition to standard out
I’m not sure where the limits of where syslog should be used are, but for some use cases writing to syslog is very useful. In particular, syslog is used as “the place to go if there is any problems with hardware, or the low level running of my system”, and having such a common dumping ground is quite useful.
The following function call will start writing exceptions from a python script to syslog in addition to standard error (or wherever you were logging to before)
# Code log exceptions to syslog in addition to standard out import sys import syslog import traceback def syslog_exceptions(): hook = sys.excepthook def new_hook(type, value, traceback): hook(type, value, traceback) output = traceback.format_exception(type, value, traceback) syslog.syslog(output) new_hook.previous_hook = hook sys.excepthook = new_hook
P.S There was a patch about a year ago to add this to the python standard library… looks like nothing happened alas.
Cloning table schemas
create table x2 ( like x1 INCLUDING DEFAULTS INCLUDING CONSTRAINTS INCLUDING INDEXES );
Python tracing decorator with optional depth argument
The following code sample will print the lines in a function and the lines in the functions it calls up to a certain depth as they are executed. This is quite useful for debugging, since it prevents one from having to do ‘print bisecting’ or similar. The downside is that this can become terribly verbose, particularly if you are tracing functions with loops.
This code is untested for multiple threads
class Trace(object): def __init__(self, max_depth): self.max_depth = max_depth self.depth = 0 def __call__(self, frame, reason, arg): if reason == 'call': self.depth += 1 elif reason == 'line' and self.max_depth is not None and self.depth <= self.max_depth: filename = frame.f_code.co_filename line_no = frame.f_lineno line = linecache.getline(filename, line_no) print '%s:%s:%s' % (filename, line_no, line), elif reason == 'return': self.depth -= 1 return self def trace(max_depth=None): _trace = Trace(max_depth) def decorator(f): def _f(*args, **kwds): sys.settrace(_trace) result = f(*args, **kwds) sys.settrace(None) return result return _f return decorator
Wifi tethering out of the box on gingerbread without root
I spent quite a while being annoyed at needing root for wireless tethering on my android phone until I discovered that google had quietly hidden this in their setting menu under tethering and portable hot spots section of their wireless and network settings. But once I found this everything seemed to work fine. I am however not sure whether this option will be blocked on slightly more evil mobile services than mine.
Using spareroom efficiently
I find searching sites like gumtree, craigslist or spareroom irritating, the main problem I get is that I end up wasting time clicking through results that I’ve seen before and generally browsing around. This is partly due to my ineffectualness, but also do with the layout of the site. Things become particularly annoying if you visit the site frequently.
What I really want is a means to consider each listing once and only once – some sites might be quite good at doing this – by giving you e-mail updates and such like – but I couldn’t easily see a way of making this work.
Dirty hacks to the rescue:
I wrote the following scraper which will perform a search for me and print out a list of urls
# Write-once code - do not pass judgment import urllib from lxml.etree import HTML URL = "http://www.spareroom.co.uk/flatshare/search.pl?searchtype=advanced&flatshare_type=offered&location_type=area&search=&miles_from_max=0&showme_rooms=Y&showme_buddyup_properties=Y&min_rent=&max_rent=&per=pw&rooms_for=&no_of_rooms=&available_search=N&day_avail=&mon_avail=&year_avail=&min_age_req=&max_age_req=&min_beds=&max_beds=&keyword=&nmsq_mode=%21nmsq_mode%21&action=search&templateoveride=&x=149&y=13" def mangle_urls(urls): new_matches = list(set([url.split('&', 1)[0] for url in urls])) return new_matches stream = urllib.urlopen(URL) new_url = stream.url page = stream.read() tree = HTML(page) matches = mangle_urls(tree.xpath('//a[contains("More info", text())]/@href')) matches_so_far = set(matches) for offset in range(10, 300, 10): url = new_url + 'offset=%d' % offset page = urllib.urlopen(url).read() tree = HTML(page) new_matches = mangle_urls(tree.xpath('//a[contains("More info", text())]/@href')) new_matches = set(new_matches) - matches_so_far matches_so_far |= set(new_matches) matches += list(new_matches) matches = ['http://www.spareroom.co.uk' + url for url in matches if not url.startswith('http')] print '\n'.join(matches)
I then took the output saved it to ‘new_rooms’. With vim and some macro magic, I then went through links one at a time, considering them, possibly sending replies, and then moving the links to and ‘already_considered’ file. Using command line magic I was then able to ensure that the next time I got a list of links I could remove the links I’d already considered.
Not quite sure if this quite justified the time invested, but it’s coming pretty close. If this script still works when you read this post it could just be worth your time using the script (since you don’t have to write it.)
Kind regards,
Anon
Templating in make: parameterised targets
I just discovered the eval and define keywords in make. These can be used to create parameterised targets. These are useful for a few purposes – one being building for different architectures.
Example use case,
define arch_target build-$(arch): squid # make for architecture... endef $(foreach arch,i386 x86_64,$(eval $(arch_target)))