Human string comparison in python

March 18, 2009 at 12:35 am Leave a comment

NOTE: An alternative way to do this

I’ve found a possible alternative to this. If you only want to sort strings in your current locale you can use locale.strcoll, which is a cmp function in your current locale.

import locale
["one", "two", "three"].sort(locale.strcoll)

On my machine this still isn’t a very human friendly sort but depending on your LC_COLLATE environment variable it may be. This is very much less of a reinvent-the-wheel-because-buying-one-is-too-hard approach – though I suspect it might take one 4 times as long to get working…

Of course at times what you want your commadline to do isn’t what you want all other programs to do…

The string comparison in python does not order lists as one would expect:

>>> l.sort(weird_strcmp)
>>> l
['hello', '?', '}']
>>> l = ["1", "Hello", "abc", "?", "}"]
>>> l.sort()
>>> l
['1', '?', 'Hello', 'abc', '}']

python sorts strings as sequences of bytes.

I couldn’t find any easy way to sort strings in a more friendly fashion (Though surely one must exist?!) So here is an quick implementation of one (use this code as you wish):

from itertools import chain

letters = list(chain(*zip(range(0x41, 0x41 + 26), 
               range(0x61, 0x61 + 26)))) # interspersed upper and lower

numbers = range(0x30, 0x30 + 10)

symbols = sorted(list(
        set(range(0x80)) - set(letters) - set(numbers)))

weird_order = list(chain(letters, numbers, symbols))

assert set(weird_order) == set(range(0x80)), "Didn't get every character"

def weird_strcmp(a, b):
    """Compare strings for lists for English humans who hate 
         unicode and aren't using python 3k any time soon."""

    a = map(ord, a)
    b = map(ord, b)

    for x in chain(a, b):
        assert x < 0x80, "%s is not in the ascii character range" % x
    a = map(weird_order.index, a)
    b = map(weird_order.index, b)
    return cmp(a, b)
>>> l.sort(weird_strcmp)
>>> l
['abc', 'Hello', '1', '?', '}']

Be aware that this isn’t really tested and may not order symbols how you think they should be. It, however, has the advantage of not taking you 15-30 minutes to write and debug. Say if you can think of any foibles and I’ll fix them.


Entry filed under: Uncategorized. Tags: , , , .

Debugging Wifi on Ubuntu Linux Getting python argspecs

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Trackback this post  |  Subscribe to the comments via RSS Feed

March 2009
« Feb   Apr »

%d bloggers like this: