Today I have had to do a number of spearman correlations for work.

As yesterday, I watched the talk from Wes McKinney on pandas at pycon 2012 (great talk if you don't know it!), I thought it would be cool to have a look at python-pandas as a) I like python b) I don't know pandas c) pandas can do spearman correlation.

So I started to write some code in python using pandas

#!/usr/bin/python
#-*- coding: UTF-8 -*-


import pandas


def main(inputfile='binary_matrix.csv'):
    data = pandas.read_csv(inputfile, index_col=0)
    data = data.transpose()
    cnt_core = 0
    for cnt in range(0, len(data.columns)):
            correlation = data[data.columns[cnt - 1]].corr(
                data[data.columns[cnt]], method='spearman')
            cnt_core = cnt_core + 1
            if not pandas.isnull(correlation):
                print("%s correlates with %s with coefficient %f" % (
                    data.columns[cnt - 1], data.columns[cnt], correlation))
    
    print("%i correlations performed" % cnt_core)

if __name__ == '__main__':
    main('binary_matrix.csv')

Nothing fancy, just a simple main function looping over the columns to correlate each with the previous one.

Then I wanted to check the results, so since I know R, I wrote the similar code in R:

data <- read.table('binary_matrix.csv',
    row.names=1, header=TRUE, sep=",", quote = "\"'")

data <- as.data.frame(t(data))

cnt_core <- 0
for (cnt in 2:ncol(data)){
    correlation <- cor(data[cnt - 1], data[cnt],
        method='spearman', use='pairw')
    cnt_core <- cnt_core + 1
    if (! is.na(correlation)){
        print(sprintf("%s correlates with %s with coefficient %f",
                colnames(data)[cnt -1], colnames(data)[cnt],
                correlation))
    }
}

print(sprintf("%i correlations performed", cnt_core))

For the record, the input is a matrix of 36 columns by 35483 rows (which is translated just after reading).

And of course I timed the output:

$ time python spearman_correlation.py
[...]
35482 correlations performed

real	0m20.379s
user	0m20.112s
sys	0m0.178s

and

$ time Rscript spearman_correlation.R 
[...]
[1] "35482 correlations performed"

real	0m32.907s
user	0m32.549s
sys	0m0.182s

Note: although I do not show the results here, trust me, they were equal.

So, over 35,482 correlations python was ~37% faster, I find that quite impressive.