python-pandas vs R
By Pierre-Yves on Tuesday, December 4 2012, 11:52 - Général - Permalink
Speed comparison of python-pandas and R for some spearman correlations
Today I have had to do a number of spearman correlations for work.
As yesterday, I watched the talk from Wes McKinney on pandas at pycon 2012 (great talk if you don't know it!), I thought it would be cool to have a look at python-pandas as a) I like python b) I don't know pandas c) pandas can do spearman correlation.
So I started to write some code in python using pandas
#!/usr/bin/python
#-*- coding: UTF-8 -*-
import pandas
def main(inputfile='binary_matrix.csv'):
data = pandas.read_csv(inputfile, index_col=0)
data = data.transpose()
cnt_core = 0
for cnt in range(0, len(data.columns)):
correlation = data[data.columns[cnt - 1]].corr(
data[data.columns[cnt]], method='spearman')
cnt_core = cnt_core + 1
if not pandas.isnull(correlation):
print("%s correlates with %s with coefficient %f" % (
data.columns[cnt - 1], data.columns[cnt], correlation))
print("%i correlations performed" % cnt_core)
if __name__ == '__main__':
main('binary_matrix.csv')
Nothing fancy, just a simple main function looping over the columns to correlate each with the previous one.
Then I wanted to check the results, so since I know R, I wrote the similar code in R:
data <- read.table('binary_matrix.csv',
row.names=1, header=TRUE, sep=",", quote = "\"'")
data <- as.data.frame(t(data))
cnt_core <- 0
for (cnt in 2:ncol(data)){
correlation <- cor(data[cnt - 1], data[cnt],
method='spearman', use='pairw')
cnt_core <- cnt_core + 1
if (! is.na(correlation)){
print(sprintf("%s correlates with %s with coefficient %f",
colnames(data)[cnt -1], colnames(data)[cnt],
correlation))
}
}
print(sprintf("%i correlations performed", cnt_core))
For the record, the input is a matrix of 36 columns by 35483 rows (which is translated just after reading).
And of course I timed the output:
$ time python spearman_correlation.py [...] 35482 correlations performed real 0m20.379s user 0m20.112s sys 0m0.178s
and
$ time Rscript spearman_correlation.R [...] [1] "35482 correlations performed" real 0m32.907s user 0m32.549s sys 0m0.182s
Note: although I do not show the results here, trust me, they were equal.
So, over 35,482 correlations python was ~37% faster, I find that quite impressive.
Comments
http://www.youtube.com/watch?v=e08k...
Might be interesting for you :)