An Eager Avocado

Eager Avocado

I give myself very good advice, but I very seldom follow it.

Non-trivial operation on data.table columns

,

This note explores the use of data.table package to calculate pairwise correlation between columns, with iris data set as example.

library(data.table)
DT = data.table(iris)

The iris data is now data.table-ized

##      Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
##   1:          5.1         3.5          1.4         0.2    setosa
##   2:          4.9         3.0          1.4         0.2    setosa
##   3:          4.7         3.2          1.3         0.2    setosa
##   4:          4.6         3.1          1.5         0.2    setosa
##   5:          5.0         3.6          1.4         0.2    setosa
##  ---                                                            
## 146:          6.7         3.0          5.2         2.3 virginica
## 147:          6.3         2.5          5.0         1.9 virginica
## 148:          6.5         3.0          5.2         2.0 virginica
## 149:          6.2         3.4          5.4         2.3 virginica
## 150:          5.9         3.0          5.1         1.8 virginica

To calculate the correlation between each pair of variables across the whole data set, a regular cor call is sufficient

everything = cor(DT[, !c('Species'),with=FALSE],method='pearson')
##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
## Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
## Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
## Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000

If we want to calculate the correlation between each pair of variables within each Species, we can choose to partition by Species first, and repeat the call to cor(). If there are $k$ different species, we will need to investigate $k$ correlation matrices, each of which is of dimension $n\times n$.

species = unique(DT[,Species])
for (sp in species) {
    print(paste("Correlation within Species", sp))
    print(cor(DT[Species==sp,!c('Species'),with=FALSE], method='pearson'))
}
## [1] "Correlation within Species setosa"
##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length    1.0000000   0.7425467    0.2671758   0.2780984
## Sepal.Width     0.7425467   1.0000000    0.1777000   0.2327520
## Petal.Length    0.2671758   0.1777000    1.0000000   0.3316300
## Petal.Width     0.2780984   0.2327520    0.3316300   1.0000000
## [1] "Correlation within Species versicolor"
##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length    1.0000000   0.5259107    0.7540490   0.5464611
## Sepal.Width     0.5259107   1.0000000    0.5605221   0.6639987
## Petal.Length    0.7540490   0.5605221    1.0000000   0.7866681
## Petal.Width     0.5464611   0.6639987    0.7866681   1.0000000
## [1] "Correlation within Species virginica"
##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length    1.0000000   0.4572278    0.8642247   0.2811077
## Sepal.Width     0.4572278   1.0000000    0.4010446   0.5377280
## Petal.Length    0.8642247   0.4010446    1.0000000   0.3221082
## Petal.Width     0.2811077   0.5377280    0.3221082   1.0000000

Many times we are only interested in the correlation of one variable with the remaining. The desired output corresponds to one row (or column) of the correlation matrix. In most cases, calculating the whole matrix can be computationally expensive, and unnecessary. To calculate only the values of interest and get a correlation matrix by Species, there are several ways

  1. Loop over the partitions of data set, calculate the correlation, and bind all the result together
target = "Sepal.Length"
allCols = setdiff(names(DT), "Species")
cm = list()
for (sp in species) {
    cm[[sp]] =  sapply(allCols, function(x) cor(DT[Species==sp, .SD, .SDcols=c(target,x)])[1,2])
}
cm.mat = sapply(cm,c)
print(cm.mat)
##                 setosa versicolor virginica
## Sepal.Length 1.0000000  1.0000000 1.0000000
## Sepal.Width  0.7425467  0.5259107 0.4572278
## Petal.Length 0.2671758  0.7540490 0.8642247
## Petal.Width  0.2780984  0.5464611 0.2811077

The final binding can be done either with sapply, do.call(rbind), or do.call(cbind)

sapply(cm,c)
##                 setosa versicolor virginica
## Sepal.Length 1.0000000  1.0000000 1.0000000
## Sepal.Width  0.7425467  0.5259107 0.4572278
## Petal.Length 0.2671758  0.7540490 0.8642247
## Petal.Width  0.2780984  0.5464611 0.2811077
do.call(rbind,cm)
##            Sepal.Length Sepal.Width Petal.Length Petal.Width
## setosa                1   0.7425467    0.2671758   0.2780984
## versicolor            1   0.5259107    0.7540490   0.5464611
## virginica             1   0.4572278    0.8642247   0.2811077
do.call(cbind,cm)
##                 setosa versicolor virginica
## Sepal.Length 1.0000000  1.0000000 1.0000000
## Sepal.Width  0.7425467  0.5259107 0.4572278
## Petal.Length 0.2671758  0.7540490 0.8642247
## Petal.Width  0.2780984  0.5464611 0.2811077
  1. It is tempting to exploit the power of data.table group-by. However, I haven’t found a way to make it work in a reproducible manner.

For example, the code below attempts to apply a correlation function on (x,y), with x being the variables in the data set, and y is the target variable, but it could not subset the target variable correctl

corr.with.me = function(x,group) {
    y = DT[Species==group, Sepal.Length]
    print("y:")
    print(y)
    return(cor(x,y,use="complete.obs"))
    # return(cor(x, DT[Species==group, .SD, .SDcols=c("Sepal.Length")], method='pearson',use = "complete.obs"))
}

DT[,lapply(.SD,corr.with.me, .BY),by=Species]
## [1] "y:"
## numeric(0)
## Error in cor(x, y, use = "complete.obs"): incompatible dimensions