I give myself very good advice, but I very seldom follow it.
Non-trivial operation on data.table columns
This note explores the use of data.table package to calculate pairwise correlation between columns, with iris data set as example.
The iris data is now data.table-ized
To calculate the correlation between each pair of variables across the whole data set, a regular cor call is sufficient
If we want to calculate the correlation between each pair of variables within each Species, we can choose to partition by Species first, and repeat the call to cor(). If there are $k$ different species, we will need to investigate $k$ correlation matrices, each of which is of dimension $n\times n$.
Many times we are only interested in the correlation of one variable with the remaining. The desired output corresponds to one row (or column) of the correlation matrix. In most cases, calculating the whole matrix can be computationally expensive, and unnecessary. To calculate only the values of interest and get a correlation matrix by Species, there are several ways
Loop over the partitions of data set, calculate the correlation, and bind all the result together
The final binding can be done either with sapply, do.call(rbind), or do.call(cbind)
It is tempting to exploit the power of data.table group-by. However, I haven’t found a way to make it work in a reproducible manner.
For example, the code below attempts to apply a correlation function on (x,y), with x being the variables in the data set, and y is the target variable, but it could not subset the target variable correctl