An Eager Avocado

Eager Avocado

I give myself very good advice, but I very seldom follow it.

data.table subsetting

,

The data.table package supports a powerful syntax to select rows and columns.

Selecting a single column

library(data.table)
data("iris")
iris = iris[sample.int(nrow(iris),size=10,replace = FALSE),]
DT = data.table(iris)
DT
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
##  1:          6.0         2.7          5.1         1.6 versicolor
##  2:          6.1         2.8          4.7         1.2 versicolor
##  3:          5.8         2.7          4.1         1.0 versicolor
##  4:          4.4         3.2          1.3         0.2     setosa
##  5:          6.4         3.2          4.5         1.5 versicolor
##  6:          4.6         3.1          1.5         0.2     setosa
##  7:          6.9         3.1          4.9         1.5 versicolor
##  8:          5.7         4.4          1.5         0.4     setosa
##  9:          5.7         2.8          4.1         1.3 versicolor
## 10:          6.2         3.4          5.4         2.3  virginica

The syntax below returns a one-column data.table

DT[,.(Species)]
##        Species
##  1: versicolor
##  2: versicolor
##  3: versicolor
##  4:     setosa
##  5: versicolor
##  6:     setosa
##  7: versicolor
##  8:     setosa
##  9: versicolor
## 10:  virginica

This one returns a vector

DT[,Species]
##  [1] versicolor versicolor versicolor setosa     versicolor setosa    
##  [7] versicolor setosa     versicolor virginica 
## Levels: setosa versicolor virginica

However, when with=FALSE

DT[,"Species",with=FALSE]

it will always return a data.table

##        Species
##  1: versicolor
##  2: versicolor
##  3: versicolor
##  4:     setosa
##  5: versicolor
##  6:     setosa
##  7: versicolor
##  8:     setosa
##  9: versicolor
## 10:  virginica

This is because the input for column subset is always a vector, even with length 1. To get a single column as a vector, we can use list subsetting syntax, since data.table is also a data.frame

DT[["Species"]]
##  [1] versicolor versicolor versicolor setosa     versicolor setosa    
##  [7] versicolor setosa     versicolor virginica 
## Levels: setosa versicolor virginica

What if we also want the row-filtering power of data.table? It gets tricky because the syntax for row filtering only works with a column, not a column name.

This one works

DT[Species=="setosa"]
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1:          4.4         3.2          1.3         0.2  setosa
## 2:          4.6         3.1          1.5         0.2  setosa
## 3:          5.7         4.4          1.5         0.4  setosa

But this one doesn’t

DT["Species" == "setosa"]
## Empty data.table (0 rows) of 5 cols: Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species

To use a dynamic column name in row subsetting, we need to rely on [[

DT[DT[["Species"]] == 'setosa']
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1:          4.4         3.2          1.3         0.2  setosa
## 2:          4.6         3.1          1.5         0.2  setosa
## 3:          5.7         4.4          1.5         0.4  setosa

Does this compromise performance?

myCol = "Species"
bm = data.table(microbenchmark(times=100,
               DT[Species=='setosa'],
               DT[DT[[myCol]] == 'setosa'],
               iris[iris[[myCol]] == 'setosa',]))
knitr::kable(bm[,.("Mean (microsec)" =mean(time)/1000, "Median (microsec)" =median(time)/1000), by=expr])
expr Mean (microsec) Median (microsec)
DT[Species == “setosa”] 1418.0504 1126.9530
DT[DT[[myCol]] == “setosa”] 518.5885 408.5935
iris[iris[[myCol]] == “setosa”, ] 240.9830 191.2095

The bulky syntax turned out to outperform the neater ones, and to my surprise, operation on data.frame is more efficient than on data.table.

Assigment Operator :=

Assignment (or sub-assignment) is done in place. So we should expect this to change the original data.table

DT[Species=='versicolor',Species:='versi']

By chaining the assignment to a previous selection, we’re only modifying the copy.

DT[Species=='virginica'][,Species:='virgi']

The original data.table remains unchanged.

##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
##  1:          6.0         2.7          5.1         1.6     versi
##  2:          6.1         2.8          4.7         1.2     versi
##  3:          5.8         2.7          4.1         1.0     versi
##  4:          4.4         3.2          1.3         0.2    setosa
##  5:          6.4         3.2          4.5         1.5     versi
##  6:          4.6         3.1          1.5         0.2    setosa
##  7:          6.9         3.1          4.9         1.5     versi
##  8:          5.7         4.4          1.5         0.4    setosa
##  9:          5.7         2.8          4.1         1.3     versi
## 10:          6.2         3.4          5.4         2.3 virginica