Scaling data.table using index

Written on 2015-11-23

R can handle fairly big data working on a single machine, 2B (2E9) rows and couple of columns require about 100 GB of memory.
This is already well enough to care about performance.
With this post I'm going discuss scalability of filter queries.


The index has been introduced to data.table in 1.9.4. It is also known as secondary keys. Unlike with key, a single data.table can have multiple indexes.
It basically store additional vector of rows order as data.table attribute.
Sounds really simple, it is even better because user does not have use them in any special way - use of index is automatically handled in data.table.
And the performance gains are big enough to write a post on that.


What you should know about data.table index (as of 2015-11-23):

  • index will be used when subsetting dataset with == or %in% on a single variable
  • by default if index for a variable is not present on filtering, it is automatically created and used
  • indexes are lost if you change the order of data
  • you can check if you are using index with options(datatable.verbose=TRUE)

Above features are likely to be improved in future.

  • also important to mention, there is an open FR to automatically utilize index when doing unkeyed join (new feature in 1.9.6) - using new on argument. So in future version user will be able to leverage mighty performance of indexes for joining datasets.

Brief look at the structure:

library(data.table)
op = options(datatable.verbose=TRUE,
             datatable.auto.index=TRUE)
dt = data.table(a=letters[c(3L,1L,2L)])
set2keyv(dt, "a")
## forder took 0 sec
attr(dt, "index")
## integer(0)
## attr(,"__a")
## [1] 2 3 1
dt[a=="b"]
## Using existing index 'a'
## Starting bmerge ...done in 0 secs
##    a
## 1: b
dt[a %in% c("b","c")]
## Using existing index 'a'
## Starting bmerge ...done in 0 secs
##    a
## 1: c
## 2: b
options(op)

So how it looks in practice. I will compare base R data.frame, data.table and indexed data.table. You can try other tool, I doubt if you will get better performance in any other tool, not just other R package.
The volumes tested are 1e7, 5e7 and 1e8 rows. Should works fine on 8GB memory.

Some helper function.

# easy control usage of index and verbose
with_index = function(x, auto.index=TRUE, verbose=TRUE){
    op=options("datatable.auto.index"=auto.index, "datatable.verbose"=verbose)
    on.exit(op)
    x
}

1e7

set.seed(123)
n = 1e7
dt = data.table(high = sample(n*0.9, n, TRUE), normal = sample(n*0.1, n, TRUE), low = sample(10, n, TRUE), value = rnorm(n))
df = as.data.frame(dt)
set2keyv(dt, "high")
high.filter = sample(dt$high, 1L)
df.r = df[df$high==high.filter,]
dt.r = with_index(dt[high==high.filter])
## Using existing index 'high'
## Starting bmerge ...done in 0 secs
dti.r = with_index(dt[high==high.filter], auto.index = FALSE)
all.equal(as.data.table(df.r), dt.r) && all.equal(dt.r, dti.r)
## [1] TRUE
library(microbenchmark)
mb = list()
mb[["1e7"]] = microbenchmark(times = 10L,
    data.frame = df[df$high==high.filter,],
    data.table = with_index(dt[high==high.filter], auto.index = FALSE, verbose = FALSE),
    data.table.index = with_index(dt[high==high.filter], auto.index = TRUE, verbose = FALSE)
)
print(mb[["1e7"]])
## Unit: microseconds
##              expr       min         lq        mean     median         uq
##        data.frame 130884.40 131268.799 136008.4565 132336.511 135063.196
##        data.table  25237.72  25273.030  25986.2384  25683.307  26758.476
##  data.table.index    529.61    536.632    625.7158    619.203    682.958
##         max neval
##  166294.227    10
##   26987.085    10
##     760.766    10

5e7

## Unit: microseconds
##              expr       min         lq        mean     median        uq
##        data.frame 640864.68 643263.534 647379.2399 644586.934 645459.86
##        data.table 122280.32 122311.531 123419.3050 122419.632 124901.38
##  data.table.index    569.55    671.838    705.8739    703.111    805.41
##         max neval
##  678550.008    10
##  125660.152    10
##     812.037    10

1e8

## Unit: microseconds
##              expr        min          lq         mean      median
##        data.frame 1274080.15 1276696.438 1282344.6489 1278789.568
##        data.table  243092.33  243347.848  245446.5897  243854.973
##  data.table.index     533.84     575.047     681.7643     681.873
##          uq         max neval
##  1281497.67 1314164.479    10
##   248166.62  249478.923    10
##      752.13     817.687    10

Timing summary

How fast is data.table index and how it scales?

## mean seconds
##                expr    1e7    5e7    1e8
## 1:       data.frame 0.1360 0.6474 1.2823
## 2:       data.table 0.0260 0.1234 0.2454
## 3: data.table.index 0.0006 0.0007 0.0007
## relative
##                expr       1e7      5e7       1e8
## 1:       data.frame 226.66667 924.8571 1831.8571
## 2:       data.table  43.33333 176.2857  350.5714
## 3: data.table.index   1.00000   1.0000    1.0000

plot of chunk plot_timing

On the 1e8 rows the indexed data.table solution is ~1831.86 times faster than data.frame and ~350.57 times faster than non-index data.table.


Scaling data.table index even further for big data?

If you don't have a single machine good enough to handle a data.table in memory you can stil preserve the data.table's index performance.
You need to split your data into separate instances of R, index each of them. Then just rbind results queried from each instance.
That is pretty easy with Rserve, but since this is a topic for separate post I will leave you with basic working example.

library(Rserve)
library(RSclient)
port = 6311:6312
# start nodes
sapply(port, function(port) Rserve(debug = FALSE, port = port, args = c("--no-save")))
# connect nodes
rscl = lapply(setNames(port, port), function(port) RS.connect(port=port))
# populate data, 5M rows in each node
qcall = quote({
    stopifnot(suppressPackageStartupMessages(require("data.table", character.only = TRUE, quietly = TRUE)))
    set.seed(123)
    n = 5e6
    x <- data.table(high = sample(n*0.9, n, TRUE), normal = sample(n*0.1, n, TRUE), low = sample(10, n, TRUE), value = rnorm(n))
    high.filter <- sample(x$high, 1L)
    set2keyv(x, "high")
    TRUE
})
sapply(rscl, RS.eval, qcall, lazy=FALSE)
## 6311 6312 
## TRUE TRUE
# query using index, capture data.table verbose messages
qcall = quote({
    op = options(datatable.auto.index=TRUE, datatable.verbose=TRUE)
    prnt = capture.output(r <- x[high==high.filter])
    options(op)
    list(verbose = prnt, results = r)
})
l = lapply(rscl, RS.eval, qcall, lazy=FALSE)
# datatable.verbose from each node
invisible(lapply(lapply(l, `[[`, "verbose"), cat, sep="\n"))
## Using existing index 'high'
## Starting bmerge ...done in 0 secs
## Using existing index 'high'
## Starting bmerge ...done in 0 secs
# results from each node
lapply(l, `[[`, "results")
## $`6311`
##       high normal low      value
## 1: 3188512  27799   6 -0.6886669
## 2: 3188512 471094   5 -1.8107128
## 3: 3188512 221944   7  0.3489058
## 
## $`6312`
##       high normal low      value
## 1: 3188512  27799   6 -0.6886669
## 2: 3188512 471094   5 -1.8107128
## 3: 3188512 221944   7  0.3489058

Reproducibility

You can find script of blog post in Rmarkdown format in the blog github repo.
If you have any comments feel free to put them into github issue.