Scaling data.table using index
R can handle fairly big data working on a single machine, 2B (2E9) rows and couple of columns require about 100 GB of memory.
This is already well enough to care about performance.
With this post I'm going discuss scalability of filter queries.
The index has been introduced to data.table in 1.9.4. It is also known as secondary keys. Unlike with key, a single data.table can have multiple indexes.
It basically store additional vector of rows order as data.table attribute.
Sounds really simple, it is even better because user does not have use them in any special way - use of index is automatically handled in data.table.
And the performance gains are big enough to write a post on that.
What you should know about data.table index (as of 2015-11-23):
- index will be used when subsetting dataset with
==
or%in%
on a single variable - by default if index for a variable is not present on filtering, it is automatically created and used
- indexes are lost if you change the order of data
- you can check if you are using index with
options(datatable.verbose=TRUE)
Above features are likely to be improved in future.
- also important to mention, there is an open FR to automatically utilize index when doing unkeyed join (new feature in 1.9.6) - using new on argument. So in future version user will be able to leverage mighty performance of indexes for joining datasets.
Brief look at the structure:
library(data.table)
op = options(datatable.verbose=TRUE,
datatable.auto.index=TRUE)
dt = data.table(a=letters[c(3L,1L,2L)])
set2keyv(dt, "a")
## forder took 0 sec
attr(dt, "index")
## integer(0)
## attr(,"__a")
## [1] 2 3 1
dt[a=="b"]
## Using existing index 'a'
## Starting bmerge ...done in 0 secs
## a
## 1: b
dt[a %in% c("b","c")]
## Using existing index 'a'
## Starting bmerge ...done in 0 secs
## a
## 1: c
## 2: b
options(op)
So how it looks in practice. I will compare base R data.frame, data.table and indexed data.table. You can try other tool, I doubt if you will get better performance in any other tool, not just other R package.
The volumes tested are 1e7, 5e7 and 1e8 rows. Should works fine on 8GB memory.
Some helper function.
# easy control usage of index and verbose
with_index = function(x, auto.index=TRUE, verbose=TRUE){
op=options("datatable.auto.index"=auto.index, "datatable.verbose"=verbose)
on.exit(op)
x
}
1e7
set.seed(123)
n = 1e7
dt = data.table(high = sample(n*0.9, n, TRUE), normal = sample(n*0.1, n, TRUE), low = sample(10, n, TRUE), value = rnorm(n))
df = as.data.frame(dt)
set2keyv(dt, "high")
high.filter = sample(dt$high, 1L)
df.r = df[df$high==high.filter,]
dt.r = with_index(dt[high==high.filter])
## Using existing index 'high'
## Starting bmerge ...done in 0 secs
dti.r = with_index(dt[high==high.filter], auto.index = FALSE)
all.equal(as.data.table(df.r), dt.r) && all.equal(dt.r, dti.r)
## [1] TRUE
library(microbenchmark)
mb = list()
mb[["1e7"]] = microbenchmark(times = 10L,
data.frame = df[df$high==high.filter,],
data.table = with_index(dt[high==high.filter], auto.index = FALSE, verbose = FALSE),
data.table.index = with_index(dt[high==high.filter], auto.index = TRUE, verbose = FALSE)
)
print(mb[["1e7"]])
## Unit: microseconds
## expr min lq mean median uq
## data.frame 130884.40 131268.799 136008.4565 132336.511 135063.196
## data.table 25237.72 25273.030 25986.2384 25683.307 26758.476
## data.table.index 529.61 536.632 625.7158 619.203 682.958
## max neval
## 166294.227 10
## 26987.085 10
## 760.766 10
5e7
## Unit: microseconds
## expr min lq mean median uq
## data.frame 640864.68 643263.534 647379.2399 644586.934 645459.86
## data.table 122280.32 122311.531 123419.3050 122419.632 124901.38
## data.table.index 569.55 671.838 705.8739 703.111 805.41
## max neval
## 678550.008 10
## 125660.152 10
## 812.037 10
1e8
## Unit: microseconds
## expr min lq mean median
## data.frame 1274080.15 1276696.438 1282344.6489 1278789.568
## data.table 243092.33 243347.848 245446.5897 243854.973
## data.table.index 533.84 575.047 681.7643 681.873
## uq max neval
## 1281497.67 1314164.479 10
## 248166.62 249478.923 10
## 752.13 817.687 10
Timing summary
How fast is data.table index and how it scales?
## mean seconds
## expr 1e7 5e7 1e8
## 1: data.frame 0.1360 0.6474 1.2823
## 2: data.table 0.0260 0.1234 0.2454
## 3: data.table.index 0.0006 0.0007 0.0007
## relative
## expr 1e7 5e7 1e8
## 1: data.frame 226.66667 924.8571 1831.8571
## 2: data.table 43.33333 176.2857 350.5714
## 3: data.table.index 1.00000 1.0000 1.0000
On the 1e8 rows the indexed data.table solution is ~1831.86 times faster than data.frame and ~350.57 times faster than non-index data.table.
Scaling data.table index even further for big data?
If you don't have a single machine good enough to handle a data.table in memory you can stil preserve the data.table's index performance.
You need to split your data into separate instances of R, index each of them. Then just rbind results queried from each instance.
That is pretty easy with Rserve, but since this is a topic for separate post I will leave you with basic working example.
library(Rserve)
library(RSclient)
port = 6311:6312
# start nodes
sapply(port, function(port) Rserve(debug = FALSE, port = port, args = c("--no-save")))
# connect nodes
rscl = lapply(setNames(port, port), function(port) RS.connect(port=port))
# populate data, 5M rows in each node
qcall = quote({
stopifnot(suppressPackageStartupMessages(require("data.table", character.only = TRUE, quietly = TRUE)))
set.seed(123)
n = 5e6
x <- data.table(high = sample(n*0.9, n, TRUE), normal = sample(n*0.1, n, TRUE), low = sample(10, n, TRUE), value = rnorm(n))
high.filter <- sample(x$high, 1L)
set2keyv(x, "high")
TRUE
})
sapply(rscl, RS.eval, qcall, lazy=FALSE)
## 6311 6312
## TRUE TRUE
# query using index, capture data.table verbose messages
qcall = quote({
op = options(datatable.auto.index=TRUE, datatable.verbose=TRUE)
prnt = capture.output(r <- x[high==high.filter])
options(op)
list(verbose = prnt, results = r)
})
l = lapply(rscl, RS.eval, qcall, lazy=FALSE)
# datatable.verbose from each node
invisible(lapply(lapply(l, `[[`, "verbose"), cat, sep="\n"))
## Using existing index 'high'
## Starting bmerge ...done in 0 secs
## Using existing index 'high'
## Starting bmerge ...done in 0 secs
# results from each node
lapply(l, `[[`, "results")
## $`6311`
## high normal low value
## 1: 3188512 27799 6 -0.6886669
## 2: 3188512 471094 5 -1.8107128
## 3: 3188512 221944 7 0.3489058
##
## $`6312`
## high normal low value
## 1: 3188512 27799 6 -0.6886669
## 2: 3188512 471094 5 -1.8107128
## 3: 3188512 221944 7 0.3489058
Reproducibility
You can find script of blog post in Rmarkdown format in the blog github repo.
If you have any comments feel free to put them into github issue.