Solve common R problems efficiently with data.table
I was recently browsing stackoverflow.com (often called SO) for the most voted questions under R tag.
To my surprise, many questions on the first page were quite well addressed with the data.table package. I found a few other questions that could benefit from a data.table answer, therefore went ahead and answered them.
In this post, I’d like to summarise them along with benchmarks (where possible) and my comments if any.
Many answers under highly voted questions seem to have been posted a while back. data.table is quite actively developed and has had tons of improvements (in terms of speed and memory usage) over the recent years. It might therefore be entirely possible that some of those answers will have even better performance by now.
50 highest voted questions under R tag
Here’s the list of top 50 questions. I’ve marked those for which a data.table answer is available (which is usually quite performant).
|I||Number of votes||Question title||Use data.table solution|
|1||1153||How to make a great R reproducible example?|
|2||621||How to sort a dataframe by column(s)?||TRUE|
|3||496||R Grouping functions: sapply vs. lapply vs. apply. vs. tappl||TRUE|
|4||429||How can we make xkcd style graphs?|
|5||396||How to join (merge) data frames (inner, outer, left, right)?||TRUE|
|6||330||What statistics should a programmer (or computer scientist)|
|7||314||Drop columns in R data frame||TRUE|
|8||290||Tricks to manage the available memory in an R session|
|9||280||Remove rows with NAs in data.frame||TRUE|
|10||279||Quickly reading very large tables as dataframes in R||TRUE|
|11||263||How to properly document S4 class slots using Roxygen2?|
|12||250||Assignment operators in R: '=' and '<-'|
|13||236||Drop factor levels in a subsetted data frame||TRUE|
|14||234||Plot two graphs in same plot in R|
|15||225||What is the difference between require() and library()?|
|16||221||data.table vs dplyr: can one do something well the other can|
|17||216||In R, why is
|18||212||R function for testing if a vector contains a given element|
|19||201||Expert R users, what's in your .Rprofile?|
|20||197||R list to data frame||TRUE|
|21||197||Rotating and spacing axis labels in ggplot2|
|22||197||How to Correctly Use Lists in R?|
|23||192||How to convert a factor to an integer\numeric without a loss|
|24||184||How can I read command line parameters from an R script?|
|25||184||How to unload a package without restarting R?|
|26||182||Tools for making latex tables in R|
|27||181||In R, what is the difference between the  and [] notatio|
|28||180||How can I view the source code for a function?|
|29||171||Cluster analysis in R: determine the optimal number of clust|
|30||170||How do I install an R package from source?|
|31||162||How do I replace NA values with zeros in R?|
|32||152||Counting the number of elements with the values of x in a ve|
|33||152||Write lines of text to a file in R|
|34||151||Standard library function in R for finding the mode?|
|35||150||How to trim leading and trailing whitespace in R?|
|36||143||How to save a plot as image on the disk?|
|37||139||Most underused data visualization|
|38||137||Convert data.frame columns from factors to characters||TRUE|
|39||136||How to find the length of a string in R?|
|40||134||Workflow for statistical analysis and report writing|
|41||132||Create an empty data.frame|
|42||130||adding leading zeros using R|
|43||129||Check existence of directory and create if doesn't exist|
|44||127||Run R script from command line|
|45||125||Changing column names of a data frame in R||TRUE|
|46||120||How to set limits for axes in ggplot2 R plots?|
|47||114||How to find out which package version is loaded in R?|
|48||112||How to plot two histograms together in R?|
|49||112||How can 2 strings be concatenated in R|
|50||112||How to organize large R programs?|
Below are the chosen answers where data.table can be applied. Each one supplied with the usage and timing copied from the linked answer. Click on the question title to view SO question or follow the answer link for a reproducible example and benchmark details.
dat by variables
b. Use descending order for
z and ascending for
setorder(dat, -z, b)
Timing and memory consumption:
# R-session memory usage (BEFORE) = ~2GB (size of 'dat') # ------------------------------------------------------------ # Package function Time (s) Peak memory Memory used # ------------------------------------------------------------ # doBy orderBy 409.7 6.7 GB 4.7 GB # taRifx sort 400.8 6.7 GB 4.7 GB # plyr arrange 318.8 5.6 GB 3.6 GB # base R order 299.0 5.6 GB 3.6 GB # dplyr arrange 62.7 4.2 GB 2.2 GB # ------------------------------------------------------------ # data.table order 6.2 4.2 GB 2.2 GB # data.table setorder 4.5 2.4 GB 0.4 GB # ------------------------------------------------------------
dt with variables
grp calculate sum of
x and length of
x by groups specified by
dt[, .(sum(x), .N), grp]
Timing in seconds:
# fun elapsed #1: aggregate 109.139 #2: by 25.738 #3: dplyr 18.978 #4: tapply 17.006 #5: lapply 11.524 #6: sapply 11.326 #7: data.table 2.686
dt2 having common join column named CustomerId.
# right outer join keyed data.tables dt1[dt2] # right outer join unkeyed data.tables - use `on` argument dt1[dt2, on = "CustomerId"] # left outer join - swap dt1 with dt2 dt2[dt1, on = "CustomerId"] # inner join - use `nomatch` argument dt1[dt2, nomatch=0L, on = "CustomerId"] # anti join - use `!` operator dt1[!dt2, on = "CustomerId"] # inner join merge(dt1, dt2, by = "CustomerId") # full outer join merge(dt1, dt2, by = "CustomerId", all = TRUE) # see ?merge.data.table arguments for other cases
# inner join #Unit: milliseconds # expr min lq mean median uq max neval # base 15546.0097 16083.4915 16687.117 16539.0148 17388.290 18513.216 10 # sqldf 44392.6685 44709.7128 45096.401 45067.7461 45504.376 45563.472 10 # dplyr 4124.0068 4248.7758 4281.122 4272.3619 4342.829 4411.388 10 # data.table 937.2461 946.0227 1053.411 973.0805 1214.300 1281.958 10 # left outer join #Unit: milliseconds # expr min lq mean median uq max neval # base 16140.791 17107.7366 17441.9538 17414.6263 17821.9035 19453.034 10 # sqldf 43656.633 44141.9186 44777.1872 44498.7191 45288.7406 47108.900 10 # dplyr 4062.153 4352.8021 4780.3221 4409.1186 4450.9301 8385.050 10 # data.table 823.218 823.5557 901.0383 837.9206 883.3292 1277.239 10 # right outer join #Unit: milliseconds # expr min lq mean median uq max neval # base 15821.3351 15954.9927 16347.3093 16044.3500 16621.887 17604.794 10 # sqldf 43635.5308 43761.3532 43984.3682 43969.0081 44044.461 44499.891 10 # dplyr 3936.0329 4028.1239 4102.4167 4045.0854 4219.958 4307.350 10 # data.table 820.8535 835.9101 918.5243 887.0207 1005.721 1068.919 10 # full outer join #Unit: seconds # expr min lq mean median uq max neval # base 16.176423 16.908908 17.485457 17.364857 18.271790 18.626762 10 # dplyr 7.610498 7.666426 7.745850 7.710638 7.832125 7.951426 10 # data.table 2.052590 2.130317 2.352626 2.208913 2.470721 2.951948 10
b from dataset
DT, or drop columns by names stored in a variable.
set function can be also used, it will work on data.frames too.
DT[, c('a','b') := NULL] # or del <- c('a','b') DT[, (del) := NULL] # or set(DT, j = 'b', value = NULL)
No timing here as that process is almost instant, you can expect less memory consumption dropping columns with
:= operator or
set function as it is made by reference. In R version earlier than 3.1 regular data.frame methods would copy your data in memory.
Reading 1 million rows dataset from csv file.
Timing in seconds:
## user system elapsed Method ## 24.71 0.15 25.42 read.csv (first time) ## 17.85 0.07 17.98 read.csv (second time) ## 10.20 0.03 10.32 Optimized read.table ## 3.12 0.01 3.22 fread ## 12.49 0.09 12.69 sqldf ## 10.21 0.47 10.73 sqldf on SO ## 10.85 0.10 10.99 ffdf
na.omit data.table method which can be handly for that using
You should not expect performance impovement over data.frame methods other than faster detection of rows to delete (
NAs in this example).
Additional memory efficiency is expected in future thanks to Delete rows by reference - data.table#635.
Drop levels for all factor columns in a dataset after making subset on it.
upd.cols = sapply(subdt, is.factor) subdt[, names(subdt)[upd.cols] := lapply(.SD, factor), .SDcols = upd.cols]
No timing, don't expect speed up as the bottleneck is the
factor function used to recreate the factor columns, this is also true for
Having a list of data.frames called
ll, perform row binding on all elements of the list into single dataset.
system.time(ans1 <- rbindlist(ll)) # user system elapsed # 3.419 0.278 3.718 system.time(ans2 <- rbindlist(ll, use.names=TRUE)) # user system elapsed # 5.311 0.471 5.914 system.time(ans3 <- do.call("rbind", ll)) # user system elapsed # 1097.895 1209.823 2438.452
This is a problem related to the fact that
data.frame() by default converts character columns to factor. This is not a problem for
data.table() which keeps character column class. So a question simply is already solved for data.table.
dt = data.table(col1 = c("a","b","c"), col2 = 1:3) sapply(dt, class)
If you have a factor columns in your dataset already and you want to convert them to character you can do the following.
upd.cols = sapply(dt, is.factor) dt[, names(dt)[upd.cols] := lapply(.SD, as.character), .SDcols = upd.cols]
No timings, speed up is unlikely as these are quite primitive operation. If you are creating factor columns in data.frame (default) then you can get speed up with data.table as you avoid
factor function which depending on factor levels can be costly. Still those operations shouldn't be a bottleneck in your workflow.
Rename columns in
dt dataset to good and better.
setnames(dt, c("good", "better"))
No timing here, you should not expect speed up but you can expect less memory consumption due to fact that
setnames update names by reference.
Didn't find a problem you are looking for?
It is very likely as I just listed the questions from the first page of the most voted - so among 50 questions only.
Before you ask new question on SO it may be wise to search for existing one because there is quite a lot already answered.
As of 11 Oct 2015, data.table was the 2nd largest tag about an R package
You can effectively search SO using
[r] [data.table] my problem here in the SO search bar.
If you decide to ask a question remember about Minimal Reproducible Example (MRE). If you need help on that see the highest voted question on R tag: How to make a great R reproducible example?.
Also if you are new user to data.table you should take a look at Getting started wiki.
Final note on performance
Seeing the above timings you can get a valid impression that data.table can dramatically reduce the time (and resources) required in your data processing workflow.
Yet the above questions doesn't include few other use cases where data.table makes in fact the most stunning impression, those are:
- reshaping data - melt, dcast
- overlapping joins - overlapping range joins
- rolling joins - R - Data.Table Rolling Joins
- binary search - Access data quickly and easily: data.table package
- data.table index - Scaling data.table using index
It is good to be aware of those too.
Any comments are welcome in the blog github repo as issues.