Anoop Shah
The data.table package allows R to handle large datasets efficiently, and perform data management in a similar way to Stata (with by and egen-style functionality), SAS DATA steps or SQL. It is different from the flat file package (ff) because the dataset is held in RAM at all times. However, in common with data.frames and unlike SQL database tables, data.tables are not designed to have rows added incrementally. (In R if you want to collect data in an incremental fashion, you first need to create an empty object (e.g. data.table or data.frame) which is big enough, and then start to fill it.)
If a data.table is sent as an argument to a function in a package which does not know about data.table, it behaves as a data.frame. For example, you can use data.tables instead of data.frames when fitting statistical models.
However when you create and manipulate data.tables there are some important differences:
The assignment operator -> assigns a new reference (pointer, alias) but does not make a copy of the object. Several named objects can point to the same data.table. If a data.table is modified by code in a function, it is modified in place.
If you want to be sure of making a separate copy of a data.table or part of a data.table (e.g. the vector of column names) you should do so explicity using the copy() function. The c() vector concatenation function also makes a copy.
This is demonstrated by the code below. Bcopy is a copy of b but B is just another pointer. I will explain the special updating syntax := later, which is specific to data.tables.
library(data.table)
b <- data.table(me = 2)
Bcopy <- copy(b)
Bcopy[, me:=3]
## me
## 1: 3
b
## me
## 1: 2
B <- b
B[, me:=3]
## me
## 1: 3
b
## me
## 1: 3
## Extracting a vector of names
Bnames <- names(b) # Using the <- assignment operator
BnamesC <- c(names(b)) # Using the vector concatenation function
BnamesCopy <- copy(names(b)) # Using the copy function
## Change the column name of the data.table b
setnames(b, 'me', 'you')
Bnames
## you
BnamesC
## me
BnamesCopy
## me
Data table indexes are sorted lists which can be used for fast binary searches. Indexes are called keys and can be set using the setkey command, with the unquoted key column name(s) as arguments. setkeyv is a version of this function which takes quoted column names. Keys do not have to be unique.
A <- data.table(a = 1:5, b = c(1, 1, 2, 3, 3))
setkeyv(A, "a")
setkey(A, a)
This is an example where more than one variable is used for the key (events within patient):
A <- data.table(events = 1:5, patients = c(1, 3, 2, 1, 3), value = 5:1)
setkey(A, patients, events)
A
## events patients value
## 1: 1 1 5
## 2: 4 1 2
## 3: 3 2 3
## 4: 2 3 4
## 5: 5 3 1
This data.table is now sorted by patients and then by events.
Keys are useful for aggregating results and are required when merging (joining) data.tables.
Virtually all the manipulation of data.tables that you may want to do (updating, subsetting, selecting, joining) is handled by the square brackets. I have categorised the features by their approximate SQL equivalents.
Let A be a data.table. You can perform operations on A using the syntax A[i, j, …] where … are optional arguments. The i argument relates to rows and j to columns. At least one of i or j must be provided; if you are providing j and not i you must put a comma before j.
The square brackets are like a function, but with notation that is more concise than the traditional myfunction(A, i, j, …) because you don't need to type a word such as 'myfunction'.
All arguments within [] are evaluated within the scope of the data.table, as if using with(). This means that column names are available as vector objects. To subset a data.frame using one of its columns you need to fully specify the column (e.g. mydataframe[mydataframe$col1==1,]), but for data.tables you can do this more concisely (e.g. mydatatable[col1==1,]).
You can select columns or functions in the j argument.
Using the default (with=TRUE), you can specify the columns as objects. If you want to select more than one column you must combine them with list(). You can use the name= syntax in list to name the columns, similar to SQL As.
A[, value]
## [1] 5 2 3 4 1
A[, list(events, value)]
## events value
## 1: 1 5
## 2: 4 2
## 3: 3 3
## 4: 2 4
## 5: 5 1
A[, list(myevents = events, myvalue = value)]
## myevents myvalue
## 1: 1 5
## 2: 4 2
## 3: 3 3
## 4: 2 4
## 5: 5 1
Using the option with=FALSE, you can specify a vector of column names.
A[, "value", with = FALSE]
## value
## 1: 5
## 2: 2
## 3: 3
## 4: 4
## 5: 1
A[, c("events", "value"), with = FALSE]
## events value
## 1: 1 5
## 2: 4 2
## 3: 3 3
## 4: 2 4
## 5: 5 1
If with=TRUE, you can use functions within j.
A[, exp(value)]
## [1] 148.413 7.389 20.086 54.598 2.718
A[, list(exp = exp(value), square = value^2)]
## exp square
## 1: 148.413 25
## 2: 7.389 4
## 3: 20.086 9
## 4: 54.598 16
## 5: 2.718 1
You can also use functions that give a scalar results (i.e. a single number), in this case the resultant data.table will have a single row or short vectors will be recycled to make them up to the length of the longest vector.
A[, min(value)]
## [1] 1
A[, list(exp = exp(value), min = min(value))]
## exp min
## 1: 148.413 1
## 2: 7.389 1
## 3: 20.086 1
## 4: 54.598 1
## 5: 2.718 1
If you don't know the column names in advance but want to use a complicated function, it may be easiest to create the expression from text using parse(text= ) and then evaluate it within [] using eval().
colname <- "value"
myexpr <- paste("list(exp=exp(", colname, "), square=", colname, "^2)", sep = "")
A[, eval(parse(text = myexpr))]
## exp square
## 1: 148.413 25
## 2: 7.389 4
## 3: 20.086 9
## 4: 54.598 16
## 5: 2.718 1
You can select rows by specifying the i argument in one of three ways:
You can also use an expression which produces column numbers.
A[1]
## events patients value
## 1: 1 1 5
A[1:2 * 2]
## events patients value
## 1: 4 1 2
## 2: 2 3 4
(Note that if you are selecting all columns and do not have a j argument, the comma after the i argument is optional, unlike with data.frames.)
You may have to manually coerce the vector to logical.
A[TRUE] # TRUE is recycled, all events are selected
## events patients value
## 1: 1 1 5
## 2: 4 1 2
## 3: 3 2 3
## 4: 2 3 4
## 5: 5 3 1
A[events == 3]
## events patients value
## 1: 3 2 3
# A[events] does not work, you need to use A[events==TRUE] or
# A[as.logical(events)]
The data.tables are linked using the key columns like an SQL table join, except that it is ordered. The mult option can be used to specify whether to include the first, last or all rows in the join.
setkey(A, patients)
B <- data.table(patients = 2:4, age = c(23, 34, 45))
setkey(B, patients)
B[A] # data.table B indexed by the rows of A.
## patients age events value
## 1: 1 NA 1 5
## 2: 1 NA 4 2
## 3: 2 23 3 3
## 4: 3 34 2 4
## 5: 3 34 5 1
# patient 4, who occurs in B but not A, is not included.
A[B] # data.table A indexed by the rows of B.
## patients events value age
## 1: 2 3 3 23
## 2: 3 2 4 34
## 3: 3 5 1 34
## 4: 4 NA NA 45
# patient 1, who is in A but not B, is not included.
A[B, mult = "first"] # if the key is the same, include only the first row.
## patients events value age
## 1: 2 3 3 23
## 2: 3 2 4 34
## 3: 4 NA NA 45
A[B, mult = "last"] # if the key is the same, include only the last row.
## patients events value age
## 1: 2 3 3 23
## 2: 3 5 1 34
## 3: 4 NA NA 45
A self-join may be useful to show e.g. only the top row per patient. If a data.table has only one column, it is implicitly taken to be the first key column. In this example we sort by value within patient.
setkey(A, patients, value)
A[data.table(unique(patients)), mult = "first"]
## patients events value
## 1: 1 4 2
## 2: 2 3 3
## 3: 3 5 1
You can't update and select in the same function. The format of the j argument is different for updating, and uses the := operator.
This modifies the actual data.table in memory, and returns the modified table.
On the left hand side of :=, you can have a single unquoted column name or a vector of quoted column names if with=FALSE. On the right hand side, you can have a single vector or a list of vectors. You can use list(NULL) to delete columns. The column names can be new or existing. If it is an existing column, you must supply the same data type (e.g. you cannot replace an integer with a double) because the new values replace the old ones in the same memory location.
Some examples:
A[, new:=1]
## events patients value new
## 1: 4 1 2 1
## 2: 1 1 5 1
## 3: 3 2 3 1
## 4: 5 3 1 1
## 5: 2 3 4 1
A[events == 3, new:=2]
## events patients value new
## 1: 4 1 2 1
## 2: 1 1 5 1
## 3: 3 2 3 2
## 4: 5 3 1 1
## 5: 2 3 4 1
A[, c("this", "that"):=list(2, 3), with = FALSE]
## events patients value new this that
## 1: 4 1 2 1 2 3
## 2: 1 1 5 1 2 3
## 3: 3 2 3 2 2 3
## 4: 5 3 1 1 2 3
## 5: 2 3 4 1 2 3
A[, c("this", "that"):=list(NULL), with = FALSE]
## events patients value new
## 1: 4 1 2 1
## 2: 1 1 5 1
## 3: 3 2 3 2
## 4: 5 3 1 1
## 5: 2 3 4 1
A[, new:=NULL]
## events patients value
## 1: 4 1 2
## 2: 1 1 5
## 3: 3 2 3
## 4: 5 3 1
## 5: 2 3 4
To transfer a column from another data.table (in this example, age from data.table B):
A[, age:=B[A]$age]
## events patients value age
## 1: 4 1 2 NA
## 2: 1 1 5 NA
## 3: 3 2 3 23
## 4: 5 3 1 34
## 5: 2 3 4 34
# other ways of ensuring you are getting a vector
A[, age:=B[A, age][, age]] # quickest with large datasets
## events patients value age
## 1: 4 1 2 NA
## 2: 1 1 5 NA
## 3: 3 2 3 23
## 4: 5 3 1 34
## 5: 2 3 4 34
A[, age:=B[A][, age]]
## events patients value age
## 1: 4 1 2 NA
## 2: 1 1 5 NA
## 3: 3 2 3 23
## 4: 5 3 1 34
## 5: 2 3 4 34
# DO NOT DO: A[, age:=B[A, age]] because B[A, age] returns a data.table
# with two columns: the key column and age, but you need a vector
If you don't want to print the updated data.table each time, use the invisible() function (e.g. invisible(A[, age:=B[A]$age])).
Data.frame style notation using $ also works, but is not advised with large data.table objects because it may require the object to be copied and will therefore be slower.
A$age <- NULL
The by function is very useful, because it evaluates the j expression (select or update) for every group specified in the by option (e.g. per patient, per practice, per date). It is similar to by: egen in Stata, but you can specify expressions instead of column names.
If you are using by over more than one column, they have to be supplied as a list. The only requirement is
# Select
A[, min(value), by = patients]
## patients V1
## 1: 1 2
## 2: 2 3
## 3: 3 1
# Update (with vector recycling)
A[, minvalue := min(value), by = patients]
## events patients value minvalue
## 1: 4 1 2 2
## 2: 1 1 5 2
## 3: 3 2 3 3
## 4: 5 3 1 1
## 5: 2 3 4 1
A[, minvalue := min(value), by = list(patients, events)]
## events patients value minvalue
## 1: 4 1 2 2
## 2: 1 1 5 5
## 3: 3 2 3 3
## 4: 5 3 1 1
## 5: 2 3 4 4
# within groups: patients==1 vs. patients!=1
A[, minvalue := min(value), by = list(patients == 1)]
## events patients value minvalue
## 1: 4 1 2 2
## 2: 1 1 5 2
## 3: 3 2 3 1
## 4: 5 3 1 1
## 5: 2 3 4 1
The output of A[i, j, …] is a data.table. If you put square brackets after it, you can do another query on the result, and so on. For example, you can do a select query with where and then select a subset of the results to display.
A[patients == 3, list(patients, sq = value^2)]
## patients sq
## 1: 3 1
## 2: 3 16
A[patients == 3, list(patients, sq = value^2)][sq > 1]
## patients sq
## 1: 3 16
You can use the order() function in square brackets in the i argument to order the result of a query.
A[order(-patients, events)]
## events patients value minvalue
## 1: 2 3 4 1
## 2: 5 3 1 1
## 3: 3 2 3 1
## 4: 1 1 5 2
## 5: 4 1 2 2
A[patients == 3, list(patients, sq = value^2)][order(-sq)]
## patients sq
## 1: 3 16
## 2: 3 1
A convenient way to rename columns. Column names should be quoted.
setnames(A, c("patients", "value"), c("patid", "value2"))
A
## events patid value2 minvalue
## 1: 4 1 2 2
## 2: 1 1 5 2
## 3: 3 2 3 1
## 4: 5 3 1 1
## 5: 2 3 4 1
The IDate object type provided by the data.table package by is an integer date class, which stores dates using less memory than the conventional date class. IDate objects can be created using the as.IDate function, and behave similarly to Date objects, e.g.:
as.IDate("2000-01-02")
## [1] "2000-01-02"
as.IDate("2000-01-02") - as.Date("2000-01-01")
## Time difference of 1 days
The tables() function lists all the data.tables in your current workspace, along with their size in megabytes.
The on-screen print() function for data.tables (invoked by typing the name of the data.table at the R prompt, or using print(mydatatable)) by default prints only the top 5 and bottom 5 rows of large data.tables rather than the whole data.table. This is convenient to quickly view a data.table, and doesn't fill up your screen.
merge(X, Y, by = varname), where varname is the name of the the common column, merges data.tables X and Y in a similar way to data.frames.
A data.table is a special type of data.frame, and a data.frame is a special type of list containing vectors (or lists) of identical length. Any data.frame can be made into a data.table. Columns of a data.table can be referred to by $, as with lists. List-based functions such as lapply also work on data.tables, and they can be coerced to a matrix for using the apply function (although this is ill-advised for large data.tables).
However, dont use syntax such as mydatatable$column <- something for assigment, because this might be inefficient. It is better to use the data.table := operator for in-place assignment.
As mentioned above, data.tables can be used in place of data.frames when supplying data to a model fitting function.
Be careful when trying to select a single column in order to get a vector (e.g. to use in assignment). If you get it wrong you will end up with a data.table instead of a vector. This may occur when you are indexing by another data.table.
A way of ensuring you have a vector is using the $ list subsetting operator in order to ensure you select a single column, e.g.:
A <- data.table(events = 1:5, patients = c(1, 3, 2, 1, 3), value = 5:1)
A[, list(value)] # a data.table with one column
## value
## 1: 5
## 2: 4
## 3: 3
## 4: 2
## 5: 1
A[, value, by = events] # a data.table with two columns
## events value
## 1: 1 5
## 2: 2 4
## 3: 3 3
## 4: 4 2
## 5: 5 1
A[, value] # a vector
## [1] 5 4 3 2 1
A[, list(value)]$value # always a vector, same as A[, list(value)][, value]
## [1] 5 4 3 2 1