Databases of electronic health records such as CALIBER have large numbers of patients and can create challenges for data management. The CALIBER suite of R packages has tools to assist in the analysis of these data. These packages are particularly designed for CALIBER data, but may be of use for other electronic health record datasets.
Standard data.frame objects are not suitable for handling very large datasets, because they are unsorted, can be slow to search and multiple copies may be made whenever they are modified. We recommend the use of the following:
Data.table objects are like data.frames but are updated by reference, and can be sorted to enable fast binary searches. In particular, the <-
assignment operator creates a new name which refers to the same object rather than creating a new object. We recommend using the copy()
function when you explicitly want to make a copy of an object.
Data.table allows SQL-type operations, but using its own concise syntax. Data.table operations are faster than aggregate, table, doBy, plyr, etc.
Flat file data frames reside on the hard drive (usually in the temporary directory, but can be saved elsewhere for storage). The methods for using them are different from data.table or data.frame, but some of the CALIBERdatamanage functions behave exactly the same regardless of the data format.
We recommend the use of ffdf objects for raw datasets which are too large to fit in memory, e.g. raw CPRD clinical, test or therapy data.
The CALIBERdatamanage package contains functions for:
First we load the required packages and specify the path to the data.
library(CALIBERcodelists)
library(CALIBERdatamanage)
library(CALIBERlookups) # optional, it will otherwise be
# loaded automatically when required
kRawPath <- "~/RAWDATA/"
We are using raw data for the stable angina cohort. We can load raw files directly from zip or gz files (if from zip they are automatically unzipped into the temporary directory and loaded).
It is assumed that the first row contains the column headings. The delimiter is detected automatically (tab or comma). Internally, importDT uses fread
instead of read.delim, as it is faster and uses less memory. Dates are converted to IDate.
# Load patient data to data.table
PATS <- importDT(kRawPath %&% "patients.csv.zip")
## Imported to data.table with 115305 rows and 17 columns
## Column classes after attempted date conversion:
## anonpatid pracid pracregion pracuts praclcd
## "integer" "integer" "integer" "IDate" "IDate"
## frd crd tod deathdate toreason
## "IDate" "IDate" "IDate" "IDate" "integer"
## gender year_of_birth in_hes_source hes_start hes_end
## "integer" "integer" "integer" "IDate" "IDate"
## hes_ethnicity in_hes
## "character" "integer"
## Converting in_hes_source to logical.
## Converting in_hes to logical.
We designate the patient file as a 'cohort', which means that it contains one row per patient, each variable has an optional description and all the variables are sorted alphabetically. A cohort object is also a data.table, which means it is updated by reference. Converting a data.table to a cohort changes the original data.table
# Convert to cohort
cohort(PATS, idcolname = "anonpatid")
# This is the same as PATS <- as.cohort(PATS, idcolname = 'anonpatid')
The importFFDF
function can load and append multiple files if supplied with a vector of file names. Unlike importDT, character strings are loaded as factors and dates as Date types.
# Load clinical data to ffdf
CLIN <- importFFDF(kRawPath %&% "clinical.part." %&% 0:3 %&% ".zip")
## Importing /tmp/RtmpSuhHlK/clinical.part.0
## Importing /tmp/RtmpSuhHlK/clinical.part.1
## Importing /tmp/RtmpSuhHlK/clinical.part.2
## Importing /tmp/RtmpSuhHlK/clinical.part.3
## Imported to ffdf with 29739915 rows and 18 columns
## Column classes after attempted date conversion:
## anonpatid eventdate sysdate constype consid medcode staffid
## "integer" "Date" "Date" "integer" "integer" "integer" "integer"
## textid episode enttype adid data1 data2 data3
## "integer" "integer" "integer" "integer" "numeric" "integer" "numeric"
## data4 data5 data6 data7
## "integer" "integer" "integer" "logical"
FFDF objects are saved on the hard drive, and use up very little RAM. This can be checked using the function object.size
:
print(object.size(PATS), units = "auto")
## 9.7 Mb
print(object.size(CLIN), units = "auto")
## 60.7 Kb
To save the dataset for future use, we can use the function pack.ffdf
. Loading from a 'packed' file is quicker than loading from a text file.
# Save clinical table as packed FFDF on file
pack.ffdf(kRawPath %&% "mypackedCLIN.zip", CLIN)
# To unpack (CLIN is restored to the global environment)
unpack.ffdf(kRawPath %&% "mypackedCLIN.zip")
We are trying to identify diagnoses of stable angina. To do this, we load the stable angina codelist.
sa_codelist <- codelist("sa_diagnosis_gprd.codelist.1.csv")
sa_codelist
## Codelist based on read dictionary with 26 terms.
##
## Name: sa_diagnosis_gprd
## Version: 1
## Source: GPRD
## Author: Julie George, Emily Herrett, Liam Smeeth, Harry Hemingway
## Date: 12 Apr 2011
## Timestamp: 15.04 23-Apr-13
## Categories:
## 1. History of stable angina
## 2. Vasospastic angina
## 3. Cardiac syndrome X
## 4. Stable angina
##
## TERMS (sorted by category and code):
## category code term medcode
## 1: 1 14A5.00 H/O: angina pectoris 6336
## 2: 1 14AJ.00 H/O: Angina in last year 57062
## 3: 2 G331.00 Prinzmetal's angina 12986
## 4: 2 G331.11 Variant angina pectoris 11048
## 5: 2 G332.00 Coronary artery spasm 36854
## 6: 3 G37..00 Cardiac syndrome X 8568
## 7: 4 662K.00 Angina control 13185
## 8: 4 662K000 Angina control - good 19542
## 9: 4 662K100 Angina control - poor 15373
## 10: 4 662K200 Angina control - improving 14782
## 11: 4 662Kz00 Angina control NOS 15349
## 12: 4 8B27.00 Antianginal therapy 45960
## 13: 4 G33..00 Angina pectoris 1430
## 14: 4 G330.00 Angina decubitus 20095
## 15: 4 G330000 Nocturnal angina 18125
## 16: 4 G330z00 Angina decubitus NOS 29902
## 17: 4 G33z.00 Angina pectoris NOS 25842
## 18: 4 G33z100 Stenocardia 54535
## 19: 4 G33z200 Syncope anginosa 7696
## 20: 4 G33z300 Angina on effort 1414
## 21: 4 G33z500 Post infarct angina 9555
## 22: 4 G33z600 New onset angina 26863
## 23: 4 G33z700 Stable angina 12804
## 24: 4 G33zz00 Angina pectoris NOS 28554
## 25: 4 G34y000 Chronic coronary insufficiency 24540
## 26: 4 Gyu3000 [X]Other forms of angina pectoris 39546
## category code term medcode
We want to find the date of the earliest angina record per patient, and add it to the patient cohort as the index date.
ANGINA <- as.data.table(extractCodes(CLIN, sa_codelist))
# The new variable sa_diagnosis_gprd is a factor
ANGINA[, .N, by = list(sa_diagnosis_gprd, as.integer(sa_diagnosis_gprd))]
## sa_diagnosis_gprd as.integer N
## 1: Stable angina 4 152274
## 2: Vasospastic angina 2 297
## 3: History of stable angina 1 4184
## 4: Cardiac syndrome X 3 465
# Now we create an index date for the cohort
ANGINA <- ANGINA[(!is.na(eventdate)) & sa_diagnosis_gprd %in% c("Stable angina",
"History of stable angina")]
ANGINA[, indexdate := min(eventdate), by = anonpatid]
# Convert indexdate back to date in case it is a number
ANGINA[, indexdate := as.IDate(indexdate, origin = "1970-01-01")]
# Transfer the index date to the PATS cohort
transferVariables(ANGINA, PATS, "indexdate", description = "date of initial diagnosis")
## [1] "anonpatid" "crd" "deathdate" "frd"
## [5] "gender" "hes_end" "hes_ethnicity" "hes_start"
## [9] "in_hes" "in_hes_source" "pracid" "praclcd"
## [13] "pracregion" "pracuts" "tod" "toreason"
## [17] "year_of_birth" "indexdate"
# Generate a variable to include patients only if the indexdate is after
# their current registration date
PATS[, include, := istrue(indexdate >= crd)]
## Cohort with 115305 patients; ID column = anonpatid
##
## COLUMN DESCRIPTIONS
## crd (IDate):
## deathdate (IDate):
## frd (IDate):
## gender (integer):
## hes_end (IDate):
## hes_ethnicity (character):
## hes_start (IDate):
## in_hes (logical):
## in_hes_source (logical):
## indexdate (IDate): date of initial diagnosis
## pracid (integer):
## praclcd (IDate):
## pracregion (integer):
## pracuts (IDate):
## tod (IDate):
## toreason (integer):
## year_of_birth (integer):
modifyDescription(PATS, "include", "Whether to include in main analysis")
# Count the number of patients included
PATS[, .N, by = include]
## include N
## 1: FALSE 77242
## 2: TRUE 38063
Additional variables can be added for medical conditions prior to the index date, such as whether a patient has diabetes.
dm_codelist <- codelist("dm_gprd.codelist.2.csv")
dm_codelist
## Codelist based on read dictionary with 513 terms.
##
## Name: dm_gprd
## Version: 2
## Source: GPRD
## Author: Julie George, Emily Herrett, Anoop Shah, Liam Smeeth, Harry Hemingway
## Date: 06 Jan 2012
## Timestamp: 15.04 23-Apr-13
## Categories:
## 1. H/O diabetes
## 2. Possible diabetes
## 3. T1DM diagnosed
## 4. T2DM diabetes diagnosed
## 5. Secondary diabetes
## 6. Diabetes, not otherwise specified
## 7. Diabetes excluded
## 8. Diabetes resolved
##
## TERMS (sorted by category and code):
## category code term medcode
## 1: 1 1434.00 H/O: diabetes mellitus 6813
## 2: 1 14F4.00 H/O: Admission in last year for diabetes ... 7045
## 3: 1 14P3.00 H/O: insulin therapy 17236
## 4: 1 2126300 Diabetes resolved 28622
## 5: 1 212H.00 Diabetes resolved 18766
## ---
## 509: 6 TJ23z00 Adverse reaction to insulins and antidiab... 61210
## 510: 6 U602311 [X] Adverse reaction to insulins and anti... 65684
## 511: 6 ZC2C800 Dietary advice for diabetes mellitus 10642
## 512: 6 ZV65312 [V]Dietary counselling in diabetes mellitus 16881
## 513: 7 1I0..00 Diabetes mellitus excluded 19203
# Use categories 3, 4 and 6
addCodelistToCohort(PATS, "diabetes", CLIN, dm_codelist, categories = c(3, 4,
6), binary = TRUE, limit_years = c(-Inf, 0), description = "Diabetes prior to index date")
## Called from: addToCohort(x, varname, USE, old_varname = "value", value_choice = function(x) any(istrue(x)),
## limit_years = limit_years, overwrite = overwrite, idcolname = idcolname,
## datecolname = datecolname, description = description)
## diabetes N
## 1: FALSE 109777
## 2: TRUE 5528
Adding the most recent systolic blood pressure within 2 years before the index date, taking the mean if there is more than one measurement on the same day.
# First extract the blood pressure readings
BP <- extractEntity(CLIN, 1)
# Now use the addToCohort function to add the mean BP
addToCohort(PATS, "sbp", BP, "Systolic", value_choice = mean, date_priority = "last",
limit_years = c(-2, 0), date_varname = "sbp_date", description = "Most recent SBP within 2y prior to index date")
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 130 140 143 156 1130 82651