Data management for electronic health record data using the CALIBERdatamanage package

Databases of electronic health records such as CALIBER have large numbers of patients and can create challenges for data management. The CALIBER suite of R packages has tools to assist in the analysis of these data. These packages are particularly designed for CALIBER data, but may be of use for other electronic health record datasets.

The R packages comprise:

Handling large datasets in R

Standard data.frame objects are not suitable for handling very large datasets, because they are unsorted, can be slow to search and multiple copies may be made whenever they are modified. We recommend the use of the following:

data.table

Data.table objects are like data.frames but are updated by reference, and can be sorted to enable fast binary searches. In particular, the <- assignment operator creates a new name which refers to the same object rather than creating a new object. We recommend using the copy() function when you explicitly want to make a copy of an object.

Data.table allows SQL-type operations, but using its own concise syntax. Data.table operations are faster than aggregate, table, doBy, plyr, etc.

ffdf

Flat file data frames reside on the hard drive (usually in the temporary directory, but can be saved elsewhere for storage). The methods for using them are different from data.table or data.frame, but some of the CALIBERdatamanage functions behave exactly the same regardless of the data format.

We recommend the use of ffdf objects for raw datasets which are too large to fit in memory, e.g. raw CPRD clinical, test or therapy data.

Data conversion and data management

The CALIBERdatamanage package contains functions for:

Example analysis

First we load the required packages and specify the path to the data.

library(CALIBERcodelists)

library(CALIBERdatamanage)

library(CALIBERlookups)  # optional, it will otherwise be
# loaded automatically when required

kRawPath <- "~/RAWDATA/"

We are using raw data for the stable angina cohort. We can load raw files directly from zip or gz files (if from zip they are automatically unzipped into the temporary directory and loaded).

It is assumed that the first row contains the column headings. The delimiter is detected automatically (tab or comma). Internally, importDT uses fread instead of read.delim, as it is faster and uses less memory. Dates are converted to IDate.

# Load patient data to data.table
PATS <- importDT(kRawPath %&% "patients.csv.zip")
## Imported to data.table with 115305 rows and 17 columns

## Column classes after attempted date conversion:

## anonpatid pracid pracregion pracuts praclcd 
## "integer" "integer" "integer" "IDate" "IDate" 

## frd crd tod deathdate toreason
## "IDate" "IDate" "IDate" "IDate" "integer"

## gender year_of_birth in_hes_source hes_start hes_end
## "integer" "integer" "integer" "IDate" "IDate"

## hes_ethnicity in_hes
## "character" "integer"
## Converting in_hes_source to logical.

## Converting in_hes to logical.

We designate the patient file as a 'cohort', which means that it contains one row per patient, each variable has an optional description and all the variables are sorted alphabetically. A cohort object is also a data.table, which means it is updated by reference. Converting a data.table to a cohort changes the original data.table

# Convert to cohort
cohort(PATS, idcolname = "anonpatid")
# This is the same as PATS <- as.cohort(PATS, idcolname = 'anonpatid')

The importFFDF function can load and append multiple files if supplied with a vector of file names. Unlike importDT, character strings are loaded as factors and dates as Date types.

# Load clinical data to ffdf
CLIN <- importFFDF(kRawPath %&% "clinical.part." %&% 0:3 %&% ".zip")
## Importing /tmp/RtmpSuhHlK/clinical.part.0
## Importing /tmp/RtmpSuhHlK/clinical.part.1
## Importing /tmp/RtmpSuhHlK/clinical.part.2
## Importing /tmp/RtmpSuhHlK/clinical.part.3

## Imported to ffdf with 29739915 rows and 18 columns

## Column classes after attempted date conversion:

## anonpatid eventdate sysdate constype consid medcode staffid
## "integer" "Date" "Date" "integer" "integer" "integer" "integer" 

## textid episode enttype adid data1 data2 data3
## "integer" "integer" "integer" "integer" "numeric" "integer" "numeric" 

## data4 data5 data6 data7
## "integer" "integer" "integer" "logical"

FFDF objects are saved on the hard drive, and use up very little RAM. This can be checked using the function object.size:

print(object.size(PATS), units = "auto")
## 9.7 Mb
print(object.size(CLIN), units = "auto")
## 60.7 Kb

To save the dataset for future use, we can use the function pack.ffdf. Loading from a 'packed' file is quicker than loading from a text file.

# Save clinical table as packed FFDF on file
pack.ffdf(kRawPath %&% "mypackedCLIN.zip", CLIN)

# To unpack (CLIN is restored to the global environment)
unpack.ffdf(kRawPath %&% "mypackedCLIN.zip")

We are trying to identify diagnoses of stable angina. To do this, we load the stable angina codelist.

sa_codelist <- codelist("sa_diagnosis_gprd.codelist.1.csv")
sa_codelist
## Codelist based on read dictionary with 26 terms.
## 
## Name: sa_diagnosis_gprd
## Version: 1
## Source: GPRD
## Author: Julie George, Emily Herrett, Liam Smeeth, Harry Hemingway
## Date: 12 Apr 2011
## Timestamp: 15.04 23-Apr-13
## Categories:
## 1. History of stable angina
## 2. Vasospastic angina
## 3. Cardiac syndrome X
## 4. Stable angina
## 
## TERMS (sorted by category and code):
##     category    code                              term medcode
##  1:        1 14A5.00              H/O: angina pectoris    6336
##  2:        1 14AJ.00          H/O: Angina in last year   57062
##  3:        2 G331.00               Prinzmetal's angina   12986
##  4:        2 G331.11           Variant angina pectoris   11048
##  5:        2 G332.00             Coronary artery spasm   36854
##  6:        3 G37..00                Cardiac syndrome X    8568
##  7:        4 662K.00                    Angina control   13185
##  8:        4 662K000             Angina control - good   19542
##  9:        4 662K100             Angina control - poor   15373
## 10:        4 662K200        Angina control - improving   14782
## 11:        4 662Kz00                Angina control NOS   15349
## 12:        4 8B27.00               Antianginal therapy   45960
## 13:        4 G33..00                   Angina pectoris    1430
## 14:        4 G330.00                  Angina decubitus   20095
## 15:        4 G330000                  Nocturnal angina   18125
## 16:        4 G330z00              Angina decubitus NOS   29902
## 17:        4 G33z.00               Angina pectoris NOS   25842
## 18:        4 G33z100                       Stenocardia   54535
## 19:        4 G33z200                  Syncope anginosa    7696
## 20:        4 G33z300                  Angina on effort    1414
## 21:        4 G33z500               Post infarct angina    9555
## 22:        4 G33z600                  New onset angina   26863
## 23:        4 G33z700                     Stable angina   12804
## 24:        4 G33zz00               Angina pectoris NOS   28554
## 25:        4 G34y000    Chronic coronary insufficiency   24540
## 26:        4 Gyu3000 [X]Other forms of angina pectoris   39546
##     category    code                              term medcode

We want to find the date of the earliest angina record per patient, and add it to the patient cohort as the index date.

ANGINA <- as.data.table(extractCodes(CLIN, sa_codelist))

# The new variable sa_diagnosis_gprd is a factor
ANGINA[, .N, by = list(sa_diagnosis_gprd, as.integer(sa_diagnosis_gprd))]
##           sa_diagnosis_gprd as.integer      N
## 1:            Stable angina          4 152274
## 2:       Vasospastic angina          2    297
## 3: History of stable angina          1   4184
## 4:       Cardiac syndrome X          3    465

# Now we create an index date for the cohort
ANGINA <- ANGINA[(!is.na(eventdate)) & sa_diagnosis_gprd %in% c("Stable angina", 
    "History of stable angina")]
ANGINA[, indexdate := min(eventdate), by = anonpatid]

# Convert indexdate back to date in case it is a number
ANGINA[, indexdate := as.IDate(indexdate, origin = "1970-01-01")]

# Transfer the index date to the PATS cohort
transferVariables(ANGINA, PATS, "indexdate", description = "date of initial diagnosis")
##  [1] "anonpatid"     "crd"           "deathdate"     "frd"          
##  [5] "gender"        "hes_end"       "hes_ethnicity" "hes_start"    
##  [9] "in_hes"        "in_hes_source" "pracid"        "praclcd"      
## [13] "pracregion"    "pracuts"       "tod"           "toreason"     
## [17] "year_of_birth" "indexdate"

# Generate a variable to include patients only if the indexdate is after
# their current registration date
PATS[, include, := istrue(indexdate >= crd)]
## Cohort with 115305 patients; ID column = anonpatid 
## 
## COLUMN DESCRIPTIONS
## crd (IDate): 
## deathdate (IDate): 
## frd (IDate): 
## gender (integer): 
## hes_end (IDate): 
## hes_ethnicity (character): 
## hes_start (IDate): 
## in_hes (logical): 
## in_hes_source (logical): 
## indexdate (IDate): date of initial diagnosis
## pracid (integer): 
## praclcd (IDate): 
## pracregion (integer): 
## pracuts (IDate): 
## tod (IDate): 
## toreason (integer): 
## year_of_birth (integer): 
modifyDescription(PATS, "include", "Whether to include in main analysis")

# Count the number of patients included
PATS[, .N, by = include]
##    include     N
## 1:   FALSE 77242
## 2:    TRUE 38063

Additional variables can be added for medical conditions prior to the index date, such as whether a patient has diabetes.

dm_codelist <- codelist("dm_gprd.codelist.2.csv")
dm_codelist
## Codelist based on read dictionary with 513 terms.
## 
## Name: dm_gprd
## Version: 2
## Source: GPRD
## Author: Julie George, Emily Herrett, Anoop Shah, Liam Smeeth, Harry Hemingway
## Date: 06 Jan 2012
## Timestamp: 15.04 23-Apr-13
## Categories:
## 1. H/O diabetes
## 2. Possible diabetes
## 3. T1DM diagnosed
## 4. T2DM diabetes diagnosed
## 5. Secondary diabetes
## 6. Diabetes, not otherwise specified
## 7. Diabetes excluded
## 8. Diabetes resolved
## 
## TERMS (sorted by category and code):
##      category    code                                         term medcode
##   1:        1 1434.00                       H/O: diabetes mellitus    6813
##   2:        1 14F4.00 H/O: Admission in last year for diabetes ...    7045
##   3:        1 14P3.00                         H/O: insulin therapy   17236
##   4:        1 2126300                            Diabetes resolved   28622
##   5:        1 212H.00                            Diabetes resolved   18766
##  ---                                                                      
## 509:        6 TJ23z00 Adverse reaction to insulins and antidiab...   61210
## 510:        6 U602311 [X] Adverse reaction to insulins and anti...   65684
## 511:        6 ZC2C800         Dietary advice for diabetes mellitus   10642
## 512:        6 ZV65312  [V]Dietary counselling in diabetes mellitus   16881
## 513:        7 1I0..00                   Diabetes mellitus excluded   19203

# Use categories 3, 4 and 6
addCodelistToCohort(PATS, "diabetes", CLIN, dm_codelist, categories = c(3, 4, 
    6), binary = TRUE, limit_years = c(-Inf, 0), description = "Diabetes prior to index date")
## Called from: addToCohort(x, varname, USE, old_varname = "value", value_choice = function(x) any(istrue(x)), 
##     limit_years = limit_years, overwrite = overwrite, idcolname = idcolname, 
##     datecolname = datecolname, description = description)
##    diabetes      N
## 1:    FALSE 109777
## 2:     TRUE   5528

Adding the most recent systolic blood pressure within 2 years before the index date, taking the mean if there is more than one measurement on the same day.

# First extract the blood pressure readings
BP <- extractEntity(CLIN, 1)

# Now use the addToCohort function to add the mean BP
addToCohort(PATS, "sbp", BP, "Systolic", value_choice = mean, date_priority = "last", 
    limit_years = c(-2, 0), date_varname = "sbp_date", description = "Most recent SBP within 2y prior to index date")
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0     130     140     143     156    1130   82651