The goal of heRmes is to standardise the heart failure phenotyping of collections of electronic health records.
Installation
You can install the latest version of heRmes like so:
# install.packages("devtools")
devtools::install_github("nicksunderland/heRmes")
Phenotypes
The code lists underpinning the various phenotypes are stored in text files within the package structure at: inst/extdata/ukhdr_phenotypes
. The format of the file matches that used by the UKHDR Phenotype Library, but the important columns are: code
, description
, coding_system.name
, phenotype_id
and phenotype_name
. Below is an example of how to view the available phenotypes and obtain the codes.
Avaiable phenotypes
For example, view the first 5 phenotypes.
get_phenotypes()[1:5]
#> CCU002_02 Cardiomyopathy
#> "PH1002"
#> Acute Myocardial Infarction (AMI)
#> "PH1024"
#> Heart Failure (fatal/non-fatal)
#> "PH1028"
#> Congestive heart failure - Charlson primary care
#> "PH1055"
#> Myocardial infarction - Charlson primary care
#> "PH1062"
Codes
View the codes for PH1645
corresponding to the HERMES Heart Failure phenotype.
# top 5 codes
get_codes(pheno_id = "PH1645")[1:5, c("phenotype_id", "phenotype_name", "coding_system.name", "code")]
#> phenotype_id phenotype_name coding_system.name code
#> <char> <char> <char> <char>
#> 1: PH1645 Heart failure ICD9 codes 40201
#> 2: PH1645 Heart failure ICD9 codes 42832
#> 3: PH1645 Heart failure ICD9 codes 42821
#> 4: PH1645 Heart failure ICD9 codes 42823
#> 5: PH1645 Heart failure ICD9 codes 42820
Phenotyping a dataset
Create sample data. This can be in long (only one column containing diagnosis codes) or wide format (multiple columns containing diagnosis codes). n.b. prioritising coding based on diagnosis code position (e.g. primary vs. secondary vs. tertiary positions) is not currently supported.
set.seed(2020)
n <- 10
dat <- data.frame(ids = paste0("ID_", c(1:(n/2), 1:(n/2))),
codes = sample(c("I420", "foo", "bar", "baz"), n, replace = TRUE),
codes1 = sample(c("I420", "foo", "bar", "baz"), n, replace = TRUE))
dat
#> ids codes codes1
#> 1 ID_1 baz I420
#> 2 ID_2 baz I420
#> 3 ID_3 bar baz
#> 4 ID_4 foo baz
#> 5 ID_5 baz baz
#> 6 ID_1 I420 foo
#> 7 ID_2 I420 foo
#> 8 ID_3 baz baz
#> 9 ID_4 foo foo
#> 10 ID_5 foo I420
Phenotype the individuals with phenotype PH1643
(heart failure syndrome) or PH1646
(cardiomyopathy), excluding phenotypes PH1637
(congenital heart disease) and PH1636
(myocardial infarction). There can be multiple included or excluded phenotypes given in a list.
result <- phenotype(x = dat,
id_col = "ids",
code_cols = list("ICD10 codes" = c("codes", "codes1")),
include = list("PH1645"),
exclude = list("PH1637"))
#> Phenotyping...
#> [i] processing 10 records
#> [i] pivoting data longer
#> [i] getting inclusion phenotype codes from PhenoID(s) PH1645
#> [i] getting exclusion phenotype codes from PhenoID(s) PH1637
#> [i] assessing phenotype PH1645
#> [i] assessing phenotype PH1637
#> [i] summarising phenotyping of participants
#> [i] finished
result[]
#> ids PH1645 PH1637 none include exclude overall
#> <char> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl>
#> 1: ID_1 FALSE FALSE TRUE FALSE FALSE FALSE
#> 2: ID_2 FALSE FALSE TRUE FALSE FALSE FALSE
#> 3: ID_3 FALSE FALSE TRUE FALSE FALSE FALSE
#> 4: ID_4 FALSE FALSE TRUE FALSE FALSE FALSE
#> 5: ID_5 FALSE FALSE TRUE FALSE FALSE FALSE
Code formatting issues
Many of the coding systems have slight formating differences - for example an ICD-10 code may appear as I509
or I50.9
in a dataset. The phenotype()
provides a way to clean these codes through use of the gsub
argument. This takes a 3 element list: [[1]] is a string representing the regular expression pattern, [[2]] is the replacement string, and [[3]] is a character or character vector which can be one or more of: x
(apply to codes in x
), pheno
(apply to all codes in phenotypes), both
(apply to everything), or a valid phenotype ID found in include
or exclude
(apply to specific phenotype datasets). Other arguments can be passed to gsub
through ...
.
It is important to inspect your dataset (x
) and phenotype coding (use get_codes()
) prior to running the phenotyping to avoid join issues related to formatting differences.
Output formatting can be changed by altering the inputs. If the phenotype IDs are named, these names are used as column names in the result. The overall result is given in the column overall
, although this can be renamed by giving the name
parameter.
# change format
dat[10, "codes1"] <- "I42.0"
dat[]
#> ids codes codes1
#> 1 ID_1 baz I420
#> 2 ID_2 baz I420
#> 3 ID_3 bar baz
#> 4 ID_4 foo baz
#> 5 ID_5 baz baz
#> 6 ID_1 I420 foo
#> 7 ID_2 I420 foo
#> 8 ID_3 baz baz
#> 9 ID_4 foo foo
#> 10 ID_5 foo I42.0
# without dealing with the error ID_5 is incorrectly classified as no HF.
wrong <- phenotype(x = dat,
id_col = "ids",
code_cols = list("ICD10 codes" = c("codes", "codes1")),
include = list(HF = "PH1645"),
exclude = list(congHD = "PH1637"),
name = "Heart Failure")
#> Phenotyping...
#> [i] processing 10 records
#> [i] pivoting data longer
#> [i] getting inclusion phenotype codes from PhenoID(s) PH1645
#> [i] getting exclusion phenotype codes from PhenoID(s) PH1637
#> [i] assessing phenotype PH1645
#> [i] assessing phenotype PH1637
#> [i] summarising phenotyping of participants
#> [i] finished
wrong[]
#> ids HF congHD none include exclude Heart Failure
#> <char> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl>
#> 1: ID_1 FALSE FALSE TRUE FALSE FALSE FALSE
#> 2: ID_2 FALSE FALSE TRUE FALSE FALSE FALSE
#> 3: ID_3 FALSE FALSE TRUE FALSE FALSE FALSE
#> 4: ID_4 FALSE FALSE TRUE FALSE FALSE FALSE
#> 5: ID_5 FALSE FALSE TRUE FALSE FALSE FALSE
# deal with formatting issue using gsub
pheno <- phenotype(x = dat,
id_col = "ids",
code_cols = list("ICD10 codes" = c("codes", "codes1")),
include = list(HF = "PH1645"),
exclude = list(congHD = "PH1637"),
gsub = list("\\.", "", c("x")),
name = "Heart Failure")
#> Phenotyping...
#> [i] processing 10 records
#> [i] pivoting data longer
#> [i] cleaning input codes with regex [ \. ], replacement [ ]
#> [i] getting inclusion phenotype codes from PhenoID(s) PH1645
#> [i] getting exclusion phenotype codes from PhenoID(s) PH1637
#> [i] assessing phenotype PH1645
#> [i] assessing phenotype PH1637
#> [i] summarising phenotyping of participants
#> [i] finished
pheno[]
#> ids HF congHD none include exclude Heart Failure
#> <char> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl>
#> 1: ID_1 FALSE FALSE TRUE FALSE FALSE FALSE
#> 2: ID_2 FALSE FALSE TRUE FALSE FALSE FALSE
#> 3: ID_3 FALSE FALSE TRUE FALSE FALSE FALSE
#> 4: ID_4 FALSE FALSE TRUE FALSE FALSE FALSE
#> 5: ID_5 FALSE FALSE TRUE FALSE FALSE FALSE
Update library from UKHDR
This package’s phenotype library can be updated from the UKHDR Phenotype Library API using the below function. This queries the library for phenotypes matching enteries in the search_terms
argument.
update_library(search_terms = c("heart failure", "cardiomyopathy", "myocardial infarction"))
#> [i] reading phenotype id: PH25 - skipping, already exists
#> [i] reading phenotype id: PH182 - skipping, already exists
#> [i] reading phenotype id: PH530 - skipping, already exists
#> [i] reading phenotype id: PH531 - skipping, already exists
#> [i] reading phenotype id: PH631 - skipping, already exists
#> [i] reading phenotype id: PH687 - skipping, already exists
#> [i] reading phenotype id: PH968 - skipping, already exists
#> [i] reading phenotype id: PH993 - skipping, already exists
#> [i] reading phenotype id: PH1028 - skipping, already exists
#> [i] reading phenotype id: PH1055 - skipping, already exists
#> [i] reading phenotype id: PH1074 - skipping, already exists
#> [i] reading phenotype id: PH1603 - skipping, already exists
#> [i] reading phenotype id: PH129 - skipping, already exists
#> [i] reading phenotype id: PH145 - skipping, already exists
#> [i] reading phenotype id: PH185 - skipping, already exists
#> [i] reading phenotype id: PH961 - skipping, already exists
#> [i] reading phenotype id: PH1002 - skipping, already exists
#> [i] reading phenotype id: PH215 - skipping, already exists
#> [i] reading phenotype id: PH356 - skipping, already exists
#> [i] reading phenotype id: PH481 - skipping, already exists
#> [i] reading phenotype id: PH530 - skipping, already exists
#> [i] reading phenotype id: PH611 - skipping, already exists
#> [i] reading phenotype id: PH612 - skipping, already exists
#> [i] reading phenotype id: PH613 - skipping, already exists
#> [i] reading phenotype id: PH741 - skipping, already exists
#> [i] reading phenotype id: PH886 - skipping, already exists
#> [i] reading phenotype id: PH942 - skipping, already exists
#> [i] reading phenotype id: PH949 - skipping, already exists
#> [i] reading phenotype id: PH988 - skipping, already exists
#> [i] reading phenotype id: PH1024 - skipping, already exists
#> [i] reading phenotype id: PH1062 - skipping, already exists
Update library from UKHDR (unpublished)
This package’s phenotype library can be updated with unpublished/development phenotypes from the UKHDR Phenotype Library API using the below function. However, since unpublished phenotypes are not searchable by name, we need to pass the exact ID and also login details for the website (stored in a local .Renviron
file in this example.)
# development phenotypes, ids named for readability only
hermes_phenos <- c(`Congenital heart disease` = "PH1637",
`Heart failure` = "PH1645")
# update
update_library(search_terms = c(),
ids = hermes_phenos,
UKHDR_UN = Sys.getenv("UKHDR_UN"),
UKHDR_PW = Sys.getenv("UKHDR_PW"))
Plotting phenotype
To see the intersection of the codes in two or more phenotype files use the plot_code_overlap()
function.
plot_code_overlap(pheno_ids = c("PH1645", "PH1028", "PH1055", "PH1074", "PH182", "PH25", "PH530", "PH531", "PH631", "PH687", "PH968", "PH993"),
types = c("ICD10 codes", "ICD9 codes", "OPCS4 codes", "Read codes v2"))