Collider bias
collider_bias.Rmd
An example of using the collider bias functions to investigate the degree of index event (collider) bias in your GWAS progression data. We start with independent variants from the incident GWAS of interest - this will will require pre-processing with appropriate clumping parameters. All variants from the progression GWAS are included in the initial step.
library(genepi.utils)
# read in the data
gwas_clumped_incidence <- data.table::fread(clumped_incidence_path)[index==TRUE, ] # take only the index SNPs
gwas_progression <- data.table::fread(progression_path)
Slope-hunter
Background
Run the SlopeHunter expectation-maximization method to estimate the bias adjustment factor. The algorithm uses model-based clustering and proposes that the distributions of the incidence(GI) and progression (GP) BETAs.
- Cluster 1: SNPs that cause incidence but not progression (red)
- Cluster 2: SNPs that cause incidence and progression (blue)
- Cluster 3: SNPs that cause progression but not incidence
(gray)
- Cluster 4: SNPs that cause neither incidence or progression (gray)
The first thing to note is that we can filter out SNPs from cluster 3
and 4 by only including SNPs with a significant association (P-value)
with incidence - this is the ip
parameter. The problem is
then reduced to finding two clusters that best fit distributions 1 and 2
of the equation above. The SlopeHunter EM algorithm iteratively
determines which SNPs belong to each distribution and once complete the
adjustment factor (slope gradient) can be determined from group 1,
i.e. only those SNPs thought to solely cause incidence.
This function’s code is adapted from the SlopeHunter R package and if using this method the SlopeHunter package should be referenced as the original source.
Run
Running the Slope-hunter algorithm requires initialising parameters.
-
ip
is the initial p-value by which to filter the incident GWA variants. Variants with a p-value greater thatip
will be removed from the analysis (i.e. variants in clusters 3 & 4). -
pi0
is an initial guess for the proportion of variants that only associate with incidence (i.e. belong to cluster 1).
-
sxy1
is an initial guess for the covariance between incidence and progression betas in cluster 2.
The standard error of the slope is estimated with bootstrapping, 100 samples in the case below.
result <- slopehunter(gwas_i = gwas_clumped_incidence,
gwas_p = gwas_progression,
merge = c("CHR"="CHR", "BP"="BP"),
ip = 0.001,
pi0 = 0.6,
sxy1 = 1e-5,
bootstraps = 0)
str(result)
Visualise
We can visualise the iterative EM algorithm like so. The variants are allocated to either cluster 1 (incidence only) or cluster 2 (incidence and progression) based on how well they conform to defined bivariate distributions for each cluster. The key point is that the incidence-only distribution has a linear relationship between incidence and progression betas, with no constant term - i.e. we search for a straight line passing through the origin. At each iteration the distribution parameters are recalculated (variance / s.dev / cov) and variant reassigned to either cluster 1 or 2, until the model converges.
The b-slope
In the expectation (E) part of the Slope-hunter algorithm we
initially set the covariance matrix of the first component. The
off-diagonal components are set to sxy0 = sx0*sy0*dir0*0.95
where sx0
and sy0
are the standard deviations
of the incidence and progression betas, respectively, and the
dir0
term is the sign()
of the covariance -
which is positive in this case:
In the maximisation (M) step of the algorithm, we update
dir0
as the sign()
of
sum(pt*xbeta*ybeta)/sum(pt)
. pt
is a positive
array and so the sign depends on the relative proportions of negatives
and positive of the product of xbeta
and
ybeta
. This is positive in our case, as visualised in this
histogram:
The effect, I believe, is that a positive slope is maintained at each iteration of the algorithm.
Overall assessment - multiple methods
The algorithms can be sensitive to the incidence p-value threshold
ip
. Too high and null variants without any effect on
incidence will be included. Too low and variants will small, but
potentially significant, effects on incidence will be removed. It is a
good idea to investigate the stability of the adjustment factor estimate
to difference in ip
threshold.
We can run many combinations of the above analyses by defining sets/vectors of parameters. Note: be careful with large numbers of parameters as this will quickly lead to large numbers of combinations. e.g.
- methods = c(“slopehunter”)
- ip = c(0.05,0.001,0.00001)
- pi0 = c(0.6, 0.65, 0.7).
- sxy1 = c(1e-5, 1e-4, 1e-3)
…will lead to 27 separate analyses.
We can use the analyse_collider_bias()
function to run
all combinations of methods and parameters. Here is the stability of the
b-slope with different p-values (ip
).
results <- analyse_collider_bias(gwas_i = gwas_clumped_incidence,
gwas_p = gwas_progression,
merge = c("CHR"="CHR", "BP"="BP"),
methods= c("slopehunter","mr_collider_bias","dudbridge"),
tsmr_method = c("mr_ivw","mr_egger_regression","mr_simple_median","mr_simple_mode"),
ip = c(0.01,0.001,0.0001,0.00001,0.000001,0.0000001),
pi0 = c(0.6),
sxy1 = c(1e-05),
bootstraps = 100)
p <- plot_slope(results)
p
Plotted another way with
plot_correction_stability()
.
p <- plot_correction_stability(results)
p
Applying the correction factor
Once the parameters have been set and b
estimate of the
correction calculated we next need to apply the correction to the
(biased) progression GWAS data. As the correction uses the incidence
data we can only apply this to variants that appear in both incidence
and progression datasets (the harmonised dataset).
# apply correction
corrected_gwas <- apply_collider_correction(gwas_i = gwas_incidence,
gwas_p = gwas_progression,
merge = c("CHR"="CHR", "BP"="BP"),
b_correction_factor = 0.6473,
b_std_err = 0.2162)
# calculate difference in raw and adjusted betas; and effect on genome wide significance
corrected_gwas[, beta_diff := BETA_progression - adjusted_beta]
sig_change_levels <- c("Dropped hit", "New hit", "No change")
corrected_gwas[, sig_change:= data.table::fcase(P_progression < 5e-8 & adjusted_p >= 5e-8, factor("Dropped hit", levels=sig_change_levels),
P_progression >=5e-8 & adjusted_p < 5e-8, factor("New hit", levels=sig_change_levels),
default = factor("No change", levels=sig_change_levels))]
# plot difference
p <- ggplot2::ggplot(data = corrected_gwas,
mapping = ggplot2::aes(x=factor(""), y = beta_diff)) +
ggplot2::geom_point(mapping = ggplot2::aes(shape = sig_change,
fill = adjusted_beta,
color = sig_change), alpha=0.95, position = ggplot2::position_jitter(seed = 1, width = 0.1)) +
viridis::scale_fill_viridis() +
ggplot2::scale_shape_manual(name = "Change in genome wide significance",
values = c("Dropped hit"=25, "New hit"=24, "No change"=21), drop = FALSE) +
ggplot2::scale_color_manual(name = "Change in genome wide significance",
values = c("Dropped hit"="red", "New hit"="green", "No change"="#44015410"), drop = FALSE) +
ggplot2::geom_violin(alpha = 0.3, fill="purple", color="#44015410") +
ggplot2::theme_bw(base_size = 14) +
ggplot2::labs(title = 'Slope-hunter adjustment',
x = expression("\u03B2"[progression-SNPs]),
y = expression("\u03B2"[adjustment]),
color = "Change in genome wide significance",
fill = "BETA magnitude")