Mendelian Randomization - Towards more reliable non-linear mendelian randomization investigations: practical guidance and software code for implementation

Details: Written by: Steve Burgess; Published: 31 May 2024

This blog post is an addendum to the commentary "Towards more reliable non-linear mendelian randomization investigations" to provide practical advice on the implementation of non-linear Mendelian randomization.

Non-linear Mendelian randomization is an extension to standard Mendelian randomization that aims to understand the shape of the causal relationship between an exposure and outcome using genetic variants as instrumental variables. As this approach aims to make stronger conclusions than standard Mendelian randomization, it makes stronger assumptions than standard Mendelian randomization.

We have proposed two methods for non-linear Mendelian randomization: the residual method and the doubly-ranked method. Both methods operate by dividing the population into strata and obtaining linear estimates within each stratum. Hence, stratified Mendelian randomization is perhaps a more accurate name, but so far non-linear Mendelian randomization has stuck. There are other approaches to perform non-linear instrumental variable analysis, but stratification works well in the context of Mendelian randomization as the genetic instrument typically explains a small fraction of variance in the exposure.

The residual method has been shown to give biased estimates both in simulations and in practice - it assumes that the effect of the genetic instrument on the exposure is constant in the population, which is often not true in practice. Hence, we would now recommend against the use of the residual method in practice. Indeed, we have retracted and republished or revised two of our previous publications using this approach.

The doubly-ranked method performs well in simulations - it makes a weaker assumption (the "rank-preserving assumption"), which is more plausible.
However, others have shown that strata formed by both methods can have non-null genetic associations with age and sex (generally worse for the residual method, but also present for the doubly-ranked method), even if genetic associations with age and sex are null in the population as a whole. This is a threat to the validity of findings, as such associations reflect violations of the instrumental variable assumptions that may bias estimates. Such associations appear to be stronger in UK Biobank than in other datasets, potentially due to greater selection bias in UK Biobank. Adjustment for age and sex will reduce bias, although whether it mitigates bias completely is unclear.

Additionally, estimates from the doubly-ranked method can be variable between implementations of the method. Even removing one or two individuals from the analysis can change estimates drastically. Estimates are still valid, but they behave like having a completely different dataset. Which isn't really ideal. When implementing the doubly-ranked method in practice, we have starting using bootstrap aggregating: we repeat the analysis a large number of times removing a small number of individuals (usually 12) at random from the dataset each time, and average over results using Rubin's rules (similar to in multiple imputation for missing data). Why 12 individuals? No real reason, but if it was good enough for Jesus, then it's good enough for me.

Final point - the original version of the doubly-ranked method suggested creating pre-strata containing Q individuals each, where Q is the number of strata in the dataset. For a dataset of the size of UK Biobank, this is quite small. Hence, we suggest increasing the size of pre-strata. In the code below (which uses a recent update to the SUMnlmr package), we set prestrat = 10, which means the pre-strata contain 10*Q individuals. The pre-strata shouldn't be too large, as the method works by assuming that the instrument values are equal (or close to equal) within pre-strata. But by increasing pre-strata size, this means that the eventual strata should be more distinct - the means of the exposure in each stratum should be more spread out.

Hence is code to implement non-linear Mendelian randomization using the doubly-ranked method, incorporating bootstrap aggregation and variable pre-stratum size, using the SUMnlmr package:

library(SUMnlmr)

# input data

out = cbind(outcome1, outcome2, etc)
             # list of outcomes
             # we recommend including age/sex as outcomes, as they are important negative controls
             # however, remember not to adjust for sex/age when sex/age is the outcome!
grs = genetic_risk_score; exposure = exposure

# parameter values - please edit to fit your application

strata   = 5 # number of strata - 5 is a nice number, but please change to what is most sensible
             #                    in your application
outcomes = 3 # number of outcomes
participants = 30000
             # number of study participants

# covariates

sexbin  = ifelse(sex=="Male", 0, 1)
age2    = age^2; agesex = age*sexbin
age2sex = age^2*sexbin
             # given the observed associations with age and sex, we suggest adjusting for
             # age^2, and age*sex and age^2*sex interactions to minimize impact of any associations
PCs     = principal components
centre  = recruitment centre
             # any technical covariates to account for
 
# analysis code

bx = array(NA, dim=c(strata, outcomes)); bxse = array(NA, dim=c(strata, outcomes));
by = array(NA, dim=c(strata, outcomes)); byse = array(NA, dim=c(strata, outcomes))
xmean = array(NA, dim=c(strata, outcomes))
xmin  = array(NA, dim=c(strata, outcomes))
xmax  = array(NA, dim=c(strata, outcomes))
 
for (k in 1:outcomes) {
set.seed(496+k) # seed is set to ensure reproducibility
                # 496 is one of my favourite numbers, please replace with your favourite number
times = 100     # 100 repetitions for the bootstrap aggregation - I wouldn't go lower than this
 
bxi = array(NA, dim=c(times, strata)); bxsei = array(NA, dim=c(times, strata));
byi = array(NA, dim=c(times, strata)); bysei = array(NA, dim=c(times, strata))
xmeani = array(NA, dim=c(times, strata))
xmini  = array(NA, dim=c(times, strata))
xmaxi  = array(NA, dim=c(times, strata))
 
for (i in 1:times) {
 skip = sample(participants, 12, replace=FALSE)
 summ_data<-create_nlmr_summary(y = out[-skip,k],
                                x = exposure[-skip],
                                g = grs[-skip],
                                covar = cbind(sex, age, age2, agesex, age2sex, PCs, centre)[-skip,],
                                family = "gaussian", #  or "binomial" if outcome is binary
                                strata_method = "ranked",
                                controlsonly = FALSE,
                                prestrat = 10,       # pre-strata of size 10 times the number of strata
                                                     # if the sample size is large, this should be okay
                                q = strata)
 bxi[i,]    = summ_data$summary$bx
 bxsei[i,]  = summ_data$summary$bxse
 byi[i,]    = summ_data$summary$by
 bysei[i,]  = summ_data$summary$byse
 xmeani[i,] = summ_data$summary$xmean
 xmaxi[i,]  = summ_data$summary$xmin
 xmini[i,]  = summ_data$summary$xmax
 cat(i)
 }
bx[,k]    = apply(bxi, 2, mean)
bxse[,k]  = sqrt(apply(bxsei^2, 2, mean)+(times+1)/times*apply(bxi, 2, var))
by[,k]    = apply(byi, 2, mean)
byse[,k]  = sqrt(apply(bysei^2, 2, mean)+(times+1)/times*apply(byi, 2, var))
xmean[,k] = apply(xmeani, 2, mean)
xmax[,k]  = apply(xmaxi, 2, mean)
xmin[,k]  = apply(xmini, 2, mean)
    # application of Rubin's rules
cat(k, "\n")
}

This gives the genetic associations with the exposure within the strata (bx = beta-coefficients, bxse = standard errors), genetic associations with the outcome within the strata (by = beta-coefficients, byse = standard errors), and the mean levels of the exposure within each strata (xmean), plus a couple of other values used by the plotting commands in the SUMnlmr package.

These can then be plotted, for example as a forest plot:

par(mar=par("mar")+c(0,2,0,0))
labs = c("Lowest stratum  ", paste0("Stratum ", 2:(strata-1), "  "), "Highest stratum  ")
plot(by[,1]/bx[,1], strata:1-0.5, bty="n", xlim=c(min((by[,1]-1.96*byse[,1])/bx[,1]), max((by[,1]+1.96*byse[,1])/bx[,1])),
   ylim=c(0,5), yaxt="n", main="Outcome 1", xlab="MR estimate (95% confidence interval)", ylab="")
for (j in 1:strata) {
 lines(c((by[j,1]-1.96*byse[j,1])/bx[j,1],
         (by[j,1]+1.96*byse[j,1])/bx[j,1]), c(strata+.5-j, strata+.5-j))
 text(min((by[,1]-1.96*byse[,1])/bx[,1]), strata+.5-j, labs[j], adj=1, xpd=NA) }

Or used as inputs for the frac_poly_summ_mr command in the SUMnlmr package.

Please let us know if this code is useful - we are still understanding how to implement non-linear Mendelian randomization in the most reliable way, and so this advice may evolve over time. But this represents the current best practice from my perspective.