If you are planning a study, or analysing a study with missing data, these guidelines (pdf, 25Kb) are for you.
From time to time people have concerns about computational issues with multiple imputation; this multiple imputation computational issues document (pdf) may help.
Missing data is very common in observational and experimental research. It can arise due to all sorts of reasons, such as faulty machinery in lab experiments, patients dropping out of clinical trials, or non-response to sensitive items in surveys. Handling missing data is a complex and active research area in statistics.
Ignoring the problem of missing data can lead to loss of statistical power and can also introduce bias. Analysis of data with missing observations involves, firstly, constructing a suitable set of assumptions about why the data is missing in the study. Given these assumptions, there are several methods for carrying out the analysis of data, including the EM algorithm, inverse probability weighting, a full Bayesian analysis, direct application of maximum likelihood and multiple imputation. We usually cannot assess the validity of the assumptions regarding the missingness mechanism, so it is often recommended to examine how robust the inference is to the choice of assumptions in a sensitivity analysis.
- Introduction to Multiple Imputation: Slides
These slides aim to introduce you to the concepts and ideas related to analysing datasets with missing observations. They have been extracted from James Carpenter and Mike Kenward's introductory course on missing data (2005).
- Substantive model compatible imputation of missing covariates
Missing covariate data commonly occur in epidemiological and clinical research, and are often dealt with using multiple imputation (MI). The imputation of partially observed covariates is complicated if the model of interest is non-linear (e.g. Cox proportional hazards model), or contains non-linear (e.g. squared) or interaction terms, and standard software implementations of MI may impute covariates from models that are uncompatible (uncongenial) with such models of interest, which may result in biased estimates. We have recently proposed a modified version of the popular fully conditional specification (FCS) (or chained equations) approach to multiple imputation, which ensures that each partially observed covariate is imputed from a model which is compatible with the specified model for the outcome.
A paper describing the method has been published in Statistical Methods in Medical Research:
Jonathan W. Bartlett, Shaun R. Seaman, Ian R. White, James R. Carpenter. Multiple imputation of covariates by fully conditional specification: accommodating the substantive model, Statistical Methods in Medical Research, 2015; 24:462-487
smcfcs in R
An R package implementing the approach is now available, and can be installed from within R from CRAN:
The latest development version can be installed into R from GitHub using:
smcfcs in Stata
A Stata program implementing the approach for linear, logistic and Cox proportional hazards outcome models is available for free download. Imputation is now supported for continuous (under the normal linear regression model), binary (under the logistic model), count (using either Poisson or negative binomial regression models), and categorical (using ordered logistic or multinomial logistic regression) covariates.
To install, load Stata, and at the command window type:
ssc install smcfcs
The latest development version can be installed into Stata from GitHub using:
net install smcfcs, from (https://raw.githubusercontent.com/jwb133/Stata-smcfcs/master/) replace
More details about the Stata package can be found in an accompanying Stata journal paper:
Bartlett JW, Morris TP. 2015. Multiple imputation of covariates by substantive-model compatible fully conditional specification. The Stata Journal; 15(2): 437-456
- Stata export and import to REALCOM
A Stata program is available for import and export to the Realcom Impute software.
The Stata program can be installed by typing the following (one line) in Stata’s command prompt:
net install realcomImpute, from (https://raw.githubusercontent.com/jwb133/StataRealcomImpute/master/) replace
Please note that we do not maintain the Realcom Impute software itself. It is developed and maintained by researchers at the University of Bristol.
Please note that additional software provided by the DIA working group is available on the DIA working group pages.
The following contain materials available from the Drug Information Association (DIA) working group.
Please note we have had some difficulties with some of the links - do email James.Carpenter@lshtm.ac.uk if they do not work.
- SAS code for describing and plotting withdrawal rates
These SAS macros provide basic information to characterize withdrawal/dropout information in the data set which can be the preliminary step of the missing data analyses. The descriptive summary statistics outputs provide visitwise percentages of patients with at least one post-baseline observation that have data at subsequent visits as well as visitwise response means by treatment for patients that dropout versus those that continue. The plot outputs include Kaplan-Meier plot of time to dropout by treatment and plot of visitwise response mean changes from baseline by treatment and dropout status (dropout or continue).
Direct likelihood / Bayesian approaches
- Direct likelihood with influence and residual diagnostics
These SAS macros focus on the direct likelihood analysis approach under MAR assumption with influence and residual diagnostics. In many clinical studies such as in the highly controlled scenario of longitudinal confirmatory trials, it is plausible to start with MAR assumption and missing data may be mostly MAR. Such approach with restrictive models are often a reasonable choice for the primary analysis since they are simple models with few independent variables and often include only the design factors of the experiment.
The primary analysis macro (DL_Primary1) uses SAS PROC MIXED with REPEATED statement as the standard MMRM analysis. Within subject covariance structure can be specified in the REPEATED statement. Visitwise treatment main effects as well treatment difference would be provided through LSMEANS statement. A separate macro (DL_Cov1) can repeat the primary analysis using a list of different user-specified within subject covariance structures and also provide AIC and likelihood as model selection reference.
The general idea of quantifying the influence of one or more observations relies on computing parameter estimates based on all data points, removing the cases in question from the data, refitting the model, and comparing between full-data and reduced-data estimation. Another 3 macros here would implement this idea.
Macro DL_residual1 conducts influence diagnostics for observations with aberrant residuals. The primary direct likelihood analysis model is used to obtain studentized residuals for each observation and users can specify the influential cut off value to determine aberrant observations whose residuals are beyond the cut off values. The PORC MIXED reruns the direct likelihood analysis with aberrant data deleted in placebo arm only, study drug arm only, and all arms so that influence from these aberrant observations on primary analysis can be evaluated accordingly.
The last two macros DL_Influence_Patient1 and DL_Influence_Site1 conduct influence diagnostics for clusters of observations, where the cluster is defined by patients or by investigative site. The influential patients or influential investigative sites will be identified using Cook’s D provided by PROC MIXED as well as cut off values specified by users. Results from the primary analysis (includes all patients/sites) and from datasets with influential patients/sites removed are printed for comparison and evaluation of influence.
- Gaussian Repeated Measures with conjugate priors fitted using proc MCMC in SAS
The basic idea is to use conjugate priors to reduce autocorrelation in the MCMC chain when using the classic MMRM model. The macro is fully compatible with the proc MCMC missing data facilities, so handles both intermittent and monotone missing data (withdrawals) correctly under MAR.
The core code uses the UDS facility in proc MCMC to implement a fully conjugate updating algorithm for the classic Gaussian Repeated Measures (RM) model often known as MMRM. The current version only supports a single shared unstructured covariance matrix. Missing data is handled by the data augmentation facility built into proc MCMC.
The main advantage of using this macro is that there is very little correlation from MC step to step. This makes it runs faster than implementations such as the GSK5 macros that require extensive thinning of the chain.
There are two main macros. The first %build creates a parallel data set from vertical data and codes up classification variables ready for use in the main macro %mymcmc, adding missing data flags. This makes it easy to handle the macro’s input data conventions for classification variables (factors).
Facilities are built into the %mymcmc macro to:
a) directly generate imputed data based on a series of possible reference-based marginal models using the data augmentation in proc MCMC.
b) extend the model such as the Diggle-Kenward selection model (example below).
The macros were written by James Roger. They are freely available for general usage. Comments, questions and error reports should be addressed to email@example.com.
Currently only supports a single shared variance-covariance matrix.
List of files/folders that can be downloaded from mymcmc29190912.
1. MyMCMC_explained01.pdf. A set of slides that describe the main macro and the possible uses in more detail.
2. MyMCMC58.sas The SAS code for the macro. The header includes a detailed description. This macro version includes automatic switching for different versions of SAS.
3. Demo_Start2.sas and Demo_Start2-results.html. Example code shows how a very simple model would be fitted directly using proc MCMC and then fits the same model using the macro.
4. Demo_Priors2.sas and Demo_Priors2-results.html. This example shows how to implement informative priors for either linear predictor or covariance parameters.
5. Demo_Impute3.sas and Demo_Impute3-results.html. This example demonstrates the additional facilities for directly imputing data under J2R and CIR.
6. Demo_DK5.sas and Demo_DK5-results.html. Demonstration of extending the model to include the missing data mechanism using the Diggle Kenward model.
This was written by James Roger (firstname.lastname@example.org).
- Selection models
Selection model is one of the most famous classical statistical methods to handle missing data analyses under MNAR assumption (Diggle and kenward 1994). It is based on factorizations of joint likelihood of both measurement process and missingness process. A marginal density of the measurement process describes the complete data generation while the density of the missingness process conditional on the outcomes describes the missing data “selection” based on the complete data. Therefore, similar to shared parameter model, this is a joint modelling approach: two process linked through response variable. Please note that in selection models, it is the response values that directly model the missingness process and/or dropout probability, in contrast to some latent random effects as in shared parameter models. Users need to make their judgement and choice according to their project details.
Classic selection model assumes Non-Future Dependence (NFD). That is, dropout probability only depends on the previously last observed and current missing responses. This is a reasonable assumption but can significantly simplify the analysis. It’s also very popular to see different treatment arms with different dropout performance which can be specified in selection model approach.
The current macro models the response using a standard repeated measure model and models the dropout using a logistic regression.
Selection model fitting in the current macro involves integration and the code uses SAS built-in nonlinear optimization function which can be very slow. Convergence may not be achieved for extreme cases. Please always check log files.
Alternatively, selection models can be implemented via proc mcmc. The model specifications and computations may be more straightforward.
Macros can be downloaded from Selection Model_20120726.
- Shared parameter models
In contrast to most of the other macros this fits a random coefficient regression model.
- The longitudinal measurement process model follows a standard random-coefficient mixed effect model
- The dropout mechanism model uses a complementary log-log link or logit link.
These 2 models are linked by latent (unobservable) subject random effect vector Ui which is selected from one of 3 options:
- only random intercept is shared;
- both random intercept and random linear time slope are shared;
- random intercept, random linear time slope and random quadratic time slope are shared.
The underlying model is a random coefficient regression model.
Alternatively a wide range of unstructured repeated measures regression model with shared parameters can be fitted using the Mymcmc macro (Gaussian Repeated Measures with conjugate priors fitted using proc MCMC in SAS) in the Direct likelihood / Bayesian approaches section. But these require user coding of the dropout mechanism model.
The macros can be downloaded from Shared Parameter Model_20120726.
For explanation see the set of Powerpoint slides.
Imputation based approaches
- Imputation for Gaussian Repeated Measures with time changing covariates
A Gaussian repeated measures model with one or several unstructured covariance matrices is fitted using proc MCMC sampling directly based on conjugate priors. Any missing values for subject visits with no response are imputed and directly available in the imputed data set.
Main restriction is that every subject uses the same covariance matrix throughout their series of visits.
Data is input in vertical form just like proc MIXED.
The main application is the modelling of off-treatment data, and other situations where the actual treatment changes across visits. Subsequent analysis will usually be based on multiple imputation techniques. The tools can also be used to fit many of the models usually fitted using the GSK 5 macros.
The implementation is fast (about ten times faster than GSK 5 macros) and leads to chains with very little auto-correlation.
The following files are contained in a zip file downloaded as RMConj_19180827.
1. RMConj_Explained1.pdf. A description of the tool and the methods used.
2. RMConj32.sas. The SAS code for the macro. The header includes a detailed description and development history.
3. MIAnalze04.sas. Macro used in examples for combining results multiple imputations.
4. Demo1.sas is an example program file that does MAR, J2R and CIR analyses to the standard DIA example data set chapter15_example.sas7bdat. Results are in file Demo1-results.pdf.
5. RMC_FollowOn1.sas is an example program outlining the analysis including follow-up observations after treatment withdrawal. Data set fudata1.sas7bdat is an expanded version of the DIA data. set based on a J2R model. Results are in the file RMC_FollowOn1-results.pdf.
- Imputation of Recurrent event data for partial observed off-treatment data
Latest update 13 February 2019
In the past, many trials have stopped collection of data following discontinuation of randomised treatment. However, more recently data collection continues after randomised treatment discontinuation, since the occurrence of this event is irrelevant to the calculation of a treatment policy estimand. When all such data are fully collected, analysis simply ignores the treatment adherence and categorizes patients by their randomisation allocation, grouping together those who complete their randomised treatment and those who do not.
Often patients who discontinue randomised treatment will leave the trial before completion. This leads to missing data all of which is in the off-treatment period. This suggests that it should be imputed using experience off treatment.
For continuous outcome methods are described in the section 'Stepwise imputation for marginal model based on previous residuals' where multiple imputation (MI) is used to complete the missed data under models which borrow information from experience in the off-treatment period rather than either the on-treatment period or a combination of both on and off.
The macros described here extend the MI approaches for recurrent event data as described in 'Reference based MI for negative binomial discrete data' to impute using information borrowed from the off-treatment period only. Patients potentially go through three periods; on randomized treatment, off randomized treatment and finally missing. The data are assumed to follow a log-linear model with a Negative trinomial distribution (equivalent to Gamma-Poisson model). Details are available in the program headers and examples supplied. A paper has been submitted for publication.
A full description of the methodology is now available in Pharmaceutical Statistics.
The Feb 2019 update does not change anything but adds a check that the model includes a CLASS variable. It also allows the creation of a file for the posterior sample. This is sueful if modelling rather imputation is your purpose.
The following list of files that can be downloaded from NegMult20190212.
NM_Reg21.sas Main macro using Negative Multinomial computational approach
NM_Rand12.sas Main macro using Gamma-Poisson computational approach (slower)
NB_Analze4.sas Macro to fit Negative Binomial log-linear model to multiple imputed data sets and summarise using Rubin’s formula
NM_Simdata2.sas Program to simulate data based on characteristics or a real trial.
NM_Simdemo5.sas Program which uses these macros to fit a series of possible imputation models including ones that allow piecewise constant, seasonal variation and Delta.
NM_Simdemo5-results.pdf Results from the SimDemo3 program.
- Multiple imputation for informatively censored time to event data – the Informative Censoring R package
The R package InformativeCensoring, available on CRAN, can be used to perform multiple imputation for a time to event outcome when it is believed censoring may be informative.
Two methods are implemented. The first, based on Jackson et al 2014, first fits a Cox model to the observed data under the usual (conditional on covariates) non-informative censoring assumption. Multiple imputed datasets can then be generated in which it is assumed that the hazard for failure following censoring changes by a user specified multiplier compared to the hazard implied by the non-informative censoring assumption.
The second, based on Hsu and Taylor, performs Kaplan-Meier type imputation of censored time, in which the Kaplan-Meier estimate is calculated for an individual based on data from those individuals who are closest in terms of predicted hazard of failure and predicted hazard of censoring. This matching can also be performed using time-dependent covariates. The approach thus assumes that censoring is non-informative conditional on the covariates used to predict the hazard of failure/censoring.
Corresponding vignettes are provided describing how the two methods are implemented and can be used.
- Multiple imputation for time to event data under Kaplan-Meier, Cox or piecewise-exponential frameworks – SAS macros
Multiple imputation (MI) and analysis of imputed time-to-event data is implemented in a collection of SAS macros based on the methodology described in the following publications:
- Lipkovich I, Ratitch B, O’Kelly M (2016) Sensitivity to censored-at-random assumption in the analysis of time-to-event endpoints. Pharmaceutical Statistics 15(3):216-229
- Moscovici JL, Ratitch B (2017) Combining Survival Analysis Results after Multiple Imputation of Censored Event Times. PharmaSUG-2017 (available on-line https://www.pharmasug.org/proceedings/2017/SP/PharmaSUG-2017-SP05.pdf)
Briefly, the methods estimate multiple imputations via draws from the Bayesian posterior distribution of parameters of a model (piecewise exponential); or via bootstrapped versions of the input data with a standard inverse method translating estimated probability into time to event (Cox and Kaplan-Meier). Hazards can be subjected to increase/decrease via user-specified amount delta; reference-based imputed survival can be implemented by estimating the imputation model based on a user-specified subset of observed subjects (piecewise exponential and Kaplan-Meier); or by user specification of the treatment group parameter to be used when calculating the imputed time to event (Cox).
The macros can be downloaded here: Package_Release_V3 final.
- Pattern Mixture
The macro imputes under a series of different methods, analyses data using Repeated measures (not ANOVA at a single visit) and provides summary estimates for least-squares means.
Either complete case missing value restriction (CCMV) or nearest case missing value (NCMV)restriction. [subset cases]
. . . and either non-future dependence (NFMV) or use all future data (ALL), when estimating parameters for each imputation step. [subset visits]
See Thijs & Molenberghs et al (2002), Strategies to fit Pattern Mixture Models, Biostatistics.3,2, 234-265.
Similar models can be fitted using the MNAR statement in proc MI (SAS version 9.4 or later). Neither include ACMV.
The files can be downloaded at Pattern Mixture Model_20120726.
Includes a useful set of Powerpoint slides.
- Plot and compare up to 2 control-based approaches
This macro calls other macros downloadable from the Imputation-based approachessection of this web site using a common interface, and plots the results of up to two analyses.
The package can be downloaded from Macro interface and plotting package 20160216.
Several examples output files are available in png format.
- Reference-based MI for Negative Binomial discrete data – R package dejaVu
The R package dejaVu, now available on CRAN, implements controlled based multiple imputation for count data, as proposed by Keene, Oliver N., et al. “Missing data sensitivity analysis for recurrent event data using controlled imputation.” Pharmaceutical Statistics 13:4 (2014): 258-264.
When used to analyse an existing partially observed dataset, the package first fits a negative binomial regression model to the observed data, assuming MAR. Multiple imputations of the counts in the periods after subjects dropout are then generated, under a user chosen assumption. Options include MAR, and the jump to reference and copy reference MNAR assumptions. Users can also write and use their own imputation mechanisms with the package.
The package was developed by the Scientific Computing and Statistical Innovation groups of the Advanced Analytics Centre at AstraZeneca.
A SAS implementation of the same methods, developed by James Roger, is available in the 'Reference-based MI for Negative Binomial discrete data – SAS macros' section.
- Reference-based MI for Negative Binomial discrete data – SAS macros
Updated 22 January 2020 and update corrected 6 February 2020
Statistical analyses of recurrent event data have typically been based on the missing at random assumption (MAR) along with constant event rate. These treat the number of events as having a Negative Binomial distribution with an offset term which is the log of the length of time observed. One implication of this is that, if data are collected only when patients are on their randomized treatment, the resulting de jure estimator of treatment effect corresponds to the situation in which the patients adhere to this regime throughout the study. For confirmatory analysis of clinical trials, analyses are required that investigate alternative de facto estimands that depart from this assumption.
The macro described in this section parallels those available elsewhere in this section for continuous data but is based on the assumption of a Gamma-Poisson process underlying the classic Negative Binomial analysis. A detailed description of the methodology is presented in Keene, O.N., Roger, J.H., Hartley, B.F., and Kenward, M.G. (2014). Missing data sensitivity analysis for recurrent event data using controlled imputation. Pharmaceutical Statistics, 13, 4, 258-264. This macro implements the methods described there.
As implemented here the approach assumes that the event rate is constant across time. What it does do is allow for frailty in the event rate across patients. To move away from the assumption of constant event rate one might:
a) Extend this approach by breaking up the time period into a series of periods with separate constant rates.
b) Follow a more general time-to-event approach as described by Akacha et al (2015).
List of files/folders that can be downloaded from NegBI_PMI_20200206.
1. NegBin_PMI29.SAS: The file containing the macro itself. See the header of this file for details of usage.
2. Demo_NegBinMI_1.SAS: A SAS program file running an example as described by Roger & Akacha at the PSI 2014 conference.
3. Demo_NegBinMI_1.MHT: The output from running this demonstration file.
4. Demo_NegBinMI_1.LOG: The log file from running this demonstration file.
5. PSI2014_6A_Roger_Akacha.pdf: The slides from the PSI meeting which describe the example and outline the two approaches and their connection.
This page was written by James Roger (email@example.com).
- Reference-based MI via Multivariate Normal RM (the “five macros” and MIWithD)
The “five macros” fit a Bayesian Normal RM model and then impute post withdrawal data under a series of possible post-withdrawal profiles including J2R, CIR and CR as described by Carpenter et al [Carpenter, J. R., Roger, J., and Kenward, M.G. Analysis of longitudinal trials with missing data: A framework for relevant, accessible assumptions, and inference via multiple imputation. J Biopharm Stat (2012).]. It then analyses the data using a univariate ANOVA at each visit and summarizes across imputations using Rubin’s rules.
An earlier macro to implement a similar marginal approach is also available here called MIWITHD (current version is 34). It uses the MCMC statement in proc MI rather than proc MCMC which limits its functionality compared to the five macro0s. The macros are available here for historical reasons, and the “5 macros” will usually be preferred.
The Part1A macro is rather slow as it is compatible with early versions of proc MCMC, and does not take advantage of more recent facilities. But the sample from the posterior is automatically stored and can be repeatedly used in Partt2A onward.
The MyMCMC macro (Gaussian Repeated Measures with conjugate priors fitted using proc MCMC in SAS) in the Direct likelihood / Bayesian approaches section imputes using MAR, J2R, CIR, ALMCF or AFCMCF much more efficiently and faster, but is restricted to a single variance-covariance matrix.
The current zip file containing the “five macros” package as tested on SAS 9.4 can be downloaded from Five_Macros20171010.
This version no longer requires SAS/Grpah for the Plotter macro. The previous version ( tested on earlier version of SAS) can be downloaded from Five_Macros20161111.
The folder TheFive macros contains the marcro Part1A, Part1B, Part2A, Part2B and Part3 along with a plotting macro.
The folders DIA_Run and PI_Demo_GSK contain examples based on two publicly available data sets (supplied).
The method based on causal inference by (White et al. to be published) has been integrated into Part2A and is now available automatically.
Also included are a poster from PSI 2012 annual conference that describes the method and an extensive documentation file.
The zip file containing the MIWithD macro can be downloaded from deliver_mar2014.
The program file contains the only documentation. Note that methods have slightly different names here so that J2R was called J2C, CIR was called CDC while CR was called CC.
Also a final draft version of the original Carpenter et al paper is included package.
- SAS macro for imputation under generalized linear mixed model
Multiple Imputation requires the modelling of incomplete data under formal assumptions about the combined model for observed and unobserved data (the imputation model).
Generalized Linear Mixed Models provide a natural framework for modelling repeated observations, especially for non-Gaussian outcomes. The new BGLIMM procedure in SAS/Stat 15.1 fits a wide range of such models allowing for missing data under MAR assumption. The posterior output data set includes the Bayesian sampled values for the missed values. These are exactly what are needed to complete the imputed data sets. This SAS macro BGI automatically merges the input data set with the multiple posterior missed values to a generate a single MI data set indexed by the variable _IMPUTATION_. This can then be multiply analyzed by the user and a summary built using Rubin’s rules.
The design vector required for each imputed value is specified in the call to proc BGLIMM just as if it were observed. This allows complex models such as treatment switching.
The macro requires access to SAS/Stat 15.1 or later.
The macro can be downloaded from BGI20190226.
- Sequential imputation with tipping point and delta adjustment
Control-based Imputation: CBI_PMM imputes data at each visit in a separate call to the MI procedure in SAS. Initially the data set has only reference (placebo) subjects. When it reaches a visit where one or more subjects have withdrawn those subjects are added to the data set. Active subjects are imputed as if they were on placebo when they were in fact on active up to withdrawal. Their response data is ignored apart from appearing as covariates in the sequential regressions after withdrawal. As such treatment never appears in the imputation model.
Delta adjusted MAR: Delta_PMM imputes with delta adjustment on top of an MAR model. It is carried out using a conditional delta algorithm, where delta is embedded within a stepwise regression approach where regression is on previous absolute values (observed or imputed).
Tipping Point analysis: Delta_and_Tip carries out a series of delta adjusted analyses as above to provide a tipping point analysis.
Analysis of each imputed data set either uses RM or univariate ANCOVA and the macros summarize these using Rubin’s rules.
The following macros can be downloaded from PMM Delta Tipping Point and CBI_20150602:
- cbi_pmm2.sas [Control-based imputation]
- delta_and_tip_2.sas [Delta and tipping point]
- delta_pmm2.sas [Delta for MAR]
The zip file also includes documentation files (docx) for each of the first three and demonstration SAS program files for the first two. It also includes the PharmaSUG2011 paper Implementation of Pattern-Mixture Models Using Standard SAS/STAT Procedures by Ratitch & O’Kelly which explains the methods in detail.
- Stepwise imputation for marginal model based on previous residuals
Updated 16 June 2018
There is a mistake in the handling of seeds in version 11 of the MISTEP macro when using the BY= facility. This is not a problem in the previous version 9 and earlier versions. When treatment is used as the BY= variable this can lead to too small SED for the treatment differences, and as a result inflated significance values. This erroneous version was in the file MISTEP20180327.
The corrected version MISTEP12.SAS in s available in the download below..
The macro MIStep duplicates many of the facilities in MONOTONE REG statement in proc MI, but adds the facility to regress on previous residuals rather than previous absolute values. This allows it to fit marginal methods such as J2R and CIR, and Causal (White et al) using an efficient stepwise algorithm.
The macro also allows imputation under models which include treatment switching. So can be used for imputation based on post-withdrawal experience.
There is a wrapper macro MIStepWrap which allows one do repeated calls to the MISTEP macro from a single macro call, which makes routine use much simpler.
Includes an MIAnalyze macro that fits a univariate ANOVA model and summarises least-squares means and their differences using Rubin’s rules. Multiple calls are appended to form a single dataset for comparing methods.
Requires parallel data format, as provided by the BUILD macro supplied with the MyMCMC macro.
Requires previous fixing of any intermediate missing data
The following files are contained in a zip file downloaded as MISTEP20180614.
1. MIStep_explained02.pdf. An introduction to the theory behind using previous residuals, rather than previous absolute values, to generate marginal reference-based imputations such as Jump to Reference (J2R). This also explains possible uses for the macro and results of the example program below.
2. MIStep11.sas. The SAS code for the macro. The header includes a detailed description and development history. Main update is BY= now allows different numbers of factor levels in different BY groups. This is useful for follow-on models where data can be sparse.
3. MIAnalyze04.sas. A SAS macro that is used the examples to run an ANCOVA analysis and summarise across imputed data sets using Rubin’s rules.
4. MIStep_demo10.sas and mistep_demo10-results.html. Example of using the macro.s on the DIA working group example data set. This includes standard MAR, J2R, CIR, OLCMCF, Casual with K0=0.5 and also with K1=0.5 and separate correlation (shared variance) as examples.
5. GSKTest5.sas and GSKTest5_1-results.html. Program code and output for the same examples using the GSK 5 macros showing similar values based of 1000 imputations in both cases.
6. MISTEPWrap01.sas. The SAS code for wrapper macro. Header includes details.
7. MIStepWrap_demo10.sas and MIStepWrap_demo10-results.pdf. The same examples as MIStep_demo9 which can be run more easily using the wrapper. Results vary due to different seeds.
8. PSI_Wrap2.sas and PSI_Wrap2-results. Example of using the wrapper to run imputation models based on post treatment withdrawal experience.
This page was written by James Roger (firstname.lastname@example.org).
- Vansteelandt et al’s 2012 doubly robust method
The Doubly Robust zip file contains SAS macros implementing the doubly robust approach described in:
Vansteelandt S, Carpenter J, Kenward M (2012), Analysis of incomplete data using inverse probability weighting and doubly robust estimators, Methodology: European Journal of Research Methods for the Behavioral and Social Sciences, 6, 37-48.
Doubly Robust estimation implements the missing-at-random assumption by using inverse probability-of-being-observed weighting, but augmented with a model-expected value for the missing outcome. The ingenious weighted combination of these two elements has the property that the estimate of, say, treatment effect will be consistent even if one of the models – a) the model for the probability of missingness or b) the model for the value of the missing outcome – is wrong. However, the consistency property does not hold if both a) and b) are wrong. The analysis model that estimates e.g., the treatment effect, is assumed to be correct.
There is also an accompanying users guidebook explaining the macros and their use.
The macros were written by Belinda Hernández of Quintiles, with review by Michael O’Kelly.
Presentations manuscripts and Training Materials
- A Practical Guide to Preventing and Treating Missing Data (2013)
Material presented by Craig Mallinckrodt and Russ Wolfinger in 2013 at the FDA workshop on missing data.
The pdf file is available in zip format at training_2013FDA_Workshop.
- Example Language for Statistical Analysis Plans and Protocols Describing Some Frequently-Used Methods for Handling Missing Data
- When and how to use reference based imputation for missing data (2013)
Slides presented by Michael O’Kelly & James Roger
The slides can be downloaded at training_ref_imputation.
- Implementing Estimands in Trials: Detailed Clinical Objectives – James Bell, 3rd June 2019
Powerpoint slides from presentation by James Bell ‘Implementing Estimands in Trials: Detailed Clinical Objectives’, 3rd June 2019, PSI conference.
Example data sets
- Example data set from an antidepressant clinical trial
This data set is one of only a few publicly available data sets that can be used to demonstrate methods for handling missing data where a continuous outcome is measured repeatedly.
Original data are from an antidepressant clinical trial with four treatments; two doses of an experimental medication, a positive control, and placebo.
Hamilton 17-item rating scale for depression (HAMD17) was observed at baseline and weeks 1, 2, 4, 6, and 8.
To mask the real data Week-8 observations were removed.
Two arms were created; the original placebo arm and a “drug arm” created by randomly selecting patients from the three non-placebo arms.
[Original data: Goldstein, Lu, Detke, Wiltse, Mallinckrodt, Demitrack. Duloxetine in the treatment of depression: a double-blind placebo-controlled comparison with paroxetine. J Clin
Psychopharmacol 2004;24: 389-399.]
The data set can be downloaded from MBSW2011_example.
Name of the SAS data set is “Chapter15 example.sas7bdat”.
- Example datasets with low and high dropout
These two data sets are made publicly available so that they can be used to demonstrate methods for handling missing data where a continuous outcome is measured repeatedly.The purpose is to contrast similar data with a low dropout rate and that with a high dropout rate. For full details and further explanation see Mallinckrodt et al (1) written by the DIA working party on Missing data.
The data sets are somewhat contrived to avoid implications for marketed drugs. Nevertheless, the key features of the original data were preserved. The original data were from 2 nearly identically designed antidepressant clinical trials that were originally reported by Goldstein et al (2) and Detke et al. (3). Each trial had 4 treatment arms with approximately 90 patients each that included 2 doses of an experimental medication (subsequently granted marketing authorizations in most major jurisdictions), an approved medication, and placebo. Assessments on the Hamilton 17-item rating scale for depression (HAMD17)19 were taken at baseline and weeks 1, 2, 4, 6, and 8 in each trial. All patients from the original placebo arm were included along with a contrived drug arm that was created by randomly selecting patients from the nonplacebo arms. Random selection continued until 100 drug-treated patients were selected. In addition to all the original placebo-treated patients, additional placebo-treated patients were randomly re-selected so that there were also 100 patients in the contrived placebo arms. For these re-selected placebo-treated patients, a new patient identification number was assigned and outcomes were adjusted to create new observations by adding a randomly generated value to each patient’s observations.
These trials are referred to as the low and high dropout data sets. In the high dropout data set, completion rates were 70% for drug and 60% for placebo (Table 2). In the low dropout data set, completion rates were 92% in both the drug and placebo arms. The dropout rates in the contrived data sets closely mirrored those in the corresponding original studies. The design difference that may explain the difference in dropout rates between these two otherwise similar trials was that the low dropout data set came from a study conducted in Eastern Europe that included a 6-month extension treatment period after the 8-week acute treatment phase and used titration dosing. The high dropout data set came from a study conducted in the US that did not have the extension treatment period and used fixed dosing.
1. Mallinckrodt C, Roger J, Chuang-Stein C, Molenberghs G, O’Kelly M, Ratitch B, Janssens M, Bunouf P. Recent Developments in the Prevention and Treatment of Missing Data. Therapeutic Innovation & Regulatory Science 2014; 48: 68.
2. Goldstein DJ, Lu Y, Detke MJ, Wiltse C, Mallinckrodt C, Demitrack MA. Duloxetine in the treatment of depression: a double-blind placebo-controlled comparison with paroxetine. J Clin Psychopharmacol. 2004;24:389-399.
3. Detke MJ, Wiltse CG, Mallinckrodt CH, McNamara RK, Demitrack MA, Bitter I. Duloxetine in the acute and long-term treatment of major depressive disorder: a placebo- and paroxetinecontrolled trial. Eur Neuropsychopharmacol. 2004;14(6):457-470.
The datasets can be downloaded from high_low_datasets.
This page was written by James Roger (email@example.com).
In presence of a multilevel structure in a dataset, for example children (level 1) clustered in schools (level 2), it is important to use sound methods to handle this structure. If our dataset is partially observed, not only we need to use multilevel methods for our primary analysis, but also to deal with missing data. This is because we need to ensure compatibility of the imputation and analysis models. Imagine we wanted to perform Multiple Imputation (MI) on a multilevel dataset. We answer the following questions:
- What are the possible MI methods?
First there are three simple ad-hoc methods, that we will see all suffer important issues.
- Single-level imputation: impute as single level data, ignoring clustering
- Fixed-effect imputation: Impute with cluster indicator as fixed effect
- Stratified imputation: Impute separately within each cluster.
Any specific imputation strategy might be used with each of these methods, for example either of the two most commonly used parametric imputation methods, i.e. Full Conditional Specification (FCS) or Joint Modelling (JM).
There are then methods that specifically use multilevel imputation models:
- Multivariate normal JM imputation: the imputation model is a joint multivariate normal model for all partially observed variables, with random effects for all outcomes of the imputation model. A full MCMC sampler, e.g. Gibbs, is used to generate the imputations;
- Latent normal JM imputation: as above, but binary/categorical variables are handled through latent normal variables. Again, this makes use of MCMC sampling, possibly with Metropolis-Hastings steps;
- Heteroscedastic latent normal JM imputation: as above, but allowing for heteroscedasticity, to better reflect heterogeneity across clusters.
- One-stage FCS imputation: this uses FCS, so it defines multiple univariate models for partially observed variables given all other variables rather than a single joint model for all variables. It mimics one-stage meta-analyses, in that it fits each univariate model on the whole dataset;
- Two-stage FCS imputation: same as before, but mimicking two-stage meta-analyses, so that models are fitted first within clusters, and then meta-analysed. This allows for heteroscedasticity;
- Substantive-Model-Compatible MI: this method imputes compatibly with a specific analysis model, which needs to be known before imputation is performed. It allows for imputation compatible with interactions and non-linearities. There are different SMC-MI methods depending on the way the covariates are imputed, based on FCS, JM or sequential imputation. This can also be used within a fully Bayesian framework.
- How do we choose between them?
The answers to this question depends on what is the target multilevel analysis model we aim to use:
Case #1: Random intercept analysis model.
It is a bad idea to use simple single level MI, ignoring clustering.
Imputing separately by cluster, or with cluster as a fixed effect is generally fine. Disadvantage of stratified imputation is that it loses efficiency. Both methods cannot be used with level 1 systematically missing data (missing for a whole cluster) or with level 2 missing data (data related to the clustering level, e.g. school).
Any multilevel MI method is OK in these settings.
Case #2: Random intercept and slope model with fully observed random slope variable.
If there is a random slope in the analysis model, this needs to be included in the imputation model as well. We can do this without any issue if the random slope variable is fully observed. We just need to make sure the random slope is included in the imputation model as well.
Case #3: Random intercept and slope model with missing data in random slope variable.
If the random slope variable is partially observed, we need to include it as an outcome in the imputation model. A simple, homoscedastic model, is not sufficient. Heteroscedastic imputation models are better. These have been implemented both within the FCS and the Joint Modelling framework.
However, in theory only substantive model compatible imputation (SMC-MI) can handle missing data compatibly with the analysis model in this situation, and hence it is the only method expected to lead to no bias. The only disadvantage is that we need to know the exact formulation of the analysis model before we impute missing data.
Case #4: Random intercept and slope model with interactions/non-linearities.
Again, SMC-MI is the only method that is expected to lead to unbiased estimates, as long as we know the exact formulation of the substantive model in advance of the imputation.
- What software can we use?
There are several software available for multilevel MI. A (possibly not exhaustive) list includes:
- Pan: a package for imputation of clustered data using the multivariate normal model. Does not allow for the imputation of categorical data and only allows for the inclusion of fully observed level 2 variables;
- Jomo: we created and actively maintain this R package for multilevel joint modelling imputation, based on the latent normal algorithm. The package allows the imputation of a mix of continuous and categorical variables at two levels.
It is possible to allow for heteroscedasticity in the imputation model, and also to perform substantive-model-compatible imputation.
This tutorial paper on the R journal explains how to use the package. A paper on the use of the SMC-MI functions will follow.
- Mitml: This package provides different interfaces to both pan and jomo, and lots of tools to analyse imputed data.
- Mice: This is perhaps the most popular MI package, it uses the FCS method and includes few function for multilevel imputation, using the one-stage approach;
- Micemd: This package implements the two-stage FCS approach to imputation and uses a heteroscedastic imputation model. It allows for the imputation of level 1 data only, both binary or continuous;
- Mdmb: This package allows for imputation under the SMC-MI framework, in particular using sequential imputation.
- jointAI: though technically not a MI package, as it is intended to be used to perform fully Bayesian analyses, this package fits virtually the same MCMC that could be fitted to impute missing data in a SMC-MI algorithm.
- REALCOM: This standalone package performs MI with the same latent normal algorithm as jomo, although it does not allow for heteroscedastic imputation models nor substantive model compatible imputation;
- BLIMP: This is a very flexible package that can be used both to perform standard FCS multilevel imputation, or SMC-MI. It also allow for fully Bayesian analyses.
- What work have we done in our group?
In our research group, we have done a lot of work in the development of multilevel MI methods. As mentioned, first we released the jomo R package, which has now been extensively used in a number of research papers and has been downloaded more than a million times since 2015. It implements Joint Modelling imputation via the latent normal model, and it allows for the imputation of a mix of binary/categorical and continuous data at two levels of hierarchy. It also allows for imputation via a heteroscedastic model and for substantive model compatible imputation. At the moment, models supported for SMC-MI are lm, lmer, glm, glmer, coxph, clmm and polr. We recently published a Tutorial paper on the R Journal, where we explain how to use the standard imputation functions. A second paper on the use of SMC-MI functions will follow.
The original motivation for developing our software came from research on how to best handle missing data in Individual Patient Data Meta-Analyses (IPD-MA). In (Quartagno and Carpenter, 2016), we showed how our multilevel imputation strategy works as well as stratified imputation for a range of scenarios, and better in others (e.g. when there are systematically missing variables). We found that, when the assumptions on the distribution followed by the covariance matrix of the imputation model was not met, this led to a small loss of efficiency, but no clear bias was identified.
In (Quartagno and Carpenter, 2019), we showed how using the latent normal model can lead to good results when some of the partially observed variables are binary/categorical. Furthermore, in (Quartagno, Carpenter and Goldstein,2019) we found that multilevel MI can be used to handle missing data in weighted survey analyses as well.
We have participated in two international comparisons of methods for handling missing multilevel data. In particular, in (Audigier et al, 2018) our heteroscedastic approach was compared to a number of competitors, giving good results, particularly with binary variables and large number of clusters. More recently, (Huque et al 2020) compared the SMC-MI functions in jomo to other methods, again finding good results, even for longitudinal data analyses, i.e. with small cluster sizes.
Finally, in (Quartagno and Carpenter, 2018) we have introduced the SMC-MI method for Joint Modelling imputation, and we plan to submit a wider discussion of this method soon for consideration of an international journal.
Any method of statistical analysis will make untestable assumptions about the distribution of missing data. If the wrong assumptions are made, then misleading conclusions may be drawn. Unfortunately this is a critical problem when missing data occurs which cannot be circumnavigated.
When there are missing data, it is therefore important that primary analysis be conducted under the most plausible assumptions for the missing data. Sensitivity analysis, which addresses the same question as the primary analysis, but under a range of different credible assumptions for the missing data should then be undertaken. This will reveal how robust the results are to different missing data assumptions.
What methods can be used for sensitivity analysis?
There are three broad classes of missing data assumptions introduced by Rubin (1976): Missing-completely-at-random (MCAR), Missing-at-random (MAR) and Missing-not-at-random (MNAR). Methods for analysis under MCAR and MAR are well developed. For example, any complete case analysis will provide valid inference under MCAR and under MAR any likelihood based method or multiple imputation analysis (see Molenberghs et al, 2014 and Carpenter and Kenward, 2007 for further guidance on statistical analysis under MCAR/MAR). Procedures for sensitivity analysis typically therefore focus on approaches to MNAR analysis where the response data and missingness mechanism must be jointly modelled.
There are two principle ways this can be done. First, a model for the missing status given the response data, with a marginal model for the response data can be specified. For example, a logistic regression model could be used to model the probability of the response being missing, with a parameter that governs how this depends on the unobserved outcome, fitted alongside a model for the response data. This is referred to as the selection model.
Alternatively the conditional distributions of the response data given the observed data for each missing data pattern, with a marginal model for the missingness process can be specified. For example, a multivariate normal model for the unobserved data, which has mean higher by a certain proportion than the observed data. This factorization is the pattern-mixture model.
Selection or pattern-mixture models can be fitted using maximum likelihood, or within a Bayesian framework. There will be numerous ways in which a pattern mixture model or selection model can be fully specified; however, many specifications will be practically implausible. So where should you start?
A commonly advocated principled way to perform MNAR sensitivity is to explore departures from the joint data distribution implied by MAR. For example, in the pattern-mixture framework, starting with specification of the conditional data distribution implied by MAR, one can readily perform sensitivity analysis exploring departures from MAR by shifting the parameters of the distribution, for example by specifying a higher or lower expected outcome value for unobserved data. After specifying separate response models for each pattern, inference can be obtained using maximum likelihood, or within a Bayesian framework.
Alternatively, Multiple Imputation provides an accessible solution for conducting sensitivity analysis within the pattern-mixture framework, termed Controlled Multiple Imputation.
What is controlled multiple imputation?
Controlled Multiple Imputation (MI) procedures combine pattern‐mixture modelling with MI and provide a practical, accessible platform for sensitivity analysis. The standard MI procedure imputes missing data using the conditional distributions of partially observed response data given the observed response data under the assumption of MAR. Within the pattern mixture framework, the conditional distributions implied by MAR for each missing data pattern can be modified as appropriate. The modified conditional distributions can then be used within the MI algorithm, in place of the MAR distribution to impute under MNAR. Multiple imputed data sets are obtained. The imputed data sets are each analysed using the primary substantive analysis model, which would have been used in the absence of any missing data (as done for standard MAR MI). Results across imputed data sets can then be combined using Rubin’s rules for inference.
When MI is performed in such a manner, this is termed ‘Controlled Multiple Imputation“ as the analyst has direct control over the imputation distribution.
Controlled MI procedures include δ‐based methods, which enable one to explore the impact of a worse or better response than that predicted based on the observed data distribution. Starting with the data distribution implied by MAR, a numerical parameter (δ) is specified which shifts the proposed distribution of the unobserved data away from MAR. For example, for a continuous outcome, data can be imputed assuming a mean response which is lower/higher than that predicted based on the observed data. For a binary or time-to-event outcome δ can respectively represent the difference in the (log) odds, or hazard, of response between the observed and unobserved cases.
An alternative example of controlled MI is reference‐based MI, which enables one to explore the impact of individuals with missing data behaving like a specified reference group in the observed data. The difference between the MAR and MNAR distribution is described entirely using information within the data set, by reference to other groups in the data. The parameters of the observed data distribution, estimated assuming MAR, are mixed around, across groups to form contextually relevant MNAR distributions for the unobserved data. For example, in a two group placebo controlled trial data for individuals missing data in the active arm can be imputed following the behaviour in the placebo group (Carpenter et al, 2013).
We have written a practical tutorial on sensitivity analysis for clinical trials with missing continuous outcome data using controlled multiple imputation with worked examples, which is available here. Delta- and reference-based multiple imputation methods can also be used with binary (Leacy et al, 2017), ordinal (Tang 2018), count (Keene et al, 2014) and survival data (Jackson et al, 2014 and Atkinson et al, 2019) . Links to software for implementation are provided below.
What is Information Anchored sensitivity analysis?
When a data set includes missing data, there will naturally be a loss of information in analysis due to missing data, relative to when all data could be observed. This means, the precision of estimates will be reduced (i.e. larger Standard Deviations, Standard Errors and Confidence Intervals), given a greater uncertainty with a reduced data set. This will be the case in any primary analysis. It is important to be aware that a sensitivity analysis can change the statistical information about an estimate.
Information-anchored sensitivity analyses, is defined as sensitivity analysis which varies the assumption for the missing data, and which holds the proportion of information lost due to missing data constant across the primary and sensitivity analyses (Cro et al, 2019).
We regard information anchored inference as desirable for sensitivity analysis. It ensures there is no loss or gain of information due to missing data in the sensitivity analysis relative to the primary analysis.
This is a particularly desirable property for sensitivity analysis within the context of clinical trial analysis. Regulators can be reassured the sensitivity analysis is not artificially injecting information, while trialists can be reassured that the sensitivity analysis is not discarding any of the valuable obtained data.
In clinical trials it can often be most appropriate to conduct primary analysis under MAR. Sensitivity analysis exploring departures from MAR will be required. Controlled imputation (described above) provides a practical, accessible route for doing so. We have shown elsewhere (Cro et al, 2019) that Rubin's MI combining rules provide information anchored inference, hence provide an appropriate estimate of variance for the treatment effect when used following controlled multiple imputation (including both delta-and reference-based MI). Rubin’s rules preserve the loss of information seen under MAR in controlled MI sensitivity analysis. This is why we recommend the use of Rubin's variance estimator within delta‐ and reference‐based sensitivity analyses.
What software can be used?
There are several software packages available for conducting controlled MI that enable accessible sensitivity analysis. Options (not exhaustive) include are summarised below. For δ-based imputation with a continuous outcome standard multiple imputation commands (e.g. mi impute in Stata or proc mi in SAS) can be used to create imputed data sets and imputed values shifted by adding the required δ (Cro et. al. 2020).
- The ‘five macros’ and miwithd: Performs reference-based MI for multivariate normal data following the general algorithm of Carpenter, Roger and Kenward (2013). Available on the DIA working group pages.
- NegBin_PMI: Performs reference-based MI for negative binomial discrete data following the methodology of (Keene et al, 2014). Available on the DIA working group pages.
- mimix: Performs reference based multiple imputation for multivariate normal data following the general algorithm of Carpenter, Roger and Kenward (https://pubmed.ncbi.nlm.nih.gov/24138436/). Available at https://ideas.repec.org/c/boc/bocode/s457983.html
- mlmi: implements a maximum likelihood MI version of reference based imputation for repeatedly measured continuous endpoints. Available at https://github.com/jwb133/mlmi
- dejaVu: Performs reference-based MI for negative binomial discrete data following the methodology of Keene et al . Available on the DIA working group pages.
- InformativeCensoring: Performs multiple imputation for a time to event outcome under informative censoring using (i) a Cox model fitted to the observed data under the non-informative censoring assumption and a user specified multiplier compared to the hazard implied by the non-informative censoring assumption or (ii) a Kaplan-Meier type imputation of censored time. Available on the DIA working group pages