Read our latest review article in the Biometrical Journal: Missing data: A statistical framework for practice.
If you are planning a study, or analysing a study with missing data, these guidelines (pdf, 25Kb) are for you.
Missing data is very common in observational and experimental research. It can arise due to all sorts of reasons, such as faulty machinery in lab experiments, patients dropping out of clinical trials, or non-response to sensitive items in surveys. Handling missing data is a complex and active research area in statistics.
Ignoring the problem of missing data can lead to loss of statistical power and can also introduce bias. Analysis of data with missing observations involves, firstly, constructing a suitable set of assumptions about why the data is missing in the study. Given these assumptions, there are several methods for carrying out the analysis of data, including the EM algorithm, inverse probability weighting, a full Bayesian analysis, direct application of maximum likelihood and multiple imputation. We usually cannot assess the validity of the assumptions regarding the missingness mechanism, so it is often recommended to examine how robust the inference is to the choice of assumptions in a sensitivity analysis.
- Introduction to Multiple Imputation: Slides
These slides aim to introduce you to the concepts and ideas related to analysing datasets with missing observations. They have been extracted from James Carpenter and Mike Kenward's introductory course on missing data (2005).
- Substantive model compatible imputation of missing covariates
Missing covariate data commonly occur in epidemiological and clinical research, and are often dealt with using multiple imputation (MI). The imputation of partially observed covariates is complicated if the model of interest is non-linear (e.g. Cox proportional hazards model), or contains non-linear (e.g. squared) or interaction terms, and standard software implementations of MI may impute covariates from models that are uncompatible (uncongenial) with such models of interest, which may result in biased estimates. We have recently proposed a modified version of the popular fully conditional specification (FCS) (or chained equations) approach to multiple imputation, which ensures that each partially observed covariate is imputed from a model which is compatible with the specified model for the outcome.
A paper describing the method has been published in Statistical Methods in Medical Research:
Jonathan W. Bartlett, Shaun R. Seaman, Ian R. White, James R. Carpenter. Multiple imputation of covariates by fully conditional specification: accommodating the substantive model, Statistical Methods in Medical Research, 2015; 24:462-487
smcfcs in R
An R package implementing the approach is now available, and can be installed from within R from CRAN:
The latest development version can be installed into R from GitHub using:
smcfcs in Stata
A Stata program implementing the approach for linear, logistic and Cox proportional hazards outcome models is available for free download. Imputation is now supported for continuous (under the normal linear regression model), binary (under the logistic model), count (using either Poisson or negative binomial regression models), and categorical (using ordered logistic or multinomial logistic regression) covariates.
To install, load Stata, and at the command window type:
ssc install smcfcs
The latest development version can be installed into Stata from GitHub using:
net install smcfcs, from (https://raw.githubusercontent.com/jwb133/Stata-smcfcs/master/) replace
More details about the Stata package can be found in an accompanying Stata journal paper:
Bartlett JW, Morris TP. 2015. Multiple imputation of covariates by substantive-model compatible fully conditional specification. The Stata Journal; 15(2): 437-456
- Stata export and import to REALCOM
A Stata program is available for import and export to the Realcom Impute software.
The Stata program can be installed by typing the following (one line) in Stata’s command prompt:
net install realcomImpute, from (https://raw.githubusercontent.com/jwb133/StataRealcomImpute/master/) replace
Please note that we do not maintain the Realcom Impute software itself. It is developed and maintained by researchers at the University of Bristol.
Please note that additional software provided by the DIA working group is available on the DIA working group pages; our software for multilevel multiple imputation is available on the Multilevel MI pages, and our sensitivity analysis software is available on the Sensitivity analysis pages.
Drug Information Association Scientific Working Group on Estimands and Missing Data
The following contain materials available from the Drug Information Association (DIA) working group.
Accessing the downloads will open Sharepoint. Then press the "Download" button in the top left corner, rather than drill down within the folder.
- SAS code for describing and plotting withdrawal rates
Latest update 26 July 2012
Quick summary and Download
These SAS macros provide basic information to characterize withdrawal/dropout information in the data set which can be the preliminary step of the missing data analyses. The descriptive summary statistics outputs provide visitwise percentages of patients with at least one post-baseline observation that have data at subsequent visits as well as visitwise response means by treatment for patients that dropout versus those that continue. The plot outputs include Kaplan-Meier plot of time to dropout by treatment and plot of visitwise response mean changes from baseline by treatment and dropout status (dropout or continue).
Estimating Principal Strata
- SAS code to estimate principal strata
Latest update 02 September 2021
The package includes SAS code to identify a principal stratum based either on the time of an event or on membership of a category. The macro implements the idea of Rubin (1998, More powerful randomization-based p-values in double-blind trials with non-compliance. Stats in Medicine, vol 17, 371-385). In Rubin’s framework, membership of the principal stratum in a clinical trial is defined as status with regard to an event or a category, irrespective of randomised treatment group. Status under treatment groups other than that actually randomized is multiply imputed, using probability model estimated from baseline characteristics of subjects randomised to those other treatment groups. The core of the implementation was devised by Bohdana Ratitch. The code written by Bohdana Ratitch has been adapted by Michael O’Kelly and reviewed by Pavol Kral, with some of the code QCd via independent programming. For the imputation of status with regard to an event, a time-to-event imputation macro devised by Ilya Lipkovich is used – a version of this time-to-event imputation macro is already available on the missingdata.org.uk webpage.
Version submitted 02Sep2021 with correction of an error in the algorithm when “membership of a category” was used to define the principal stratum – see modification history in macro Princip_Strata_TTE_YN_MIv2.sas for details.
The principal-stratum package consists of
-Brief user documentation in Word in the topmost folder;
-In the \macros\ subfolder, the principal stratum main macro Princip_Strata_TTE_YN_MIv2.sas, its utility macros and the time-to-event imputation program and its utility macros;
-Example code that uses the macro, in the topmost folder:
The demonstration code sim_for_psv4.sas simulates an outcome censored by an event, and then demonstrates the use of the principal stratum macro to identify the principal stratum of those to whom the event of interest would never happen (“always survivors”); .log and .lst outputs are included;
Other demonstration code takes as input data sets available from the National Institute of Health (NIH) and the Aids Clinical Trial Group (ACTG). Use of these latter data sets must be applied for to the respective owners (NIH and ACTG); .log and .lst are not included for these programs.
Download the files here.
Direct likelihood approaches
- Direct likelihood with influence and residual diagnostics
Latest update 26 July 2012
These SAS macros focus on the direct likelihood analysis approach under MAR assumption with influence and residual diagnostics. In many clinical studies such as in the highly controlled scenario of longitudinal confirmatory trials, it is plausible to start with MAR assumption and missing data may be mostly MAR. Such approach with restrictive models are often a reasonable choice for the primary analysis since they are simple models with few independent variables and often include only the design factors of the experiment.
The primary analysis macro (DL_Primary1) uses SAS PROC MIXED with REPEATED statement as the standard MMRM analysis. Within subject covariance structure can be specified in the REPEATED statement. Visitwise treatment main effects as well treatment difference would be provided through LSMEANS statement. A separate macro (DL_Cov1) can repeat the primary analysis using a list of different user-specified within subject covariance structures and also provide AIC and likelihood as model selection reference.
The general idea of quantifying the influence of one or more observations relies on computing parameter estimates based on all data points, removing the cases in question from the data, refitting the model, and comparing between full-data and reduced-data estimation. Another 3 macros here would implement this idea.
Macro DL_residual1 conducts influence diagnostics for observations with aberrant residuals. The primary direct likelihood analysis model is used to obtain studentized residuals for each observation and users can specify the influential cut off value to determine aberrant observations whose residuals are beyond the cut off values. The PORC MIXED reruns the direct likelihood analysis with aberrant data deleted in placebo arm only, study drug arm only, and all arms so that influence from these aberrant observations on primary analysis can be evaluated accordingly.
The last two macros DL_Influence_Patient1 and DL_Influence_Site1 conduct influence diagnostics for clusters of observations, where the cluster is defined by patients or by investigative site. The influential patients or influential investigative sites will be identified using Cook’s D provided by PROC MIXED as well as cut off values specified by users. Results from the primary analysis (includes all patients/sites) and from datasets with influential patients/sites removed are printed for comparison and evaluation of influence.
- Selection models
Latest update 07 March 2012
Selection model is one of the most famous classical statistical methods to handle missing data analyses under MNAR assumption (Diggle and kenward 1994). It is based on factorizations of joint likelihood of both measurement process and missingness process. A marginal density of the measurement process describes the complete data generation while the density of the missingness process conditional on the outcomes describes the missing data “selection” based on the complete data. Therefore, similar to shared parameter model, this is a joint modelling approach: two process linked through response variable. Please note that in selection models, it is the response values that directly model the missingness process and/or dropout probability, in contrast to some latent random effects as in shared parameter models. Users need to make their judgement and choice according to their project details.
Classic selection model assumes Non-Future Dependence (NFD). That is, dropout probability only depends on the previously last observed and current missing responses. This is a reasonable assumption but can significantly simplify the analysis. It’s also very popular to see different treatment arms with different dropout performance which can be specified in selection model approach.
The current macro models the response using a standard repeated measure model and models the dropout using a logistic regression.
Selection model fitting in the current macro involves integration and the code uses SAS built-in nonlinear optimization function which can be very slow. Convergence may not be achieved for extreme cases. Please always check log files.
Alternatively, selection models can be implemented via proc mcmc. The model specifications and computations may be more straightforward.
Macros can be downloaded from Selection Model_20120726.
- Shared parameter models
Latest update 07 March 2012
In contrast to most of the other macros this fits a random coefficient regression model.
- The longitudinal measurement process model follows a standard random-coefficient mixed effect model
- The dropout mechanism model uses a complementary log-log link or logit link.
These 2 models are linked by latent (unobservable) subject random effect vector Ui which is selected from one of 3 options:
- only random intercept is shared;
- both random intercept and random linear time slope are shared;
- random intercept, random linear time slope and random quadratic time slope are shared.
The underlying model is a random coefficient regression model.
Alternatively a wide range of unstructured repeated measures regression model with shared parameters can be fitted using the Mymcmc macro (Gaussian Repeated Measures with conjugate priors fitted using proc MCMC in SAS) in the Direct likelihood / Bayesian approaches section. But these require user coding of the dropout mechanism model.
The macros can be downloaded from Shared Parameter Model_20120726.
For explanation see the set of Powerpoint slides.
Imputation based approaches
- Gaussian Repeated Measures with time changing covariates including reference-based imputation (RBI)
- Latest update 05 May 2023
New facility to impute with separate means model from that used for estimation. Applications include RBI method CR and Carpenter et al (2013) separate Variance matrix proposal.
This is an extended version of the RMConj macro that simplifies its use for reference-based (RBI) imputation, while maintaining its other features as a fast sampler from a Bayesian Multivariate model with one or more unstructured covariance matrices.
Many of its features can now be done directly using the BGLIMM procedure, excluding the new (two stage) facilities which still require this macro.
A Gaussian repeated measures model with one or several unstructured covariance matrices is fitted using proc MCMC sampling directly based on conjugate priors. Any missing values for subject visits with no response are imputed and directly available in the imputed data set.
Main restriction is that every subject uses the same covariance matrix throughout their series of visits.
Data is input in vertical form just like proc MIXED.
The original motivation was the modelling of off-treatment data, and other situations where the actual treatment changes across visits. Subsequent analysis will usually be based on multiple imputation techniques. The tools can also be used to fit many of the models usually fitted using the GSK 5 macros.
A new macro RefBTbv is supplied which simplifies the declaration of reference-based (RBI) as described by Carpenter et al (2013).
The RMConjPlus macro allows two stages. The first is an estimation stage with imputed values created in line as part of the MCMC process. When requested the second stage imputes as a separate action from the estimation allowing the covariates values for each subject-visit combination to have different covariates values to that in stage 1.
This allows the use of the RBI method CR and also the Carpenter et al (2013) rules for covariances when separate covariance matrices are used for each treatment arm.
The implementation is fast (about ten times faster than GSK 5 macros) and leads to chains with very little auto-correlation.
The following files are contained in a zip file downloaded as RMConjPlus20230426.zip
1. RMConjPlusdoc20230427.pdf. A description of the tools and the methods used.
2. RMConjPlus20.sas. The SAS code for the macro RMConjplus. The header includes a detailed description and development history.
3. RefBTbV04.sas. The SAS code for the RefBasedX macro. The header includes a detailed description and development history.
4. MIAnalze10.sas. Macro used in examples for combining results from multiple imputations.
5. DemoRMconj9.sas is an example program file that does MAR, J2R, CIR, Causal and CR analyses to the standard DIA example data set chapter15_example.sas7bdat. Results are in file Demo_RMConj9-results.html
6. Example data set chapter15_example.sas7bdat
- Imputation of Recurrent event data for partial observed off-treatment data
Latest update 16 March 2021
These macros fit Bayesian models using Negative Multinomial distribution for Count data using MCMC.
In the past, many trials have stopped collection of data following discontinuation of randomised treatment. However, more recently data collection continues after randomised treatment discontinuation, since the occurrence of this event is irrelevant to the calculation of a treatment policy estimand. When all such data are fully collected, analysis simply ignores the treatment adherence and categorizes patients by their randomisation allocation, grouping together those who complete their randomised treatment and those who do not.
Often patients who discontinue randomised treatment will leave the trial before completion. This leads to missing data all of which is in the off-treatment period. This suggests that it should be imputed using experience off treatment.
For continuous outcome methods are described in the section 'Stepwise imputation for marginal model based on previous residuals' where multiple imputation (MI) is used to complete the missed data under models which borrow information from experience in the off-treatment period rather than either the on-treatment period or a combination of both on and off. Similarly the methods described in the section 'Imputation for Gaussian Repeated Measures with time changing covariates' can be used. Each has differing limitations on possible model.
The macros described here extend the MI approaches for recurrent event data as described in 'Reference based MI for negative binomial discrete data' to impute using information borrowed from the off-treatment period only. All the examples there can be handled using this more flexible macro. The additional application allow patients to potentially go through three periods; on randomized treatment, off randomized treatment and finally missing. The data are assumed to follow a log-linear model with a Negative Multinomial distribution (equivalent to Gamma-Poisson model). A furhter application is build models with timescale chopped into multiple intervals. Details are available in the program headers and examples supplied.
A full description of the methodology is available in Pharmaceutical Statistics. Two different but equivalent computational methods are used, one in each of two SAS macros.
The Feb 2019 update does not change anything but adds a check that the model includes a CLASS variable. It also allows the creation of a file for the posterior sample. This is useful if modelling rather imputation is your purpose. The March 2021 update adds a check that necessary parameters for imputation are in fact estimable. That is, it errors if there is extrinsic aliasing of parameters required for the imputation.
The following list of files that can be downloaded NegMult20210313 here.
NM_Reg22.sas Main macro using Negative Multinomial computational approach
NM_Rand13.sas Main macro using Gamma-Poisson computational approach (slower)
NB_Analze4.sas Macro to fit Negative Binomial log-linear model to multiple imputed data sets and summarise using Rubin’s formula
NM_Simdata2.sas Program to simulate data based on characteristics or a real trial.
NM_Simdemo5.sas Program which uses these macros to fit a series of possible imputation models including ones that allow piecewise constant, seasonal variation and Delta.
NM_Simdemo5-results.pdf Results from the SimDemo3 program.
- Multiple imputation for informatively censored time to event data – the Informative Censoring R package
Latest update 24 July 2020
Quick Summary and Download
The R package InformativeCensoring, available on CRAN, can be used to perform multiple imputation for a time to event outcome when it is believed censoring may be informative.
Two methods are implemented. The first, based on Jackson et al 2014, first fits a Cox model to the observed data under the usual (conditional on covariates) non-informative censoring assumption. Multiple imputed datasets can then be generated in which it is assumed that the hazard for failure following censoring changes by a user specified multiplier compared to the hazard implied by the non-informative censoring assumption.
The second, based on Hsu and Taylor, performs Kaplan-Meier type imputation of censored time, in which the Kaplan-Meier estimate is calculated for an individual based on data from those individuals who are closest in terms of predicted hazard of failure and predicted hazard of censoring. This matching can also be performed using time-dependent covariates. The approach thus assumes that censoring is non-informative conditional on the covariates used to predict the hazard of failure/censoring.
Corresponding vignettes are provided describing how the two methods are implemented and can be used.
- Multiple imputation for time to event data under Kaplan-Meier, Cox or piecewise-exponential frameworks – SAS macros
Latest update 19 October 2020
Multiple imputation (MI) and analysis of imputed time-to-event data is implemented in a collection of SAS macros based on the methodology described in the following publications:
- Lipkovich I, Ratitch B, O’Kelly M (2016) Sensitivity to censored-at-random assumption in the analysis of time-to-event endpoints. Pharmaceutical Statistics 15(3):216-229
- Moscovici JL, Ratitch B (2017) Combining Survival Analysis Results after Multiple Imputation of Censored Event Times. PharmaSUG-2017 (available on-line https://www.pharmasug.org/proceedings/2017/SP/PharmaSUG-2017-SP05.pdf)
Briefly, the methods estimate multiple imputations via draws from the Bayesian posterior distribution of parameters of a model (piecewise exponential); or via bootstrapped versions of the input data with a standard inverse method translating estimated probability into time to event (Cox and Kaplan-Meier). Hazards can be subjected to increase/decrease via user-specified amount delta; reference-based imputed survival can be implemented by estimating the imputation model based on a user-specified subset of observed subjects (piecewise exponential and Kaplan-Meier); or by user specification of the treatment group parameter to be used when calculating the imputed time to event (Cox).
The macros can be downloaded here: Package_Release_V3_1_final.zip
- Pattern Mixture
Latest update 26 July 2012
The macro imputes under a series of different methods, analyses data using Repeated measures (not ANOVA at a single visit) and provides summary estimates for least-squares means.
Either complete case missing value restriction (CCMV) or nearest case missing value (NCMV)restriction. [subset cases]
. . . and either non-future dependence (NFMV) or use all future data (ALL), when estimating parameters for each imputation step. [subset visits]
See Thijs & Molenberghs et al (2002), Strategies to fit Pattern Mixture Models, Biostatistics.3,2, 234-265.
Similar models can be fitted using the MNAR statement in proc MI (SAS version 9.4 or later). Neither include ACMV.
The files can be downloaded at Pattern Mixture Model_20120726.
Includes a useful set of Powerpoint slides.
- Plot and compare up to 2 control-based approaches
Latest update 16 Feb 2016
This macro calls other macros downloadable from the Imputation-based approachessection of this web site using a common interface, and plots the results of up to two analyses.
The package can be downloaded from Macro interface and plotting package 20160216.
Several examples output files are available in png format.
- Reference-based MI for Negative Binomial discrete data – R package dejaVu
Latest update 13 January 2021
Quick Summary and Download
The R package dejaVu, now available on CRAN, implements controlled based multiple imputation for count data, as proposed by Keene, Oliver N., et al. “Missing data sensitivity analysis for recurrent event data using controlled imputation.” Pharmaceutical Statistics 13:4 (2014): 258-264.
When used to analyse an existing partially observed dataset, the package first fits a negative binomial regression model to the observed data, assuming MAR. Multiple imputations of the counts in the periods after subjects dropout are then generated, under a user chosen assumption. Options include MAR, and the jump to reference and copy reference MNAR assumptions. Users can also write and use their own imputation mechanisms with the package.
The package was developed by the Scientific Computing and Statistical Innovation groups of the Advanced Analytics Centre at AstraZeneca.
A SAS implementation of the same methods, developed by James Roger, is available in the 'Reference-based MI for Negative Binomial discrete data – SAS macros' section.
- Reference-based MI for Negative Binomial discrete data – SAS macros
Updated 22 January 2020 and update corrected 6 February 2020. Updated 12 August 2021 to work on SAS 9.4.
Statistical analyses of recurrent event data have typically been based on the missing at random assumption (MAR) along with constant event rate. These treat the number of events as having a Negative Binomial distribution with an offset term which is the log of the length of time observed. One implication of this is that, if data are collected only when patients are on their randomized treatment, the resulting de jure estimator of treatment effect corresponds to the situation in which the patients adhere to this regime throughout the study. For confirmatory analysis of clinical trials, analyses are required that investigate alternative de facto estimands that depart from this assumption.
The macro described in this section parallels those available elsewhere in this section for continuous data but is based on the assumption of a Gamma-Poisson process underlying the classic Negative Binomial analysis. A detailed description of the methodology is presented in Keene, O.N., Roger, J.H., Hartley, B.F., and Kenward, M.G. (2014). Missing data sensitivity analysis for recurrent event data using controlled imputation. Pharmaceutical Statistics, 13, 4, 258-264. This macro implements the methods described there.
As implemented here the approach assumes that the event rate is constant across time. What it does do is allow for frailty in the event rate across patients. To move away from the assumption of constant event rate one might:
a) Extend this approach by breaking up the time period into a series of periods with separate constant rates.
b) Follow a more general time-to-event approach as described by Akacha et al (2015).
List of files/folders that can be downloaded from NegBI_PMI_20210810.
1. NegBin_PMI30.SAS: The file containing the macro itself. See the header of this file for details of usage. Version to run on SAS 9.4..
2. Demo_NegBinMI_1.SAS: A SAS program file running an example as described by Roger & Akacha at the PSI 2014 conference.
3. Demo_NegBinMI_1.sas.MHT: The output from running this demonstration file.
4. Demo_NegBinMI_1.sas.LOG: The log file from running this demonstration file.
5. PSI2014_6A_Roger_Akacha.pdf: The slides from the PSI meeting which describe the example and outline the two approaches and their connection.
6. Bladder_complete2.sas7bdat The example data set.
7. NegBIn_PMI30.SAS. Previous version before update to run on SAS 9.4.
This page was written by James Roger (email@example.com).
- Reference-based MI via Multivariate Normal RM (the “five macros” and MIWithD)
Latest update 13 October 2022
The “five macros” fit a Bayesian Normal RM model and then impute post withdrawal data under a series of possible post-withdrawal profiles including J2R, CIR and CR as described by Carpenter et al [Carpenter, J. R., Roger, J., and Kenward, M.G. Analysis of longitudinal trials with missing data: A framework for relevant, accessible assumptions, and inference via multiple imputation. J Biopharm Stat (2012).]. It then analyses the data using a univariate ANOVA at each visit and summarizes across imputations using Rubin’s rules.
An earlier macro to implement a similar marginal approach is also available here called MIWITHD (current version is 34). It uses the MCMC statement in proc MI rather than proc MCMC which limits its functionality compared to the five macros. The MIWithD macros are available here for historical reasons, and the “5 macros” will usually be preferred.
The Part1B macro is rather slow as it is compatible with early versions of proc MCMC, and does not take advantage of more recent facilities. But the sample from the posterior is automatically stored and can be repeatedly used in Partt2A onward.
The MyMCMC macro (Gaussian Repeated Measures with conjugate priors fitted using proc MCMC in SAS) in the Direct likelihood / Bayesian approaches section imputes using MAR, J2R, CIR, ALMCF or AFCMCF much more efficiently and faster, but is restricted to a single variance-covariance matrix. The RM_Conj macro in the Imputation based approaches / Imputation for Gaussian Repeated Measures with time varying covariates, takes vertical data allowing time varying covariates and also allows for several different variance-covariance matrices while retaining the speed advantage of conjugate priors.
The current zip file containing the “five macros” package as tested on SAS 9.4 can be downloaded from Five_Macros20221011.zip .The subfolder TheFivemacros contains the macro Part1A, Part1B, Part2A, Part2B and Part3 along with a plotting macro.
The subfolders DIA_Run and PI_Demo_GSK contain examples based on two publicly available data sets (supplied).
The method based on causal inference by (White IR, Joseph R, Best N. A causal modelling framework for reference-based imputation and tipping point analysis in clinical trials with quantitative outcome. J Biopharm Stat. 2020;30:334–50. https://www.tandfonline.com/doi/full/10.1080/10543406.2019.1684308) has been integrated into Part2A and is now available automatically.
Also included are a poster from PSI 2012 annual conference that describes the method and an extensive documentation file.
The zip file containing the MIWithD macro can be downloaded from deliver_mar2014. Note that methods have slightly different names here so that J2R was called J2C, CIR was called CDC while CR was called CC.
Also a final draft version of the original Carpenter et al paper is included in the MIWithD package.
- SAS macro for imputation under generalized linear mixed model
Latest update 26 February 2019
Multiple Imputation requires the modelling of incomplete data under formal assumptions about the combined model for observed and unobserved data (the imputation model).
Generalized Linear Mixed Models provide a natural framework for modelling repeated observations, especially for non-Gaussian outcomes. The new BGLIMM procedure in SAS/Stat 15.1 fits a wide range of such models allowing for missing data under MAR assumption. The posterior output data set includes the Bayesian sampled values for the missed values. These are exactly what are needed to complete the imputed data sets. This SAS macro BGI automatically merges the input data set with the multiple posterior missed values to a generate a single MI data set indexed by the variable _IMPUTATION_. This can then be multiply analyzed by the user and a summary built using Rubin’s rules.
The design vector required for each imputed value is specified in the call to proc BGLIMM just as if it were observed. This allows complex models such as treatment switching.
The macro requires access to SAS/Stat 15.1 or later.
The macro can be downloaded from BGI20190226.
- Sequential imputation with tipping point and delta adjustment
Latest update 21 March 2015
Control-based Imputation: CBI_PMM imputes data at each visit in a separate call to the MI procedure in SAS. Initially the data set has only reference (placebo) subjects. When it reaches a visit where one or more subjects have withdrawn those subjects are added to the data set. Active subjects are imputed as if they were on placebo when they were in fact on active up to withdrawal. Their response data is ignored apart from appearing as covariates in the sequential regressions after withdrawal. As such treatment never appears in the imputation model.
Delta adjusted MAR: Delta_PMM imputes with delta adjustment on top of an MAR model. It is carried out using a conditional delta algorithm, where delta is embedded within a stepwise regression approach where regression is on previous absolute values (observed or imputed).
Tipping Point analysis: Delta_and_Tip carries out a series of delta adjusted analyses as above to provide a tipping point analysis.
Analysis of each imputed data set either uses RM or univariate ANCOVA and the macros summarize these using Rubin’s rules.
The following macros can be downloaded from PMM Delta Tipping Point and CBI_20150602:
- cbi_pmm2.sas [Control-based imputation]
- delta_and_tip_2.sas [Delta and tipping point]
- delta_pmm2.sas [Delta for MAR]
The zip file also includes documentation files (docx) for each of the first three and demonstration SAS program files for the first two. It also includes the PharmaSUG2011 paper Implementation of Pattern-Mixture Models Using Standard SAS/STAT Procedures by Ratitch & O’Kelly which explains the methods in detail.
- Stepwise imputation for marginal model based on previous residuals
Updated 04 February 2022
This update includes new parameters that make reference-based imputation method CIR and White et al Causal method easier to do.
The macro MIStep duplicates many of the facilities in MONOTONE REG statement in proc MI, but adds the facility to regress on previous residuals rather than previous absolute values. This allows it to fit marginal methods such as J2R and CIR, and Causal (White et al) using an efficient stepwise algorithm.
The macro also allows imputation under models which include treatment switching. So can be used for imputation based on Retrieved-Dropout (RDO) data.
There is a wrapper macro MIStepWrap which allows one to do repeated calls to the MISTEP macro from a single macro call, which makes routine use much simpler.
Includes an MIAnalyze macro that fits a univariate ANOVA model and summarises least-squares means and their differences using Rubin’s rules. Multiple calls are appended to form a single dataset for comparing methods.
Requires parallel data format, as provided by the BUILD macro supplied with the MyMCMC macro.
Requires previous fixing of any intermediate missing data.
The following files are contained in a zip file downloaded as MISTEP20220203.zip.
1. MIStep16.sas. The SAS code for the macro. The header includes a detailed description and development history. Main update is BY= now allows different numbers of factor levels in different BY groups. This is useful for follow-on models where data can be sparse.
2. MIAnalyze07.sas. A SAS macro that is used the examples to run an ANCOVA analysis and summarise across imputed data sets using Rubin’s rules.
4. MIStep_demo13.sas and mistep_demo13-results.pdf. Example of using the macro.s on the DIA working group example data set. This includes standard MAR, J2R, CIR, Casual with K0=0.5 and also with K1=0.5 as examples.
5. GSKTest5.sas and GSKTest5_1-results.html. Program code and output for the same examples using the GSK 5 macros showing similar values based of 1000 imputations in both cases.
6. MISTEPWrap04.sas. The SAS code for wrapper macro. Header includes details.
7. MIStepWrap_demo13.sas and MIStepWrap_demo13-results.pdf. The same examples as MIStep_demo13 which can be run more easily using the wrapper. Results vary due to different seeds.
This page was written by James Roger (firstname.lastname@example.org).
- Template code treatment policy estimand using SAS PROC MI and the MISTEP macro
Latest Update 5 May 2023
Add code using BGLIMM procedure to do same MI directly from within SAS.
A treatment policy estimand (ICH E9R1) requires that "the value for the variable of interest is used regardless of whether or not the intercurrent event occurs. For example, when specifying how to address use of additional medication as an intercurrent event, the values of the variable of interest are used whether or not the patient takes additional medication." When data are missing due to the patient withdrawing completely from the study it is necessary to impute the missed values prior to analysis.
The template code in these two packages implement four possible approaches;
Approach 1: assumes outcomes are missing at random (MAR), conditional on baseline, randomized treatment group and prior observed values, ignoring whether subject is ON-/OFF-treatment.
Approach 2: assumes outcomes are MAR, conditional on the same factors as Approach 1, but including also on-/off-treatment status at each visit, as suggested by Guizzaro et al. (2021).
Approach 2a: Same as 2 but with on-treatment period not stratified by outcome pattern.
Approach 3: assumes outcomes are MAR conditioning on baseline and randomized treatment group, but conditioning on the residuals from imputations of prior visits, imputing each visit k taking direct account of on/off treatment status only at visit k; and having as an assumption that the correlation between outcome at visit k and residuals for any given prior visit do not differ by treatment group or by on/off treatment statusat prior visits; this approach uses ideas described by Roger under “Stepwise imputation for marginal model based on previous residuals” at https://www.lshtm.ac.uk/research/centres-projects-groups/missing-data#d…;
Approach 4: Imputes missing values using only the off-treatment retrieved dropouts; assumes change from time of discontinuation of study treatment to end of scheduled follow-up is MAR in subjects who discontinue study treatment, conditioning only on baseline, last on-treatment outcome, and time on study; with time on study considered continuous and the relationship between time on study and outcome considered linear.
Approaches 2-4 decrease in the degrees of freedom required to model the missing values, but increase in the assumptions made about the missing outcomes; since retrieved-dropout data may be sparse, it may desirable or necessary to identify one of Approaches 2-4 as feasible for an analysis, bearing in mind the assumptions made. The template code in this package is designed to facilitate the implementation of all the above Approaches and to be adaptable for use in a range of trials.
Template code for Approach 1 is included as representing a base case – the assumptions of Approach 1 in the context of a treatment policy estimand have been questioned (Guizzaro, 2021, p. 122).
To illustrate the issues of sparsity, two simulated datasets are included, one with, and one without, sparsity problems for Approaches 2-3.
Two independent sets of template code are provided for Approaches 2 and 3, one using the SAS MISTEP macro, and one using open SAS code.
The following files are contained in a zip file downloaded as Trt_policy_template.zip
1. Document describing the package (i.e., this document): Template code implementing treatment policy using SAS PROC MI and MISTEP macro.docx.
2. Treatment policy implementation using SAS PROC MI.sas and its .log and .lst: template code for Approaches 1-4 using SAS PROC MI.
3. Treatment policy implementation using MISTEP.sas and its .log and .lst: template code for Approaches 2-3 using the MISTEP macro.
4. *1000 imps.lst: outputs from 2) and 3) using 10000 imputations, to allow comparison of the results of the SAS PROC MI template code with those of the MISTEP SAS macro.
5. Mistep12.sas: a copy of the MISTEP SAS macro, also available as a download from the link above.
6. example_nonconverging.sas7bdat a SAS dataset with outcomes available selected dropouts, for which Approaches 2-3 fail due to sparsity.
7. example_converging.sas7bdat a SAS dataset with outcomes available selected dropouts, with sufficient data to implement all Approaches 1-4.
The following files are contained in a zip file downloaded as BGLIMM_Templates20230505.zip
- SAS code using BGLIM in file Treatment policy implementation using SAS BGLIMM_3.sas.
- Results in file Treatment policy implementation using SAS BGLIMM_3-results.html.
- Copy of macro to build imputation file BGI2.sas (Look for potential later version elsewhere in this DIA section.
- The two source data sets..
- Vansteelandt et al’s 2012 doubly robust method
Latest update 12 February 2013
Quick summary and Downloads
The Doubly Robust zip file contains SAS macros implementing the doubly robust approach described in:
Vansteelandt S, Carpenter J, Kenward M (2012), Analysis of incomplete data using inverse probability weighting and doubly robust estimators, Methodology: European Journal of Research Methods for the Behavioral and Social Sciences, 6, 37-48.
Doubly Robust estimation implements the missing-at-random assumption by using inverse probability-of-being-observed weighting, but augmented with a model-expected value for the missing outcome. The ingenious weighted combination of these two elements has the property that the estimate of, say, treatment effect will be consistent even if one of the models – a) the model for the probability of missingness or b) the model for the value of the missing outcome – is wrong. However, the consistency property does not hold if both a) and b) are wrong. The analysis model that estimates e.g., the treatment effect, is assumed to be correct.
There is also an accompanying users guidebook explaining the macros and their use.
The macros were written by Belinda Hernández of Quintiles, with review by Michael O’Kelly.
Presentations manuscripts and Training Materials
- A Practical Guide to Preventing and Treating Missing Data (2013)
Material presented by Craig Mallinckrodt and Russ Wolfinger in 2013 at the FDA workshop on missing data.
The pdf file is available in zip format at training_2013FDA_Workshop.
- Example Language for Statistical Analysis Plans and Protocols Describing Some Frequently-Used Methods for Handling Missing Data
- When and how to use reference based imputation for missing data (2013)
Slides presented by Michael O’Kelly & James Roger in March 2013
The slides can be downloaded at training_ref_imputation.
- Implementing Estimands in Trials: Detailed Clinical Objectives – James Bell, 3rd June 2019
Powerpoint slides from presentation by James Bell ‘Implementing Estimands in Trials: Detailed Clinical Objectives’, 3rd June 2019, PSI conference.
- Three papers written by members of this DIA working group on the the practical usage of Estimands
Choosing Estimands in Clinical Trials: Putting the ICH E9(R1) Into Practice
Defining Efficacy Estimands in Clinical Trials: Examples Illustrating ICH E9(R1) Guidelines
Aligning Estimators With Estimands in Clinical Trials: Putting the ICH E9(R1) Guidelines Into Practice
- Approaches for Treatment Policy
This section holds material related to the estimation of treatment policy estimand using off-treatment data. Also see SAS code in the section "Template code treatment policy estimand using SAS PROC MI and the MISTEP macro" in "Imputation Based Approaches".
- Slides by Michael O'Kelly and Sylvia Li (2021) Approaches for implementing the treatment policy estimand when subjects discontinue the study early: Assessing sparsity, bias and variability. Download as Trt Pol implementation DIA SWG 2023-04-13 SL MOK.pptx
- PSI conference poster by James Roger (2017) Joint modelling of On-treatment and Off-treatment data. Download as JamesRogerPSI2017Poster.pdf
Example data sets
- Example data set from an antidepressant clinical trial
This data set is one of only a few publicly available data sets that can be used to demonstrate methods for handling missing data where a continuous outcome is measured repeatedly.
Original data are from an antidepressant clinical trial with four treatments; two doses of an experimental medication, a positive control, and placebo.
Hamilton 17-item rating scale for depression (HAMD17) was observed at baseline and weeks 1, 2, 4, 6, and 8.
To mask the real data Week-8 observations were removed.
Two arms were created; the original placebo arm and a “drug arm” created by randomly selecting patients from the three non-placebo arms.
[Original data: Goldstein, Lu, Detke, Wiltse, Mallinckrodt, Demitrack. Duloxetine in the treatment of depression: a double-blind placebo-controlled comparison with paroxetine. J Clin
Psychopharmacol 2004;24: 389-399.]
The data set can be downloaded from MBSW2011_example.
Name of the SAS data set is “Chapter15 example.sas7bdat”.
- Example datasets with low and high dropout
These two data sets are made publicly available so that they can be used to demonstrate methods for handling missing data where a continuous outcome is measured repeatedly.The purpose is to contrast similar data with a low dropout rate and that with a high dropout rate. For full details and further explanation see Mallinckrodt et al (1) written by the DIA working party on Missing data.
The data sets are somewhat contrived to avoid implications for marketed drugs. Nevertheless, the key features of the original data were preserved. The original data were from 2 nearly identically designed antidepressant clinical trials that were originally reported by Goldstein et al (2) and Detke et al. (3). Each trial had 4 treatment arms with approximately 90 patients each that included 2 doses of an experimental medication (subsequently granted marketing authorizations in most major jurisdictions), an approved medication, and placebo. Assessments on the Hamilton 17-item rating scale for depression (HAMD17)19 were taken at baseline and weeks 1, 2, 4, 6, and 8 in each trial. All patients from the original placebo arm were included along with a contrived drug arm that was created by randomly selecting patients from the nonplacebo arms. Random selection continued until 100 drug-treated patients were selected. In addition to all the original placebo-treated patients, additional placebo-treated patients were randomly re-selected so that there were also 100 patients in the contrived placebo arms. For these re-selected placebo-treated patients, a new patient identification number was assigned and outcomes were adjusted to create new observations by adding a randomly generated value to each patient’s observations.
These trials are referred to as the low and high dropout data sets. In the high dropout data set, completion rates were 70% for drug and 60% for placebo (Table 2). In the low dropout data set, completion rates were 92% in both the drug and placebo arms. The dropout rates in the contrived data sets closely mirrored those in the corresponding original studies. The design difference that may explain the difference in dropout rates between these two otherwise similar trials was that the low dropout data set came from a study conducted in Eastern Europe that included a 6-month extension treatment period after the 8-week acute treatment phase and used titration dosing. The high dropout data set came from a study conducted in the US that did not have the extension treatment period and used fixed dosing.
1. Mallinckrodt C, Roger J, Chuang-Stein C, Molenberghs G, O’Kelly M, Ratitch B, Janssens M, Bunouf P. Recent Developments in the Prevention and Treatment of Missing Data. Therapeutic Innovation & Regulatory Science 2014; 48: 68.
2. Goldstein DJ, Lu Y, Detke MJ, Wiltse C, Mallinckrodt C, Demitrack MA. Duloxetine in the treatment of depression: a double-blind placebo-controlled comparison with paroxetine. J Clin Psychopharmacol. 2004;24:389-399.
3. Detke MJ, Wiltse CG, Mallinckrodt CH, McNamara RK, Demitrack MA, Bitter I. Duloxetine in the acute and long-term treatment of major depressive disorder: a placebo- and paroxetinecontrolled trial. Eur Neuropsychopharmacol. 2004;14(6):457-470.
The datasets can be downloaded from high_low_datasets.
This page was written by James Roger (email@example.com).
In presence of a multilevel structure in a dataset, for example children (level 1) clustered in schools (level 2), it is important to use sound methods to handle this structure. If our dataset is partially observed, not only we need to use multilevel methods for our primary analysis, but also to deal with missing data. This is because we need to ensure compatibility of the imputation and analysis models. Imagine we wanted to perform Multiple Imputation (MI) on a multilevel dataset. We answer the following questions:
- What are the possible MI methods?
First there are three simple ad-hoc methods, that we will see all suffer important issues.
- Single-level imputation: impute as single level data, ignoring clustering
- Fixed-effect imputation: Impute with cluster indicator as fixed effect
- Stratified imputation: Impute separately within each cluster.
Any specific imputation strategy might be used with each of these methods, for example either of the two most commonly used parametric imputation methods, i.e. Full Conditional Specification (FCS) or Joint Modelling (JM).
There are then methods that specifically use multilevel imputation models:
- Multivariate normal JM imputation: the imputation model is a joint multivariate normal model for all partially observed variables, with random effects for all outcomes of the imputation model. A full MCMC sampler, e.g. Gibbs, is used to generate the imputations;
- Latent normal JM imputation: as above, but binary/categorical variables are handled through latent normal variables. Again, this makes use of MCMC sampling, possibly with Metropolis-Hastings steps;
- Heteroscedastic latent normal JM imputation: as above, but allowing for heteroscedasticity, to better reflect heterogeneity across clusters.
- One-stage FCS imputation: this uses FCS, so it defines multiple univariate models for partially observed variables given all other variables rather than a single joint model for all variables. It mimics one-stage meta-analyses, in that it fits each univariate model on the whole dataset;
- Two-stage FCS imputation: same as before, but mimicking two-stage meta-analyses, so that models are fitted first within clusters, and then meta-analysed. This allows for heteroscedasticity;
- Substantive-Model-Compatible MI: this method imputes compatibly with a specific analysis model, which needs to be known before imputation is performed. It allows for imputation compatible with interactions and non-linearities. There are different SMC-MI methods depending on the way the covariates are imputed, based on FCS, JM or sequential imputation. This can also be used within a fully Bayesian framework.
- How do we choose between them?
The answers to this question depends on what is the target multilevel analysis model we aim to use:
Case #1: Random intercept analysis model.
It is a bad idea to use simple single level MI, ignoring clustering.
Imputing separately by cluster, or with cluster as a fixed effect is generally fine. Disadvantage of stratified imputation is that it loses efficiency. Both methods cannot be used with level 1 systematically missing data (missing for a whole cluster) or with level 2 missing data (data related to the clustering level, e.g. school).
Any multilevel MI method is OK in these settings.
Case #2: Random intercept and slope model with fully observed random slope variable.
If there is a random slope in the analysis model, this needs to be included in the imputation model as well. We can do this without any issue if the random slope variable is fully observed. We just need to make sure the random slope is included in the imputation model as well.
Case #3: Random intercept and slope model with missing data in random slope variable.
If the random slope variable is partially observed, we need to include it as an outcome in the imputation model. A simple, homoscedastic model, is not sufficient. Heteroscedastic imputation models are better. These have been implemented both within the FCS and the Joint Modelling framework.
However, in theory only substantive model compatible imputation (SMC-MI) can handle missing data compatibly with the analysis model in this situation, and hence it is the only method expected to lead to no bias. The only disadvantage is that we need to know the exact formulation of the analysis model before we impute missing data.
Case #4: Random intercept and slope model with interactions/non-linearities.
Again, SMC-MI is the only method that is expected to lead to unbiased estimates, as long as we know the exact formulation of the substantive model in advance of the imputation.
- What software can we use?
There are several software available for multilevel MI. A (possibly not exhaustive) list includes:
- Pan: a package for imputation of clustered data using the multivariate normal model. Does not allow for the imputation of categorical data and only allows for the inclusion of fully observed level 2 variables;
- Jomo: we created and actively maintain this R package for multilevel joint modelling imputation, based on the latent normal algorithm. The package allows the imputation of a mix of continuous and categorical variables at two levels.
It is possible to allow for heteroscedasticity in the imputation model, and also to perform substantive-model-compatible imputation.
This tutorial paper on the R journal explains how to use the package. A paper on the use of the SMC-MI functions will follow.
- Mitml: This package provides different interfaces to both pan and jomo, and lots of tools to analyse imputed data.
- Mice: This is perhaps the most popular MI package, it uses the FCS method and includes few function for multilevel imputation, using the one-stage approach;
- Micemd: This package implements the two-stage FCS approach to imputation and uses a heteroscedastic imputation model. It allows for the imputation of level 1 data only, both binary or continuous;
- Mdmb: This package allows for imputation under the SMC-MI framework, in particular using sequential imputation.
- jointAI: though technically not a MI package, as it is intended to be used to perform fully Bayesian analyses, this package fits virtually the same MCMC that could be fitted to impute missing data in a SMC-MI algorithm.
- REALCOM: This standalone package performs MI with the same latent normal algorithm as jomo, although it does not allow for heteroscedastic imputation models nor substantive model compatible imputation;
- BLIMP: This is a very flexible package that can be used both to perform standard FCS multilevel imputation, or SMC-MI. It also allow for fully Bayesian analyses.
- What work have we done in our group?
In our research group, we have done a lot of work in the development of multilevel MI methods. As mentioned, first we released the jomo R package, which has now been extensively used in a number of research papers and has been downloaded more than a million times since 2015. It implements Joint Modelling imputation via the latent normal model, and it allows for the imputation of a mix of binary/categorical and continuous data at two levels of hierarchy. It also allows for imputation via a heteroscedastic model and for substantive model compatible imputation. At the moment, models supported for SMC-MI are lm, lmer, glm, glmer, coxph, clmm and polr. We recently published a Tutorial paper on the R Journal, where we explain how to use the standard imputation functions. A second paper on the use of SMC-MI functions will follow.
The original motivation for developing our software came from research on how to best handle missing data in Individual Patient Data Meta-Analyses (IPD-MA). In (Quartagno and Carpenter, 2016), we showed how our multilevel imputation strategy works as well as stratified imputation for a range of scenarios, and better in others (e.g. when there are systematically missing variables). We found that, when the assumptions on the distribution followed by the covariance matrix of the imputation model was not met, this led to a small loss of efficiency, but no clear bias was identified.
In (Quartagno and Carpenter, 2019), we showed how using the latent normal model can lead to good results when some of the partially observed variables are binary/categorical. Furthermore, in (Quartagno, Carpenter and Goldstein,2019) we found that multilevel MI can be used to handle missing data in weighted survey analyses as well.
We have participated in two international comparisons of methods for handling missing multilevel data. In particular, in (Audigier et al, 2018) our heteroscedastic approach was compared to a number of competitors, giving good results, particularly with binary variables and large number of clusters. More recently, (Huque et al 2020) compared the SMC-MI functions in jomo to other methods, again finding good results, even for longitudinal data analyses, i.e. with small cluster sizes.
Finally, in (Quartagno and Carpenter, 2018) we have introduced the SMC-MI method for Joint Modelling imputation, and we plan to submit a wider discussion of this method soon for consideration of an international journal.
Any method of statistical analysis will make untestable assumptions about the distribution of missing data. If the wrong assumptions are made, then misleading conclusions may be drawn. Unfortunately this is a critical problem when missing data occurs which cannot be circumnavigated.
When there are missing data, it is therefore important that primary analysis be conducted under the most plausible assumptions for the missing data. Sensitivity analysis, which addresses the same question as the primary analysis, but under a range of different credible assumptions for the missing data should then be undertaken. This will reveal how robust the results are to different missing data assumptions.
What methods can be used for sensitivity analysis?
There are three broad classes of missing data assumptions introduced by Rubin (1976): Missing-completely-at-random (MCAR), Missing-at-random (MAR) and Missing-not-at-random (MNAR). Methods for analysis under MCAR and MAR are well developed. For example, any complete case analysis will provide valid inference under MCAR and under MAR any likelihood based method or multiple imputation analysis (see Molenberghs et al, 2014 and Carpenter and Kenward, 2007 for further guidance on statistical analysis under MCAR/MAR). Procedures for sensitivity analysis typically therefore focus on approaches to MNAR analysis where the response data and missingness mechanism must be jointly modelled.
There are two principle ways this can be done. First, a model for the missing status given the response data, with a marginal model for the response data can be specified. For example, a logistic regression model could be used to model the probability of the response being missing, with a parameter that governs how this depends on the unobserved outcome, fitted alongside a model for the response data. This is referred to as the selection model.
Alternatively the conditional distributions of the response data given the observed data for each missing data pattern, with a marginal model for the missingness process can be specified. For example, a multivariate normal model for the unobserved data, which has mean higher by a certain proportion than the observed data. This factorization is the pattern-mixture model.
Selection or pattern-mixture models can be fitted using maximum likelihood, or within a Bayesian framework. There will be numerous ways in which a pattern mixture model or selection model can be fully specified; however, many specifications will be practically implausible. So where should you start?
A commonly advocated principled way to perform MNAR sensitivity is to explore departures from the joint data distribution implied by MAR. For example, in the pattern-mixture framework, starting with specification of the conditional data distribution implied by MAR, one can readily perform sensitivity analysis exploring departures from MAR by shifting the parameters of the distribution, for example by specifying a higher or lower expected outcome value for unobserved data. After specifying separate response models for each pattern, inference can be obtained using maximum likelihood, or within a Bayesian framework.
Alternatively, Multiple Imputation provides an accessible solution for conducting sensitivity analysis within the pattern-mixture framework, termed Controlled Multiple Imputation.
What is controlled multiple imputation?
Controlled Multiple Imputation (MI) procedures combine pattern‐mixture modelling with MI and provide a practical, accessible platform for sensitivity analysis. The standard MI procedure imputes missing data using the conditional distributions of partially observed response data given the observed response data under the assumption of MAR. Within the pattern mixture framework, the conditional distributions implied by MAR for each missing data pattern can be modified as appropriate. The modified conditional distributions can then be used within the MI algorithm, in place of the MAR distribution to impute under MNAR. Multiple imputed data sets are obtained. The imputed data sets are each analysed using the primary substantive analysis model, which would have been used in the absence of any missing data (as done for standard MAR MI). Results across imputed data sets can then be combined using Rubin’s rules for inference.
When MI is performed in such a manner, this is termed ‘Controlled Multiple Imputation“ as the analyst has direct control over the imputation distribution.
Controlled MI procedures include δ‐based methods, which enable one to explore the impact of a worse or better response than that predicted based on the observed data distribution. Starting with the data distribution implied by MAR, a numerical parameter (δ) is specified which shifts the proposed distribution of the unobserved data away from MAR. For example, for a continuous outcome, data can be imputed assuming a mean response which is lower/higher than that predicted based on the observed data. For a binary or time-to-event outcome δ can respectively represent the difference in the (log) odds, or hazard, of response between the observed and unobserved cases.
An alternative example of controlled MI is reference‐based MI, which enables one to explore the impact of individuals with missing data behaving like a specified reference group in the observed data. The difference between the MAR and MNAR distribution is described entirely using information within the data set, by reference to other groups in the data. The parameters of the observed data distribution, estimated assuming MAR, are mixed around, across groups to form contextually relevant MNAR distributions for the unobserved data. For example, in a two group placebo controlled trial data for individuals missing data in the active arm can be imputed following the behaviour in the placebo group (Carpenter et al, 2013).
We have written a practical tutorial on sensitivity analysis for clinical trials with missing continuous outcome data using controlled multiple imputation with worked examples, which is available here. Delta- and reference-based multiple imputation methods can also be used with binary (Leacy et al, 2017), ordinal (Tang 2018), count (Keene et al, 2014) and survival data (Jackson et al, 2014 and Atkinson et al, 2019) . Links to software for implementation are provided below.
What is Information Anchored sensitivity analysis?
When a data set includes missing data, there will naturally be a loss of information in analysis due to missing data, relative to when all data could be observed. This means, the precision of estimates will be reduced (i.e. larger Standard Deviations, Standard Errors and Confidence Intervals), given a greater uncertainty with a reduced data set. This will be the case in any primary analysis. It is important to be aware that a sensitivity analysis can change the statistical information about an estimate.
Information-anchored sensitivity analyses, is defined as sensitivity analysis which varies the assumption for the missing data, and which holds the proportion of information lost due to missing data constant across the primary and sensitivity analyses (Cro et al, 2019).
We regard information anchored inference as desirable for sensitivity analysis. It ensures there is no loss or gain of information due to missing data in the sensitivity analysis relative to the primary analysis.
This is a particularly desirable property for sensitivity analysis within the context of clinical trial analysis. Regulators can be reassured the sensitivity analysis is not artificially injecting information, while trialists can be reassured that the sensitivity analysis is not discarding any of the valuable obtained data.
In clinical trials it can often be most appropriate to conduct primary analysis under MAR. Sensitivity analysis exploring departures from MAR will be required. Controlled imputation (described above) provides a practical, accessible route for doing so. We have shown elsewhere (Cro et al, 2019) that Rubin's MI combining rules provide information anchored inference, hence provide an appropriate estimate of variance for the treatment effect when used following controlled multiple imputation (including both delta-and reference-based MI). Rubin’s rules preserve the loss of information seen under MAR in controlled MI sensitivity analysis. This is why we recommend the use of Rubin's variance estimator within delta‐ and reference‐based sensitivity analyses.
What software can be used?
There are several software packages available for conducting controlled MI that enable accessible sensitivity analysis. Options (not exhaustive) include are summarised below. For δ-based imputation with a continuous outcome standard multiple imputation commands (e.g. mi impute in Stata or proc mi in SAS) can be used to create imputed data sets and imputed values shifted by adding the required δ (Cro et. al. 2020).
- The ‘five macros’ and miwithd: Performs reference-based MI for multivariate normal data following the general algorithm of Carpenter, Roger and Kenward (2013). Available on the DIA working group pages.
- NegBin_PMI: Performs reference-based MI for negative binomial discrete data following the methodology of (Keene et al, 2014). Available on the DIA working group pages.
- mimix: Performs reference based multiple imputation for multivariate normal data following the general algorithm of Carpenter, Roger and Kenward (https://pubmed.ncbi.nlm.nih.gov/24138436/). Available at https://ideas.repec.org/c/boc/bocode/s457983.html
- mlmi: implements a maximum likelihood MI version of reference based imputation for repeatedly measured continuous endpoints. Available at https://github.com/jwb133/mlmi
- dejaVu: Performs reference-based MI for negative binomial discrete data following the methodology of Keene et al . Available on the DIA working group pages.
- InformativeCensoring: Performs multiple imputation for a time to event outcome under informative censoring using (i) a Cox model fitted to the observed data under the non-informative censoring assumption and a user specified multiplier compared to the hazard implied by the non-informative censoring assumption or (ii) a Kaplan-Meier type imputation of censored time. Available on the DIA working group pages
The Missing Data Topic Group is one of the 9 topic groups within the STRengthening Analytical Thinking for Observational Studies (STRATOS) Initiative. The objective of STRATOS is to provide accessible and accurate guidance in the design and analysis of observational studies. The guidance is intended for applied statisticians and other data analysts with varying levels of statistical education, experience and interests.
- Aims of the Missing Data Topic Group
Missing data are ubiquitous in observational studies, and the simple solution of restricting the analyses to the subset with complete records will often result in bias and loss of power. The seriousness of these issues for resulting inferences depends on both the mechanism causing the missing data and the form of the substantive question and associated model. The methodological literature on methods for the analysis of partially observed data has grown substantially over the last twenty years such that it may be hard for analysts to identify appropriate (but not unduly complex) methods for their setting. The aim of the missing data TG is to draw on the existing advice and the expertise of its members, to provide practical guidance which will lead to appropriate analysis in standard observational settings, while giving principles which can inform analysis plans for less common substantive models.
To achieve this aim, the topic group will describe a set of principles for the analysis of partially observed observational data, and illustrate their application in a range of settings, ranging from simple summaries of single variables, through regression models, models for hierarchical and longitudinal data and models to adjust for time varying confounding.
The specific aims of this TG are to:
- assist analysts in understanding the nature of the additional assumptions inherent in the analysis of partially observed data;
- describe, in a range of settings, the implications of these assumptions for analyses that restrict to the subset of complete records;
- detail the range of methods available for improving on a complete records analysis, including the EM and related algorithms, multiple imputation, inverse probability and doubly robust methods, and
- provide guidance on the utility and pitfalls of each approach, bearing in mind the importance of software availability for most applied researchers.
TG1 represents a collaboration of 10 experts in missing data:
- James Carpenter (co-chair), London School of Hygiene and Tropical Medicine, London, UK
- Katherine Lee (co-chair), Murdoch Children’s Research Institute and University of Melbourne, Melbourne, Australia
- Melanie Bell, University of Arizona, U
- Els Goetghebeur, Ghent University, Belgium
- Joseph Hogan, Brown University, US
- Rod Little, University of Michigan, US
- Andrea Rotnitzky, Harvard University, US
- Kate Tilling, University of Bristol, Bristol, UK
- Rosie Cornish, University of Bristol, Bristol, UK
- Rheanna Mainzer, Murdoch Children’s Research Institute, Melbourne, Australia
- Lee KJ, Tilling K, Cornish RP, Little RJA, Bell ML, Goetghebeur E, Hogan JW, and Carpenter JR on behalf of the STRATOS initiative. Framework for the Treatment And Reporting of Missing data in Observational Studies: The TARMOS framework. In press at Journal of Clinical Epidemiology
- Carpenter J and Lee KJ on behalf of STRATOS TG1. STRengthening the Analysis of Observational Studies (STRATOS): Introducing the Missing Data topic group (TG1). Biometric Bulletin, 2017. 34 (4); 11-13.
TG1 affiliated publications
- Hughes RA, Heron J, Sterne JAC, and Tilling K. Accounting for missing data in statistical analyses: multiple imputation is not always the answer. Int J Epidemiol. 2019 Aug 1;48(4):1294-1304. doi: 10.1093/ije/dyz032.
Papers currently under review
- Litte RJ, Carpenter JR and Lee KJ. A comparison of three popular methods for handling missing data: complete case analysis, weighting and multiple imputation. Submitted to Sociological Methods and Research, 16 Dec 2020.
“Framework for the Treatment And Reporting of Missing data in Observational Studies: The TARMOS framework” – Invited presented by James Carpenter and Katherine Lee at the International Society for Clinical Biostatistics conference in August 2020.
- “Framework for the treatment and reporting of missing data in observational studies”- Poster presentation at the International Society for Clinical Biostatistics conference in July 2019.
If you wish to contact the topic group, please email Katherine Lee.
Multiple Imputation and its Application, 2nd edition 2023
James R Carpenter, Jonathan W Bartlett, Tim P Morris, Angela M Wood, Matteo Quartagno and Michael G. Kenward
Published by Wiley, ISBN 978-1119756088
- Book description & overview
An updated practical guide to datasets with missing data using multiple imputation
Collecting, analysing and drawing inferences from data is central to research in the medical and social sciences. Unfortunately, it is rarely possible to collect all the intended data. The literature on inference from the resulting incomplete data is now huge, and continues to grow both as methods are developed for large and complex data structures, and as increasing computer power and suitable software enable researchers to apply these methods.
This book focuses on a particular statistical method for analysing and drawing inferences from incomplete data, called Multiple Imputation (MI). MI is attractive because it is both practical and widely applicable. The authors aim is to clarify the issues raised by missing data, describing the rationale for MI, the relationship between the various imputation models and associated algorithms and its application to increasingly complex data structures.
The second edition includes:
- new chapters on
- prognostic models
- measurement error and misclassification
- causal inference
- using MI in practice
- exercises exploring both theoretical and practical aspects of MI, with solutions using both R and Stata (see below)
- updated chapter on MI with non-linear relationships, interactions and other derived variables
- expanded chapter on MI with survival data, including imputing missing covariates in Cox models and MI for case-cohort and nested case-control studies
- new chapters on
- Exercise solutions
The exercise solutions are being finalised and will be made available here soon.