Study designs for impact evaluation

The impact evaluation theme utilises a range of methods to evaluate complex policies and interventions on health or different determinants of health.

Randomised controlled trials (RCTs) provide the strongest level of evidence as random allocation to the intervention ensures that the groups being compared are as similar as possible at the start of the study, reducing the risk of bias. In quasi-experimental studies, allocation of the intervention is not random, and it is harder to avoid biases. Observational studies, also known as non-experimental studies, observe naturally occurring phenomena without manipulation.

Randomised controlled trials (RCTs)

Randomised controlled trials (RCTs) are the gold standard for studying the impact of an intervention. There are many different types of trials, but all involve some way of randomly allocating people or units to receive either the novel intervention that we are evaluating, or the control/comparison condition. Randomisation aims to make the people or units allocated to each condition as similar as possible, so we can be more confident that any difference in outcomes is due to the intervention.

The Clinical Trials Unit (CTU) and the International Statistics and Epidemiology Group (ISEG) are distinguished groups at the London School of Hygiene & Tropical Medicine that specialise in the design, conduct, analysis and reporting of clinical trials.

Trials can have different types of designs, which are discussed below:

Individually randomised trials

Individually assigned randomised trials are studies in which individuals are randomly assigned to receive either an intervention or be in the control/comparison arm. Interventions can be vaccines, drugs, disease prevention tools (e.g. bednets), but can also include social (e.g, food baskets) or educational interventions. In individually randomised designs, researchers usually assume that individuals in trial arms are fully comparable in terms of measured and unmeasured characteristics.

These studies also rely on other assumptions, such as blinding of participants, researchers or both, compliance with the assigned intervention, and that one participant’s intervention does not influence the outcome of other participants.

Cluster randomised trials

Most public health evaluations involve the assessment of interventions that occur at units larger than that of individuals and utilise a cluster randomised trial design. These ‘clusters’ might be health system catchment areas, schools, geographic regions or even countries. In some cases, it is appropriate and possible to randomly allocate the intervention of interest at the level of these larger units, and as such, apply the most efficient and reliable approach to reducing confounding.

This book by Richard Hayes and Lawrence Moulton has become the leading methodological text in this area.

Stepped-wedge cluster-randomised controlled trials

Stepped-wedge cluster-randomised controlled trials (SWTs) are used in a wide range of areas of public health, as well as other areas of public policy such as education and international development. SWTs can be thought of as a modified crossover design as the clusters are in both trial arms at different times.

All clusters start in the control arm, and the intervention is introduced by random allocation and at regular intervals either to one cluster at a time or in small groups of clusters, until all clusters are eventually receiving the intervention.

This paper from Copas et al (2015) provides a useful framework for designing SWTs.

Realist trials

Realist trials aim to inform mid-level programme theory by evaluating not only if the intervention works but how it works, for whom, and under what conditions.

These trials should explore intervention components separately and together. For example, trials can use multi-arm or factorial trial designs, be conducted across different populations and contexts, and use both quantitative and qualitative data to explore the pathways and mechanisms of change across different populations and contexts.

The following study by Chris Bonell et al (2012) conceptualises the idea of a realist trial.

Quasi-experiments (or natural experiments) and other non-randomised designs

For many health promotion and public health interventions, randomisation may not be possible due to feasibility or ethical constraints, and could even be counter-productive.

Quasi-experimental designs provide alternative methods to evaluate interventions when randomisation is not possible or when data has already been collected, and the research relies on observational data. These methods include:

Difference-in-Difference

This method is used to evaluate the impact of interventions that are non-randomly allocated to a subset of potentially eligible units (i.e., individuals, families, or places). The change in the outcomes in the unit(s) that got the intervention (the ‘difference’) is compared to the change in the outcomes in the unit(s) that did not get the intervention: hence the difference-in-the-differences.

This approach requires data from before and after the delivery of the intervention, in units that do and do not get the intervention. The effect is often estimated as the interaction between the change over time and the allocation group (i.e. whether or not a unit got the intervention) in a regression model.

It is possible that the places that receive the intervention are different at baseline from the places that do not receive the intervention in terms of the outcome of interest, and this method accounts for this possibility. However, the method assumes that in the absence of the intervention, the change over time in the outcome of interest occurs at the same rate in the intervention and comparison places - this is often referred to as the ‘parallel trends assumption’.

Therefore, while the method can account for differences at baseline, it cannot account for a varying rate of change over time that is not due to the intervention. This assumption cannot be directly tested since it is an assumption about the counterfactual state: i.e. what would have happened without the intervention, which was not observed. Researchers can look at trends in other related outcomes, or trends in the outcome of interest before the intervention started, to try to find evidence that supports the assumption about the trends that they cannot actually see.

In the following paper, Krug et al (2024) investigated the effect of an intervention to increase the uptake of modern contraceptives among adolescent girls in three African countries using difference-in-differences, with several diagnostics used to assess the assumptions underlying the method.

Regression discontinuity design

Regression discontinuity is used to evaluate the impact of interventions when allocation is determined by a cut-off value on a numerical scale. For example, if counties with a population of over one million are allocated to receive an intervention, while those with a lower population are not, then regression discontinuity could be used.

Regression discontinuity compares outcomes in places that fall within a narrow range on either side of the cut-off value. For example, any place with a population short of or over one million by, say, 50,000 people could be included in the comparison. This method assumes that places on either side of the cut-off value are very similar, and therefore, the allocation of an intervention based solely on an arbitrary cut-off value may be as good as a random allocation. The method requires few additional assumptions and has been shown to be valid.

It is important to bear in mind that the effect is estimated only for places that fall within a range around the cut-off value, and therefore cannot be generalised to places that are markedly different, such as those with much smaller or much larger populations.

Arcand et al (2010) investigated the effect of an HIV education intervention in Cameroon that was allocated according to the number of schools in the town.

Interrupted Time Series

The interrupted time series method is used to estimate the effect of interventions by examining the change in the trend of an outcome after an intervention is introduced. It can be used in a situation when comparison units are not available, as all eligible places receive the intervention.

This method requires a large amount of data to be collected before and after the intervention is introduced, and from a number of time points, to allow modelling of what the trend in the outcome would have been if the intervention was not introduced. The model is compared to what actually occurs. Any change in the level of the outcome or in the rate of change over time, compared to the model, can be interpreted as the effect of the intervention.

It is possible that changes in the trend in the outcome may be due to factors other than the intervention. This can be accounted for quantitatively: by investigating events or policy changes that took place at the same time. Alternatively, like the approach used in the difference-in-differences method to assess the counterfactual rate of change over time, researchers may investigate ‘control trends’ in outcomes. This is done by investigating other related outcomes that might be affected by most of the possible alternative explanations for the change in the trend observed, but not affected by the actual intervention.

In the following paper from Rogers et al (2023), the authors investigate the effect of a soft drinks industry levy (SDIL) on household soft drink purchases in the UK. They considered that background trends in household purchases might have also affected the purchase of soft drinks, so they investigated the trends in other related outcomes: toiletries (shampoo, hair conditioner, and liquid soap).

The assumptions made were that the majority of the possible alternative explanations, such as policy changes or changes to purchasing, would have affected toiletries purchases to the same extent as soft drink purchases and that toiletries purchases would not be affected by the SDIL. Using this approach, they were able to show more convincingly that the SDIL brought about the change in the trend.

Synthetic controls

‘Synthetic controls’ is a method for evaluating the impact of an intervention using data from places that did not get the intervention, collected over time.

The method works by first looking at the trends in the outcome of interest before the intervention was introduced. The data from various places that do not ultimately get the intervention are each given a weight so that the weighted-average of their data looks as much as possible like the trend in the places that will get the intervention. This weighted average is the ‘synthetic control’. The weights, unchanged, are then applied to the places without the intervention after the intervention has been introduced, and this weighted average is compared to the actual trend in the place with the intervention. This comparison can be used to estimate the impact.

Similar to the other methods discussed earlier, researchers must assume that there is no other intervention or policy change happening in the places receiving the intervention at the same time. The method requires a lot of data, both from many places and over multiple time points. It does not use or require parameterised models, so inferential statistics are calculated using permutations rather than more traditional methods.

In the following paper Abadie et al (2012), they introduced the method and applied it to investigate the impact of a tobacco control policy change on cigarette consumption in California, by comparing the trend in California with a weighted-average of the trends in the other states in the USA.

Propensity score-based methods

Conducting evaluations of interventions using observational data requires careful control for confounding. In many cases, we can calculate the conditional probability that a unit receives the exposure or the intervention given observed characteristics (also called propensity scores) and use those to balance out the analysis.

Although the usage of propensity score is not a design, propensity score matching, weighting or regression adjustment are often used in different study designs to allow us to find comparable units that received and did not receive the exposure/treatment.

For more details of its use and implementation, read this publication from Rosenbaum et al (2023).

For an example of how to implement and interpret propensity scores, you can read this paper from Pescarini et al (2022).