Psychologists and neuroscientists often plot error bars to show the variability in their obtained results. Very often, the error bar shows the standard error of the mean, or SEM. A quick rule of thumb when looking at graphs is to check whether the error bars between conditions overlap - if they do, you can conclude that the the means are NOT significantly different at p < 0.05. Note that the converse is not true, that is, non-overlapping error bars ≠ significant difference. The means have to be approximately 3SEM apart (assuming the two means have similar SEMs).
This rule, however, only applies to between-subject designs. What about within-subject (i.e. repeated measures) designs? Error bars would be too large as they take into account both between-subject and within-subject variance (variability in the paired difference between the different conditions in the same subject). In my opinion, plotting between-subject SEM for within-subject tests (e.g. paired t-tests) is meaningless, because conflating between-subject and within-subject variance makes the error bars uninterpretable.
A solution was first proposed by Loftus and Masson (1994). Briefly, it involves running a repeated measures anova, and taking the mean squared error (MSE) from the appropriate (repeated-measures) ANOVA analysis (i.e. the denominator from the appropriate F-test). SEMwithin can then be calculated by taking the square root of MS divided by N. This procedure works because a repeated-measures ANOVA removes between-subject variance first, so the MSE captures only within-subject variability.
If all you want to know is how to plot within-subject error bars, you can stop reading here. If you would like a more nuanced discussion, read on.
**************************
The Loftus and Masson method, however, yields a single SEM value for all conditions. The implicit assumption is that the variance between the pairwise difference between conditions is constant (i.e. sphericity assumption). Of course, this assumption can be tested and corrected (e.g., Greenhouse-Geisser or Huynd-Feldt). But wouldn't it be nice if we could just tell from the error bars (i.e. the same way we can tell if the homogeneity of variance is violated from the error bars of a between-subject design)? This prompted Cousineau (2005) to propose a different solution, which involves "normalizing" data from all subjects such that the between-subject variance is removed. This is done by subtracting, for each condition and subject (i.e. each "cell"), the subject average across conditions, and then adding the grand average of all cells. Hence all subjects will have the same average, but the within-subject effects are preserved. The SEM can then be calculated as per normal, and each condition will have it's own error bar. Cousineau, unfortunately, did not take into account the fact that normalization (in particular, adding the grand average of all cells) induces positive correlations between the cells, so the error bars are a little too small compared to those calculated by the Loftus and Masson method. This discrepancy was identified by Morey (2005) who proposed a simple solution of multiplying the Cousineau variance (note: NOT SEM) by M/(M-1), where M = number of conditions, before calculating error bars**.
Ahh... but the story is not over yet. I just found out that Franz and Loftus (2012) published a recent study challenging the normalization method. They argue that the normalization method does not actually allow for the checking of the sphericity assumption and even though the method produces different error bars for each condition, any difference in the error bars will not not interpretable. Specifically, sphericity requires inspecting all pairwise differences between conditions, and the only way to do it visually, is to plot all the pairwise differences, and see if the variance of those pairwise differences are similar. Franz and Loftus argue that these pairwise differences should be plotted next to the plot of means for visual inspection.
My thoughts? I have to say, I still recommend Loftus and Masson + appropriate test for sphericity. It isn't THAT difficult to compute - an actual charge by detractors of the method. I mean, it's not computationally intractable or anything - just run the ANOVA >.<. That said, I wouldn't write off a paper which uses Morey (2005). For the purpose of "eye-balling" significance... those error bars do the trick as well. And the condition-specific variance reflected in those error bars is still information (though I need to think a little harder about what it might be good for). At the end of the day, I see error bars as visual aids*. Authors ought to be clear and honest about how the error bars were obtained. Readers, however, should always defer to the results of the appropriate statistical test to evaluate findings.
*On that note, I don't really think plotting the variability of the pairwise differences is all that useful. If there are say, 6 conditions... you'll need to plot 15 pairwise differences... and that's not really all that helpful for visualization, which defeats the purpose of graphing in the first place.
** Previous version of this posts states that the correction is M/(M-1). Thanks to Eric Garr for pointing out the mistake :)
This rule, however, only applies to between-subject designs. What about within-subject (i.e. repeated measures) designs? Error bars would be too large as they take into account both between-subject and within-subject variance (variability in the paired difference between the different conditions in the same subject). In my opinion, plotting between-subject SEM for within-subject tests (e.g. paired t-tests) is meaningless, because conflating between-subject and within-subject variance makes the error bars uninterpretable.
A solution was first proposed by Loftus and Masson (1994). Briefly, it involves running a repeated measures anova, and taking the mean squared error (MSE) from the appropriate (repeated-measures) ANOVA analysis (i.e. the denominator from the appropriate F-test). SEMwithin can then be calculated by taking the square root of MS divided by N. This procedure works because a repeated-measures ANOVA removes between-subject variance first, so the MSE captures only within-subject variability.
If all you want to know is how to plot within-subject error bars, you can stop reading here. If you would like a more nuanced discussion, read on.
**************************
The Loftus and Masson method, however, yields a single SEM value for all conditions. The implicit assumption is that the variance between the pairwise difference between conditions is constant (i.e. sphericity assumption). Of course, this assumption can be tested and corrected (e.g., Greenhouse-Geisser or Huynd-Feldt). But wouldn't it be nice if we could just tell from the error bars (i.e. the same way we can tell if the homogeneity of variance is violated from the error bars of a between-subject design)? This prompted Cousineau (2005) to propose a different solution, which involves "normalizing" data from all subjects such that the between-subject variance is removed. This is done by subtracting, for each condition and subject (i.e. each "cell"), the subject average across conditions, and then adding the grand average of all cells. Hence all subjects will have the same average, but the within-subject effects are preserved. The SEM can then be calculated as per normal, and each condition will have it's own error bar. Cousineau, unfortunately, did not take into account the fact that normalization (in particular, adding the grand average of all cells) induces positive correlations between the cells, so the error bars are a little too small compared to those calculated by the Loftus and Masson method. This discrepancy was identified by Morey (2005) who proposed a simple solution of multiplying the Cousineau variance (note: NOT SEM) by M/(M-1), where M = number of conditions, before calculating error bars**.
Ahh... but the story is not over yet. I just found out that Franz and Loftus (2012) published a recent study challenging the normalization method. They argue that the normalization method does not actually allow for the checking of the sphericity assumption and even though the method produces different error bars for each condition, any difference in the error bars will not not interpretable. Specifically, sphericity requires inspecting all pairwise differences between conditions, and the only way to do it visually, is to plot all the pairwise differences, and see if the variance of those pairwise differences are similar. Franz and Loftus argue that these pairwise differences should be plotted next to the plot of means for visual inspection.
My thoughts? I have to say, I still recommend Loftus and Masson + appropriate test for sphericity. It isn't THAT difficult to compute - an actual charge by detractors of the method. I mean, it's not computationally intractable or anything - just run the ANOVA >.<. That said, I wouldn't write off a paper which uses Morey (2005). For the purpose of "eye-balling" significance... those error bars do the trick as well. And the condition-specific variance reflected in those error bars is still information (though I need to think a little harder about what it might be good for). At the end of the day, I see error bars as visual aids*. Authors ought to be clear and honest about how the error bars were obtained. Readers, however, should always defer to the results of the appropriate statistical test to evaluate findings.
*On that note, I don't really think plotting the variability of the pairwise differences is all that useful. If there are say, 6 conditions... you'll need to plot 15 pairwise differences... and that's not really all that helpful for visualization, which defeats the purpose of graphing in the first place.
** Previous version of this posts states that the correction is M/(M-1). Thanks to Eric Garr for pointing out the mistake :)
What do you mean exactly when you say that the within-subjects error bars would be too large?
ReplyDeleteHey Eric! Nice to hear from you! Hope you are doing well :)!
ReplyDeleteNormally when you plot SEM error bars, you can conclude that the means of two conditions are not significantly different if the error bars overlap. But if you are running a within-subjects test (e.g., paired t-test), and you compute SEM the “normal way” (i.e. standard deviation/sqrt(sample size)), the resulting error bars would be too large in the sense that even if they overlap, you cannot conclude that there is no significant difference in means (take a look at figure 1 of Morey, 2005 for visualization). In fact, I would argue that those error bars are completely uninformative to your effect of interest!
Think about it this way - if you are testing if a particular drug increases heart-rate, you only care about (heart-rate after drug) minus (heart-rate before drug) for each participant. The fact that some people have faster or slower baseline heart-rates is inconsequential! But if you calculate the SEM the “normal way”, this SEM would also include the between-subject variance. So if the people in your sample have vastly different baseline heart-rates, you’ll have really big error bars, even if you have a consistent +5 beats second increase in heart-rate after taking the drug.
Does that make sense? Btw, just curious, why the interest in within-subject error bars?
Ah, yes thanks! I wouldn't say I have a real interest in within-subject error bars as much as I want to make sure that I interpret data correctly. Hope you're well.
ReplyDeleteSo, Yuan Chang, I was recently thinking about within-subject errors bars and I remembered this blog post. I took a look at the Morey paper, and the correction is actually M/M-1, not M/M+1. Just thought you should know :)
ReplyDeleteAhh... you are right! Must have typed it in wrongly when I wrote this- thank goodness it's correct in my code :)! Will correct the post now. Thanks for letting me know!
ReplyDeleteI just checked out your website - I really like the monthly reading section. I've been thinking of doing something similar, but I just haven't had the discipline to get it going! Hope all is well!
Ah man, it's hard to keep up with those posts. I should change the name to "Once in a blue moon readings". Also, I noticed that you are affiliated with Hyo Gweon's lab. Say hello to her for me--I used to be a research assistant under her when she was a grad student at MIT.
Delete