I am going to ramble about effect sizes. Like anything I write about, I am sure the ideas are not new and have been written about by people much smarter than me. But, I like to organize my thoughts and this blog is my way of doing that.
I may be being overly pessimistic, but I’m going to complain that effect sizes are a relatively useless statistic that editors are causing us to calculate.
When an experiment is run, the scientist typically ends up with two or more groups of data. They then summarize those groups in some way, such as a mean or a median, and look to see the extent to which the groups are different. Part of that process is to determine the probability of obtaining that difference by chance. That is the process of “significance” testing.
Much has been made recently, and rightly so, about the process of significance testing in that it can (and is) easily abused. The researcher has many degrees of freedom that he or she can use in conducting the test that can improve their odds of obtaining a “statistically significant” result.
Along with significance testing, there is now a push to report effect sizes (and even confidence intervals on effect sizes…). Part of this push is likely in response to the abuses within significance testing. But, effect sizes can also affected by the same researcher degrees of freedom.
An effect size is a measure of how “big” the presumed effect of some variable is. I like to think of it as Cohen describes it (or at least that is where I believe to have encountered this idea), the effect size has to do with the ability to see any difference between the groups without any summary statistics.
Suppose you gathered people from Spain into a room. There are people from the north and the south, and people from the north are a little taller from those in the south. But just by looking into the room and observing heights, could you tell there are two different groups of people there? Probably not- that would be a small effect of geography. You’d have to measure each person and calculate some summary statistics to begin to see that there are two groups.
Now, suppose you put people from south Spain and people from the Netherlands into a room. Just by looking into the room you could probably tell that there were two different kinds of people in the room based on height (folks are tall in the Netherlands). In this case, there would be a large effect of geography.
Effect sizes are generally measured as a ratio of the extent to which the central tendencies of conditions differ over the variability within the conditions. The mean group differences divided by the standard deviation of the within-group differences gives us Cohen’s d. The sum of squared differences between the groups over the sum of squared differences between the groups plus the sum of squares within each group (ss between / (ss between+ss within) gives us eta squared (or partial eta squared, depending on the number of independent variables). These latter statistics generally represent the proportion of total variance that is attributable to the effect.
Statistical significance is also calculated in very similar ways.
The major difference being that statistical significance also depends upon the size of the sample (the number of different people in the room). Nevertheless, the two are generally related. For instance, the statistic “t” used in significance testing is equal to “d”, a measure of effect size, when divided by the square rood of N ( t / sqrt( N ) = d ). Eta or partial eta squared is equal to F * DF1 / (F * DF1 + DF2) using the F obtained in analysis of variance.
As both the test statistic used to determine significance and the effect size are a function of the variability in the samples, anything that reduces that variability should improve both the size of the effect and the statistical significance. I would suggest that property is what makes the concept of an effect size somewhat meaningless.
I’ll do like Staddon does in his book Adaptive Dynamics and adopt a pendulum for an example (though this is not his usage of the example). I have two groups of 20 pendulums. In one group the length is 10cm, and in the other the length is 11cm. The one that is 11cm in length should swing more slowly than the one that has a length of 10cm.
If you look at the two groups, you can probably pretty clearly determine that one group is swinging faster than the other. Even though there may be some variability between individual pendulums (the researcher set the lengths to 10 or 11cm +/- 1mm, some pendulums have slightly tighter pivots etc..) the variables responsible for that variability do not have enough impact to prevent you from seeing the overall difference that pendulum length makes.
Suppose I then told you that the color should matter. Black pendulums should swing slower than white ones in a well-lit room. My imaginary idea here is that the black ones will retain heat better, expand slightly and thus swing more slowly. After several hours, you probably could not see a difference. Even after going in and measuring the individual tick-speeds you probably could not see a difference.
So- determined to
prove support your hypothesis, what do you do? You go in and start over. You precisely measure and set the starting lengths to be exactly 10cm, with no variation in length. You mix a concoction of WD-40 and precisely lubricate each pivot. You place micro-switches on each pendulum to record their swing-time down to the microsecond. You put them in glass, precisely vacuum-sealed boxes to eliminate any effect of air- and so forth. You control every tiny variable that you can imagine might make a difference. Now, when you start your pendulums swinging, after an hour you still can’t notice that he black ones are swinging faster than the white ones. It again looks like you have no effect. But lo- when you look at your computer-recorded data you see that the black ones are swinging .04 miliseconds faster than the white ones. Within the black or white ones, there is only a .001 ms difference. That is, between any two black ones or any two white ones, there is only .001ms timing difference, but between any black and any white one, there is a .04 ms difference. You have a difference between the pendulums of different colors that is 40 times bigger than the difference between pendulums of the same color. You will calculate a HUGE effect size.
But does it matter? Will the color variable make a difference? Well- yes. Will it make any practical difference in the operation of the world- absolutely not. Effect sizes, we are taught, are supposed to help us with that latter question, but the truth is, they are relatively meaningless outside the context of the way in which the research was conducted. We must consider the total number of variables controlled, and the quality of that control. With greater control, effect sizes become less meaningful (imo). See also- visualizing effect sizes.