A quiz
For fun, imagine the following scenarios involving two psychologists studying different phenomenon. Then please take the short (2-question) quiz that follows.Psychologist A conducts 10 studies, each with 2 experimental conditions with 30 participants per condition. She submits only her significant results as a manuscript to the Journal of Personality and Social Psychology, which gleefully publishes her manuscript as a paper. Here are her results.
Study 2: t(58) = -2.88, p = .005
Study 3: t(58) = -3.54, p < .001
Study 4: t(58) = -2.38, p = .021
Study 6: t(58) = -2.67, p = .010
Study 7: t(58) = -3.03, p = .004
Study 8: t(58) = -2.57, p = .013
Study 9: t(58) = -3.43, p = .001
Study 10: t(58) = -2.10, p = .040
Psychologist B conducts 10 studies, each with 2 experimental conditions with 150 participants per condition. She submits only her significant results as a manuscript to the Journal of Personality and Social Psychology, which gleefully publishes her manuscript as a paper. Here are her results.
Study 2: t(298) = -2.02, p = .044
Study 3: t(298) = -2.62, p = .009
Study 6: t(298) = -2.7, p = .007
Study 7: t(298) = -5.05, p < .001
Study 8: t(298) = -2.59, p = .010
Study 9: t(298) = -3.68, p < .001
Study 10: t(298) = -2.72, p = .007
Hint for the quiz: the average t-value is nearly the same between both researchers' studies.
The quiz was to get you thinking about what information you use when you're evaluating research findings.
It might seem like you need more information to answer the quiz questions, but some quick heuristics could get you to the right answers (C and A, respectively).
(1) The one obvious difference between the two sets of results is sample size.
(2) If you eyeball the t-values and take a mental average, they're about the same (A = -2.64 and B = -2.40).
Couple this knowledge with the fact that both researchers found 70-80% significant results, and you can conclude their studies are probably equally powered.
Takeaway: N = 60 doesn't always equate to low power. Power depends on the size of your effect.
Now to the real point of this blog: p-values. In fact, you could have wagered a guess as to the power of the studies above from the p-values alone.
I want to convince you that p-values (especially lots of small ones) still matter in science, despite all the recent shade thrown at them.
The code* for generating the faux results above is more involved than usual, so I'm going to introduce you to the necessary functions in steps.
First, some basics
Here's a basic for() loop. I like to read it like this: for every ith piece in 1 to 10 (1:10 = 1, 2, 3... 8, 9, 10), run the code inside the brackets {}.for(i in 1:10){ print("I <3 statistics") }
## [1] "I <3 statistics" ## [1] "I <3 statistics" ## [1] "I <3 statistics" ## [1] "I <3 statistics" ## [1] "I <3 statistics" ## [1] "I <3 statistics" ## [1] "I <3 statistics" ## [1] "I <3 statistics" ## [1] "I <3 statistics" ## [1] "I <3 statistics"
Here's a slightly more complicated for() loop. First I create a character vector.
Think of a character vector like a line made up of only text-things (they can be numbers, but they're treated like text).
I read the for() loop like this: for every ith piece in 1 though the length of msg (it's 3 pieces long), print the ith piece of msg (1 = "I", 2 = "<3", and 3 = "statistics").
msg <- c("I", "<3", "statistics") # print it msg
## [1] "I" "<3" "statistics"
# how many elements (how long) is msg? length(msg)
## [1] 3
# print each element using for() loop for(i in 1:length(msg)){ print(msg[[i]]) }
## [1] "I" ## [1] "<3" ## [1] "statistics"
I'll eventually show you how to simulate lots and lots of t-tests and save their results, so really quickly I'm going to show you how the t-test function works.
Below I sample 100 values from the random normal population with a mean of 0 and a standard deviation of 1. I test the mean of these values against the null: mean = 0.
t.test(x = rnorm(n = 100, mean = 0, sd = 1), mu = 0)
## ## One Sample t-test ## ## data: rnorm(n = 100, mean = 0, sd = 1) ## t = 1.0304, df = 99, p-value = 0.3053 ## alternative hypothesis: true mean is not equal to 0 ## 95 percent confidence interval: ## -0.1055099 0.3334730 ## sample estimates: ## mean of x ## 0.1139815
The mean is not significantly different from 0 (p > .05), as we'd expect to be the case (95% of the time, if we sampled forever).
Below I do the same thing but I change the null value: mean = 1.
t.test(x = rnorm(n = 100, mean = 0, sd = 1), mu = 1)
## ## One Sample t-test ## ## data: rnorm(n = 100, mean = 0, sd = 1) ## t = -9.8806, df = 99, p-value < 2.2e-16 ## alternative hypothesis: true mean is not equal to 1 ## 95 percent confidence interval: ## -0.1020263 0.2665699 ## sample estimates: ## mean of x ## 0.08227179
The mean is significantly different from 1. Of course it is: we know that the true mean of the distribution is 0 and that the null value is a full standard deviation from 0. If we conducted a power analysis, we'd find that N = 100 gave us more than 99% power to find a significant effect. I'll show you later how more power leads to greater frequencies of small p-values too, regardless of the effect size.
Here's an obvious advantage of R over SPSS: R is 'object-oriented'. That means I can save output and ask for it later. It's like the memory function on your calculator but way more powerful and intuitive.
I can save the t-test output, which is a list of all the values you see printed when you run the function. For example, I can extract the p-value with the $ operator.
t.test(x = rnorm(n = 100, mean = 0, sd = 1), mu = 1)$p.value
## [1] 1.900054e-16
Now I'll do this 10,000 times below and save the results in a list called 'pvalues'. Every iteration of the for() loop saves the p-value in the empty list. By the end of the for() loop, the list is anything but empty: it has 10,000 p-values in it.
# save the results in an empty list called p-values pvalues <- list() for(i in 1:10000){ pvalues[[i]] = t.test(x = rnorm(n = 100, mean = 0, sd = 1), mu = 0)$p.value }
Now I can use the unlist() function and perform a logical operation: check whether every value in this vector is less than .05 (TRUE? FALSE?).
A neat feature of logical vectors in R (i.e., strings of TRUEs and FALSEs) is that they're treated like numbers.
FALSE = 0 and TRUE = 1.
So if I take the mean of a string of 0s and 1s, where 1 = p < .05 and 0 = p > .05, I find the error rate in our 10,000 t-tests.
mean(unlist(pvalues < .05))
## [1] 0.0486
Know what else is cool? We can test whether this value is significantly different than .05.
To do so, I need to give the prop.test() function 3 numbers:
- How many successes (p < .05)
- How many total outcomes (all the p-values)?
- What proportion to test against (.05)?
# how many p-values are < .05? x <- length(unlist(pvalues)[unlist(pvalues) < .05]) # how many p-values? n <- length(unlist(pvalues)) # proportion test, where x = successes, n = all outcomes, and p = test value (the probability of success) prop.test(x = x, n = n, p = .05)
## ## 1-sample proportions test with continuity correction ## ## data: x out of n, null probability 0.05 ## X-squared = 0.38368, df = 1, p-value = 0.5356 ## alternative hypothesis: true p is not equal to 0.05 ## 95 percent confidence interval: ## 0.04450814 0.05304264 ## sample estimates: ## p ## 0.0486
Of course, it's not significantly different than .05. We knew the true mean is zero. If the null is zero and we have a huge sample size (N = 10,000), then the t-test will have approximately a 5% error rate by design. This is the power of statistics and simulations.
P-curve
Now I want to introduce you to the concept of -p-curves.
Here's the code for the function I'll use in the next few sections.
p.curves <- function(sims = 10000, delta = .4, power = .80, n.more = 10, max.hack.n.mult = 2){ # run power analysis to detrmine n for given power and delta pwr.analysis = power.t.test(delta = delta, power = power, sig.level = .05) n = pwr.analysis$n # empty lists for null and true difference results null <- list() real <- list() for(i in 1:sims){ # baseline group 1 group1 = rnorm(n = n, mean = 0, sd = 1) # add true delta group2 = rnorm(n = n, mean = 0 + delta, sd = 1) # comparison for null effects (same population as group 1) group3 = rnorm(n = n, mean = 0, sd = 1) # t-tests null[[i]] = t.test(group1, group3)$p.value real[[i]] = t.test(group1, group2)$p.value } # histogram of null relationship p-values null.hist = hist(x = unlist(null)[unlist(null) < .05], plot = FALSE) null.hist$counts <- round(null.hist$counts/sum(null.hist$counts),3) plot(null.hist, freq = TRUE, main = paste(sims, "independent t-tests (n =",round(n),", delta =", 0, ")"),xlab = paste(length(unlist(null)[unlist(null) < .05]), "significant p-values (p < .05)"), ylab = "Relative frequency of p-values", labels = TRUE) # histogram of true difference delta p-values real.hist = hist(x = unlist(real)[unlist(real) < .05], plot = FALSE) real.hist$counts <- round(real.hist$counts/sum(real.hist$counts),3) plot(real.hist, freq = TRUE, main = paste(sims, "independent t-tests (n =",round(n),", delta =", delta, ", power =", power, ")"), xlab = paste(length(unlist(real)[unlist(real) < .05]), "significant p-values (p < .05)"), ylab = "Relative frequency of p-values", labels = TRUE) # empty list for storing p-hacked results phacked <- list() for(i in 1:sims){ # baseline group 1 group1 = rnorm(n = n, mean = 0, sd = 1) # same population as group 1 group2 = rnorm(n = n, mean = 0, sd = 1) # while test results is non-significant and group 1 is smaller than the max group size allowed, append n.more random values from same populations as above while((t.test(group1, group2)$p.value >= .05 & length(group1) < n * max.hack.n.mult) == TRUE){ group1 = append(group1, rnorm(n = n.more, mean = 0, sd = 1)) group2 = append(group2, rnorm(n = n.more, mean = 0, sd = 1)) } # now store t-tests if above is FALSE phacked[[i]] = t.test(group1, group2)$p.value } # histogram of p-hacked comparisons phacked.hist = hist(x = unlist(phacked)[unlist(phacked) < .05], plot = FALSE) phacked.hist$counts = round(phacked.hist$counts/sum(phacked.hist$counts),3) plot(phacked.hist, main = paste(sims, "independent, p-hacked t-tests (n =",round(n),", delta =", 0, ")"), xlab = paste(length(unlist(phacked)[unlist(phacked) < .05]), "significant p-values (p < .05)"), ylab = "Relative frequency of p-values", freq = TRUE, labels = TRUE) }
Next I'm going to generate 27 plots (each with 10,000 samples). These plots will have 3 dimensions: Power (33%, 50%, and 80%), effect size (d = 0, d = 0.1, d = 0.3, and d = 0.5), and p-hacking (in this case, adding 10 more samples to each condition until the result is significant but stopping if the full sample size has doubled).
Pay attention to these features of the p-curves:
- When the effect is zero and there was no p-hacking, then p-curve is relatively flat (it's called a random uniform distribution).
- When the effect is non-zero and there was no p-hacking, then the p-curve is right-skewed (the tail of the distribution is on the right).
- When p-hacking occurred, the p-curve is left-skewed (the tail of the distribution is on the left).
- As power increases, so does the proportion of smaller p-values
- The proportion of p-values has no relationship with the effect size (when power is constant, you see approximately the same proportion of p-values in every bin).
lapply(c(1/3,1/2,4/5), function(x) { deltas <- c(.1, .3, .5) for(i in 1:length(deltas)){ p.curves(power = x, delta = deltas[[i]]) } } )
These patterns are the gist of a new-ish technique developed by Uri Simonsohn, Lief Nelson, and Joe Simmons: p-curve analysis.
Here's the takeaway: true relationships produce significant p-value distributions that are right-skewed.
In plainer language, there are more smaller p-values (p = .001) than larger p-values (p = .04).
In a world of publication bias--academic journals tend to publish way more significant effects than null effects--p-curve is a powerful tool for evaluating scientific evidence.
If you're interested in how p-curve can be applied to effect sizes and how you can compute p-curves in your own research, you should visit http://www.p-curve.com/
Here's the takeaway: true relationships produce significant p-value distributions that are right-skewed.
In plainer language, there are more smaller p-values (p = .001) than larger p-values (p = .04).
In a world of publication bias--academic journals tend to publish way more significant effects than null effects--p-curve is a powerful tool for evaluating scientific evidence.
If you're interested in how p-curve can be applied to effect sizes and how you can compute p-curves in your own research, you should visit http://www.p-curve.com/
Statistical power
In this last section I'll help you visualize the effect of power on the proportion of p-values. Below you'll see the code I use to simulate many tests with varying degrees of power.power.sim <- sapply(c(1/3, 1/2, 4/5), function(x){ delta = .4 pwr.analysis = power.t.test(delta = delta, power = x, sig.level = .05) n = pwr.analysis$n p.values <- list() for(i in 1:10000){ p.values[[i]] <- t.test(rnorm(n = n, mean = 0, sd = 1), rnorm(n = n, mean = 0 + delta, sd = 1))$p.value } return(p.values) }, simplify = FALSE, USE.NAMES = TRUE )
Below is a graph of all 30,000 p-values (10,000 per power level).
if(any((c("ggplot2", "dplyr") %in% row.names(installed.packages()))) == FALSE) install.packages(c("ggplot2", "dplyr")[c("ggplot2", "dplyr") %in% row.names(installed.packages()) == FALSE]) library(ggplot2) library(dplyr) data.frame(power = gl(n = 3, k = 10000, length = 30000, labels = c("33%", "50%", "80%")), p.value = unlist(power.sim)) %>% ggplot(., aes(x = p.value, fill = power)) + geom_histogram(bins = 80) + scale_fill_brewer(palette = "Reds") + labs(x = "p-values", y = "Frequency", title = "10000 simulations (delta = 0.4)")
Now only the significant p-values (p < .05)
data.frame(power = gl(n = 3, k = 10000, length = 30000, labels = c("33%", "50%", "80%")), p.value = unlist(power.sim)) %>% filter(p.value < .05) %>% ggplot(., aes(x = p.value, fill = power)) + geom_histogram(bins = 80) + scale_fill_brewer(palette = "Reds") + labs(x = "p-values < .05", y = "Frequency", title = "10000 simulations (delta = 0.4)")
Here's one obvious takeaway: you'll find significance more often if you power your studies more.
I think this point is underappreciated.
If you have 80% power to detect a given effect size, your long-run chance of finding a significant effect is...
80%.
My guess is that the typical effect size for a two-condition experiment in social psychology is in the ballpark between d = 0.3 and d = 0.41, 2.
That means 80% power for a run-of-the-mill, publishable, reasonably-easy-to-replicate effect size requires between 100 and 175 participants per condition.
That probably seems like a lot, but the reality is that the fewer subjects you run—whether it's a pilot study or the 'real thing'—the less likely you are to find an effect. This means risking failing to find effects that are there. But it also means that if you do power your studies to 80% (or more), then when you find null effects, you learn something useful:
(1) The effect isn't there, so stop testing it. Or
(2) The effect is small and therefore likely requires more resources from you and future researchers to find it.**
End rant. Happy R.
Nick
** Thanks to Will Gervais for this insight about statistical power. While I'm at it, this blog (and associated paper) was pretty insightful too.
References
- Gignac, G. E., & Szodorai, E. T. (2016). Effect size guidelines for individual differences researchers. Personality and Individual Differences, 102, 74-78.
- Richard, F. D., Bond Jr, C. F., & Stokes-Zoota, J. J. (2003). One Hundred Years of Social Psychology Quantitatively Described. Review of General Psychology, 7(4), 331-363.
* Simulated results from the quiz
# list of p-values for n = 150 ps.150 <- list() # list of p-values for n = 30 ps.30 <- list() # list of results for small effect result.sml <- list() # list of results for large effect result.lrg <- list() for(i in 1:10){ # delta for n = 150 d.sml <- power.t.test(n = 150, power = .8)$delta # delta for n = 30 d.lrg <- power.t.test(n = 30, power = .8)$delta # t-test for small effect test.sml <- t.test(rnorm(150,0,1), rnorm(150,d.sml,1), var.equal = T) # t-test for large effect test.lrg <- t.test(rnorm(30,0,1), rnorm(30,d.lrg,1), var.equal = T) # store each p-value in loop ps.150[[i]] <- test.sml$p.value ps.30[[i]] <- test.lrg$p.value # create results statements (Study i: t() = , p = ) result.sml[[i]] <- paste0("Study ", i,": t(",test.sml$parameter,") = ",round(test.sml$statistic,2),", p = ",test.sml$p.value) result.lrg[[i]] <- paste0("Study ", i,": t(",test.lrg$parameter,") = ",round(test.lrg$statistic,2),", p = ",test.lrg$p.value) }
No comments:
Post a Comment