Investigating p-values in hypothesis tests
Introduction
Today I thought of writing a post covering some topic from statistics. I decided to discuss some properties of p-value of hypothesis tests.
The post is written under Julia 1.8.5, Distributions.jl 0.25.80, HypothesisTests.jl 0.10.11, UnicodePlots.jl 3.3.4.
Some theory
Let us assume that we have some hypothesis, call it H0, that we assume to be true. Now consider that we have some statistic T, whose distribution we can compute assuming that H0 is true. For a given sample of data we computed the value of the statistic and denote this value as t.
We are ready to define p-value, which is the probability of obtaining the value of the T statistic at least as extreme as the result t actually observed, under the assumption that H0 is correct.
Let me give a simple example. Assume that we have some data, for which we
assume that it comes from normal distribution with an unknown mean and known
variance equal to v=1
. We want to test H0 hypothesis that the mean of the
distribution is equal to 0, against assumption that the mean is not equal
to 0. We check it by taking a n=9
element sample of the data and computing
its mean. Assume that this mean is m=1
. In this case the value of the
statistic t is computed using the formula t=m/sqrt(v/n)=1/sqrt(1/9)=3
. From
mathematical statistics we know that the distribution of the statistic T in
this case is normal with mean 0 and variance 1. Therefore we can compute that
the probability of seeing the value at least as extreme (in absolute value) as
the observed is 2*(1-F(|t|)) = 2*(1-F(3)) ≈ 0.0027
. This number is our
p-value. In this case, since this probability is low we would most likely
conclude that the data does not support our hypothesis that the mean is equal
to 0.
Note that in the formula F
is cumulative distribution function of standard
normal distribution, and we take absolute value of t
and multiply the weight
of the tail of the distribution by two, as we assume that we equally treat
positive and negative deviations from 0 as extreme (this is called a two-sided
p-value).
We could have obtained this result conveniently in Julia using the
OneSampleZTest
function from the HypothesisTests.jl package as follows:
julia> using HypothesisTests
julia> m = 1.0
1.0
julia> v = 1.0
1.0
julia> n = 9
9
julia> OneSampleZTest(m, sqrt(v), n)
One sample z-test
-----------------
Population details:
parameter of interest: Mean
value under h_0: 0
point estimate: 1.0
95% confidence interval: (0.3467, 1.653)
Test summary:
outcome with 95% confidence: reject h_0
two-sided p-value: 0.0027
Details:
number of observations: 9
z-statistic: 3.0
population standard error: 0.3333333333333333
Now, a crucial observation about p-value, given the definition we have
presented, is that assuming H0 is true the distribution of p-value should
be uniform on [0, 1]
interval.
For example, in the case of our normally distributed variable we claim that
PV = 2*(1-F(|T|))
has uniform distribution on [0, 1]
interval.
Let us make a quick computation to verify it is indeed the case. Assume
x
is from [0, 1]
interval:
Pr(PV ≤ x) = Pr(2*(1-F(|T|)) ≤ x) = Pr(F⁻¹(1-x/2) ≤ |T|)
Now we note that this can be rewritten as:
Pr(F⁻¹(1-x/2) ≤ T) + Pr(T ≤ -F⁻¹(1-x/2))
Note that these two probabilities are equal as normal distribution is symmetric around 0 so we can simplify it to:
2 * (1 - Pr(T ≤ F⁻¹(1-x/2))) = 2 * (1 - F(F⁻¹(1-x/2))) = 2 * x/2 = x
So in summary we get Pr(PV ≤ x) = x
for x
from [0, 1]
interval, so
in this case indeed p-value has uniform distribution.
Checking the theory using simulation
Above we have claimed that the distribution of p-value for OneSampleZTest
is uniform on [0, 1]
interval. Let us check it using simulation. For this
we will generate 9-element sample from standard normal distribution multiple
times, and check if the distribution of p-value.
julia> using Random
julia> using Statistics
julia> using UnicodePlots
julia> pvalues = [pvalue(OneSampleZTest(mean(randn(9)), 1.0, 9))
for _ in 1:10^6]
1000000-element Vector{Float64}:
0.1327790631059239
0.37168172133050215
0.9877013474897027
0.24068935009108738
0.24685639345198623
0.19113260388855702
0.1806265110576506
0.3482022041517382
⋮
0.33776890104210877
0.141776876179674
0.38354782746729116
0.9866784609861702
0.9219832724792354
0.9768708734987599
0.44377261216567804
julia> histogram(pvalues)
┌ ┐
[0.0 , 0.05) ┤███████████████████████████████▋ 49 921
[0.05, 0.1 ) ┤███████████████████████████████▊ 50 228
[0.1 , 0.15) ┤███████████████████████████████▋ 49 908
[0.15, 0.2 ) ┤████████████████████████████████ 50 451
[0.2 , 0.25) ┤███████████████████████████████▌ 49 655
[0.25, 0.3 ) ┤███████████████████████████████▋ 49 907
[0.3 , 0.35) ┤███████████████████████████████▋ 49 867
[0.35, 0.4 ) ┤███████████████████████████████▊ 50 303
[0.4 , 0.45) ┤███████████████████████████████▋ 49 827
[0.45, 0.5 ) ┤███████████████████████████████▋ 49 980
[0.5 , 0.55) ┤███████████████████████████████▋ 49 855
[0.55, 0.6 ) ┤███████████████████████████████▌ 49 569
[0.6 , 0.65) ┤███████████████████████████████▋ 49 798
[0.65, 0.7 ) ┤███████████████████████████████▊ 50 143
[0.7 , 0.75) ┤███████████████████████████████▉ 50 406
[0.75, 0.8 ) ┤███████████████████████████████▋ 49 923
[0.8 , 0.85) ┤███████████████████████████████▋ 50 048
[0.85, 0.9 ) ┤███████████████████████████████▋ 49 835
[0.9 , 0.95) ┤███████████████████████████████▊ 50 249
[0.95, 1.0 ) ┤███████████████████████████████▊ 50 127
└ ┘
Frequency
Let us additionally check what happens if our data does not meet the assumptions of our H0:
julia> pvalues = [pvalue(OneSampleZTest(mean(randn(9)) .+ 1, 1.0, 9))
for _ in 1:10^6]
1000000-element Vector{Float64}:
0.0002078548349486422
7.323288709478623e-5
0.4909823219717196
0.12532336793818022
0.05461673383130956
0.001132766345973069
0.02138767010993988
0.4385291137658384
⋮
0.06136424417006158
0.003270513361517393
0.0007748069139344607
0.06833756938048707
8.815311836419556e-5
4.057178851779267e-5
0.000801919865655273
julia> histogram(pvalues)
┌ ┐
[0.0 , 0.05) ┤███████████████████████████████ 850 737
[0.05, 0.1 ) ┤██▎ 61 462
[0.1 , 0.15) ┤█▏ 28 540
[0.15, 0.2 ) ┤▋ 16 534
[0.2 , 0.25) ┤▍ 10 542
[0.25, 0.3 ) ┤▍ 7 367
[0.3 , 0.35) ┤▎ 5 366
[0.35, 0.4 ) ┤▎ 4 022
[0.4 , 0.45) ┤▎ 3 058
[0.45, 0.5 ) ┤▎ 2 329
[0.5 , 0.55) ┤▏ 1 908
[0.55, 0.6 ) ┤▏ 1 551
[0.6 , 0.65) ┤▏ 1 319
[0.65, 0.7 ) ┤▏ 1 091
[0.7 , 0.75) ┤▏ 940
[0.75, 0.8 ) ┤▏ 753
[0.8 , 0.85) ┤▏ 715
[0.85, 0.9 ) ┤▏ 619
[0.9 , 0.95) ┤▏ 601
[0.95, 1.0 ) ┤▏ 546
└ ┘
Frequency
Indeed, as expected, we see that in this case we are getting low p-value most of the time.
The fact that p-value has uniform distribution under H0 is a crucial property. Unfortunately, not all hypothesis tests can ensure this.
Distribution p-value for discrete tests
A typical issue with p-value distribution is encountered when we work with discrete distributions. Let me give you an example.
Now assume we sample data from a binomial distribution with sample size
n=16
and success probability p=0.5
. Let us repeat the simulation that we
performed above using this distribution:
julia> using Distributions
julia> pvalues = [pvalue(BinomialTest(rand(Binomial(16, 0.5)), 16, 0.5))
for _ in 1:10^6]
1000000-element Vector{Float64}:
0.8036193847656249
0.8036193847656249
0.8036193847656249
0.8036193847656249
0.8036193847656249
0.45449829101562506
0.8036193847656249
0.8036193847656249
⋮
0.8036193847656249
0.21011352539062514
0.8036193847656249
0.21011352539062514
0.21011352539062514
0.45449829101562506
0.8036193847656249
julia> histogram(pvalues)
┌ ┐
[0.0 , 0.05) ┤█▊ 21 491
[0.05, 0.1 ) ┤████▉ 55 699
[0.1 , 0.15) ┤ 0
[0.15, 0.2 ) ┤ 0
[0.2 , 0.25) ┤███████████▊ 133 700
[0.25, 0.3 ) ┤ 0
[0.3 , 0.35) ┤ 0
[0.35, 0.4 ) ┤ 0
[0.4 , 0.45) ┤ 0
[0.45, 0.5 ) ┤█████████████████████▋ 244 309
[0.5 , 0.55) ┤ 0
[0.55, 0.6 ) ┤ 0
[0.6 , 0.65) ┤ 0
[0.65, 0.7 ) ┤ 0
[0.7 , 0.75) ┤ 0
[0.75, 0.8 ) ┤ 0
[0.8 , 0.85) ┤███████████████████████████████ 348 485
[0.85, 0.9 ) ┤ 0
[0.9 , 0.95) ┤ 0
[0.95, 1.0 ) ┤ 0
[1.0 , 1.05) ┤█████████████████▌ 196 316
└ ┘
Frequency
Now we can see that clearly the distribution is not uniform. For example, under
the standard 0.05 cut-off threshold for p-value we would expect to have
50,000 observations in [0.0, 0.05)
bin, but we got only 21,491
. Is this a
problem? Well, it is. Assume that we generate the data with p=0.55
, but assume
the H0 probability is 0.5
. We get:
julia> pvalues = [pvalue(BinomialTest(rand(Binomial(16, 0.55)), 16, 0.5))
for _ in 1:10^6]
1000000-element Vector{Float64}:
0.21011352539062514
0.07681274414062504
0.8036193847656249
0.8036193847656249
1.0
0.8036193847656249
0.8036193847656249
0.45449829101562506
⋮
0.8036193847656249
0.45449829101562506
0.8036193847656249
0.8036193847656249
0.21011352539062514
1.0
1.0
julia> histogram(pvalues)
┌ ┐
[0.0 , 0.05) ┤██▉ 31 600
[0.05, 0.1 ) ┤██████▌ 68 901
[0.1 , 0.15) ┤ 0
[0.15, 0.2 ) ┤ 0
[0.2 , 0.25) ┤█████████████▊ 145 964
[0.25, 0.3 ) ┤ 0
[0.3 , 0.35) ┤ 0
[0.35, 0.4 ) ┤ 0
[0.4 , 0.45) ┤ 0
[0.45, 0.5 ) ┤███████████████████████▏ 243 914
[0.5 , 0.55) ┤ 0
[0.55, 0.6 ) ┤ 0
[0.6 , 0.65) ┤ 0
[0.65, 0.7 ) ┤ 0
[0.7 , 0.75) ┤ 0
[0.75, 0.8 ) ┤ 0
[0.8 , 0.85) ┤███████████████████████████████ 328 306
[0.85, 0.9 ) ┤ 0
[0.9 , 0.95) ┤ 0
[0.95, 1.0 ) ┤ 0
[1.0 , 1.05) ┤█████████████████▎ 181 315
└ ┘
Frequency
So we can see that even if I generate the data from a distribution different
than assumed under H0 you would reject this hypothesis with 0.05
threshold
less frequently than even what you would expect under H0.
Conclusions
The bias in distribution of p-value can have significant consequences when making statistical inference in practice. For example, in a paper I co-authored some time ago, we show the property of p-value for various statistical tests used in VaR backtesting (a very common test in financial applications). If you would like to learn the details you can find it here.