Introduction

Today I thought of writing a post covering some topic from statistics. I decided to discuss some properties of p-value of hypothesis tests.

The post is written under Julia 1.8.5, Distributions.jl 0.25.80, HypothesisTests.jl 0.10.11, UnicodePlots.jl 3.3.4.

Some theory

Let us assume that we have some hypothesis, call it H0, that we assume to be true. Now consider that we have some statistic T, whose distribution we can compute assuming that H0 is true. For a given sample of data we computed the value of the statistic and denote this value as t.

We are ready to define p-value, which is the probability of obtaining the value of the T statistic at least as extreme as the result t actually observed, under the assumption that H0 is correct.

Let me give a simple example. Assume that we have some data, for which we assume that it comes from normal distribution with an unknown mean and known variance equal to v=1. We want to test H0 hypothesis that the mean of the distribution is equal to 0, against assumption that the mean is not equal to 0. We check it by taking a n=9 element sample of the data and computing its mean. Assume that this mean is m=1. In this case the value of the statistic t is computed using the formula t=m/sqrt(v/n)=1/sqrt(1/9)=3. From mathematical statistics we know that the distribution of the statistic T in this case is normal with mean 0 and variance 1. Therefore we can compute that the probability of seeing the value at least as extreme (in absolute value) as the observed is 2*(1-F(|t|)) = 2*(1-F(3)) ≈ 0.0027. This number is our p-value. In this case, since this probability is low we would most likely conclude that the data does not support our hypothesis that the mean is equal to 0.

Note that in the formula F is cumulative distribution function of standard normal distribution, and we take absolute value of t and multiply the weight of the tail of the distribution by two, as we assume that we equally treat positive and negative deviations from 0 as extreme (this is called a two-sided p-value).

We could have obtained this result conveniently in Julia using the OneSampleZTest function from the HypothesisTests.jl package as follows:

julia> using HypothesisTests

julia> m = 1.0
1.0

julia> v = 1.0
1.0

julia> n = 9
9

julia> OneSampleZTest(m, sqrt(v), n)
One sample z-test
-----------------
Population details:
    parameter of interest:   Mean
    value under h_0:         0
    point estimate:          1.0
    95% confidence interval: (0.3467, 1.653)

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           0.0027

Details:
    number of observations:   9
    z-statistic:              3.0
    population standard error: 0.3333333333333333

Now, a crucial observation about p-value, given the definition we have presented, is that assuming H0 is true the distribution of p-value should be uniform on [0, 1] interval.

For example, in the case of our normally distributed variable we claim that PV = 2*(1-F(|T|)) has uniform distribution on [0, 1] interval.

Let us make a quick computation to verify it is indeed the case. Assume x is from [0, 1] interval:

Pr(PV ≤ x) = Pr(2*(1-F(|T|)) ≤ x) = Pr(F⁻¹(1-x/2) ≤ |T|)

Now we note that this can be rewritten as:

Pr(F⁻¹(1-x/2) ≤ T) + Pr(T ≤ -F⁻¹(1-x/2))

Note that these two probabilities are equal as normal distribution is symmetric around 0 so we can simplify it to:

2 * (1 - Pr(T ≤ F⁻¹(1-x/2))) = 2 * (1 - F(F⁻¹(1-x/2))) = 2 * x/2 = x

So in summary we get Pr(PV ≤ x) = x for x from [0, 1] interval, so in this case indeed p-value has uniform distribution.

Checking the theory using simulation

Above we have claimed that the distribution of p-value for OneSampleZTest is uniform on [0, 1] interval. Let us check it using simulation. For this we will generate 9-element sample from standard normal distribution multiple times, and check if the distribution of p-value.

julia> using Random

julia> using Statistics

julia> using UnicodePlots

julia> pvalues = [pvalue(OneSampleZTest(mean(randn(9)), 1.0, 9))
                  for _ in 1:10^6]
1000000-element Vector{Float64}:
 0.1327790631059239
 0.37168172133050215
 0.9877013474897027
 0.24068935009108738
 0.24685639345198623
 0.19113260388855702
 0.1806265110576506
 0.3482022041517382
 ⋮
 0.33776890104210877
 0.141776876179674
 0.38354782746729116
 0.9866784609861702
 0.9219832724792354
 0.9768708734987599
 0.44377261216567804

julia> histogram(pvalues)
                ┌                                        ┐
   [0.0 , 0.05) ┤███████████████████████████████▋ 49 921
   [0.05, 0.1 ) ┤███████████████████████████████▊ 50 228
   [0.1 , 0.15) ┤███████████████████████████████▋ 49 908
   [0.15, 0.2 ) ┤████████████████████████████████  50 451
   [0.2 , 0.25) ┤███████████████████████████████▌ 49 655
   [0.25, 0.3 ) ┤███████████████████████████████▋ 49 907
   [0.3 , 0.35) ┤███████████████████████████████▋ 49 867
   [0.35, 0.4 ) ┤███████████████████████████████▊ 50 303
   [0.4 , 0.45) ┤███████████████████████████████▋ 49 827
   [0.45, 0.5 ) ┤███████████████████████████████▋ 49 980
   [0.5 , 0.55) ┤███████████████████████████████▋ 49 855
   [0.55, 0.6 ) ┤███████████████████████████████▌ 49 569
   [0.6 , 0.65) ┤███████████████████████████████▋ 49 798
   [0.65, 0.7 ) ┤███████████████████████████████▊ 50 143
   [0.7 , 0.75) ┤███████████████████████████████▉ 50 406
   [0.75, 0.8 ) ┤███████████████████████████████▋ 49 923
   [0.8 , 0.85) ┤███████████████████████████████▋ 50 048
   [0.85, 0.9 ) ┤███████████████████████████████▋ 49 835
   [0.9 , 0.95) ┤███████████████████████████████▊ 50 249
   [0.95, 1.0 ) ┤███████████████████████████████▊ 50 127
                └                                        ┘
                                 Frequency

Let us additionally check what happens if our data does not meet the assumptions of our H0:

julia> pvalues = [pvalue(OneSampleZTest(mean(randn(9)) .+ 1, 1.0, 9))
                  for _ in 1:10^6]
1000000-element Vector{Float64}:
 0.0002078548349486422
 7.323288709478623e-5
 0.4909823219717196
 0.12532336793818022
 0.05461673383130956
 0.001132766345973069
 0.02138767010993988
 0.4385291137658384
 ⋮
 0.06136424417006158
 0.003270513361517393
 0.0007748069139344607
 0.06833756938048707
 8.815311836419556e-5
 4.057178851779267e-5
 0.000801919865655273

julia> histogram(pvalues)
                ┌                                        ┐
   [0.0 , 0.05) ┤███████████████████████████████  850 737
   [0.05, 0.1 ) ┤██▎ 61 462
   [0.1 , 0.15) ┤█▏ 28 540
   [0.15, 0.2 ) ┤▋ 16 534
   [0.2 , 0.25) ┤▍ 10 542
   [0.25, 0.3 ) ┤▍ 7 367
   [0.3 , 0.35) ┤▎ 5 366
   [0.35, 0.4 ) ┤▎ 4 022
   [0.4 , 0.45) ┤▎ 3 058
   [0.45, 0.5 ) ┤▎ 2 329
   [0.5 , 0.55) ┤▏ 1 908
   [0.55, 0.6 ) ┤▏ 1 551
   [0.6 , 0.65) ┤▏ 1 319
   [0.65, 0.7 ) ┤▏ 1 091
   [0.7 , 0.75) ┤▏ 940
   [0.75, 0.8 ) ┤▏ 753
   [0.8 , 0.85) ┤▏ 715
   [0.85, 0.9 ) ┤▏ 619
   [0.9 , 0.95) ┤▏ 601
   [0.95, 1.0 ) ┤▏ 546
                └                                        ┘
                                 Frequency

Indeed, as expected, we see that in this case we are getting low p-value most of the time.

The fact that p-value has uniform distribution under H0 is a crucial property. Unfortunately, not all hypothesis tests can ensure this.

Distribution p-value for discrete tests

A typical issue with p-value distribution is encountered when we work with discrete distributions. Let me give you an example.

Now assume we sample data from a binomial distribution with sample size n=16 and success probability p=0.5. Let us repeat the simulation that we performed above using this distribution:

julia> using Distributions

julia> pvalues = [pvalue(BinomialTest(rand(Binomial(16, 0.5)), 16, 0.5))
                  for _ in 1:10^6]
1000000-element Vector{Float64}:
 0.8036193847656249
 0.8036193847656249
 0.8036193847656249
 0.8036193847656249
 0.8036193847656249
 0.45449829101562506
 0.8036193847656249
 0.8036193847656249
 ⋮
 0.8036193847656249
 0.21011352539062514
 0.8036193847656249
 0.21011352539062514
 0.21011352539062514
 0.45449829101562506
 0.8036193847656249

julia> histogram(pvalues)
                ┌                                        ┐
   [0.0 , 0.05) ┤█▊ 21 491
   [0.05, 0.1 ) ┤████▉ 55 699
   [0.1 , 0.15) ┤  0
   [0.15, 0.2 ) ┤  0
   [0.2 , 0.25) ┤███████████▊ 133 700
   [0.25, 0.3 ) ┤  0
   [0.3 , 0.35) ┤  0
   [0.35, 0.4 ) ┤  0
   [0.4 , 0.45) ┤  0
   [0.45, 0.5 ) ┤█████████████████████▋ 244 309
   [0.5 , 0.55) ┤  0
   [0.55, 0.6 ) ┤  0
   [0.6 , 0.65) ┤  0
   [0.65, 0.7 ) ┤  0
   [0.7 , 0.75) ┤  0
   [0.75, 0.8 ) ┤  0
   [0.8 , 0.85) ┤███████████████████████████████  348 485
   [0.85, 0.9 ) ┤  0
   [0.9 , 0.95) ┤  0
   [0.95, 1.0 ) ┤  0
   [1.0 , 1.05) ┤█████████████████▌ 196 316
                └                                        ┘
                                 Frequency

Now we can see that clearly the distribution is not uniform. For example, under the standard 0.05 cut-off threshold for p-value we would expect to have 50,000 observations in [0.0, 0.05) bin, but we got only 21,491. Is this a problem? Well, it is. Assume that we generate the data with p=0.55, but assume the H0 probability is 0.5. We get:

julia> pvalues = [pvalue(BinomialTest(rand(Binomial(16, 0.55)), 16, 0.5))
                  for _ in 1:10^6]
1000000-element Vector{Float64}:
 0.21011352539062514
 0.07681274414062504
 0.8036193847656249
 0.8036193847656249
 1.0
 0.8036193847656249
 0.8036193847656249
 0.45449829101562506
 ⋮
 0.8036193847656249
 0.45449829101562506
 0.8036193847656249
 0.8036193847656249
 0.21011352539062514
 1.0
 1.0

julia> histogram(pvalues)
                ┌                                        ┐
   [0.0 , 0.05) ┤██▉ 31 600
   [0.05, 0.1 ) ┤██████▌ 68 901
   [0.1 , 0.15) ┤  0
   [0.15, 0.2 ) ┤  0
   [0.2 , 0.25) ┤█████████████▊ 145 964
   [0.25, 0.3 ) ┤  0
   [0.3 , 0.35) ┤  0
   [0.35, 0.4 ) ┤  0
   [0.4 , 0.45) ┤  0
   [0.45, 0.5 ) ┤███████████████████████▏ 243 914
   [0.5 , 0.55) ┤  0
   [0.55, 0.6 ) ┤  0
   [0.6 , 0.65) ┤  0
   [0.65, 0.7 ) ┤  0
   [0.7 , 0.75) ┤  0
   [0.75, 0.8 ) ┤  0
   [0.8 , 0.85) ┤███████████████████████████████  328 306
   [0.85, 0.9 ) ┤  0
   [0.9 , 0.95) ┤  0
   [0.95, 1.0 ) ┤  0
   [1.0 , 1.05) ┤█████████████████▎ 181 315
                └                                        ┘
                                 Frequency

So we can see that even if I generate the data from a distribution different than assumed under H0 you would reject this hypothesis with 0.05 threshold less frequently than even what you would expect under H0.

Conclusions

The bias in distribution of p-value can have significant consequences when making statistical inference in practice. For example, in a paper I co-authored some time ago, we show the property of p-value for various statistical tests used in VaR backtesting (a very common test in financial applications). If you would like to learn the details you can find it here.