Author Archives: Blog by Bogumił Kamiński

Grand Julia containers test. Can you get a perfect score?

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/02/03/iterable.html

Introduction

Containers are objects grouping multiple values together.
In Julia, examples of containers are vectors or dictionaries.

To ensure that it is easy to work with containers Julia introduces four
interfaces:

  • iteration;
  • indexing;
  • array;
  • broadcasting.

They are described in the Interfaces section of the Julia Manual.
However, this description is a bit technical and it mostly aimed at developers
creating new container types. Therefore, I thought to write a post
aimed to explain these interfaces by example from user’s perspective.

The material presented here is organized in two parts.

Part one is basic knowledge you should have. I tried to collect in this post
most relevant information.

Part two is organized as a quiz to check your understanding of the discussed
interfaces. There are ten questions in the quiz (plus one bonus question).
Do you know an answer to all of them?

The post is tested under Julia 1.8.5 and DataFrames.jl 1.4.4.

Container interfaces explained

Iteration

Iteration is the least demanding interface. It ensures that you can get elements
from a container sequentially.

Most of containers are iterable, including: arrays, dictionaries, sets,
I/O buffers, and strings.

If some container supports iteration then it can be used in for loops,
comprehensions, and higher other functions relying only on iteration of values
like foreach.

Here is a basic example of iterating a tuple:

julia> t = (1, 2, 3)
(1, 2, 3)

julia> for v in t
           println(v)
       end
1
2
3

julia> [v for v in t]
3-element Vector{Int64}:
 1
 2
 3

An important feature of iteration interface is that it also works for
collections that get mutated when iterating them. A basic example is reading
data from IOBuffer:

julia> buf = IOBuffer(join('a':'d', '\n'))
IOBuffer(data=UInt8[...], readable=true, writable=false, seekable=true, append=false, size=7, maxsize=Inf, ptr=1, mark=-1)

julia> foreach(println, eachline(buf))
a
b
c
d

julia> foreach(println, eachline(buf))

julia>

Note that by iterating buf line by line we were mutating it. Therefore, the
second time we iterate it we do not get any output.

When you use iteration interface it is always essential to have a mindset that
you are guaranteed to be able to read an element of a collection once.

Indexing

Indexing allows you do access elements from a container using their identifier
typically called index. This, implicitly, means that you can read the same
element multiple times without a problem (as opposed to iteration).

The key feature of this interface is that it supports a convenient x[i] syntax
for getting elements from a container. Again, let us give a simple example:

julia> d = Dict(v => '0'+v for v in s)
Dict{Int64, Char} with 4 entries:
  4 => '4'
  2 => '2'
  3 => '3'
  1 => '1'

julia> d[3]
'3': ASCII/Unicode U+0033 (category Nd: Number, decimal digit)

In practice in Julia this interface is implemented in several flavors.
The most important of them are the following.

First, indexing can support only reading or reading and writing of data
to the container. For example entries of a Vector can be mutated, but ranges
are read-only:

julia> v1 = [1, 2, 3]
3-element Vector{Int64}:
 1
 2
 3

julia> v1[2] = 12
12

julia> v2 = 1:3
1:3

julia> v2[2] = 12
ERROR: CanonicalIndexError: setindex! not defined for UnitRange{Int64}

Lesson to remember: do not assume that indexable objects are always writeable.

The second flavor is that typically indexing interface assumes that it is
possible to order valid indices and define first and last index. This can be
conveniently exploited by using begin and end keywords when indexing:

julia> v1[begin]
1

julia> v1[end]
3

However, this support is not guaranteed by all objects that support indexing.
A basic example is a dictionary:

julia> d[end]
ERROR: MethodError: no method matching lastindex(::Dict{Int64, Char})

So the second thing to remember is that you cannot assume that all indexable
objects will support passing begin and end keywords when indexing.

Array

Arrays move the indexing syntax further by allowing passing multiple values in
x[i, j, ...] syntax and also introducing a notion of CartesianIndex
indexing. Typically types that support array interface opt-in to be a subtype of
AbstractArray. An important feature of arrays is that they specify their
dimensions and using the size function one can get them. Similarly using the
axes functions one can get valid indices along the dimensions.

Here is an example:

julia> mat = [1 2 3; 4 5 6]
2×3 Matrix{Int64}:
 1  2  3
 4  5  6

julia> size(mat)
(2, 3)

julia> axes(mat)
(Base.OneTo(2), Base.OneTo(3))

julia> mat[1, 2]
2

julia> mat[CartesianIndex(1, 2)]
2

An important feature that is often needed is iteration over all elements of
an array. You can get an iterator of such indices using the eachindex
function.

It is important to remember that this function only guarantees to return
a value that is a valid index to an array. The type of this index is determined
based on an assessment of indexing efficiency. Here is an example:

julia> for idx in eachindex(mat)
           print(idx, " ")
       end
1 2 3 4 5 6
julia> for idx in eachindex(@view mat[1:2, 2:3])
           print(idx, " ")
       end
CartesianIndex(1, 1) CartesianIndex(2, 1) CartesianIndex(1, 2) CartesianIndex(2, 2)

One notable feature here is that although mat is a matrix, and supports
passing indices for all dimensions when indexing, like e.g. mat[1, 2] or
mat[CartesianIndex(1, 2)] we see that eachindex(mat) returns integer indices
(because they are faster). Such indices are called linear indices and are
supported by all arrays in Julia. Therefore you can write:

julia> mat[3]
2

This is an important feature to remember so let me stress it again: any array in
Julia can be indexed using a single integer index, called linear index.

The second, somewhat surprising, feature to remember, is that you can append as
many trailing 1 when indexing any array as you like, without affecting the
result (such extra 1s are ignored):

julia> mat[2, 2]
5

julia> mat[2, 2, 1, 1, 1]
5

Here you can find more explanations about this topic.

Broadcasting

Broadcasting is a convenient way of performing operations on arrays. In
particular it allows for combining arrays of different sizes in a single
operation. The rule is the following: when you operate on two arrays that do not
have matching dimensions then expand dimensions of length 1 to the required
length by repeating the values as needed. Here is a quick example:

julia> ["a" "b"] .^ [1, 2, 3]
3×2 Matrix{String}:
 "a"    "b"
 "aa"   "bb"
 "aaa"  "bbb"

We take a matrix ["a" "b"] having one row and two columns and a vector
having one column and three rows. As you can see single row of a matrix
gets repeated three times. Similarly single column of a vector gets repeated
two times. In this way the dimensions of both objects match and we can perform
the computation. Notably, our [1, 2, 3] vector has only one dimension, so the
column dimension was added to it by broadcasting and set to 1 (as you likely
remember when we discussed array indexing you were allowed to put trailing 1s
without affecting the result).

Broadcasting in Julia is performed by using a ., you can read more about it
here. This is a very convenient syntax. For this reason people started
to use it in many contexts. Because of this popularity one issue immediately
surfaced that broadcasting is useful also in cases when they are not working
with arrays. Here is an example:

julia> ["a"] .^ [1, 2, 3]
3-element Vector{String}:
 "a"
 "aa"
 "aaa"

While this works writing ["a"] is not very convenient. Therefore, broadcasting
was extended to allow for non-arrays to take part in operations. You can thus
write:

julia> "a" .^ [1, 2, 3]
3-element Vector{String}:
 "a"
 "aa"
 "aaa"

Where "a" pretends to be an array having length 1 in all dimensions that
are defined by the other container. This pretending is implemented in
cases where it is useful to think of a given value as as an array. A special,
and most common, case is when the value pretends to be a scalar. The easiest
way to think about a scalar is that it has 0 dimensions (so effectively in all
dimensions it can be treated as having length 1 and be appropriately
expanded).

Sometimes you want a value to be treated as a scalar even if it is not a scalar.
In such case you can wrap it with Ref. Ref creates a 0-dimensional object
that behaves like a scalar. Here is an example:

julia> x = [1, 2, 3]
3-element Vector{Int64}:
 1
 2
 3

julia> r = Ref(x)
Base.RefValue{Vector{Int64}}([1, 2, 3])

julia> size(r)
()

julia> r[]
3-element Vector{Int64}:
 1
 2
 3

Note that we could extract the value storing in r by indexing without passing
any index (because it is 0-dimensional). Now let us have a look when this is
useful:

julia> [1, 2, 3] .∈ [1, 3, 4]
3-element BitVector:
 1
 0
 0

julia> [1, 2, 3] .∈ Ref([1, 3, 4])
3-element BitVector:
 1
 0
 1

The first operation was not very useful. It was equivalent to writing
[1 ∈ 1, 2 ∈ 3, 3 in ∈ 4], probably not what we wanted. In the second operation
we protected the second vector with Ref so that it was used as a whole
for every element of [1, 2, 3] vector.

Using Ref is also needed when some object does not have a default behavior
that it pretends to be an array implemented. A common example are standard
dictionaries:

julia> haskey.(d, [1, 2, 5])
ERROR: ArgumentError: broadcasting over dictionaries and `NamedTuple`s is reserved

julia> haskey.(Ref(d), [1, 2, 5])
3-element BitVector:
 1
 1
 0

Now you should have an understanding of iteration, indexing (including array
indexing), and broadcasting.

Quiz: problems

  1. What is the order of iteration of arrays?
  2. Are tuples broadcastable?
  3. Are numbers iterable and indexable?
  4. Are atomic numbers iterable, indexable, or broadcastable?
  5. Is Set iterable, indexable, or broadcastable?
  6. Is Dict iterable, indexable, or broadcastable?
  7. Is named tuple iterable, indexable, or broadcastable?
  8. What is special about string iteration, indexing, and broadcasting?
  9. Is Pair iterable, indexable, or broadcastable?
  10. Is pairs(["a", "b", "c"]) iterable, indexable, or broadcastable?
  11. Bonus question: is DataFrame iterable, indexable, or broadcastable? (it is
    the only type not from Base Julia in the list, but I could not resist the
    temptation to include it)

Quiz: answers

1. What is the order of iteration of arrays?

With arrays, you need to remember that they are iterated in column major order
(just as in linear indexing we discussed):

julia> x = [1 2; 3 4]
2×2 Matrix{Int64}:
 1  2
 3  4

julia> for v in x
           print(v, " ")
       end
1 3 2 4

2. Are tuples broadcastable?

Yes, you can use tuples in broadcasting. They are handled in the same way
as vectors. The only difference is that if you only broadcast a tuple you
get a tupe as a result, while, if you mix tuples with arrays of dimension
at least 1 you get an array:

julia> 1 .+ (1, 2, 3)
(2, 3, 4)

julia> fill(1) .+ (1, 2, 3) # fill(1) produces 0-dimensional array
(2, 3, 4)

julia> [1] .+ (1, 2, 3) # here [1] is 1-dimensional array
3-element Vector{Int64}:
 2
 3
 4

3. Are numbers iterable and indexable?

Yes they are. They are treated as 0-dimensional containers.

julia> x = 1
1

julia> for v in x
           println(v)
       end
1

julia> x[]
1

julia> x[1, 1, 1]
1

julia> size(x)
()

julia> axes(x)
()

4. Are atomic numbers iterable, indexable, or broadcastable?

No. They only support indexing with no arguments passed:

julia> x = Threads.Atomic{Int}(1)
Base.Threads.Atomic{Int64}(1)

julia> for v in x
           println(v)
       end
ERROR: MethodError: no method matching iterate(::Base.Threads.Atomic{Int64})

julia> x[]
1

julia> x[1]
ERROR: MethodError: no method matching getindex(::Base.Threads.Atomic{Int64}, ::Int64)

julia> x .+ 1
ERROR: MethodError: no method matching length(::Base.Threads.Atomic{Int64})

5. Is Set iterable, indexable, or broadcastable?

It is iterable and broadcastable, but not indexable:

julia> s = Set([1, 2, 3])
Set{Int64} with 3 elements:
  2
  3
  1

julia> for v in s
           println(v)
       end
2
3
1

julia> s .+ 1
3-element Vector{Int64}:
 3
 4
 2

Note that when you iterate or broadcast Set the order of returned values
is undefined. It is a common pitfall when doing broadcasting:

julia> s
Set{Int64} with 3 elements:
  2
  3
  1

julia> [1, 2, 3] .∈ s # note the iteration order of s
3-element BitVector:
 0
 0
 0

julia> [1, 2, 3] .∈ Ref(s) # this is what you really wanted
3-element BitVector:
 1
 1
 1

6. Is Dict iterable, indexable, or broadcastable?

It is iterable, and partially indexable, but not broadcastable:

julia> d = Dict(v => '0'+v for v in s)
Dict{Int64, Char} with 4 entries:
  4 => '4'
  2 => '2'
  3 => '3'
  1 => '1'

julia> foreach(println, d)
4 => '4'
2 => '2'
3 => '3'
1 => '1'

julia> foreach(show, keys(d))
4231

julia> foreach(show, values(d))
'4''2''3''1'

For dictionaries, a basic thing to remember is that iterating them returns
key-value Pairs. If you want only keys, use keys to create an appropriate
iterator. Similarly to get values use the values function. Also the order of
iteration of a standard Dict is undefined.

julia> d[1]
'1': ASCII/Unicode U+0031 (category Nd: Number, decimal digit)

They are partially indexable, since they do not support firstindex and
lastindex methods.

Finally they are not broadcastable:

julia> d .+ 1
ERROR: ArgumentError: broadcasting over dictionaries and `NamedTuple`s is reserved

7. Is named tuple iterable, indexable, or broadcastable?

They are iterable, indexable, but not broadcastable:

julia> nt = (a=1, b=2, c=3)
(a = 1, b = 2, c = 3)

julia> foreach(println, nt)
1
2
3

julia> foreach(println, pairs(nt))
:a => 1
:b => 2
:c => 3

julia> nt[end]
3

julia> nt .+ 1
ERROR: ArgumentError: broadcasting over dictionaries and `NamedTuple`s is reserved

Note that if you want to get iterate key-value in a named tuple you need to
use the pairs function.

8. What is special about string iteration, indexing, and broadcasting?

When you iterate strings you get characters. String indexing is supported, but
does not have to be continuous. Instead they support byte-indexing, as I have
explained in detail in this post. In broadcasting they behave like a
scalar:

julia> str = "1₂3"
"1₂3"

julia> foreach(display, str)
'1': ASCII/Unicode U+0031 (category Nd: Number, decimal digit)
'₂': Unicode U+2082 (category No: Number, other)
'3': ASCII/Unicode U+0033 (category Nd: Number, decimal digit)

julia> collect(eachindex(str))
3-element Vector{Int64}:
 1
 2
 5

julia> str[2]
'₂': Unicode U+2082 (category No: Number, other)

julia> str[5]
'3': ASCII/Unicode U+0033 (category Nd: Number, decimal digit)

julia> string.(["a", "b"], str)
2-element Vector{String}:
 "a1₂3"
 "b1₂3"

Notice, that character '₂' has index 2, but the next valid index is 5,
because '₂' occupies three bytes.

9. Is Pair iterable, indexable, or broadcastable?

It is iterable and indexable. In broadcasting it is treated as a scalar:

julia> p = :a => sin
:a => sin

julia> foreach(display, p)
:a
sin (generic function with 14 methods)

julia> p[1]
:a

julia> p[end]
sin (generic function with 14 methods)

julia> tuple.(p, (1, 2))
((:a => sin, 1), (:a => sin, 2))

10. Is pairs(["a", "b", "c"]) iterable, indexable, or broadcastable?

It is iterable and partially indexable, but not broadcastable, although it
has length, size, and axes:

julia> pv = pairs(["a", "b", "c"])
pairs(::Vector{String})(...):
  1 => "a"
  2 => "b"
  3 => "c"

julia> foreach(println, pv)
1 => "a"
2 => "b"
3 => "c"

julia> pv[1]
"a"

julia> pv[3]
"c"

julia> length(pv)
3

julia> size(pv)
(3,)

julia> axes(pv)
(Base.OneTo(3),)

julia> identity.(pv)
ERROR: ArgumentError: broadcasting over dictionaries and `NamedTuple`s is reserved

11. Is DataFrame iterable, indexable, or broadcastable?

Data frame is indexable (but requires to always pass two dimensional index)
and broadcastable, but not iterable. To iterate it
you need to choose if you want to iterate rows or columns:

julia> using DataFrames

julia> df = DataFrame(a=1:3, b=11:13)
3×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1     11
   2 │     2     12
   3 │     3     13

julia> df[2, 1]
2

julia> df[end, end]
13

julia> df[1]
ERROR: ArgumentError: syntax df[column] is not supported use df[!, column] instead

julia> df .^ 2
3×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1    121
   2 │     4    144
   3 │     9    169

julia> foreach(println, df)
ERROR: AbstractDataFrame is not iterable. Use eachrow(df) to get a row iterator or eachcol(df) to get a column iterator

julia> foreach(println, eachrow(df))
DataFrameRow
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1     11
DataFrameRow
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   2 │     2     12
DataFrameRow
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   3 │     3     13

julia> foreach(println, eachcol(df))
[1, 2, 3]
[11, 12, 13]

Conclusions

I hope you found the examples I included in the post useful and that after
reading it your grasp of working with containers in Julia improved!

Some lesser known properties of the coefficient of determination

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/01/27/r2.html

Introduction

Last week I have posted about p-value in hypothesis testing.
I decided to continue discussion of basic statistics today.

My focus will be on computation of Pearson correlation coefficient
and the coefficient of determination. I want to concentrate on the fact standard
estimators of these quantities are biased.

The post is written under Julia 1.8.5, Distributions 0.25.80,
HypergeometricFunctions 0.3.11, HypothesisTests.jl 0.10.11, and Plots.jl 1.38.2.

Testing for bias

In the post we will assume that we have an n-element sample x and y of
two random variables that are normally distributed.

The standard formula for Pearson correlation coefficient between x and y
is (using Julia as pseudocode):

sum(((xi, yi),) -> (xi-mean(x))*(yi-mean(y)), zip(x, y)) /
sqrt(sum(xi -> (xi-mean(x))^2, x) * sum(yi -> (yi-mean(y))^2, y))

Now assume that we want to build a simple linear regression model
where we explain y by x. In this case the coefficient of determination of
this regression is known to be a square of Pearson correlation coefficient
between y and x.

An interesting feature is that both Pearson correlation coefficient and the
coefficient of determination defined above are biased estimators. Let us check
this using a simple experiment. We will generate data for n=10 and true
Pearson correlation between x and y set to 0.5 (note that then the
true coefficient of determination is equal to 0.25).

julia> using Distributions

julia> using Statistics

julia> using Random

julia> function sim(n, ρ)
           dist = MvNormal([1.0 ρ; ρ 1.0])
           x, y = eachrow(rand(dist, n))
           return cor(x, y)
       end
sim (generic function with 1 method)

julia> Random.seed!(1);

julia> cor_sim = [sim(10, 0.5) for _ in 1:10^6];

julia> r2_sim = cor_sim .^ 2;

Now check that the obtained estimators are biased indeed:

julia> using HypothesisTests

julia> OneSampleTTest(cor_sim, 0.5)
One sample t-test
-----------------
Population details:
    parameter of interest:   Mean
    value under h_0:         0.5
    point estimate:          0.478612
    95% confidence interval: (0.4781, 0.4791)

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           <1e-99

Details:
    number of observations:   1000000
    t-statistic:              -80.08122936481526
    degrees of freedom:       999999
    empirical standard error: 0.0002670739739448121


julia> OneSampleTTest(r2_sim, 0.25)
One sample t-test
-----------------
Population details:
    parameter of interest:   Mean
    value under h_0:         0.25
    point estimate:          0.300398
    95% confidence interval: (0.3, 0.3008)

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           <1e-99

Details:
    number of observations:   1000000
    t-statistic:              231.0430912829669
    degrees of freedom:       999999
    empirical standard error: 0.00021813356867576183

We see a noticeable bias for both coefficients. An interesting feature
you might notice is that Pearson correlation coefficient is biased down, while
the coefficient of determination is biased up.

Debiasing estimators

Unbiased estimators for our case have been derived over 60 years ago by
Olkin and Pratt. If we denote by r the computed Pearson coefficient
of determination then the unbiased estimates are given using the following
functions:

julia> using HypergeometricFunctions

julia> cor_unbiased(r) = r * _₂F₁(0.5, 0.5, (n-2)/2, 1-r^2)
cor_unbiased (generic function with 1 method)

julia> r2_unbiased(r) = 1 - (1 - r^2) * _₂F₁(1, 1, n/2, 1-r^2) * (n-3)/(n-2)
r2_unbiased (generic function with 1 method)

Let us check them on our data:

julia> OneSampleTTest(cor_unbiased.(cor_sim), 0.5)
One sample t-test
-----------------
Population details:
    parameter of interest:   Mean
    value under h_0:         0.5
    point estimate:          0.499949
    95% confidence interval: (0.4994, 0.5005)

Test summary:
    outcome with 95% confidence: fail to reject h_0
    two-sided p-value:           0.8529

Details:
    number of observations:   1000000
    t-statistic:              -0.18540252488698106
    degrees of freedom:       999999
    empirical standard error: 0.0002740923081266803


julia> OneSampleTTest(r2_unbiased.(cor_sim), 0.25)
One sample t-test
-----------------
Population details:
    parameter of interest:   Mean
    value under h_0:         0.25
    point estimate:          0.249932
    95% confidence interval: (0.2494, 0.2505)

Test summary:
    outcome with 95% confidence: fail to reject h_0
    two-sided p-value:           0.8054

Details:
    number of observations:   1000000
    t-statistic:              -0.24635861593453004
    degrees of freedom:       999999
    empirical standard error: 0.0002750802967082336

Indeed, debiasing worked.

Difference between estimators

Now check the direction of the difference between estimators as a function of
the observed Pearson correlation coefficient (keeping n=10 fixed as above):

julia> using Plots

julia> plot(r -> r - cor_unbiased(r), xlim=[-1, 1], label="r difference",
            xlab="observed r")

julia> plot!(r -> r^2 - r2_unbiased(r), label="r² difference")

You should get the following plot:

Biases

As you can see, Pearson correlation coefficient is always bumped up for
positive correlations and down for negative correlations in our case.

Interestingly the coefficient of determination is bumped down if the
absolute value of true correlation is less than approximately 0.6565, and if
this absolute value is larger then it will be changed up.

This means that we can expect that for large true Pearson coefficient of
correlation the standard formula for computing the coefficient of determination
will lead to underestimation of the true value on the average.

To check this let us run our simulation again with true r=0.9 and still
keeping n=10:

julia> r2_sim_2 = [sim(10, 0.9)^2 for _ in 1:10^6];

julia> OneSampleTTest(r2_sim_2, 0.81)
One sample t-test
-----------------
Population details:
    parameter of interest:   Mean
    value under h_0:         0.81
    point estimate:          0.796659
    95% confidence interval: (0.7964, 0.7969)

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           <1e-99

Details:
    number of observations:   1000000
    t-statistic:              -100.77325436105275
    degrees of freedom:       999999
    empirical standard error: 0.00013238294244740913

julia> OneSampleTTest(r2_unbiased.(sqrt.(r2_sim_2)), 0.81)
One sample t-test
-----------------
Population details:
    parameter of interest:   Mean
    value under h_0:         0.81
    point estimate:          0.810047
    95% confidence interval: (0.8098, 0.8103)

Test summary:
    outcome with 95% confidence: fail to reject h_0
    two-sided p-value:           0.7251

Details:
    number of observations:   1000000
    t-statistic:              0.351680959027248
    degrees of freedom:       999999
    empirical standard error: 0.0001342944339289557

Indeed the coefficient of determination is negatively biased in this case
and using Olkin and Pratt method worked again.

So what are the downsides of Olkin and Pratt estimator of the coefficient of
determination? The problem is that it gives negative values for the coefficient
for observed r close to 0. Why is this unavoidable? To see this assume for
a moment that true r=0. In random a sample we will see some positive observed
r^2, just by pure chance. Therefore, to keep the estimator unbiased (note that
its expected value should be 0) we have to allow for its negative values.

Conclusions

I hope you found the presented properties of estimators useful. I think it is
quite interesting that even basic statistical methods have quite complex
properties, that are not obvious initially.

Investigating p-values in hypothesis tests

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/01/20/pvalue.html

Introduction

Today I thought of writing a post covering some topic from statistics.
I decided to discuss some properties of p-value of hypothesis tests.

The post is written under Julia 1.8.5, Distributions.jl 0.25.80,
HypothesisTests.jl 0.10.11, UnicodePlots.jl 3.3.4.

Some theory

Let us assume that we have some hypothesis, call it H0,
that we assume to be true. Now consider that we have some statistic T,
whose distribution we can compute assuming that H0 is true.
For a given sample of data we computed the value of the statistic and denote
this value as t.

We are ready to define p-value, which is the probability of obtaining the
value of the T statistic at least as extreme as the result t actually
observed, under the assumption that H0 is correct.

Let me give a simple example. Assume that we have some data, for which we
assume that it comes from normal distribution with an unknown mean and known
variance equal to v=1. We want to test H0 hypothesis that the mean of the
distribution is equal to 0, against assumption that the mean is not equal
to 0. We check it by taking a n=9 element sample of the data and computing
its mean. Assume that this mean is m=1. In this case the value of the
statistic t is computed using the formula t=m/sqrt(v/n)=1/sqrt(1/9)=3. From
mathematical statistics we know that the distribution of the statistic T in
this case is normal with mean 0 and variance 1. Therefore we can compute that
the probability of seeing the value at least as extreme (in absolute value) as
the observed is 2*(1-F(|t|)) = 2*(1-F(3)) ≈ 0.0027. This number is our
p-value. In this case, since this probability is low we would most likely
conclude that the data does not support our hypothesis that the mean is equal
to 0.

Note that in the formula F is cumulative distribution function of standard
normal distribution, and we take absolute value of t and multiply the weight
of the tail of the distribution by two, as we assume that we equally treat
positive and negative deviations from 0 as extreme (this is called a two-sided
p-value).

We could have obtained this result conveniently in Julia using the
OneSampleZTest function from the HypothesisTests.jl package as follows:

julia> using HypothesisTests

julia> m = 1.0
1.0

julia> v = 1.0
1.0

julia> n = 9
9

julia> OneSampleZTest(m, sqrt(v), n)
One sample z-test
-----------------
Population details:
    parameter of interest:   Mean
    value under h_0:         0
    point estimate:          1.0
    95% confidence interval: (0.3467, 1.653)

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           0.0027

Details:
    number of observations:   9
    z-statistic:              3.0
    population standard error: 0.3333333333333333

Now, a crucial observation about p-value, given the definition we have
presented, is that assuming H0 is true the distribution of p-value should
be uniform on [0, 1] interval.

For example, in the case of our normally distributed variable we claim that
PV = 2*(1-F(|T|)) has uniform distribution on [0, 1] interval.

Let us make a quick computation to verify it is indeed the case. Assume
x is from [0, 1] interval:

Pr(PV ≤ x) = Pr(2*(1-F(|T|)) ≤ x) = Pr(F⁻¹(1-x/2) ≤ |T|)

Now we note that this can be rewritten as:

Pr(F⁻¹(1-x/2) ≤ T) + Pr(T ≤ -F⁻¹(1-x/2))

Note that these two probabilities are equal as normal distribution is symmetric
around 0 so we can simplify it to:

2 * (1 - Pr(T ≤ F⁻¹(1-x/2))) = 2 * (1 - F(F⁻¹(1-x/2))) = 2 * x/2 = x

So in summary we get Pr(PV ≤ x) = x for x from [0, 1] interval, so
in this case indeed p-value has uniform distribution.

Checking the theory using simulation

Above we have claimed that the distribution of p-value for OneSampleZTest
is uniform on [0, 1] interval. Let us check it using simulation. For this
we will generate 9-element sample from standard normal distribution multiple
times, and check if the distribution of p-value.

julia> using Random

julia> using Statistics

julia> using UnicodePlots

julia> pvalues = [pvalue(OneSampleZTest(mean(randn(9)), 1.0, 9))
                  for _ in 1:10^6]
1000000-element Vector{Float64}:
 0.1327790631059239
 0.37168172133050215
 0.9877013474897027
 0.24068935009108738
 0.24685639345198623
 0.19113260388855702
 0.1806265110576506
 0.3482022041517382
 ⋮
 0.33776890104210877
 0.141776876179674
 0.38354782746729116
 0.9866784609861702
 0.9219832724792354
 0.9768708734987599
 0.44377261216567804

julia> histogram(pvalues)
                ┌                                        ┐
   [0.0 , 0.05) ┤███████████████████████████████▋ 49 921
   [0.05, 0.1 ) ┤███████████████████████████████▊ 50 228
   [0.1 , 0.15) ┤███████████████████████████████▋ 49 908
   [0.15, 0.2 ) ┤████████████████████████████████  50 451
   [0.2 , 0.25) ┤███████████████████████████████▌ 49 655
   [0.25, 0.3 ) ┤███████████████████████████████▋ 49 907
   [0.3 , 0.35) ┤███████████████████████████████▋ 49 867
   [0.35, 0.4 ) ┤███████████████████████████████▊ 50 303
   [0.4 , 0.45) ┤███████████████████████████████▋ 49 827
   [0.45, 0.5 ) ┤███████████████████████████████▋ 49 980
   [0.5 , 0.55) ┤███████████████████████████████▋ 49 855
   [0.55, 0.6 ) ┤███████████████████████████████▌ 49 569
   [0.6 , 0.65) ┤███████████████████████████████▋ 49 798
   [0.65, 0.7 ) ┤███████████████████████████████▊ 50 143
   [0.7 , 0.75) ┤███████████████████████████████▉ 50 406
   [0.75, 0.8 ) ┤███████████████████████████████▋ 49 923
   [0.8 , 0.85) ┤███████████████████████████████▋ 50 048
   [0.85, 0.9 ) ┤███████████████████████████████▋ 49 835
   [0.9 , 0.95) ┤███████████████████████████████▊ 50 249
   [0.95, 1.0 ) ┤███████████████████████████████▊ 50 127
                └                                        ┘
                                 Frequency

Let us additionally check what happens if our data does not meet the assumptions
of our H0:

julia> pvalues = [pvalue(OneSampleZTest(mean(randn(9)) .+ 1, 1.0, 9))
                  for _ in 1:10^6]
1000000-element Vector{Float64}:
 0.0002078548349486422
 7.323288709478623e-5
 0.4909823219717196
 0.12532336793818022
 0.05461673383130956
 0.001132766345973069
 0.02138767010993988
 0.4385291137658384
 ⋮
 0.06136424417006158
 0.003270513361517393
 0.0007748069139344607
 0.06833756938048707
 8.815311836419556e-5
 4.057178851779267e-5
 0.000801919865655273

julia> histogram(pvalues)
                ┌                                        ┐
   [0.0 , 0.05) ┤███████████████████████████████  850 737
   [0.05, 0.1 ) ┤██▎ 61 462
   [0.1 , 0.15) ┤█▏ 28 540
   [0.15, 0.2 ) ┤▋ 16 534
   [0.2 , 0.25) ┤▍ 10 542
   [0.25, 0.3 ) ┤▍ 7 367
   [0.3 , 0.35) ┤▎ 5 366
   [0.35, 0.4 ) ┤▎ 4 022
   [0.4 , 0.45) ┤▎ 3 058
   [0.45, 0.5 ) ┤▎ 2 329
   [0.5 , 0.55) ┤▏ 1 908
   [0.55, 0.6 ) ┤▏ 1 551
   [0.6 , 0.65) ┤▏ 1 319
   [0.65, 0.7 ) ┤▏ 1 091
   [0.7 , 0.75) ┤▏ 940
   [0.75, 0.8 ) ┤▏ 753
   [0.8 , 0.85) ┤▏ 715
   [0.85, 0.9 ) ┤▏ 619
   [0.9 , 0.95) ┤▏ 601
   [0.95, 1.0 ) ┤▏ 546
                └                                        ┘
                                 Frequency

Indeed, as expected, we see that in this case we are getting low p-value most
of the time.

The fact that p-value has uniform distribution under H0 is a crucial
property. Unfortunately, not all hypothesis tests can ensure this.

Distribution p-value for discrete tests

A typical issue with p-value distribution is encountered when we work with
discrete distributions. Let me give you an example.

Now assume we sample data from a binomial distribution with sample size
n=16 and success probability p=0.5. Let us repeat the simulation that we
performed above using this distribution:

julia> using Distributions

julia> pvalues = [pvalue(BinomialTest(rand(Binomial(16, 0.5)), 16, 0.5))
                  for _ in 1:10^6]
1000000-element Vector{Float64}:
 0.8036193847656249
 0.8036193847656249
 0.8036193847656249
 0.8036193847656249
 0.8036193847656249
 0.45449829101562506
 0.8036193847656249
 0.8036193847656249
 ⋮
 0.8036193847656249
 0.21011352539062514
 0.8036193847656249
 0.21011352539062514
 0.21011352539062514
 0.45449829101562506
 0.8036193847656249

julia> histogram(pvalues)
                ┌                                        ┐
   [0.0 , 0.05) ┤█▊ 21 491
   [0.05, 0.1 ) ┤████▉ 55 699
   [0.1 , 0.15) ┤  0
   [0.15, 0.2 ) ┤  0
   [0.2 , 0.25) ┤███████████▊ 133 700
   [0.25, 0.3 ) ┤  0
   [0.3 , 0.35) ┤  0
   [0.35, 0.4 ) ┤  0
   [0.4 , 0.45) ┤  0
   [0.45, 0.5 ) ┤█████████████████████▋ 244 309
   [0.5 , 0.55) ┤  0
   [0.55, 0.6 ) ┤  0
   [0.6 , 0.65) ┤  0
   [0.65, 0.7 ) ┤  0
   [0.7 , 0.75) ┤  0
   [0.75, 0.8 ) ┤  0
   [0.8 , 0.85) ┤███████████████████████████████  348 485
   [0.85, 0.9 ) ┤  0
   [0.9 , 0.95) ┤  0
   [0.95, 1.0 ) ┤  0
   [1.0 , 1.05) ┤█████████████████▌ 196 316
                └                                        ┘
                                 Frequency

Now we can see that clearly the distribution is not uniform. For example, under
the standard 0.05 cut-off threshold for p-value we would expect to have
50,000 observations in [0.0, 0.05) bin, but we got only 21,491. Is this a
problem? Well, it is. Assume that we generate the data with p=0.55, but assume
the H0 probability is 0.5. We get:

julia> pvalues = [pvalue(BinomialTest(rand(Binomial(16, 0.55)), 16, 0.5))
                  for _ in 1:10^6]
1000000-element Vector{Float64}:
 0.21011352539062514
 0.07681274414062504
 0.8036193847656249
 0.8036193847656249
 1.0
 0.8036193847656249
 0.8036193847656249
 0.45449829101562506
 ⋮
 0.8036193847656249
 0.45449829101562506
 0.8036193847656249
 0.8036193847656249
 0.21011352539062514
 1.0
 1.0

julia> histogram(pvalues)
                ┌                                        ┐
   [0.0 , 0.05) ┤██▉ 31 600
   [0.05, 0.1 ) ┤██████▌ 68 901
   [0.1 , 0.15) ┤  0
   [0.15, 0.2 ) ┤  0
   [0.2 , 0.25) ┤█████████████▊ 145 964
   [0.25, 0.3 ) ┤  0
   [0.3 , 0.35) ┤  0
   [0.35, 0.4 ) ┤  0
   [0.4 , 0.45) ┤  0
   [0.45, 0.5 ) ┤███████████████████████▏ 243 914
   [0.5 , 0.55) ┤  0
   [0.55, 0.6 ) ┤  0
   [0.6 , 0.65) ┤  0
   [0.65, 0.7 ) ┤  0
   [0.7 , 0.75) ┤  0
   [0.75, 0.8 ) ┤  0
   [0.8 , 0.85) ┤███████████████████████████████  328 306
   [0.85, 0.9 ) ┤  0
   [0.9 , 0.95) ┤  0
   [0.95, 1.0 ) ┤  0
   [1.0 , 1.05) ┤█████████████████▎ 181 315
                └                                        ┘
                                 Frequency

So we can see that even if I generate the data from a distribution different
than assumed under H0 you would reject this hypothesis with 0.05 threshold
less frequently than even what you would expect under H0.

Conclusions

The bias in distribution of p-value can have significant consequences when
making statistical inference in practice. For example, in a paper I co-authored
some time ago, we show the property of p-value for various statistical tests
used in VaR backtesting (a very common test in financial applications). If you
would like to learn the details you can find it here.