Author Archives: Andrew Collier

#MonthOfJulia Day 28: Hypothesis Tests

Julia-Logo-HypothesisTests

It’s all very well generating myriad statistics characterising your data. How do you know whether or not those statistics are telling you something interesting? Hypothesis Tests. To that end, we’ll be looking at the HypothesisTests package today.

The first (small) hurdle is loading the package.

ulia> using HypothesisTests

That wasn’t too bad. Next we’ll assemble some synthetic data.

julia> using Distributions
julia> srand(357)
julia> x1 = rand(Normal(), 1000);
julia> x2 = rand(Normal(0.5, 1), 1000);
julia> x3 = rand(Binomial(100, 0.25), 1000);   # 25% success rate on samples of size 100
julia> x4 = rand(Binomial(50, 0.50), 1000);    # 50% success rate on samples of size 50
julia> x5 = rand(Bernoulli(0.25), 100) .== 1;

We’ll apply a one sample t-test to x1 and x2. The output below indicates that x2 has a mean which differs significantly from zero while x1 does not. This is consistent with our expectations based on the way that these data were generated. I’m impressed by the level of detail in the output from OneSampleTTest(): different aspects of the test are neatly broken down into sections (population, test summary and details) and there is automated high level interpretation of the test results.

julia> t1 = OneSampleTTest(x1)
One sample t-test
-----------------
Population details:
    parameter of interest:   Mean
    value under h_0:         0
    point estimate:          -0.013027816861268473
    95% confidence interval: (-0.07587776077157478,0.04982212704903784)

Test summary:
    outcome with 95% confidence: fail to reject h_0
    two-sided p-value:           0.6842692696393744 (not signficant)

Details:
    number of observations:   1000
    t-statistic:              -0.40676289562651996
    degrees of freedom:       999
    empirical standard error: 0.03202803648352013
julia> t2 = OneSampleTTest(x2)
One sample t-test
-----------------
Population details:
    parameter of interest:   Mean
    value under h_0:         0
    point estimate:          0.5078522467069418
    95% confidence interval: (0.44682036100064954,0.5688841324132342)

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           2.6256160116367554e-53 (extremely significant)

Details:
    number of observations:   1000
    t-statistic:              16.328833826939398
    degrees of freedom:       999
    empirical standard error: 0.031101562554276502

Using pvalue() we can further interrogate the p-values generated by these tests. The values reported in the output above are for the two-sided test, but we can look specifically at values associated with either the left- or right tails of the distribution. This makes the outcome of the test a lot more specific.

julia> pvalue(t1)
0.6842692696393744
julia> pvalue(t2)
2.6256160116367554e-53
julia> pvalue(t2, tail = :left)            # Not significant.
1.0
julia> pvalue(t2, tail = :right)           # Very significant indeed!
1.3128080058183777e-53

The associated confidence intervals are also readily accessible. We can choose between two-sided or left/right one-sided intervals as well as change the significance level.

julia> ci(t2, tail = :both)                # Two-sided 95% confidence interval by default
(0.44682036100064954,0.5688841324132342)
julia> ci(t2, tail = :left)                # One-sided 95% confidence interval (left)
(-Inf,0.5590572480083876)
julia> ci(t2, 0.01, tail = :right)         # One-sided 99% confidence interval (right)
(0.43538291818831604,Inf)

As a second (and final) example we’ll look at BinomialTest(). There are various ways to call this function. First, without looking at any particular data, we’ll check whether 25 successes from 100 samples is inconsistent with a 25% success rate (obviously not and, as a result, we fail to reject this hypothesis).

julia> BinomialTest(25, 100, 0.25)
Binomial test
-------------
Population details:
    parameter of interest:   Probability of success
    value under h_0:         0.25
    point estimate:          0.25
    95% confidence interval: (0.16877973809934183,0.3465524957588082)

Test summary:
    outcome with 95% confidence: fail to reject h_0
    two-sided p-value:           1.0 (not signficant)

Details:
    number of observations: 100
    number of successes:    25

Next we’ll see whether the Bernoulli samples in x5 provide contradictory evidence to an assumed 50% success rate (based on the way that x5 was generated we are not surprised to find an infinitesimal p-value and the hypothesis is soundly rejected).

julia> BinomialTest(x5, 0.5)
Binomial test
-------------
Population details:
    parameter of interest:   Probability of success
    value under h_0:         0.5
    point estimate:          0.18
    95% confidence interval: (0.11031122915326055,0.26947708596681197)

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           6.147806615048005e-11 (extremely significant)

Details:
    number of observations: 100
    number of successes:    18

There are a number of other tests available in this package, including a range of non-parametric tests which I have not even mentioned above. Certainly HypothesisTests should cover most of the bases for statistical inference. For more information, read the extensive documentation. Check out the sample code on github for further examples.

significant

The post #MonthOfJulia Day 28: Hypothesis Tests appeared first on Exegetic Analytics.

#MonthOfJulia Day 27: Distributions

Julia-Logo-Distributions

Today I’m looking at the Distributions package. Let’s get things rolling by loading it up.

julia> using Distributions

There’s some overlap between the functionality in Distributions and what we saw yesterday in the StatsFuns package. So, instead of looking at functions to evaluate various aspects of PDFs and CDFs, we’ll focus on sampling from distributions and calculating summary statistics.

Julia has native support for sampling from a uniform distribution. We’ve seen this before, but here’s a reminder.

julia> srand(359)                          # Set random number seed.
julia> rand()                              # Random number on [0, 1)
0.4770241944535658

What if you need to generate samples from a more exotic distribution? The Normal distribution, although not particularly exotic, seems like a natural place to start. The Distributions package exposes a type for each supported distribution. For the Normal distribution the type is appropriately named Normal. It’s derived from Distribution with characteristics Univariate and Continuous.

julia> super(Normal)
Distribution{Univariate,Continuous}
julia> names(Normal)
2-element Array{Symbol,1}:
 :μ
 :σ

The constructor accepts two parameters: mean (μ) and standard deviation (σ). We’ll instantiate a Normal object with mean 1 and standard deviation 3.

julia> d1 = Normal(1.0, 3.0)
Normal(μ=1.0, σ=3.0)
julia> params(d1)
(1.0,3.0)
julia> d1.μ
1.0
julia> d1.σ
3.0

Thanks to the wonders of multiple dispatch we are then able to generate samples from this object with the rand() method.

julia> x = rand(d1, 1000);

We’ll use Gadfly to generate a histogram to validate that the samples are reasonable. They look pretty good.
normal-histogram

There are functions like pdf(), cdf(), logpdf() and logcdf() which allow the density function of our distribution object to be evaluated at particular points. Check those out. We’re moving on to truncating a portion of the distribution, leaving a Truncated distribution object.

julia> d2 = Truncated(d1, -4.0, 6.0);

Again we can use Gadfly to get an idea of what this looks like. This time we’ll plot the actual PDF rather than a histogram of samples.
truncated-normal-pdf

The Distributions package implements an extensive selection of other continuous distributions, like Exponential, Poisson, Gamma and Weibull. The basic interface for each of these is consistent with what we’ve seen for Normal above, although there are some methods which are specific to some distributions.

Let’s look at a discrete distribution, using a Bernoulli distribution with success rate of 25% as an example.

julia> d4 = Bernoulli(0.25)
Bernoulli(p=0.25)
julia> rand(d4, 10)
10-element Array{Int64,1}:
 0
 1
 0
 1
 1
 0
 0
 0
 1
 0

What about a Binomial distribution? Suppose that we have a success rate of 25% per trial and want to sample the number of successes in a batch of 100 trials.

julia> d5 = Binomial(100, 0.25)
Binomial(n=100, p=0.25)
julia> rand(d5, 10)
10-element Array{Int64,1}:
 22
 21
 30
 23
 28
 25
 26
 26
 28
 21

Finally let’s look at an example of fitting a distribution to a collection of samples using Maximum Likelihood.

julia> x = rand(d1, 10000);
julia> fit(Normal, x)
Normal(μ=1.0015796782177036, σ=3.033914550184868)

Yup, those values are in pretty good agreement with the mean and standard deviation we specified for our Normal object originally.

That’s it for today. There’s more to the Distributions package though. Check out my github repository to see other examples which didn’t make it into the today’s post.

t_distribution

The post #MonthOfJulia Day 27: Distributions appeared first on Exegetic Analytics.

#MonthOfJulia Day 26: Statistics

Julia-Logo-Statistics

JuliaStats is a meta-project which consolidates various packages related to statistics and machine learning in Julia. Well worth taking a look if you plan on working in this domain.

julia> x = rand(10);
julia> mean(x)
0.5287191472784906
julia> std(x)
0.2885446536178459

Julia already has some builtin support for statistical operations, so additional packages are not strictly necessary. However they do increase the scope and ease of possible operations (as we’ll see below).Julia already has some builtin support for statistical operations. Let’s kick off by loading all the packages that we’ll be looking at today.

julia> using StatsBase, StatsFuns, StreamStats

StatsBase

The documentation for StatsBase can be found here. As the package name implies, it provides support for basic statistical operations in Julia.

High level summary statistics are generated by summarystats().

julia> summarystats(x)
Summary Stats:
Mean:         0.528719
Minimum:      0.064803
1st Quartile: 0.317819
Median:       0.529662
3rd Quartile: 0.649787
Maximum:      0.974760

Weighted versions of the mean, variance and standard deviation are implemented. There’re also geometric and harmonic means.

julia> w = WeightVec(rand(1:10, 10));      # A weight vector.
julia> mean(x, w)                          # Weighted mean.
0.48819933297961043
julia> var(x, w)                           # Weighted variance.
0.08303843715334995
julia> std(x, w)                           # Weighted standard deviation.
0.2881639067498738
julia> skewness(x, w)
0.11688162715805048
julia> kurtosis(x, w)
-0.9210456851144664
julia> mean_and_std(x, w)
(0.48819933297961043,0.2881639067498738)

There’s a weighted median as well as functions for calculating quantiles.

julia> median(x)                           # Median.
0.5296622773635412
julia> median(x, w)                        # Weighted median.
0.5729104703595038
julia> quantile(x)
5-element Array{Float64,1}:
 0.0648032
 0.317819 
 0.529662 
 0.649787 
 0.97476  
julia> nquantile(x, 8)
9-element Array{Float64,1}:
 0.0648032
 0.256172 
 0.317819 
 0.465001 
 0.529662 
 0.60472  
 0.649787 
 0.893513 
 0.97476  
julia> iqr(x)                              # Inter-quartile range.
0.3319677541313941

Sampling from a population is also catered for, with a range of algorithms which can be applied to the sampling procedure.

julia> sample(['a':'z'], 5)                # Sampling (with replacement).
5-element Array{Char,1}:
 'w'
 'x'
 'e'
 'e'
 'o'
julia> wsample(['T', 'F'], [5, 1], 10)     # Weighted sampling (with replacement).
10-element Array{Char,1}:
 'F'
 'T'
 'T'
 'T'
 'F'
 'T'
 'T'
 'T'
 'T'
 'T'

There’s also functionality for empirical estimation of distributions from histograms and a range of other interesting and useful goodies.

StatsFuns

The StatsFuns package provides constants and functions for statistical computing. The constants are by no means essential but certainly very handy. Take, for example, twoπ and sqrt2.

There are some mildly exotic mathematical functions available like logistic, logit and softmax.

julia> logistic(-5)
0.0066928509242848554
julia> logistic(5)
0.9933071490757153
julia> logit(0.25)
-1.0986122886681098
julia> logit(0.75)
1.0986122886681096
julia> softmax([1, 3, 2, 5, 3])
5-element Array{Float64,1}:
 0.0136809
 0.101089 
 0.0371886
 0.746952 
 0.101089 

Finally there is a suite of functions relating to various statistical distributions. The functions for the Normal distribution are illustrated below, but there’re functions for Beta and Binomial distribution, the Gamma and Hypergeometric distribution and many others. The function naming convention is consistent across all distributions.

julia> normpdf(0);                         # PDF
julia> normlogpdf(0);                      # log PDF
julia> normcdf(0);                         # CDF
julia> normccdf(0);                        # Complementary CDF
julia> normlogcdf(0);                      # log CDF
julia> normlogccdf(0);                     # log Complementary CDF
julia> norminvcdf(0.5);                    # inverse-CDF
julia> norminvccdf(0.99);                  # inverse-Complementary CDF
julia> norminvlogcdf(-0.693147180559945);  # inverse-log CDF
julia> norminvlogccdf(-0.693147180559945); # inverse-log Complementary CDF

StreamStats

Finally, the StreamStats package supports calculating online statistics for a stream of data which is being continuously updated.

julia> average = StreamStats.Mean()
Online Mean
 * Mean: 0.000000
 * N:    0
julia> variance = StreamStats.Var()
Online Variance
 * Variance: NaN
 * N:        0
julia> for x in rand(10)
       	update!(average, x)
       	update!(variance, x)
       	@printf("x = %3.f: mean = %.3f | variance = %.3fn", x, state(average),
                                                                state(variance))
       end
x = 0.928564: mean = 0.929 | variance = NaN
x = 0.087779: mean = 0.508 | variance = 0.353
x = 0.253300: mean = 0.423 | variance = 0.198
x = 0.778306: mean = 0.512 | variance = 0.164
x = 0.566764: mean = 0.523 | variance = 0.123
x = 0.812629: mean = 0.571 | variance = 0.113
x = 0.760074: mean = 0.598 | variance = 0.099
x = 0.328495: mean = 0.564 | variance = 0.094
x = 0.303542: mean = 0.535 | variance = 0.090
x = 0.492716: mean = 0.531 | variance = 0.080

In addition to the mean and variance illustrated above, the package also supports online versions of min() and max(), and can be used to generate incremental confidence intervals for Bernoulli and Poisson processes.

That’s it for today. Check out the full code on github and watch the video below.

The post #MonthOfJulia Day 26: Statistics appeared first on Exegetic Analytics.