Author Archives: Bradley Setzler

Parallel Processing in Julia: Bootstrapping the MLE

By: Bradley Setzler

Re-posted from: https://juliaeconomics.com/2014/06/18/parallel-processing-in-julia-bootstrapping-the-mle/

* Scripts that reproduce the results of this tutorial in Julia are located here and here.

To review our extended example, we first simulated the OLS DGP, solved for the MLE of OLS, bootstrapped the MLE, and computed various non-parametric p-values.

The only part of our previous work that may have slowed down your computer was forming the 10,000 bootstrap samples (samples). In this post, we will see how to speed up the bootstrapping by using more than one processor to collect the bootstrap samples.


The Estimator that We Wish to Bootstrap

We are interested in the following MLE estimator, adapted from our previous work. First, we import the two needed packages and generate the data:

# Prepare
srand(2)
using Distributions
using Optim
using DataFrames
N=1000
K=3

# Generate variables
genX = MvNormal(eye(K))
X = rand(genX,N)
X = X'
X_noconstant = X
constant = ones(N)
X = [constant X]
genEpsilon = Normal(0, 1)
epsilon = rand(genEpsilon,N)
trueParams = [0.01,0.05,0.05,0.07]
Y = X*trueParams + epsilon

# Export data
data = DataFrame(hcat(Y,X))
names!(data,[:Y,:one,:X1,:X2,:X3])
writetable("data.csv",data)

Next, we define the likelihood of OLS:

function loglike(rho,y,x)
    beta = rho[1:4]
    sigma2 = exp(rho[5])+eps(Float64)
    residual = y-x*beta
    dist = Normal(0, sqrt(sigma2))
    contributions = logpdf(dist,residual)
    loglikelihood = sum(contributions)
    return -loglikelihood
end

Finally, we define the bootstrap function, which includes within it a wrapper for the likelihood:

function bootstrapSamples(B)
    println("hi")
    samples = zeros(B,5)
    for b=1:B
		theIndex = sample(1:N,N)
		x = X[theIndex,:]
		y = Y[theIndex,:]
		function wrapLoglike(rho)
			return loglike(rho,y,x)
		end
		samples[b,:] = optimize(wrapLoglike,params0,method=:cg).minimum
	end
	samples[:,5] = exp(samples[:,5])
    println("bye")
    return samples
end

The reason for the cute println statements will become apparent later. Save all of this in a file called “bootstrapFunction.jl”. Finally, we import the data created above:

data = readtable("data.csv")
N = size(data,1)
Y = array(data[:Y])
X = array(data[[:one,:X1,:X2,:X3]])

Parallel Bootstrapping

We now have the function, bootstrapSamples, saved in the file “bootstrapFunction.jl”, that we wish to run in parallel. The idea is that, instead of running the MLE B times on one processor, we run it B/4 times on each of four processors, then collect the results back to the first processor. Unfortunately, Julia makes it very difficult to parallelize functions within the same file that the functions are defined; the user must explicitly tell Julia every parameter and function that is needed on each processor, which is a frustrating process of copy-pasting the command @everywhere all around your code.

Instead, we separate our work into two files, the first one you have already created. Create a second file in the same location on your computer as “bootstrapFunction.jl”. In the second file, write the following commands:

addprocs(4)
require("tutorial_5_bootstrapFunction.jl")
B=10000
b=2500
samples_pmap = pmap(bootstrapSamples,[b,b,b,b])
samples = vcat(samples_pmap[1],samples_pmap[2],samples_pmap[3],samples_pmap[4])

The command addprocs(4) means to add four processors into your Julia session. The command require is the key to getting this right: it ensures that all of the functions and parameters contained in “bootstrapFunction.jl” are available to each processor. pmap is the function that actually applies bootstrapSamples to each value in the list [b,b,b,b]. I set the list next to it to be [b,b,b,b], where b=2500, so that bootstrapEstimates would collect 2,500 bootstrap samples on each processor. The resulting matrices from all four processors are stored in the length-4 vector called samples_pmap. Finally, we vertically stack (vcat) the four matrices into one matrix, samples, which is exactly the matrix of bootstrap samples we have seen previously.

Tip: For parallel processing in Julia, separate your code into two files, one file containing the functions and parameters that need to be run in parallel, and the other file managing the processing and collecting results. Use the require command in the managing file to import the functions and parameters to all processors.


Results

Once you have run the outer file, you can do speed tests to compare the two ways of getting 10,000 bootstrap samples. We use the @elapsed command to time each code. Here are the results:

@elapsed pmap(bootstrapSamples,[b,b,b,b])
	From worker 2:	hi
	From worker 3:	hi
	From worker 4:	hi
	From worker 5:	hi
	From worker 2:	bye
	From worker 5:	bye
	From worker 4:	bye
	From worker 3:	bye
66.528345161

Each of the processors tells us “hi” and “bye” when it begins and finishes its job, respectively. The total time was a bit more than a minute (66.5 seconds). By comparison, the following code will compute 10,000 bootstrap samples on only one processor, as in our previous work:

@elapsed bootstrapSamples(10000)
hi
bye
145.265103729

So, in the absence of parallel processing, the same result would require just over 145 seconds, or almost two and a half minutes. Thus, parallel processing on a four-core personal computer allowed us to reduce the time required to bootstrap our estimator by a factor of around 2.2. As a general rule, parallel processing of the bootstrap estimator saves you more time when the function being bootstrapped is more complicated.


Bradley J. Setzler

Julia Attracts an International Audience

By: Bradley Setzler

Re-posted from: https://juliaeconomics.com/2014/06/18/julia-attracts-an-international-audience/

I created this site 2 days ago to help economists learn Julia. I am pleasantly surprised that so many share this interest. Below, see the number of views of this site over these 2 days by country. Hopefully, all of this momentum in favor of Julia will speed its accumulation of useful packages and documentation.

Please, do not hesitate to comment or email me with ideas to improve the codes I post or ideas of worthwhile codes to work on next. Thank you for visiting this site.


JuliaEconomics_Stats_1JuliaEconomics_Stats_2

 


Bradley J. Setzler

 

Stepdown p-values for Multiple Hypothesis Testing in Julia

By: Bradley Setzler

Re-posted from: https://juliaeconomics.com/2014/06/17/stepdown-p-values-for-multiple-hypothesis-testing-in-julia/

* The script to reproduce the results of this tutorial in Julia is located here.

To finish yesterday’s tutorial on hypothesis testing with non-parametric p-values in Julia, I show how to perform the bootstrap stepdown correction to p-values for multiple testing, which many economists find intimidating (including me) but is actually not so difficult.

First, I will show the simple case of Holm’s (1979) stepdown procedure. Then, I will build on the bootstrap example to apply one of the most exciting advances in econometric theory from the past decade, the bootstrap step-down procedure, developed by Romano and Wolf (2004) and extended by Chicago professor Azeem Shaikh. These methods allow one to bound the family-wise error rate, which is the probability of making at least one false discovery (i.e., rejecting a null hypothesis when the null hypothesis was true).


Motivation

The standard example to motivate this discussion is to suppose you compute p-values for, say, 10 parameters with independent estimators. If all 10 have p-value equal to 0.05 corresponding to the null hypothesis of each, then we reject the null hypothesis for any one of them with 95% confidence. However, the probability that all of them reject the null hypothesis is 1-\left(1-0.05\right)^{10} \approx 0.4 (by independence, the joint probability is the product of marginal probabilities), so we only have 60% confidence in rejecting all of them jointly. If we wish to test for all of the hypotheses jointly, we need some way to adjust the p-values to give us greater confidence in jointly rejecting them. Holm’s solution to this problem is to iteratively inflate the individual p-values, so that the smallest p-value is inflated the most, and the largest p-value is not inflated at all (unless one of the smaller p-values is inflated above the largest p-value, then the largest p-value will be inflated to preserve the order of p-values). The specifics of the procedure are below.

The Romano-Wolf-Shaikh approach is to account for the dependence among the estimators, and penalize estimators more as they become more independent. Here’s how I motivate their approach: First, consider the extreme case of two estimators that are identical. Then, if we reject the null hypothesis for the first, there is no reason to penalize the second; it would be very strange to reject the null hypothesis for the first but not the second when they are identical. Second, suppose they are almost perfectly dependent. Then, rejecting the null for one strongly suggests that we should also reject the null for the second, so we do not want to penalize one very strongly for the rejection of the other. But as they become increasingly independent, the rejection of one provides less and less affirmative information about the other, and we approach the case of perfect independence shown above, which receives the greatest penalty. The specifics of the procedure are below.


Holm’s Correction to the p-values

Suppose you have a vector of p-values, p, of length K, and a desired maximum family-wise error rate, \alpha (the most common choice is \alpha=0.05. These are all of the needed ingredients. Holm’s stepdown procedure is as follows:

  1. If the smallest element of p is greater than \frac{\alpha}{(K+1)}, reject no null hypotheses and exit (do not continue to step 2). Else, reject the null hypothesis corresponding to the smallest element of p (continue to step 2).
  2. If the second-smallest element of p is greater than \frac{\alpha}{(K+1)-1}, reject no remaining null hypotheses and exit (do not continue to step 3). Else, reject the null hypothesis corresponding to the second-smallest element of p (continue to step 3).
  3. If the third-smallest element of p is greater than \frac{\alpha}{(K+1)-2}, reject no remaining null hypotheses and exit (do not continue to step 4). Else, reject the null hypothesis corresponding to the third-smallest element of p (continue to step 4).
  4. And so on.

We could program this directly as a loop-based function that takes p,\alpha as parameters, but a simpler approach is to compute the p-values equivalent to this procedure for any \alpha, which are,

\tilde{p}_{(k)} =\min\left\{\max_{j:p_{(j)}\leq p_{(k)}} \left\{ (K+1-j)p_{(j)}\right\},1\right\},

where (k) means the k^{\mathit{th}} smallest element. This expression is equivalent to the algorithm above in the sense that, for any family-wise error rate \alpha\in (0,1), \tilde{p}_{(k)}\leq\alpha if and only if the corresponding hypothesis is rejected by the algorithm above.

The following code computes these p-values. Julia (apparently) lacks a command that would tell me the index of the rank of the p-values, so my loop below does this, including the handling of ties (when some of the p-values are the same):

function Holm(p)
    K = length(p)
    sort_index = -ones(K)
    sorted_p = sort(p)
    sorted_p_adj = sorted_p.*[K+1-i for i=1:K]
    for j=1:K
        num_ties = length(sort_index[(p.==sorted_p[j]) & (sort_index.<0)])
        sort_index[(p.==sorted_p[j]) & (sort_index.<0)] = j:(j-1+num_ties)
    end
    sorted_holm_p = [minimum([maximum(sorted_p_adj[1:k]),1]) for k=1:K]
    holm_p = [sorted_holm_p[sort_index[k]] for k=1:K]
    return holm_p
end

This is straight-forward except for sort_index, which I constructed such that, e.g., if the first element of sort_index is 3, then the first element of pvalues is the third smallest. Unfortunately, it arbitrarily breaks ties in favor of the parameter that appears first in the MLE array, so the second entry of sort_index is 1 and the third entry is 2, even though the two corresponding p-values are equal.

Please email me or leave a comment if you know of a better way to handle ties.


Bootstrap Stepdown p-values

Given our work yesterday, it is relatively easy to replace bootstrap marginal p-values with bootstrap stepdown p-values. We begin with samples, as created yesterday. The following code creates tNullDistribution, which is the same as nullDistribution from yesterday except as t-statistics (i.e., divided by standard error).

function stepdown(MLE,bootSamples)
    K = length(MLE)
    tMLE = MLE
    bootstrapSE = std(bootSamples,1)
    tNullDistribution = bootSamples
    p = -ones(K)
    for i=1:K
        tMLE[i] = abs(tMLE[i]/bootstrapSE[i])
        tNullDistribution[:,i] = abs((tNullDistribution[:,i]-mean(tNullDistribution[:,i]))/bootstrapSE[i])
        p[i] = mean(tMLE[i].<tNullDistribution[:,i])
    end
    sort_index = -ones(K)
    stepdown_p = -ones(K)
    sorted_p = sort(p)
    for j=1:K
        num_ties = length(sort_index[(p.==sorted_p[j]) & (sort_index.<0)])
        sort_index[(p.==sorted_p[j]) & (sort_index.<0)] = j:(j-1+num_ties)
    end
    for k=1:K
    	current_index = [sort_index.>=k]
        stepdown_p[sort_index[k]] = mean(maximum(tNullDistribution[:,sort_index.>=k],2).>tMLE[sort_index[k]])
    end
    return ["single_pvalues"=>p,"stepdown_pvalues"=>stepdown_p,"Holm_pvalues"=>Holm(p)]
end

The only difference between the single p-values and the stepdown p-values is the use of the maximum t-statistic in the comparison to the null distribution, and the maximum is taken over only the parameter estimates whose p-values have not yet been computed. Notice that I used a dictionary in the return so that I could output single, stepdown, and Holm p-values from the stepdown function.


Results

To test out the above corrected p-values, we return to MLE and samples from yesterday. Suppose we want to perform the two-sided tests simultaneously for the null hypotheses \beta_0=0,\beta_1=0,\beta_2=0,\beta_3=0. The results from the above code are,

julia> stepdown(MLE[1:4],samples[:,1:4])
Dict{ASCIIString,Any} with 3 entries:
  "Holm_pvalues"     => {0.766,0.766,0.18,0.036}
  "stepdown_pvalues" => [0.486,0.649,0.169,0.034]
  "single_pvalues"   => [0.486,0.383,0.06,0.009]

We see that the p-value corresponding to the null hypothesis \beta_2=0 is inflated by both the bootstrap stepdown and Holm procedures, and is no longer significant at the 0.10 level.


Bradley J. Setzler