Author Archives: Andrew Collier

#MonthOfJulia Day 31: Regression

Julia-Logo-Regression

Today we’ll be looking at two packages for regression analyses in Julia: GLM and GLMNet. Let’s get both of those loaded so that we can begin.

julia> using GLM, GLMNet

Next we’ll create a synthetic data set which we’ll use for illustration purposes.

julia> using Distributions, DataFrames
julia> points = DataFrame();
julia> points[:x] = rand(Uniform(0.0, 10.0), 500);
julia> points[:y] = 2 + 3 * points[:x] + rand(Normal(1.0, 3.0) , 500);
julia> points[:z] = rand(Uniform(0.0, 10.0), 500);
julia> points[:valid] = 2 * points[:y] + points[:z] + rand(Normal(0.0, 3.0), 500) .> 35;
julia> head(points)
6x4 DataFrame
| Row | x        | y       | z       | valid |
|-----|----------|---------|---------|-------|
| 1   | 0.867859 | 3.08688 | 6.03142 | false |
| 2   | 9.92178  | 33.4759 | 2.14742 | true  |
| 3   | 8.54372  | 32.2662 | 8.86289 | true  |
| 4   | 9.69646  | 35.5689 | 8.83644 | true  |
| 5   | 4.02686  | 12.4154 | 2.75854 | false |
| 6   | 6.89605  | 27.1884 | 6.10983 | true  |

By design there is a linear relationship between the x and y fields. We can extract that relationship from the data using glm().

julia> model = glm(y ~ x, points, Normal(), IdentityLink())
DataFrameRegressionModel{GeneralizedLinearModel{GlmResp{Array{Float64,1},Normal,IdentityLink},
                         DensePredChol{Float64,Cholesky{Float64}}},Float64}:

Coefficients:
             Estimate Std.Error z value Pr(>|z|)
(Intercept)   2.69863  0.265687 10.1572   <1e-23
x             2.99845 0.0474285 63.2204   <1e-99

The third and forth arguments to glm() stipulate that we are applying simple linear regression where we expect the residuals to have a Normal distribution. The parameter estimates are close to what was expected, taking into account the additive noise introduced into the data. The call to glm() seems rather verbose for something as simple as linear regression and, consequently, there is a shortcut lm() which gets the same result with less fuss.

Using the result of glm() we can directly access the estimated coefficients along with their standard errors and the associated covariance matrix.

julia> coef(model)
2-element Array{Float64,1}:
 2.69863
 2.99845
julia> stderr(model)
2-element Array{Float64,1}:
 0.265687 
 0.0474285
julia> vcov(model)
2x2 Array{Float64,2}:
  0.0705897  -0.0107768 
 -0.0107768   0.00224947

The data along with the linear regression fit are shown below.
regression-synthetic-data

Moving on to the GLMNet package, which implements linear models with penalised maximum likelihood estimators. We’ll use the Boston housing data from R’s MASS package for illustration.

julia> using RDatasets
julia> boston = dataset("MASS", "Boston");
julia> X = array(boston[:,1:13]);
julia> y = array(boston[:,14]);			# Median value of houses in units of $1000

Running glmnet() which will fit models for various values of the regularisation parameter, λ.

julia> path = glmnet(X, y);

The result is a set of 76 different models. We’ll have a look at the intercepts and coefficients for the first ten models (which correspond to the largest values of λ). The coefficients are held in the betas field, which is an array with a column for each model and a row for each coefficient. Since the first few models are strongly penalised, each has only a few non-zero coefficients.

julia> path.a0[1:10]
10-element Array{Float64,1}:
 22.5328
 23.6007
 23.6726
 21.4465
 19.4206
 17.5746
 15.8927
 14.3602
 12.9638
 12.5562
julia> path.betas[:,1:10]
13x10 Array{Float64,2}:
 0.0  0.0        0.0       0.0       0.0       0.0       0.0       0.0       0.0      0.0      
 0.0  0.0        0.0       0.0       0.0       0.0       0.0       0.0       0.0      0.0      
 0.0  0.0        0.0       0.0       0.0       0.0       0.0       0.0       0.0      0.0      
 0.0  0.0        0.0       0.0       0.0       0.0       0.0       0.0       0.0      0.0      
 0.0  0.0        0.0       0.0       0.0       0.0       0.0       0.0       0.0      0.0      
 0.0  0.0        0.127841  0.569442  0.971462  1.33777   1.67153   1.97564   2.25274  2.47954  
 0.0  0.0        0.0       0.0       0.0       0.0       0.0       0.0       0.0      0.0      
 0.0  0.0        0.0       0.0       0.0       0.0       0.0       0.0       0.0      0.0      
 0.0  0.0        0.0       0.0       0.0       0.0       0.0       0.0       0.0      0.0      
 0.0  0.0        0.0       0.0       0.0       0.0       0.0       0.0       0.0      0.0      
 0.0  0.0        0.0       0.0       0.0       0.0       0.0       0.0       0.0     -0.040168
 0.0  0.0        0.0       0.0       0.0       0.0       0.0       0.0       0.0      0.0      
 0.0 -0.0843998 -0.153581 -0.196981 -0.236547 -0.272599 -0.305447 -0.335377 -0.36264 -0.384493 

Now that we’ve got a bundle of models, how do we choose among them? Cross-validation, of course!

julia> path = glmnetcv(X, y)
Least Squares GLMNet Cross Validation
76 models for 13 predictors in 10 folds
Best λ 0.028 (mean loss 24.161, std 3.019)

We find that the best results (on the basis of loss) were achieved when λ had a value of 0.028, which is relatively weak regularisation. We’ll put the parameters of the corresponding model neatly in a data frame.

julia> DataFrame(variable = names(boston)[1:13],
                 beta = path.path.betas[:,indmin(path.meanloss)])
13x2 DataFrame
| Row | variable | beta       |
|-----|----------|------------|
| 1   | Crim     | -0.0983463 |
| 2   | Zn       | 0.0414416  |
| 3   | Indus    | 0.0        |
| 4   | Chas     | 2.68519    |
| 5   | NOx      | -16.3066   |
| 6   | Rm       | 3.86694    |
| 7   | Age      | 0.0        |
| 8   | Dis      | -1.39602   |
| 9   | Rad      | 0.252687   |
| 10  | Tax      | -0.0098268 |
| 11  | PTRatio  | -0.929989  |
| 12  | Black    | 0.00902588 |
| 13  | LStat    | -0.5225    |

From the fit coefficients we can conclude, for example, that average house value increases with the number of rooms in the house (Rm) but decreases with nitrogen oxides concentration (NOx), which is a proxy for traffic intensity.

Whew! That was exhilarating but exhausting. As a footnote, please have a look at the thesis “RegTools: A Julia Package for Assisting Regression Analysis” by Muzhou Liang. The RegTools package is available here. As always, the full code for today is available on github. Next time we’ll look at classification models. Below are a couple of pertinent videos to keep you busy in the meantime.


The post #MonthOfJulia Day 31: Regression appeared first on Exegetic Analytics.

#MonthOfJulia Day 30: Clustering

Julia-Logo-Clustering

Today we’re going to look at the Clustering package, the documentation for which can be found here. As usual, the first step is loading the package.

julia> using Clustering

We’ll use the RDatasets package to select the xclara data and rename the columns in the resulting data frame.

julia> using RDatasets
julia> xclara = dataset("cluster", "xclara");
julia> names!(xclara, [symbol(i) for i in ["x", "y"]]);

Using Gadfly to generate a plot we can clearly see that there are three well defined clusters in the data.
xclara-clusters

Next we need to transform the data into an Array and then transpose it so that each point lies in a separate column (remember that this is key to calculating distances!).

julia> xclara = convert(Array, xclara);
julia> xclara = xclara';

Before we can run the clustering algorithm we need to identify seed points which act as the starting locations for clusters. There are a number of options for doing this. We’re simply going to choose three points in the data at random. How did we arrive at three starting points (as opposed to, say, six)? Well, in this case it was simply visual inspection: there appear to be three clear clusters in the data. When the data are more complicated (or have higher dimensionality) then choosing the number of clusters becomes a little more tricky.

julia> initseeds(:rand, xclara, 3)
3-element Array{Int64,1}:
 2858
  980
 2800

Now we’re ready to run the clustering algorithm. We’ll start with k-means clustering.

julia> xclara_kmeans = kmeans(xclara, 3);

A quick plot will confirm that it has recognised the three clusters that we intuitively identified in the data.
xclara-clusters-colour
We can have a look at the cluster centers, the number of points assigned to each cluster and (a subset of) the cluster assignments.

julia> xclara_kmeans.centers
2x3 Array{Float64,2}:
  9.47805   69.9242  40.6836
 10.6861   -10.1196  59.7159
julia> xclara_kmeans.counts
3-element Array{Int64,1}:
  899
  952
 1149
julia> xclara_kmeans.assignments[1:10]
10-element Array{Int64,1}:
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2

The k-means algorithm is limited to using the Euclidean metric to calculate the distance between points. An alternative, k-medoids clustering, is also supported in the Clustering package. The kmedoids() function accepts a distance matrix (from an arbitrary metric) as it’s first argument, allowing for a far greater degree of flexibility.

The final algorithm implemented by Clustering is DBSCAN, which is a density based clustering algorithm. In addition to a distance matrix, dbscan() also requires neighbourhood radius and the minimum number of points per cluster.

julia> using Distances
julia> dclara = pairwise(SqEuclidean(), xclara);
julia> xclara_dbscan = dbscan(dclara, 10, 40);

As is apparent from the plot below, DBSCAN results in a dramatically different set of clusters. The “loosely” packed points on the periphery of each of the three clusters we identified before have now been lumped together into a single cluster. Only the high density cores of these clusters are now separately identified.
xclara-clusters-dbscan

That’s it for the moment about clusters. The full code for today can be found on github. Tomorrow we’ll take a look at regression. In the meantime, take a few minutes to watch the video below about using Julia’s clustering capabilities for climate classification.

The post #MonthOfJulia Day 30: Clustering appeared first on Exegetic Analytics.

#MonthOfJulia Day 29: Distances

Julia-Logo-Distances

Today we’ll be looking at the Distances package, which implements a range of distance metrics. This might seem a rather obscure topic, but distance calculation is at the core of all clustering techniques (which are next on the agenda), so it’s prudent to know a little about how they work.

Note that there is a Distance package as well (singular!), which was deprecated in favour of the Distances package. So please install and load the latter.

julia> using Distances

We’ll start by finding the distance between a pair of vectors.

julia> x = [1., 2., 3.];
julia> y = [-1., 3., 5.];

A simple application of Pythagora’s Theorem will tell you that the Euclidean distance between the tips of those vectors is 3. We can confirm our maths with Julia though. The general form of a distance calculation uses evaluate(), where the first argument is a distance type. Common distance metrics (like Euclidean distance) also come with convenience functions.

julia> evaluate(Euclidean(), x, y)
3.0
julia> euclidean(x, y)
3.0

We can just as easily calculate other metrics like the city block (or Manhattan), cosine or Chebyshev distances.

julia> evaluate(Cityblock(), x, y)
5.0
julia> cityblock(x, y)
5.0
julia> evaluate(CosineDist(), x, y)
0.09649209709474871
julia> evaluate(Chebyshev(), x, y)
2.0

Moving on to distances between the columns of matrices. Again we’ll define a pair of matrices for illustration.

julia> X = [0 1; 0 2; 0 3];
julia> Y = [1 -1; 1 3; 1 5];

With colwise() distances are calculated between corresponding columns in the two matrices. If one of the matrices has only a single column (see the example with Chebyshev() below) then the distance is calculated between that column and all columns in the other matrix.

julia> colwise(Euclidean(), X, Y)
2-element Array{Float64,1}:
 1.73205
 3.0    
julia> colwise(Hamming(), X, Y)
2-element Array{Int64,1}:
 3
 3
julia> colwise(Chebyshev(), X[:,1], Y)
2-element Array{Float64,1}:
 1.0
 5.0

We also have the option of using pairwise() which gives the distances between all pairs of columns from the two matrices. This is precisely the distance matrix that we would use for a cluster analysis.

julia> pairwise(Euclidean(), X, Y)
2x2 Array{Float64,2}:
 1.73205  5.91608
 2.23607  3.0    
julia> pairwise(Euclidean(), X)
2x2 Array{Float64,2}:
 0.0      3.74166
 3.74166  0.0    
julia> pairwise(Mahalanobis(eye(3)), X, Y)     # Effectively just the Euclidean metric
2x2 Array{Float64,2}:
 1.73205  5.91608
 2.23607  3.0    
julia> pairwise(WeightedEuclidean([1.0, 2.0, 3.0]), X, Y)
2x2 Array{Float64,2}:
 2.44949  9.69536
 3.74166  4.24264

As you might have observed from the last example above, it’s also possible to calculate weighted versions of some of the metrics.

Finally a less contrived example. We’ll look at the distances between observations in the iris data set. We first need to extract only the numeric component of each record and then transpose the resulting matrix so that observations become columns (rather than rows).

julia> using RDatasets
julia> iris = dataset("datasets", "iris");
julia> iris = convert(Array, iris[:,1:4]);
julia> iris = transpose(iris);
julia> dist_iris = pairwise(Euclidean(), iris);
julia> dist_iris[1:5,1:5]
5x5 Array{Float64,2}:
 0.0       0.538516  0.509902  0.648074  0.141421
 0.538516  0.0       0.3       0.331662  0.608276
 0.509902  0.3       0.0       0.244949  0.509902
 0.648074  0.331662  0.244949  0.0       0.648074
 0.141421  0.608276  0.509902  0.648074  0.0   

The full distance matrix is illustrated below as a heatmap using Plotly. Note how the clearly define blocks for each of the iris species setosa, versicolor, and virginica.

Distance Matrix for Iris Data

Tomorrow we’ll be back to look at clustering in Julia.

The post #MonthOfJulia Day 29: Distances appeared first on Exegetic Analytics.