Author Archives: Andrew Collier

#MonthOfJulia Day 14: Data

Julia-Logo-DataFrame

DataFrames

The DataFrame type in Julia is not dissimilar to the analogous types in R and Python/pandas. It provides a way of grouping data which is convenient for analysis and reminiscent of a database table.

I’m assuming that you’ve already installed the DataFrames package. If not, take a look at yesterday’s post. The first step is then to load it up:

julia> using DataFrames

Next we can start assembling our data. A DataFrame can be built up one field at a time (as is done in the example below) or by passing all of the data at once to the constructor.

julia> people = DataFrame();
julia> people[:name]   = ["Andrew", "Claire", "Bob", "Alice"];
julia> people[:gender] = [0, 1, 0, 1];
julia> people[:age]    = [43, 35, 27, 32];
julia> people
4x3 DataFrame
| Row | name     | gender | age |
|-----|----------|--------|-----|
| 1   | "Andrew" | 0      | 43  |
| 2   | "Claire" | 1      | 35  |
| 3   | "Bob"    | 0      | 27  |
| 4   | "Alice"  | 1      | 32  |

names() and eltypes() provide a high level overview of the data, giving the names and data types respectively for each column.

julia> names(people)
3-element Array{Symbol,1}:
 :name  
 :gender
 :age   
julia> eltypes(people)
3-element Array{Type{T<:Top},1}:
 ASCIIString
 Int64      
 Int64 

You can dig deeper with describe(), which gives a simple statistical summary of each column. It does essentially the same thing as summary() in R.

Indexing operations allow you to access the data in various ways. There’s also head() and tail(), which return the first and last few records in the data.

julia> people[:age]
4-element DataArray{Int64,1}:
 43
 35
 27
 32
julia> people[2]
4-element DataArray{Int64,1}:
 0
 1
 0
 1
julia> people[:,2]
4-element DataArray{Int64,1}:
 0
 1
 0
 1
julia> people[1,:]
1x3 DataFrame
| Row | name     | gender | age |
|-----|----------|--------|-----|
| 1   | "Andrew" | 0      | 43  |

You can apply a range of operations to columns. Note, however, that there is a subtle difference in syntax: while == is the normal equality operator, .== is the element-wise equality operator which must be applied to columns in order to make element-by-element comparisons. A similar syntax pertains to other operators like .<= and .>.

julia> people[:gender] = ifelse(people[:gender] .== 1, 'F', 'M');
julia> people
4x3 DataFrame
| Row | name     | gender | age |
|-----|----------|--------|-----|
| 1   | "Andrew" | 'M'    | 43  |
| 2   | "Claire" | 'F'    | 35  |
| 3   | "Bob"    | 'M'    | 27  |
| 4   | "Alice"  | 'F'    | 32  |
julia> people[:gender] .== 'M'
4-element DataArray{Bool,1}:
  true
 false
  true
 false
julia> people[:age] .<= 40
4-element DataArray{Bool,1}:
 false
  true
  true
  true

Of course you're not likely to construct any serious collection of data manually. It's more likely to come from a database or file. There are various ways to accomplish this. The simplest is reading from a delimited file.

julia> passwd = readtable("/etc/passwd", separator = ':', header = false);
julia> names!(passwd, [symbol(i) for i in ["username", "passwd", "UID", "GID",
                                           "comment", "home", "shell"]]);
julia> passwd[1:5,:]
5x7 DataFrame
| Row | username | passwd | UID | GID   | comment  | home        | shell               |
|-----|----------|--------|-----|-------|----------|-------------|---------------------|
| 1   | "root"   | "x"    | 0   | 0     | "root"   | "/root"     | "/bin/bash"         |
| 2   | "daemon" | "x"    | 1   | 1     | "daemon" | "/usr/sbin" | "/usr/sbin/nologin" |
| 3   | "bin"    | "x"    | 2   | 2     | "bin"    | "/bin"      | "/usr/sbin/nologin" |
| 4   | "sys"    | "x"    | 3   | 3     | "sys"    | "/dev"      | "/usr/sbin/nologin" |
| 5   | "sync"   | "x"    | 4   | 65534 | "sync"   | "/bin"      | "/bin/sync"         |

Note how names!() was used to alter the column names. There are other ways of loading data from a delimited text file that will handle column names more elegantly. We'll get to those in a few days time.

Watch the video below and then read further to find out about the DataArrays package.

DataArrays

Data are seldom perfect and missing values are not uncommon. Now, you might use some a particular numerical value (like -9999, for example) to indicate a missing datum. However, this is a bit of a kludge, difficult to maintain and open to ambiguity. The DataArrays package introduces the singleton NA type which can be used to unambiguously indicate missing data.

A vector with missing data is created using the @data macro.

julia> using DataArrays
julia> x = @data([1, 2, 3, 4, NA, 6])
6-element DataArray{Int64,1}:
 1  
 2  
 3  
 4  
  NA
 6  

Functions anyna() and allna() can be used to test whether any or all of the elements of a vector are missing.

Two ways of dealing with NAs are to either drop them or replace them with another value.

julia> dropna(x)
5-element Array{Int64,1}:
 1
 2
 3
 4
 6
julia> convert(Array, x, -1)
6-element Array{Int64,1}:
  1
  2
  3
  4
 -1
  6

Data frames have support for NAs already baked in.

julia> people[:age][2] = NA;
julia> people
4x3 DataFrame
| Row | name     | gender | age |
|-----|----------|--------|-----|
| 1   | "Andrew" | 'M'    | 43  |
| 2   | "Claire" | 'F'    | NA  |
| 3   | "Bob"    | 'M'    | 27  |
| 4   | "Alice"  | 'F'    | 32  |
julia> mean(people[:age])
NA
julia> mean(dropna(people[:age]))
34.0

Note how dropna() was used to calculate the mean of the non-missing data.

Metaprogramming with a DataFrame

The DataFramesMeta package provides a handful of macros for applying metaprogramming techniques to data frames. For example:

julia> using DataFramesMeta
julia> @with(passwd, maximum(:UID))
65534
julia> @select(people, :gender)
4x1 DataFrame
| Row | gender |
|-----|--------|
| 1   | 'M'    |
| 2   | 'F'    |
| 3   | 'M'    |
| 4   | 'F'    |

Further examples can be found on the github page for MonthOfJulia.

The post #MonthOfJulia Day 14: Data appeared first on Exegetic Analytics.

#MonthOfJulia Day 13: Packages

Julia-Logo-Packages

A lot of Julia’s functionality is implemented as add on packages (or “modules”). An extensive (though possibly not exhaustive) list of available packages can be found at http://pkg.julialang.org/. If you browse through that list I can guarantee that you will find a number of packages that pique your curiosity. How to install them? Read on.

Package management is handled via Pkg. Pkg.dir() will tell you where the installed packages are stored on your file system. Before installing any new packages, always call Pkg.update() to update your local metadata and repository (it will update any installed packages to the their most recent version).

julia-package-management

Adding a Package

Installing a new package is done with Pkg.add(). Any dependencies are handled automatically during the install process.

julia> Pkg.add("VennEuler")
INFO: Cloning cache of VennEuler from git://github.com/HarlanH/VennEuler.jl.git
INFO: Installing VennEuler v0.0.1
INFO: Building NLopt
INFO: Building Cairo
INFO: Package database updated

Pkg.available() generates a complete list of all available packages while Pkg.installed() or Pkg.status() can be used to find the versions of installed packages.

julia> Pkg.installed()["VennEuler"]
v"0.0.1"
julia> Pkg.installed("VennEuler")
v"0.0.1"

Pkg.pin() will fix a package at a specific version (no updates will be applied). Pkg.free() releases the effects of Pkg.pin().

Package Contents

The using directive loads the functions exported by a package into the global namespace. You can get a view of the capabilities of a package by typing its name followed by a period and then hitting the Tab key. Alternatively, names() will give a list of symbols exported by a package.

julia> using VennEuler
julia> names(VennEuler)
9-element Array{Symbol,1}:
 :optimize            
 :render              
 :optimize_iteratively
 :VennEuler           
 :EulerObject         
 :EulerState          
 :make_euler_object   
 :EulerSpec           
 :random_state 

The package manager provides a host of other functionality which you can read about here. Check out the videos below to find out more about Julia’s package ecosystem. From tomorrow I’ll start looking at specific packages. To get yourself prepared for that, why not go ahead and install the following packages: Cpp, PyCall, DataArrays, DataFrames and RCall.


The post #MonthOfJulia Day 13: Packages appeared first on Exegetic Analytics.

#MonthOfJulia Day 12: Parallel Processing

Julia Parallel

As opposed to many other languages, where parallel computing is bolted on as an afterthought, Julia was designed from the start with parallel computing in mind. It has a number of native features which lend themselves to efficient implementation of parallel algorithms. It also has packages which facilitate cluster computing (using MPI, for example). We won’t be looking at those, but focusing instead on coroutines, generic parallel processing and parallel loops.

Coroutines

Coroutines are not strictly parallel processing (in the sense of “many tasks running at the same time”) but they provide a lightweight mechanism for having multiple tasks defined (if not active) at once. According to Donald Knuth, coroutines are generalised subroutines (with which we are probably all familiar).

Under these conditions each module may be made into a coroutine; that is, it may be coded as an autonomous program which communicates with adjacent modules as if they were input or output subroutines. Thus, coroutines are subroutines all at the same level, each acting as if it were the master program when in fact there is no master program. There is no bound placed by this definition on the number of inputs and outputs a coroutine may have.
Conway, Design of a Separable Transition-Diagram Compiler, 1963.

Coroutines are implemented using produce() and consume(). In a moment you’ll see why those names are appropriate. To illustrate we’ll define a function which generates elements from the Lucas sequence. For reference, the first few terms in the sequence are 2, 1, 3, 4, 7, … If you know about Python’s generators then you’ll find the code below rather familiar.

julia> function lucas_producer(n)
       	a, b = (2, 1)
       	for i = 1:n
       		produce(a)
       		a, b = (b, a + b)
       	end
       end
lucas_producer (generic function with 1 method)

This function is then wrapped in a Task, which has state :runnable.

julia> lucas_task = Task(() -> lucas_producer(10))
Task (runnable) @0x0000000005b5ee60
julia> lucas_task.state
:runnable

Now we’re ready to start consuming data from the Task. Data elements can be retrieved individually or via a loop (in which case the Task acts like an iterable object and no consume() is required).

julia> consume(lucas_task)
2
julia> consume(lucas_task)
1
julia> consume(lucas_task)
3
julia> for n in lucas_task
       	println(n)
       end
4
7
11
18
29
47
76

Between invocations the Task is effectively asleep. The task temporarily springs to life every time data is requested, before becoming dormant once more.

It’s possible to simultaneously set up an arbitrary number of coroutine tasks.

Parallel Processing

Coroutines don’t really feel like “parallel” processing because they are not working simultaneously. However it’s rather straightforward to get Julia to metaphorically juggle many balls at once. The first thing that you’ll need to do is launch the interpreter with multiple worker processes.

$ julia -p 4

There’s always one more process than specified on the command line (we specified the number of worker processes; add one for the master process).

julia> nprocs()
5
julia> workers()                           # Identifiers for the worker processes.
4-element Array{Int64,1}:
 2
 3
 4
 5

We can launch a job on one of the workers using remotecall().

julia> W1 = workers()[1];
julia> P1 = remotecall(W1, x -> factorial(x), 20)
RemoteRef(2,1,6)
julia> fetch(P1)
2432902008176640000

@spawn and @spawnat are macros which launch jobs on individual workers. The @everywhere macro executes code across all processes (including the master).

julia> @everywhere p = 5
julia> @everywhere println(@sprintf("ID %d: %f %d", myid(), rand(), p))
ID 1: 0.686332 5
	From worker 4:	ID 4: 0.107924 5
	From worker 5:	ID 5: 0.136019 5
	From worker 2:	ID 2: 0.145561 5
	From worker 3:	ID 3: 0.670885 5

Parallel Loop and Map

To illustrate how easy it is to set up parallel loops, let’s first consider a simple serial implementation of a Monte Carlo technique to estimate π.

julia> function findpi(n)
       	inside = 0
       	for i = 1:n
       		x, y = rand(2)
       		if (x^2 + y^2 <= 1)
       			inside +=1
       		end
       	end
       	4 * inside / n
       end
findpi (generic function with 1 method)

The quality of the result as well as the execution time (and memory consumption!) depend directly on the number of samples.

julia> @time findpi(10000)
elapsed time: 0.051982841 seconds (1690648 bytes allocated, 81.54% gc time)
3.14
julia> @time findpi(100000000)
elapsed time: 9.533291187 seconds (8800000096 bytes allocated, 42.97% gc time)
3.1416662
julia> @time findpi(1000000000)
elapsed time: 95.436185105 seconds (88000002112 bytes allocated, 43.14% gc time)
3.141605352

The parallel version is implemented using the @parallel macro, which takes a reduction operator (in this case +) as its first argument.

julia> function parallel_findpi(n)
       	inside =  @parallel (+) for i = 1:n
       		x, y = rand(2)
       		x^2 + y^2 <= 1 ? 1 : 0
       	end
       	4 * inside / n
       end
parallel_findpi (generic function with 1 method)

There is some significant overhead associated with setting up the parallel jobs, so that the parallel version actually performs worse for a small number of samples. But when you run sufficient samples the speedup becomes readily apparent.

julia> @time parallel_findpi(10000)
elapsed time: 0.45212316 seconds (9731736 bytes allocated)
3.1724
julia> @time parallel_findpi(100000000)
elapsed time: 3.870065625 seconds (154696 bytes allocated)
3.14154744
julia> @time parallel_findpi(1000000000)
elapsed time: 39.029650365 seconds (151080 bytes allocated)
3.141653704

For reference, these results were achieved with 4 worker processes on a DELL laptop with the following CPU:

root@propane: #lshw | grep product | head -n 1
          product: Intel(R) Core(TM) i7-4600M CPU @ 2.90GHz

More information on parallel computing facilities in Julia can be found in the documentation. As usual the code for today’s Julia journey can be found on github.

The post #MonthOfJulia Day 12: Parallel Processing appeared first on Exegetic Analytics.