Well… that’s a good question. First of all, there are a lot of requirements. But one
of the most important ones is to learn how to work with data sets. And I am not
talking about playing with spreadsheets. I am talking about working with some
real programming language to get the job done with any datasets, no matter how
huge it is.
Introduction to DataFrames in Julia
In Julia, tablular data is handled using the DataFrames package. Other packages
are commonly used to read/write data into/from Julia such as CSV.
A data frame is created using the DataFrame()
function:
using DataFrames
foo = DataFrame();
foo
## 0×0 DataFrame
To use the functionalities of the package, let’s create some random data. I will
use the rand()
function to generate random numbers to create an array 100 x 10
and convert it to a data frame:
foo = DataFrame(rand(100, 10));
foo
## 100×10 DataFrame. Omitted printing of 4 columns
## │ Row │ x1 │ x2 │ x3 │ x4 │ x5 │ x6 │
## │ │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │
## ├─────┼────────────┼──────────┼───────────┼──────────┼───────────┼──────────┤
## │ 1 │ 0.0193136 │ 0.466228 │ 0.790475 │ 0.805074 │ 0.51182 │ 0.201707 │
## │ 2 │ 0.986082 │ 0.33719 │ 0.309992 │ 0.117098 │ 0.792606 │ 0.102682 │
## │ 3 │ 0.00366356 │ 0.323071 │ 0.685271 │ 0.596414 │ 0.847368 │ 0.105035 │
## │ 4 │ 0.297846 │ 0.136907 │ 0.726739 │ 0.569452 │ 0.922995 │ 0.846519 │
## │ 5 │ 0.73245 │ 0.208294 │ 0.353801 │ 0.448741 │ 0.185897 │ 0.496741 │
## │ 6 │ 0.209719 │ 0.114021 │ 0.0662264 │ 0.463682 │ 0.628582 │ 0.130653 │
## │ 7 │ 0.341692 │ 0.608349 │ 0.946541 │ 0.589161 │ 0.418321 │ 0.295541 │
## ⋮
## │ 93 │ 0.714495 │ 0.661317 │ 0.954527 │ 0.209581 │ 0.107941 │ 0.233787 │
## │ 94 │ 0.680497 │ 0.101874 │ 0.872371 │ 0.596457 │ 0.669133 │ 0.740674 │
## │ 95 │ 0.909319 │ 0.182776 │ 0.343387 │ 0.142707 │ 0.0140866 │ 0.791679 │
## │ 96 │ 0.642578 │ 0.949993 │ 0.380511 │ 0.96358 │ 0.878766 │ 0.270409 │
## │ 97 │ 0.605148 │ 0.240233 │ 0.144059 │ 0.545245 │ 0.0463105 │ 0.188397 │
## │ 98 │ 0.0907523 │ 0.334278 │ 0.288403 │ 0.519876 │ 0.267965 │ 0.552448 │
## │ 99 │ 0.751681 │ 0.289301 │ 0.488135 │ 0.382877 │ 0.320208 │ 0.999445 │
## │ 100 │ 0.856248 │ 0.577105 │ 0.588476 │ 0.435958 │ 0.0163749 │ 0.337817 │
Maybe you have noticed the “;” at the end of a command. It turns out that in Julia,
contrary to many other languages, everything is an expression, so it will return
a result. Hence, to turn off this return, we must include the “;” at the end of
each command.
To get the dimension of a data frame, we can use the size()
function. Also,
similarly to R programming language, nrow()
and ncol()
are available to
get the number of rows and columns, respectively:
size(foo)
## (100, 10)
nrow(foo)
## 100
ncol(foo)
## 10
Another basic task when working with datasets is to to get the names of each
variable contained in the table. We use the names()
function to get the column
names:
names(foo)
## 10-element Array{Symbol,1}:
## :x1
## :x2
## :x3
## :x4
## :x5
## :x6
## :x7
## :x8
## :x9
## :x10
To get a summary of the dataset in general, we can use the function describe()
:
describe(foo)
## 10×8 DataFrame. Omitted printing of 2 columns
## │ Row │ variable │ mean │ min │ median │ max │ nunique │
## │ │ Symbol │ Float64 │ Float64 │ Float64 │ Float64 │ Nothing │
## ├─────┼──────────┼──────────┼────────────┼──────────┼──────────┼─────────┤
## │ 1 │ x1 │ 0.502457 │ 0.00190391 │ 0.508102 │ 0.993014 │ │
## │ 2 │ x2 │ 0.461593 │ 0.0143797 │ 0.465052 │ 0.949993 │ │
## │ 3 │ x3 │ 0.4659 │ 0.0180212 │ 0.409124 │ 0.978917 │ │
## │ 4 │ x4 │ 0.503142 │ 0.0130052 │ 0.508707 │ 0.986293 │ │
## │ 5 │ x5 │ 0.518394 │ 0.00177395 │ 0.502389 │ 0.994104 │ │
## │ 6 │ x6 │ 0.486075 │ 0.00543681 │ 0.475648 │ 0.999445 │ │
## │ 7 │ x7 │ 0.490961 │ 0.00366989 │ 0.482302 │ 0.996092 │ │
## │ 8 │ x8 │ 0.503405 │ 0.0180501 │ 0.525201 │ 0.985918 │ │
## │ 9 │ x9 │ 0.507343 │ 0.0327247 │ 0.533176 │ 0.990731 │ │
## │ 10 │ x10 │ 0.468541 │ 0.00622055 │ 0.470003 │ 0.996703 │ │
Note that there is a message indicating the omission of some columns. This is the
default behavior of Julia. To avoid this feature, we use the show()
function
as follows:
show(describe(foo), allcols = true)
Manipulating Rows:
Subset rows in Julia can be a little odd in the beginning, but once you get used to, it becomes
more logical. For example, suppose we want the rows where x1 is above its average.
We could this as follows:
## Loading the Statistics package:
using Statistics
## Creating the conditional:
cond01 = foo[:x1] .>= mean(foo[:x1]);
## Subsetting the rows:
foo[cond01, :]
## 51×10 DataFrame. Omitted printing of 4 columns
## │ Row │ x1 │ x2 │ x3 │ x4 │ x5 │ x6 │
## │ │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │
## ├─────┼──────────┼──────────┼───────────┼──────────┼───────────┼───────────┤
## │ 1 │ 0.986082 │ 0.33719 │ 0.309992 │ 0.117098 │ 0.792606 │ 0.102682 │
## │ 2 │ 0.73245 │ 0.208294 │ 0.353801 │ 0.448741 │ 0.185897 │ 0.496741 │
## │ 3 │ 0.716057 │ 0.325789 │ 0.193415 │ 0.813209 │ 0.232703 │ 0.314502 │
## │ 4 │ 0.538082 │ 0.932279 │ 0.101212 │ 0.363205 │ 0.979265 │ 0.274936 │
## │ 5 │ 0.693567 │ 0.78976 │ 0.123106 │ 0.566847 │ 0.492958 │ 0.798202 │
## │ 6 │ 0.794447 │ 0.405418 │ 0.0521367 │ 0.587886 │ 0.922298 │ 0.211156 │
## │ 7 │ 0.664186 │ 0.432662 │ 0.0431839 │ 0.810072 │ 0.963643 │ 0.678182 │
## ⋮
## │ 44 │ 0.85821 │ 0.484308 │ 0.899559 │ 0.754818 │ 0.252699 │ 0.0590497 │
## │ 45 │ 0.714495 │ 0.661317 │ 0.954527 │ 0.209581 │ 0.107941 │ 0.233787 │
## │ 46 │ 0.680497 │ 0.101874 │ 0.872371 │ 0.596457 │ 0.669133 │ 0.740674 │
## │ 47 │ 0.909319 │ 0.182776 │ 0.343387 │ 0.142707 │ 0.0140866 │ 0.791679 │
## │ 48 │ 0.642578 │ 0.949993 │ 0.380511 │ 0.96358 │ 0.878766 │ 0.270409 │
## │ 49 │ 0.605148 │ 0.240233 │ 0.144059 │ 0.545245 │ 0.0463105 │ 0.188397 │
## │ 50 │ 0.751681 │ 0.289301 │ 0.488135 │ 0.382877 │ 0.320208 │ 0.999445 │
## │ 51 │ 0.856248 │ 0.577105 │ 0.588476 │ 0.435958 │ 0.0163749 │ 0.337817 │
What if we want two conditionals? For example, we want the same condition as before
and/or the rows where x2 is greater than or equal its average? Now things
become trickier. Let’s check how we could do this:
## Creating the second conditional:
cond02 = foo[:x2] .>= mean(foo[:x2]);
## Subsetting cond01 AND cond02:
foo[.&(cond01, cond02), :]
## 25×10 DataFrame. Omitted printing of 4 columns
## │ Row │ x1 │ x2 │ x3 │ x4 │ x5 │ x6 │
## │ │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │
## ├─────┼──────────┼──────────┼───────────┼───────────┼───────────┼───────────┤
## │ 1 │ 0.538082 │ 0.932279 │ 0.101212 │ 0.363205 │ 0.979265 │ 0.274936 │
## │ 2 │ 0.693567 │ 0.78976 │ 0.123106 │ 0.566847 │ 0.492958 │ 0.798202 │
## │ 3 │ 0.567098 │ 0.747233 │ 0.589314 │ 0.0677154 │ 0.630238 │ 0.357654 │
## │ 4 │ 0.976991 │ 0.648552 │ 0.32794 │ 0.36951 │ 0.846276 │ 0.117798 │
## │ 5 │ 0.553247 │ 0.615375 │ 0.122955 │ 0.440636 │ 0.283713 │ 0.734161 │
## │ 6 │ 0.849795 │ 0.703195 │ 0.232944 │ 0.668432 │ 0.686921 │ 0.788872 │
## │ 7 │ 0.530801 │ 0.825475 │ 0.644381 │ 0.15488 │ 0.669306 │ 0.151317 │
## ⋮
## │ 18 │ 0.665926 │ 0.943121 │ 0.438038 │ 0.921251 │ 0.82234 │ 0.761529 │
## │ 19 │ 0.987506 │ 0.946972 │ 0.0462434 │ 0.67867 │ 0.731762 │ 0.482322 │
## │ 20 │ 0.862284 │ 0.886346 │ 0.694874 │ 0.0166389 │ 0.386215 │ 0.527352 │
## │ 21 │ 0.855198 │ 0.650342 │ 0.0321678 │ 0.723076 │ 0.449779 │ 0.0364525 │
## │ 22 │ 0.85821 │ 0.484308 │ 0.899559 │ 0.754818 │ 0.252699 │ 0.0590497 │
## │ 23 │ 0.714495 │ 0.661317 │ 0.954527 │ 0.209581 │ 0.107941 │ 0.233787 │
## │ 24 │ 0.642578 │ 0.949993 │ 0.380511 │ 0.96358 │ 0.878766 │ 0.270409 │
## │ 25 │ 0.856248 │ 0.577105 │ 0.588476 │ 0.435958 │ 0.0163749 │ 0.337817 │
## Subsetting cond01 OR cond02:
foo[.|(cond01, cond02), :]
## 77×10 DataFrame. Omitted printing of 4 columns
## │ Row │ x1 │ x2 │ x3 │ x4 │ x5 │ x6 │
## │ │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │
## ├─────┼───────────┼──────────┼───────────┼──────────┼───────────┼───────────┤
## │ 1 │ 0.0193136 │ 0.466228 │ 0.790475 │ 0.805074 │ 0.51182 │ 0.201707 │
## │ 2 │ 0.986082 │ 0.33719 │ 0.309992 │ 0.117098 │ 0.792606 │ 0.102682 │
## │ 3 │ 0.73245 │ 0.208294 │ 0.353801 │ 0.448741 │ 0.185897 │ 0.496741 │
## │ 4 │ 0.341692 │ 0.608349 │ 0.946541 │ 0.589161 │ 0.418321 │ 0.295541 │
## │ 5 │ 0.413762 │ 0.644062 │ 0.495503 │ 0.96149 │ 0.249137 │ 0.592854 │
## │ 6 │ 0.129374 │ 0.663032 │ 0.0180212 │ 0.280431 │ 0.887136 │ 0.329406 │
## │ 7 │ 0.716057 │ 0.325789 │ 0.193415 │ 0.813209 │ 0.232703 │ 0.314502 │
## ⋮
## │ 70 │ 0.85821 │ 0.484308 │ 0.899559 │ 0.754818 │ 0.252699 │ 0.0590497 │
## │ 71 │ 0.714495 │ 0.661317 │ 0.954527 │ 0.209581 │ 0.107941 │ 0.233787 │
## │ 72 │ 0.680497 │ 0.101874 │ 0.872371 │ 0.596457 │ 0.669133 │ 0.740674 │
## │ 73 │ 0.909319 │ 0.182776 │ 0.343387 │ 0.142707 │ 0.0140866 │ 0.791679 │
## │ 74 │ 0.642578 │ 0.949993 │ 0.380511 │ 0.96358 │ 0.878766 │ 0.270409 │
## │ 75 │ 0.605148 │ 0.240233 │ 0.144059 │ 0.545245 │ 0.0463105 │ 0.188397 │
## │ 76 │ 0.751681 │ 0.289301 │ 0.488135 │ 0.382877 │ 0.320208 │ 0.999445 │
## │ 77 │ 0.856248 │ 0.577105 │ 0.588476 │ 0.435958 │ 0.0163749 │ 0.337817 │
In Julia, instead of the syntax condition1 & condition2, which is more common in
other programming languages, we use &(condition1, condition2)
or
|(condition1, condition2)
operators to perform multiple conditional
filtering.
Now, let’s say you have a DataFrame and you want to append rows to it.
There are a couple of ways of doing data. The first one is to use the [data1; data2]
syntax:
## Creating a DataFrame with 3 rows and 5 columns:
x = DataFrame(rand(3, 5));
## Let's add another line using [dataset1; dataset2] syntax:
[ x ; DataFrame(rand(1, 5)) ]
## 4×5 DataFrame
## │ Row │ x1 │ x2 │ x3 │ x4 │ x5 │
## │ │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │
## ├─────┼──────────┼───────────┼───────────┼──────────┼───────────┤
## │ 1 │ 0.722487 │ 0.0930212 │ 0.146 │ 0.486439 │ 0.0892853 │
## │ 2 │ 0.640469 │ 0.5902 │ 0.667832 │ 0.882527 │ 0.766987 │
## │ 3 │ 0.094589 │ 0.805257 │ 0.291809 │ 0.582878 │ 0.704144 │
## │ 4 │ 0.18066 │ 0.187027 │ 0.0440521 │ 0.077637 │ 0.884914 │
We could get the same result using the vcat()
function. According to the
documentation, vcat()
performs concatenation along dimension 1, which means
it will concatenate rows. The syntax would be:
## taking the first 2 lines and append with the third one:
vcat(x[1:2, :] , x[3, :])
Another way to do that is using the function append!()
. This function will append
a new row to the last row in a given DataFrame. Note that the column names must
match exactly.
## Column names matches
append!(x, DataFrame(rand(1, 5)))
## 4×5 DataFrame
## │ Row │ x1 │ x2 │ x3 │ x4 │ x5 │
## │ │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │
## ├─────┼──────────┼───────────┼───────────┼──────────┼───────────┤
## │ 1 │ 0.722487 │ 0.0930212 │ 0.146 │ 0.486439 │ 0.0892853 │
## │ 2 │ 0.640469 │ 0.5902 │ 0.667832 │ 0.882527 │ 0.766987 │
## │ 3 │ 0.094589 │ 0.805257 │ 0.291809 │ 0.582878 │ 0.704144 │
## │ 4 │ 0.492341 │ 0.823765 │ 0.0731187 │ 0.123074 │ 0.264452 │
Note that if the column names between two DataFrames do not match , the append!()
function is going to throw an error. Although this kind of behavior is important
when we want to control for possible side effects, we might also prefer to not worry about
this and “force” the append procedure. In order to do this we can make use of
the push!()
function.
## providing an Array:
push!(x, rand(ncol(x)))
## 5×5 DataFrame
## │ Row │ x1 │ x2 │ x3 │ x4 │ x5 │
## │ │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │
## ├─────┼──────────┼───────────┼───────────┼──────────┼───────────┤
## │ 1 │ 0.722487 │ 0.0930212 │ 0.146 │ 0.486439 │ 0.0892853 │
## │ 2 │ 0.640469 │ 0.5902 │ 0.667832 │ 0.882527 │ 0.766987 │
## │ 3 │ 0.094589 │ 0.805257 │ 0.291809 │ 0.582878 │ 0.704144 │
## │ 4 │ 0.492341 │ 0.823765 │ 0.0731187 │ 0.123074 │ 0.264452 │
## │ 5 │ 0.632829 │ 0.357564 │ 0.09631 │ 0.198201 │ 0.924137 │
## providing an dictionary:
push!(x, Dict(:x1 => rand(),
:x2 => rand(),
:x3 => rand(),
:x4 => rand(),
:x5 => rand()))
## 6×5 DataFrame
## │ Row │ x1 │ x2 │ x3 │ x4 │ x5 │
## │ │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │
## ├─────┼──────────┼───────────┼───────────┼──────────┼───────────┤
## │ 1 │ 0.722487 │ 0.0930212 │ 0.146 │ 0.486439 │ 0.0892853 │
## │ 2 │ 0.640469 │ 0.5902 │ 0.667832 │ 0.882527 │ 0.766987 │
## │ 3 │ 0.094589 │ 0.805257 │ 0.291809 │ 0.582878 │ 0.704144 │
## │ 4 │ 0.492341 │ 0.823765 │ 0.0731187 │ 0.123074 │ 0.264452 │
## │ 5 │ 0.632829 │ 0.357564 │ 0.09631 │ 0.198201 │ 0.924137 │
## │ 6 │ 0.234059 │ 0.530488 │ 0.0448796 │ 0.565734 │ 0.262909 │
As we can see, this function also accepts that we give a dictionary or an array
to append to a DataFrame.
So, there are at least 4 methods to add rows to a DataFrame. Which one to use?
Let’s see how fast it is each function:
using BenchmarkTools
@btime [x ; DataFrame(rand(1, 5))];
@btime vcat(x, DataFrame(rand(1, 5)));
@btime append!(x, DataFrame(rand(1, 5)));
@btime push!(x, rand(1, 5));
Manipulating Columns:
One of the first things we would want to do when working with a dataset is selecting
some columns. In Julia, the syntax of selecting columns in DataFrames is similar to the one
used in Matlab/Octave. For instance, we can make use of the “:” symbol to represent
that we want all columns (or all rows) and/or a sequence of them:
## Taking all rows of the first 2 columns:
foo[:, 1:2]
## 100×2 DataFrame
## │ Row │ x1 │ x2 │
## │ │ Float64 │ Float64 │
## ├─────┼────────────┼──────────┤
## │ 1 │ 0.0193136 │ 0.466228 │
## │ 2 │ 0.986082 │ 0.33719 │
## │ 3 │ 0.00366356 │ 0.323071 │
## │ 4 │ 0.297846 │ 0.136907 │
## │ 5 │ 0.73245 │ 0.208294 │
## │ 6 │ 0.209719 │ 0.114021 │
## │ 7 │ 0.341692 │ 0.608349 │
## ⋮
## │ 93 │ 0.714495 │ 0.661317 │
## │ 94 │ 0.680497 │ 0.101874 │
## │ 95 │ 0.909319 │ 0.182776 │
## │ 96 │ 0.642578 │ 0.949993 │
## │ 97 │ 0.605148 │ 0.240233 │
## │ 98 │ 0.0907523 │ 0.334278 │
## │ 99 │ 0.751681 │ 0.289301 │
## │ 100 │ 0.856248 │ 0.577105 │
## Taking the first 10 rows of all columns:
foo[1:10, :]
## 10×10 DataFrame. Omitted printing of 4 columns
## │ Row │ x1 │ x2 │ x3 │ x4 │ x5 │ x6 │
## │ │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │
## ├─────┼────────────┼──────────┼───────────┼──────────┼──────────┼──────────┤
## │ 1 │ 0.0193136 │ 0.466228 │ 0.790475 │ 0.805074 │ 0.51182 │ 0.201707 │
## │ 2 │ 0.986082 │ 0.33719 │ 0.309992 │ 0.117098 │ 0.792606 │ 0.102682 │
## │ 3 │ 0.00366356 │ 0.323071 │ 0.685271 │ 0.596414 │ 0.847368 │ 0.105035 │
## │ 4 │ 0.297846 │ 0.136907 │ 0.726739 │ 0.569452 │ 0.922995 │ 0.846519 │
## │ 5 │ 0.73245 │ 0.208294 │ 0.353801 │ 0.448741 │ 0.185897 │ 0.496741 │
## │ 6 │ 0.209719 │ 0.114021 │ 0.0662264 │ 0.463682 │ 0.628582 │ 0.130653 │
## │ 7 │ 0.341692 │ 0.608349 │ 0.946541 │ 0.589161 │ 0.418321 │ 0.295541 │
## │ 8 │ 0.308353 │ 0.428978 │ 0.914878 │ 0.84873 │ 0.440174 │ 0.310166 │
## │ 9 │ 0.413762 │ 0.644062 │ 0.495503 │ 0.96149 │ 0.249137 │ 0.592854 │
## │ 10 │ 0.129374 │ 0.663032 │ 0.0180212 │ 0.280431 │ 0.887136 │ 0.329406 │
Also, we can select a column by using its name as a symbol or using the “.” operator:
## take the column x1 using "." operator:
foo.x1
## 100-element Array{Float64,1}:
## 0.019313572390828426
## 0.9860824001880526
## 0.003663562628546835
## 0.2978463233159676
## 0.7324498468154668
## 0.2097185474768264
## 0.34169153867123714
## 0.30835315833846444
## 0.41376236563754887
## 0.1293737178707406
## ⋮
## 0.8582099391004219
## 0.7144949034554522
## 0.6804966837971145
## 0.9093192833587018
## 0.6425780404716646
## 0.6051475800989663
## 0.09075227070455938
## 0.7516814773623635
## 0.8562478916762768
## Take the column using "x1" as a symbol:
foo[:x1]
## 100-element Array{Float64,1}:
## 0.019313572390828426
## 0.9860824001880526
## 0.003663562628546835
## 0.2978463233159676
## 0.7324498468154668
## 0.2097185474768264
## 0.34169153867123714
## 0.30835315833846444
## 0.41376236563754887
## 0.1293737178707406
## ⋮
## 0.8582099391004219
## 0.7144949034554522
## 0.6804966837971145
## 0.9093192833587018
## 0.6425780404716646
## 0.6051475800989663
## 0.09075227070455938
## 0.7516814773623635
## 0.8562478916762768
Notice that the return will be an Array. To select one or more column and return
them as a DataFrame type, we use the double brackets syntax:
using DataFramesMeta
## take column x1 as DataFrame
@linq foo[[:x1]] |> first(5)
## 5×1 DataFrame
## │ Row │ x1 │
## │ │ Float64 │
## ├─────┼────────────┤
## │ 1 │ 0.0193136 │
## │ 2 │ 0.986082 │
## │ 3 │ 0.00366356 │
## │ 4 │ 0.297846 │
## │ 5 │ 0.73245 │
## Take column x1 an x2:
@linq foo[[:x1, :x2]] |> first(5)
## 5×2 DataFrame
## │ Row │ x1 │ x2 │
## │ │ Float64 │ Float64 │
## ├─────┼────────────┼──────────┤
## │ 1 │ 0.0193136 │ 0.466228 │
## │ 2 │ 0.986082 │ 0.33719 │
## │ 3 │ 0.00366356 │ 0.323071 │
## │ 4 │ 0.297846 │ 0.136907 │
## │ 5 │ 0.73245 │ 0.208294 │
There are some new things here. The first()
function aims to just show the first
lines of our dataset. Similarly, last()
performs the same, but showing us the last
lines. Also, you may have noticed the use of the “|>” operator. This is the
pipe symbol in Julia. If you are familiar with R programming language, it
works similarly to the “%>%” operator from magrittr package, but with some
limitations. For example, we can not pipe to a specific argument in a
subsequent function, so that’s why the use of @linq from DataFramesMeta
package. For now just take these commands for granted. In another post I will show
how to use the functionalities of the metaprogramming tools for DataFrames.
Another trivial task we can perform with column is to add or alter columns in a
DataFrame. For example, let’s create a new column which will be a sequence between
1 and until 100 by 0.5:
## To create a sequence, use the function range():
foo[:new_column] = range(1, step = 0.5, length = nrow(foo));
foo[:, :new_column]
## 100-element Array{Float64,1}:
## 1.0
## 1.5
## 2.0
## 2.5
## 3.0
## 3.5
## 4.0
## 4.5
## 5.0
## 5.5
## ⋮
## 46.5
## 47.0
## 47.5
## 48.0
## 48.5
## 49.0
## 49.5
## 50.0
## 50.5
We can also add column using the insertcols!()
function. The syntax allow us to
specify in which position we want to add the column in the DataFrame:
## syntax: insert!(dataset, position, column_name => array)
insertcols!(foo, 2, :new_colum2 => range(1, step = 0.5, length = nrow(foo)));
first(foo, 3)
## 3×12 DataFrame. Omitted printing of 6 columns
## │ Row │ x1 │ new_colum2 │ x2 │ x3 │ x4 │ x5 │
## │ │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │
## ├─────┼────────────┼────────────┼──────────┼──────────┼──────────┼──────────┤
## │ 1 │ 0.0193136 │ 1.0 │ 0.466228 │ 0.790475 │ 0.805074 │ 0.51182 │
## │ 2 │ 0.986082 │ 1.5 │ 0.33719 │ 0.309992 │ 0.117098 │ 0.792606 │
## │ 3 │ 0.00366356 │ 2.0 │ 0.323071 │ 0.685271 │ 0.596414 │ 0.847368 │
Note the use of the “!” in insertcols!()
function. This means that the function
is altering the object in memory rather than in a “virtual copy” that later needs
to be assigned to a new variable. This is a behavior that can be used in other function
as well.
Ok… But what if you want to do the opposite? that is, to remove a column?
Well… it is just as easy as to add it. Just use the deletecols!()
function:
deletecols!(foo, [:new_column, :new_colum2])
## 100×10 DataFrame. Omitted printing of 4 columns
## │ Row │ x1 │ x2 │ x3 │ x4 │ x5 │ x6 │
## │ │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │
## ├─────┼────────────┼──────────┼───────────┼──────────┼───────────┼──────────┤
## │ 1 │ 0.0193136 │ 0.466228 │ 0.790475 │ 0.805074 │ 0.51182 │ 0.201707 │
## │ 2 │ 0.986082 │ 0.33719 │ 0.309992 │ 0.117098 │ 0.792606 │ 0.102682 │
## │ 3 │ 0.00366356 │ 0.323071 │ 0.685271 │ 0.596414 │ 0.847368 │ 0.105035 │
## │ 4 │ 0.297846 │ 0.136907 │ 0.726739 │ 0.569452 │ 0.922995 │ 0.846519 │
## │ 5 │ 0.73245 │ 0.208294 │ 0.353801 │ 0.448741 │ 0.185897 │ 0.496741 │
## │ 6 │ 0.209719 │ 0.114021 │ 0.0662264 │ 0.463682 │ 0.628582 │ 0.130653 │
## │ 7 │ 0.341692 │ 0.608349 │ 0.946541 │ 0.589161 │ 0.418321 │ 0.295541 │
## ⋮
## │ 93 │ 0.714495 │ 0.661317 │ 0.954527 │ 0.209581 │ 0.107941 │ 0.233787 │
## │ 94 │ 0.680497 │ 0.101874 │ 0.872371 │ 0.596457 │ 0.669133 │ 0.740674 │
## │ 95 │ 0.909319 │ 0.182776 │ 0.343387 │ 0.142707 │ 0.0140866 │ 0.791679 │
## │ 96 │ 0.642578 │ 0.949993 │ 0.380511 │ 0.96358 │ 0.878766 │ 0.270409 │
## │ 97 │ 0.605148 │ 0.240233 │ 0.144059 │ 0.545245 │ 0.0463105 │ 0.188397 │
## │ 98 │ 0.0907523 │ 0.334278 │ 0.288403 │ 0.519876 │ 0.267965 │ 0.552448 │
## │ 99 │ 0.751681 │ 0.289301 │ 0.488135 │ 0.382877 │ 0.320208 │ 0.999445 │
## │ 100 │ 0.856248 │ 0.577105 │ 0.588476 │ 0.435958 │ 0.0163749 │ 0.337817 │
Now suppose that you do not want to delete a colum, but just change its name.
For this task, I am afraid there is a very difficult function to remember
the name: rename()
. The syntax is as follows:
## rename(dataFrame, :old_name => :new_name)
rename(foo, :x1 => :A1, :x2 => :A2)
## 100×10 DataFrame. Omitted printing of 4 columns
## │ Row │ A1 │ A2 │ x3 │ x4 │ x5 │ x6 │
## │ │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │ Float64 │
## ├─────┼────────────┼──────────┼───────────┼──────────┼───────────┼──────────┤
## │ 1 │ 0.0193136 │ 0.466228 │ 0.790475 │ 0.805074 │ 0.51182 │ 0.201707 │
## │ 2 │ 0.986082 │ 0.33719 │ 0.309992 │ 0.117098 │ 0.792606 │ 0.102682 │
## │ 3 │ 0.00366356 │ 0.323071 │ 0.685271 │ 0.596414 │ 0.847368 │ 0.105035 │
## │ 4 │ 0.297846 │ 0.136907 │ 0.726739 │ 0.569452 │ 0.922995 │ 0.846519 │
## │ 5 │ 0.73245 │ 0.208294 │ 0.353801 │ 0.448741 │ 0.185897 │ 0.496741 │
## │ 6 │ 0.209719 │ 0.114021 │ 0.0662264 │ 0.463682 │ 0.628582 │ 0.130653 │
## │ 7 │ 0.341692 │ 0.608349 │ 0.946541 │ 0.589161 │ 0.418321 │ 0.295541 │
## ⋮
## │ 93 │ 0.714495 │ 0.661317 │ 0.954527 │ 0.209581 │ 0.107941 │ 0.233787 │
## │ 94 │ 0.680497 │ 0.101874 │ 0.872371 │ 0.596457 │ 0.669133 │ 0.740674 │
## │ 95 │ 0.909319 │ 0.182776 │ 0.343387 │ 0.142707 │ 0.0140866 │ 0.791679 │
## │ 96 │ 0.642578 │ 0.949993 │ 0.380511 │ 0.96358 │ 0.878766 │ 0.270409 │
## │ 97 │ 0.605148 │ 0.240233 │ 0.144059 │ 0.545245 │ 0.0463105 │ 0.188397 │
## │ 98 │ 0.0907523 │ 0.334278 │ 0.288403 │ 0.519876 │ 0.267965 │ 0.552448 │
## │ 99 │ 0.751681 │ 0.289301 │ 0.488135 │ 0.382877 │ 0.320208 │ 0.999445 │
## │ 100 │ 0.856248 │ 0.577105 │ 0.588476 │ 0.435958 │ 0.0163749 │ 0.337817 │
We could also add the “!” to the rename()
function to alter the DataFrame
in memory.
Let’s talk about missing values:
Missing values are represented in Julia with missing value. When an array
contains missing values, it automatically creates an appropriate union type:
x = [1.0, 2.0, missing]
## 3-element Array{Union{Missing, Float64},1}:
## 1.0
## 2.0
## missing
typeof(x)
## Array{Union{Missing, Float64},1}
typeof.(x)
## 3-element Array{DataType,1}:
## Float64
## Float64
## Missing
To check if a particular element in an array is missing, we use the ismissing()
function:
ismissing.([1.0, 2.0, missing])
## 3-element BitArray{1}:
## false
## false
## true
It is important to notice that missing comparison produces missing as a result:
missing == missing
isequal and === can be used to produce the results of type Bool:
isequal(missing, missing)
## true
missing === missing
## true
Other functions are available to work with missing values. For instance, suppose
we want an array with only non-missing values, we use the skipmissing()
function:
x |> skipmissing |> collect
## 2-element Array{Float64,1}:
## 1.0
## 2.0
Here, we use the collect()
function as the skipmissing()
returns an iterator.
To replace the missing values with some other value we can use the
Missings.replace()
function. For example, suppose we want to change the missing
values by NaN:
Missings.replace(x, NaN) |> collect
## 3-element Array{Float64,1}:
## 1.0
## 2.0
## NaN
We also can use use other ways to perform the same operation:
## Using coalesce() function:
coalesce.(x, NaN)
## 3-element Array{Float64,1}:
## 1.0
## 2.0
## NaN
## Using recode() function:
recode(x, missing => NaN)
## 3-element Array{Float64,1}:
## 1.0
## 2.0
## NaN
Until now, we have only talked about missing values in arrays. But what about missing
values in DataFrames? To start, let’s create a DataFrame with some missing values:
x = DataFrame(A = [1, missing, 3, 4], B = ["A", "B", missing, "C"])
## 4×2 DataFrame
## │ Row │ A │ B │
## │ │ Int64⍰ │ String⍰ │
## ├─────┼─────────┼─────────┤
## │ 1 │ 1 │ A │
## │ 2 │ missing │ B │
## │ 3 │ 3 │ missing │
## │ 4 │ 4 │ C │
For some analysis, we would want only the rows with non-missing values. One way
to achieve this is making use of the completecases()
function:
x[completecases(x), :]
## 2×2 DataFrame
## │ Row │ A │ B │
## │ │ Int64⍰ │ String⍰ │
## ├─────┼────────┼─────────┤
## │ 1 │ 1 │ A │
## │ 2 │ 4 │ C │
The completecases()
function returns an boolean array with value true for
rows that have non-missing values and false otherwise. For those who are familiar
with R, this is the same behavior as the complete.cases() function from stats package.
Another option to return the rows with non-missing values of a DataFrame in Julia
is to use the dropmissing()
function:
dropmissing(x)
## 2×2 DataFrame
## │ Row │ A │ B │
## │ │ Int64 │ String │
## ├─────┼───────┼────────┤
## │ 1 │ 1 │ A │
## │ 2 │ 4 │ C │
and again, for R users is the same behavior as na.omit() function.
Merging DataFrames:
Often, we need to combine two or more DataFrames together based on some common
column(s) among them. For example, suppose we have two DataFrames:
df1 = DataFrame(x = 1:3, y = 4:6)
## 3×2 DataFrame
## │ Row │ x │ y │
## │ │ Int64 │ Int64 │
## ├─────┼───────┼───────┤
## │ 1 │ 1 │ 4 │
## │ 2 │ 2 │ 5 │
## │ 3 │ 3 │ 6 │
df2 = DataFrame(x = 1:3, z = 'd':'f', new = 11:13)
## 3×3 DataFrame
## │ Row │ x │ z │ new │
## │ │ Int64 │ Char │ Int64 │
## ├─────┼───────┼──────┼───────┤
## │ 1 │ 1 │ 'd' │ 11 │
## │ 2 │ 2 │ 'e' │ 12 │
## │ 3 │ 3 │ 'f' │ 13 │
which have the column x in common. To merge these two tables, we use the
join()
function:
join(df1, df2, on = :x)
## 3×4 DataFrame
## │ Row │ x │ y │ z │ new │
## │ │ Int64 │ Int64 │ Char │ Int64 │
## ├─────┼───────┼───────┼──────┼───────┤
## │ 1 │ 1 │ 4 │ 'd' │ 11 │
## │ 2 │ 2 │ 5 │ 'e' │ 12 │
## │ 3 │ 3 │ 6 │ 'f' │ 13 │
That’s it!! We merge our DataFrames altogether. But that’s the default behavior of
the function. There is more to explore. Essentially, join()
takes 4 arguments:
- DataFrame 1
- DataFrame 2
- on = the column(s) to be the key in merging;
- kind = type of the merge (left, right, inner, outer, …)
The kind argument specifies the type of join we are interested in performing.
The definition of each one is as follows:
-
Inner: The output contains rows for values of the key that exist
in BOTH the first (left) and second (right) arguments to
join;
-
Left: The output contains rows for values of the key that exist in
the first (left) argument to join, whether or not that value
exists in the second (right) argument;
-
Right: The output contains rows for values of the key that exist in
the second (right) argument to join, whether or not that
value exists in the first (left) argument;
-
Outer: The output contains rows for values of the key that exist in
the first (left) OR second (right) argument to join;
and here are the “strange” ones:
-
Semi: Like an inner join, but output is restricted to columns from
the first (left) argument to join;
-
Anti: The output contains rows for values of the key that exist in
the first (left) but NOT in the second (right) argument to
join. As with semi joins, output is restricted to columns
from the first (left) argument.
If you are familiar with SQL or with the join functions from dplyr package in R,
it is the same concept.
To illustrate how the different kind of joins work, let’s create more DataFrames
to demonstrate each type of join:
Names = DataFrame(ID = [20, 40], Name = ["John Doe", "Jane Doe"])
## 2×2 DataFrame
## │ Row │ ID │ Name │
## │ │ Int64 │ String │
## ├─────┼───────┼──────────┤
## │ 1 │ 20 │ John Doe │
## │ 2 │ 40 │ Jane Doe │
jobs = DataFrame(ID = [20, 60], Job = ["Lawyer", "Astronaut"])
## 2×2 DataFrame
## │ Row │ ID │ Job │
## │ │ Int64 │ String │
## ├─────┼───────┼───────────┤
## │ 1 │ 20 │ Lawyer │
## │ 2 │ 60 │ Astronaut │
In the Names and jobs DataFrame, we have the ID column as the key to perform the
join. But notice that the ID values are not equal between the DataFrames. Now
let’s perform the joins:
join(Names, jobs, on = :ID, kind = :inner)
## 1×3 DataFrame
## │ Row │ ID │ Name │ Job │
## │ │ Int64 │ String │ String │
## ├─────┼───────┼──────────┼────────┤
## │ 1 │ 20 │ John Doe │ Lawyer │
join(Names, jobs, on = :ID, kind = :left)
## 2×3 DataFrame
## │ Row │ ID │ Name │ Job │
## │ │ Int64 │ String │ String⍰ │
## ├─────┼───────┼──────────┼─────────┤
## │ 1 │ 20 │ John Doe │ Lawyer │
## │ 2 │ 40 │ Jane Doe │ missing │
join(Names, jobs, on = :ID, kind = :right)
## 2×3 DataFrame
## │ Row │ ID │ Name │ Job │
## │ │ Int64 │ String⍰ │ String │
## ├─────┼───────┼──────────┼───────────┤
## │ 1 │ 20 │ John Doe │ Lawyer │
## │ 2 │ 60 │ missing │ Astronaut │
join(Names, jobs, on = :ID, kind = :outer)
## 3×3 DataFrame
## │ Row │ ID │ Name │ Job │
## │ │ Int64 │ String⍰ │ String⍰ │
## ├─────┼───────┼──────────┼───────────┤
## │ 1 │ 20 │ John Doe │ Lawyer │
## │ 2 │ 40 │ Jane Doe │ missing │
## │ 3 │ 60 │ missing │ Astronaut │
Semi and anti join have a more uncommon behavior. Semi join returns the rows
from the left which DO MATCH with the ID from the right:
join(Names, jobs, on = :ID, kind = :semi)
## 1×2 DataFrame
## │ Row │ ID │ Name │
## │ │ Int64 │ String │
## ├─────┼───────┼──────────┤
## │ 1 │ 20 │ John Doe │
Anti join returns the rows from the left which DO NOT MATCH with
the ID from the right
join(Names, jobs, on = :ID, kind = :anti)
## 1×2 DataFrame
## │ Row │ ID │ Name │
## │ │ Int64 │ String │
## ├─────┼───────┼──────────┤
## │ 1 │ 40 │ Jane Doe │
Split-Apply-Combine:
Some common tasks involve splitting the data into groups, applying some function
to each of these groups and gathering the results to analyze later on. This is the
split-apply-combine strategy described in the paper “The Split-Apply-Combine Strategy for Data analysis” written by Hadley Wickham, creator
of many R packages, including ggplot2 and dplyr.
The DataFrames package in Julia supports the Split-Apply-Combine strategy
through the by()
function, which takes three arguments:
- DataFrame;
- one or more column names to split on;
- a function or expression to apply to each subset;
To illustrate its usage, let’s make use of the RDatasets package, which gives
access to some preloaded well known datasets from R packages.
using RDatasets
foo = dataset("datasets", "iris");
first(foo, 5)
## 5×5 DataFrame
## │ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species │
## │ │ Float64 │ Float64 │ Float64 │ Float64 │ Categorical… │
## ├─────┼─────────────┼────────────┼─────────────┼────────────┼──────────────┤
## │ 1 │ 5.1 │ 3.5 │ 1.4 │ 0.2 │ setosa │
## │ 2 │ 4.9 │ 3.0 │ 1.4 │ 0.2 │ setosa │
## │ 3 │ 4.7 │ 3.2 │ 1.3 │ 0.2 │ setosa │
## │ 4 │ 4.6 │ 3.1 │ 1.5 │ 0.2 │ setosa │
## │ 5 │ 5.0 │ 3.6 │ 1.4 │ 0.2 │ setosa │
A trivial task is to find how many of each “Species” there are in the
dataset. One way to do this is to apply the Split-Apply-Combine strategy: split
the data into the Species column, apply the nrow()
function to this
splitted dataset, and combine the results:
## Syntax: by(dataset, :name_column_to_split, name_function)
by(foo, :Species, nrow)
## 3×2 DataFrame
## │ Row │ Species │ x1 │
## │ │ Categorical… │ Int64 │
## ├─────┼──────────────┼───────┤
## │ 1 │ setosa │ 50 │
## │ 2 │ versicolor │ 50 │
## │ 3 │ virginica │ 50 │
We can also make use of anonymous function:
by(foo, :Species, x -> DataFrame(N = nrow(x)))
## 3×2 DataFrame
## │ Row │ Species │ N │
## │ │ Categorical… │ Int64 │
## ├─────┼──────────────┼───────┤
## │ 1 │ setosa │ 50 │
## │ 2 │ versicolor │ 50 │
## │ 3 │ virginica │ 50 │
One of the advantages of using anonymous function inside the by()
function is
that we can format the resulted output and apply as many function as we want:
## Applying the count, mean and standard deviation function:
by(foo, :Species, x -> DataFrame(N = nrow(x),
avg_PetalLength = mean(x[:PetalLength]),
std_PetalWidth = std(x[:PetalWidth])))
## 3×4 DataFrame
## │ Row │ Species │ N │ avg_PetalLength │ std_PetalWidth │
## │ │ Categorical… │ Int64 │ Float64 │ Float64 │
## ├─────┼──────────────┼───────┼─────────────────┼────────────────┤
## │ 1 │ setosa │ 50 │ 1.462 │ 0.105386 │
## │ 2 │ versicolor │ 50 │ 4.26 │ 0.197753 │
## │ 3 │ virginica │ 50 │ 5.552 │ 0.27465 │
Another way to use the Split-Apply-Combine strategy is implementing the
aggregate()
function, which also takes three arguments:
- DataFrame;
- one or more column names to split on;
- one or more function to be applied ON THE COLUMNS NOT USED TO SPLIT.
The difference between by()
and aggregate()
function is that in the
latter, the function(s) will be applied to each column not used in
the split part.
For instance, let’s say you want the average of each colum for each Species.
Instead of using by()
with an anonymous function and writing the name of all columns
we can do:
aggregate(foo, :Species, [mean])
## 3×5 DataFrame. Omitted printing of 1 columns
## │ Row │ Species │ SepalLength_mean │ SepalWidth_mean │ PetalLength_mean │
## │ │ Categorical… │ Float64 │ Float64 │ Float64 │
## ├─────┼──────────────┼──────────────────┼─────────────────┼──────────────────┤
## │ 1 │ setosa │ 5.006 │ 3.428 │ 1.462 │
## │ 2 │ versicolor │ 5.936 │ 2.77 │ 4.26 │
## │ 3 │ virginica │ 6.588 │ 2.974 │ 5.552 │
Note that Julia only display output that fits the screen. Pay
attention to the message “Omitted printing of 1 columns”. To
overcome this, use the show()
as advised before.