By: Joel Nelson
Re-posted from: https://blog.glcs.io/julia-dataframes
When doing any sort of development one will often find themselves in need of working with data in atabular format. This is especially true for those of us in data science, or data analysis, fields.In the Julia programming language one of the more popular libraries for this type of datawrangling is DataFrames.jl. In this blog post we’ll explore the beginnings of working with thispackage.
Introduction
The great thing about a package like Dataframes.jl
is that it bridges the gap between traditionalprogramming and SQL (Structured Query Language). Databases are great tools for easily gaining insightsinto your data by joining, filtering, aggregating, sorting, etc… Dataframes.jl
brings those goodies rightinto your hands by simply adding the package into your julia session. So, lets get started!
Getting Started
Adding the package is a few simple steps.
julia> using Pkgjulia> Pkg.add("DataFrames")julia> using DataFrames
The constructor for a DataFrame
provides flexibility to create from arrays, tuples, constants, or files. The documentation covers all these, but for this post we’ll just explore one of the more common ways.
julia> df = DataFrame(a = 1:4, b = rand(4), c = "My first DataFrame")43 DataFrame Row a b c Int64 Float64 String 1 1 0.141874 My first DataFrame 2 2 0.432084 My first DataFrame 3 3 0.47098 My first DataFrame 4 4 0.414639 My first DataFrame
You’ll notice in the code above we use a mix of datatypes including range, array, and scalar. The underlying vectors must be of the same sizeand the scalar gets broadcasted, or repeated, for each row. Also, pay attention that the types of each column are inferredbased on the arrays passed into the constructor.
Now, to access a column of a DataFrame
there are also a few different possibilities. Here are a few examples of accessing thefirst column “a”.
julia> df.a4-element Vector{Int64}: 1 2 3 4julia> df."a"4-element Vector{Int64}: 1 2 3 4julia> df[!, "a"]4-element Vector{Int64}: 1 2 3 4julia> df[!, :a]4-element Vector{Int64}: 1 2 3 4julia> df[:, :a]4-element Vector{Int64}: 1 2 3 4
In these examples columns can be access directly with literals such as df.a
, or more dynamically using brackets (since variables could be substituted.) You may also findyourself wondering the difference between !
and :
, which is an important distinction!
The !
returns the underlying vector and :
returns a copy. This can be showcased in anexample where we will attempt to change the description of the second value in column c
to “I love Julia!”
julia> df[:, :c][2] = "I love Julia!"3julia> df43 DataFrame Row a b c Int64 Float64 String 1 1 0.394165 My first DataFrame 2 2 0.809883 My first DataFrame 3 3 0.124035 My first DataFrame 4 4 0.886781 My first DataFramejulia> df[!, :c][2] = "I love Julia!"3julia> df43 DataFrame Row a b c Int64 Float64 String 1 1 0.394165 My first DataFrame 2 3 0.809883 I love Julia! 3 3 0.124035 My first DataFrame 4 4 0.886781 My first DataFrame
Notice how the change will only persist to df
when we access the column with !
.
There is often a tradeoff between returning copies versus the actual underlying vectors. Returning a copy is generally considered safer since if the copy is later mutated the underlyingDataFrame remains unchanged. However, with very large DataFrames copying every column access willresult in an increase in memory. It is best to weigh those considerations and figure out whatapproach will work best for a given program.
Data Wrangling
Import / Export
Another great feature of the Julia programming language is that many different packages will interact wellwhen used together. For instance, DataFrames.jl
and CSV.jl
can be used to very easily import and exportdata.
First, we can save the DataFrame
from above to CSV.
julia> using CSVjulia> path = joinpath(homedir(), "my_df.csv")julia> CSV.write(path, df)
And, reading in the DataFrame
from file is just as easy!
julia> CSV.read(path, DataFrame)43 DataFrame Row a b c Int64 Float64 String31 1 1 0.601361 My first DataFrame 2 2 0.178065 My first DataFrame 3 3 0.729591 My first DataFrame 4 4 0.280314 My first DataFrame
There are many keyword arguments to explore when handling csv files and the documentation is best forcovering all of these CSV.jl.
DataFrames.jl
also supports writing and reading to multiple files types such as Arrow, JSON, Parquet, and others.
Joins
A join is a way to merge data from two DataFrames into a single DataFrame. There are several typesand they generally mimic the same types that a database would support.
innerjoin
leftjoin
rightjoin
outerjoin
semijoin
antijoin
crossjoin
Definitions of each can be found in either the documentation, or docstrings, but lets take a look at a fewexamples. Say we have the following DataFrame
sets containing information from a school.
julia> student_df = DataFrame(student_id = 1:10, student_name = ["Joe", "Sally", "Jim", "Sandy", "Beth", "Alex", "Tom", "Liz", "Bill", "Carl"], teacher_id = repeat([1,2],5))103 DataFrame Row student_id student_name teacher_id Int64 String Int64 1 1 Joe 1 2 2 Sally 2 3 3 Jim 1 4 4 Sandy 2 5 5 Beth 1 6 6 Alex 2 7 7 Tom 1 8 8 Liz 2 9 9 Bill 1 10 10 Carl 2julia> teacher_df = DataFrame(teacher_id = 1:2, teacher_name = ["Mr. Jackson", "Ms. Smith"])22 DataFrame Row teacher_id teacher_name Int64 String 1 1 Mr. Jackson 2 2 Ms. Smithjulia> grade_df = DataFrame(exam_id = 1, student_id = vcat(1:3, 5:10), grade = [0.95, 0.93, 0.81, 0.85, 0.73, 0.88, 0.77, 0.75, 0.93])93 DataFrame Row exam_id student_id grade Int64 Int64 Float64 1 1 1 0.95 2 1 2 0.93 3 1 3 0.81 4 1 5 0.85 5 1 6 0.73 6 1 7 0.88 7 1 8 0.77 8 1 9 0.75 9 1 10 0.93
If we look at the grade_df
we can see there are 9 results, but in the student_df
we have 10 students.So, someone must have missed the exam! Let’s find out who that way we can alert the teacher to schedulea makeup.
Let’s do a leftjoin
, which means every row will persist from the first DataFrame
regardless if there isa match to the second DataFrame
. The leftjoin
function also takes an on
keyword argumentto signify what column needs to be used to find matches.
julia> student_grade_df = leftjoin(student_df, grade_df, on=:student_id)105 DataFrame Row student_id student_name teacher_id exam_id grade Int64 String Int64 Int64? Float64? 1 1 Joe 1 1 0.95 2 2 Sally 2 1 0.93 3 3 Jim 1 1 0.81 4 5 Beth 1 1 0.85 5 6 Alex 2 1 0.73 6 7 Tom 1 1 0.88 7 8 Liz 2 1 0.77 8 9 Bill 1 1 0.75 9 10 Carl 2 1 0.93 10 4 Sandy 2 missing missing
We notice Sandy has a missing
value for both the exam_id
and grade
fields. missing
is a special datatype in Julia that is similar to a null
value in databases. This would signify tous that there was no match in the grade_df
meaning Sandy missed the exam. We can add one more join to get the respective teacher’s name.
julia> result_df = innerjoin(student_grade_df, teacher_df, on=:teacher_id)106 DataFrame Row student_id student_name teacher_id exam_id grade teacher_name Int64 String Int64 Int64? Float64? String 1 1 Joe 1 1 0.95 Mr. Jackson 2 2 Sally 2 1 0.93 Ms. Smith 3 3 Jim 1 1 0.81 Mr. Jackson 4 5 Beth 1 1 0.85 Mr. Jackson 5 6 Alex 2 1 0.73 Ms. Smith 6 7 Tom 1 1 0.88 Mr. Jackson 7 8 Liz 2 1 0.77 Ms. Smith 8 9 Bill 1 1 0.75 Mr. Jackson 9 10 Carl 2 1 0.93 Ms. Smith 10 4 Sandy 2 missing missing Ms. Smith
We used an innerjoin
this time since we know that every student would have a teacher assigned.Now, we can let Ms. Smith know that she needs to reach out to Sandy to re-schedule her exam.
Sorting
Another helpful function for analysis is sort
. Let’s sort our result_df
by the grade
column.
julia> sort(result_df, [:grade])106 DataFrame Row student_id student_name teacher_id exam_id grade teacher_name Int64 String Int64 Int64? Float64? String 1 6 Alex 2 1 0.73 Ms. Smith 2 9 Bill 1 1 0.75 Mr. Jackson 3 8 Liz 2 1 0.77 Ms. Smith 4 3 Jim 1 1 0.81 Mr. Jackson 5 5 Beth 1 1 0.85 Mr. Jackson 6 7 Tom 1 1 0.88 Mr. Jackson 7 2 Sally 2 1 0.93 Ms. Smith 8 10 Carl 2 1 0.93 Ms. Smith 9 1 Joe 1 1 0.95 Mr. Jackson 10 4 Sandy 2 missing missing Ms. Smith
The function takes the DataFrame
and an array of columns to sort on. Our result of sort
is putting the lowestgrade first, but if we wanted it descending we can pass a rev
keyword argument.
julia> sort(result_df, [:grade], rev=true)106 DataFrame Row student_id student_name teacher_id exam_id grade teacher_name Int64 String Int64 Int64? Float64? String 1 4 Sandy 2 missing missing Ms. Smith 2 1 Joe 1 1 0.95 Mr. Jackson 3 2 Sally 2 1 0.93 Ms. Smith 4 10 Carl 2 1 0.93 Ms. Smith 5 7 Tom 1 1 0.88 Mr. Jackson 6 5 Beth 1 1 0.85 Mr. Jackson 7 3 Jim 1 1 0.81 Mr. Jackson 8 8 Liz 2 1 0.77 Ms. Smith 9 9 Bill 1 1 0.75 Mr. Jackson 10 6 Alex 2 1 0.73 Ms. Smith
In both these cases a copy of the DataFrame
is returned and the result_df
is left unchanged. But, if we wanted tosort in-place we can also use the sort!
function that will update the passed DataFrame
.
julia> sort!(result_df, [:grade], rev=true)106 DataFrame Row student_id student_name teacher_id exam_id grade teacher_name Int64 String Int64 Int64? Float64? String 1 4 Sandy 2 missing missing Ms. Smith 2 1 Joe 1 1 0.95 Mr. Jackson 3 2 Sally 2 1 0.93 Ms. Smith 4 10 Carl 2 1 0.93 Ms. Smith 5 7 Tom 1 1 0.88 Mr. Jackson 6 5 Beth 1 1 0.85 Mr. Jackson 7 3 Jim 1 1 0.81 Mr. Jackson 8 8 Liz 2 1 0.77 Ms. Smith 9 9 Bill 1 1 0.75 Mr. Jackson 10 6 Alex 2 1 0.73 Ms. Smith
Split-apply-combine
Now that we have some basics down, it’s time to dive into aggregating results. In DataFrames.jl
this isreferred to as a split-apply-combine strategy. It is a bit of a mouthful, but let’s walk through whatexactly this is referring to.
Split is simply breaking the DataFrame
into groups using the groupby
function. In our example lets splitour DataFrame
by the teacher_name
column.
julia> grouped_df = groupby(result_df, :teacher_name)GroupedDataFrame with 2 groups based on key: teacher_nameFirst Group (5 rows): teacher_name = "Mr. Jackson" Row student_id student_name teacher_id exam_id grade teacher_name Int64 String Int64 Int64? Float64? String 1 1 Joe 1 1 0.95 Mr. Jackson 2 3 Jim 1 1 0.81 Mr. Jackson 3 5 Beth 1 1 0.85 Mr. Jackson 4 7 Tom 1 1 0.88 Mr. Jackson 5 9 Bill 1 1 0.75 Mr. JacksonLast Group (5 rows): teacher_name = "Ms. Smith" Row student_id student_name teacher_id exam_id grade teacher_name Int64 String Int64 Int64? Float64? String 1 2 Sally 2 1 0.93 Ms. Smith 2 6 Alex 2 1 0.73 Ms. Smith 3 8 Liz 2 1 0.77 Ms. Smith 4 10 Carl 2 1 0.93 Ms. Smith 5 4 Sandy 2 missing missing Ms. Smith
The result of calling groupby
is of type GroupedDataFrame
, which is basically a wrapperaround one, or many, groups of a DataFrame
. In our example we have two teachers and so the resultGroupedDataFrame
has two groups.
Now, lets try to get an average exam grade for our two teachers. This will introduce the combine
function that takes a GroupedDataFrame
and any number of aggregation functions. Let’s also addthe Statistics.jl
package, so we can take advantage of the mean
function.
julia> using Statisticsjulia> combine(grouped_df, :grade => mean)22 DataFrame Row teacher_name grade_mean String Float64? 1 Mr. Jackson 0.848 2 Ms. Smith missing
The result is a DataFrame
where the first column(s) will match our GroupedDataFrame
key(s) and the subsequent column(s) will match the function(s) we pass for aggregation. However, Ms. Smith has a grade_mean
of missing
!?
In our earlier discussion we found that Sandy missed the exam, so her grade was set to missing
. A missing
valuebehaves differently than normal numbers, which is problematic in our aggregation function. Take a look at a very simpleexample.
julia> 1 + missingmissing
We notice that adding a value of 1 to missing
equals missing
. This is a necessary evil and you may be wonderingwhy don’t we just treat it as 0? Let’s see what happens to our results if we replace missing
with 0.
julia> combine(grouped_df, :grade => (x -> mean(coalesce.(x, 0))))22 DataFrame Row teacher_name grade_function String Float64 1 Mr. Jackson 0.848 2 Ms. Smith 0.672
In the above example, instead of just passing mean
as the function we create an anonymous function. This allows us to get a littlemore clever with adding a coalesce
to replace the missing
values with 0. We see from the results that Ms. Smith has a much lowerscoring average than Mr. Jackson. But, if we think about it the results are getting incorrectly skewed. We know Sandy didn’t actuallyscore a 0, but rather didn’t take the test at all. Treating her result as a 0 is skewing the average much lower than it should be.
In some cases replacing with a 0 would make sense, but not in this scenario. Here are a few better options:
We could just drop the rows that contain missing
values prior to aggregation. DataFrames.jl
provides a dropmissing
functionspecifically for this.
julia> result_no_missing_df = dropmissing(result_df)96 DataFrame Row student_id student_name teacher_id exam_id grade teacher_name Int64 String Int64 Int64 Float64 String 1 1 Joe 1 1 0.95 Mr. Jackson 2 2 Sally 2 1 0.93 Ms. Smith 3 3 Jim 1 1 0.81 Mr. Jackson 4 5 Beth 1 1 0.85 Mr. Jackson 5 6 Alex 2 1 0.73 Ms. Smith 6 7 Tom 1 1 0.88 Mr. Jackson 7 8 Liz 2 1 0.77 Ms. Smith 8 9 Bill 1 1 0.75 Mr. Jackson 9 10 Carl 2 1 0.93 Ms. Smithjulia> grouped_no_missing_df = groupby(result_no_missing_df, :teacher_name)GroupedDataFrame with 2 groups based on key: teacher_nameFirst Group (5 rows): teacher_name = "Mr. Jackson" Row student_id student_name teacher_id exam_id grade teacher_name Int64 String Int64 Int64 Float64 String 1 1 Joe 1 1 0.95 Mr. Jackson 2 3 Jim 1 1 0.81 Mr. Jackson 3 5 Beth 1 1 0.85 Mr. Jackson 4 7 Tom 1 1 0.88 Mr. Jackson 5 9 Bill 1 1 0.75 Mr. JacksonLast Group (4 rows): teacher_name = "Ms. Smith" Row student_id student_name teacher_id exam_id grade teacher_name Int64 String Int64 Int64 Float64 String 1 2 Sally 2 1 0.93 Ms. Smith 2 6 Alex 2 1 0.73 Ms. Smith 3 8 Liz 2 1 0.77 Ms. Smith 4 10 Carl 2 1 0.93 Ms. Smithjulia> combine(grouped_no_missing_df, :grade => mean)22 DataFrame Row teacher_name grade_mean String Float64 1 Mr. Jackson 0.848 2 Ms. Smith 0.84
We now see that the two teachers average test scores are very similar. This approach would work wellif we never again needed the rows containing missing
values.
But, if we wanted to keep those rows around and rather just exclude them from certain calculations. We can make use of another function, skipmissing
, which will simply skip over the missing
values.
julia> combine(grouped_df, :grade => (x -> mean(skipmissing(x))))22 DataFrame Row teacher_name grade_function String Float64 1 Mr. Jackson 0.848 2 Ms. Smith 0.84
One last thing to note on missing
values is that it is easy to identify if one, or more, of your DataFrame
columns contains missing
values. We talked earlier that DataFrames.jl
infers the types for eachcolumn and displays them in the output. You’ll notice in result_df
that the column teacher_id
is of datatypeInt64
and exam_id
is of Int64?
. Here the ?
denotes that missing
values were found, so be careful!
Conclusion
We’ve touched on some of the topics that makes DataFrames.jl
such a great general purpose package. It is a helpfultool to quickly interact for data exploration, or to be used in production code to manipulate tabular data. I hopeyou’ve enjoyed today’s reading and be sure to check out the rest of our blog posts on blog.glcs.io!