By: Jeff Dixon
Re-posted from: https://blog.glcs.io/healthcare-dataframes-mempool-case-study
Enhance healthcare revenue forecasting with DataFrames.jl. Discover design solutions, performance boosts, and Julia devoper job opportunities.
By: Jeff Dixon
Re-posted from: https://blog.glcs.io/healthcare-dataframes-mempool-case-study
Enhance healthcare revenue forecasting with DataFrames.jl. Discover design solutions, performance boosts, and Julia devoper job opportunities.
By: DSB
Re-posted from: https://medium.com/coffee-in-a-klein-bottle/diagramming-data-visualization-with-julia-37147ce63168?source=rss-8bd6ec95ab58------2
Data visualization and diagramming are usually treated as distinct subjects. Yet, for the keen observer, they are quite similar. In fact, one could say that data visualization is a subset of diagramming, where data is used in a structured manner to generate a diagram drawing. This idea is explored in the Julia package Vizagrams.jl, which implements a diagramming Domain Specific Language (DSL) with a data visualization grammar built on top of it.
In this article, we show how to use Vizagrams.jl both for diagramming and data visualization. I highly recommend using a notebook such as Jupyter or Pluto in order to follow along.
The package is registered in the Julia general repository, hence, it can be installed by simply doing:
julia>]
pkg> add Vizagrams
Vizagrams implements a diagramming DSL inspired in the great Haskell library Diagrams. We can think of a diagram as simply a collection of primitive graphical objects (circles, rectangles, lines…), where the final drawing simply renders each object one after the other, much like an SVG.
We start with the most basic example, a single circle:
using Vizagrams
# My first Diagram
d = Circle()
draw(d)
By default, our circle is drawn as black. We can modify its color by applying a style transformation.
d = S(:fill=>:red)*Circle()
draw(d)
We can make things more interesting by adding another object to our drawing. How can we do this? Well, just add it:
d = S(:fill=>:red)*Circle() + Square()
draw(d)
Note that the order of the summation matters. By adding a square after the circle, Vizagrams renders the square above the circle.
We have used the S(:fill=>:red) to apply the fill color red to the circle. Besides stylistic transformations, we can also apply geometric transformations, such as translation T , rotation R and scaling U .
d = S(:fill=>:red)*Circle() +
S(:stroke=>:blue,:fill=>:white)T(2,0)*Square() +
S(:fill=>:white)R(3.14/4)*Square()
draw(d)
Finally, we can combine also combine existing diagrams to form new ones.
d_2 = d + T(4,0) * d
draw(d_2)
Let us now go to plotting. Again, we start with the most basic example:
plt = plot(x=rand(100),y=rand(100),color=rand(["a","b"],100))
draw(plt)
As expected, the result is a simple scatter plot. Yet, there is something interesting going on. The variable plt is actually holding a diagram. Which means that we can manipulate it just like we did previously with other diagrams, and we can also combine it with other diagrams.
d = R(π/4)plt +
T(250,0)*plt +
Line([[180,250],[350,350],[350,200]]) +
T(350,350)Circle(r=10)
draw(d)
Although the example above was silly, it does illustrates the possibilities of what can be done. In fact, it is easy to see how one can combine diagramming operations to construct more useful visualizations, such as:
In Vizagrams, the diagramming DSL was used to build a data visualization grammar, i.e. a specification that can produce a variety of visualizations. The syntax for the grammar is based on Vega-Lite, which is a very popular grammar in the Data Visualization community (the Python package Altair is based on Vega-Lite, and there is also a Julia package called VegaLite.jl).
Explaining visualization grammars would take an article on its own. Yet, the specification style if fairly simple, so hopefully even those not familiar with Vega-Lite will understand what is going on.
# Importing DataFrames to store the data
using DataFrames
# VegaDatasets is used to load the dataset `cars`
using VegaDatasets
df = DataFrame(dataset("cars"));
df = dropmissing(df);
# Here is where Vizagrams plot specification actually starts
plt = Plot(
data=df,
encodings=(
x=(field=:Horsepower,),
y=(field=:Miles_per_Gallon,),
color=(field=:Origin,),
size=(field=:Acceleration,),
),
graphic=Circle()
)
draw(plt)
Note that we specified our plot by first passing on the dataset. Then, we defined the “encodings”, which were x , y , color and size. Each encoding variable has a parameter field which says which column in the dataset is mapped to it. Thus, in our example, we are mapping “Horsepower” to the x-axis, “Miles_per_Gallon” to the y-axis, the color is varying according to the “Origin” and the size according to “Acceleration”. At last, the “graphic” is specifying what is to be drawn in this plot. Since we passed Circle() , this means that we are drawing circles.
Here is where things get interesting. We can pass diagrams to this “graphic” parameter in the plot specification. First, let us create a diagram:
d = S(:fill=>:white,:stroke=>:black)*Circle(r=2) +
Circle() +
T(2,0)*Square()
draw(d)
Now, we place this diagram inside the plot specification, and…
plt = Plot(
data=df,
encodings=(
x=(field=:Horsepower,),
y=(field=:Miles_per_Gallon,),
color=(field=:Origin,),
size=(field=:Acceleration,),
),
graphic = Mark(d)
)
draw(plt)
Again, this example is not very useful, yet, it illustrates the sort of things that can be achieved. Here is perhaps a more “useful” example:
This article was a mere introduction to Vizagrams. There is much more to be explored, such as graphical marks creation, graphic expressions, stacking operations, and so on.
If you are interested in learning more about Vizagrams, check out the documentation. Again, I recommend using a notebook (Jupyter or Pluto), as one can quickly experiment with different designs.
Diagramming + Data Visualization with Julia was originally published in Coffee in a Klein Bottle on Medium, where people are continuing the conversation by highlighting and responding to this story.
By: Joel Nelson
Re-posted from: https://blog.glcs.io/julia-dataframes
When doing any sort of development one will often find themselves in need of working with data in atabular format. This is especially true for those of us in data science, or data analysis, fields.In the Julia programming language one of the more popular libraries for this type of datawrangling is DataFrames.jl. In this blog post we’ll explore the beginnings of working with thispackage.
The great thing about a package like Dataframes.jl
is that it bridges the gap between traditionalprogramming and SQL (Structured Query Language). Databases are great tools for easily gaining insightsinto your data by joining, filtering, aggregating, sorting, etc… Dataframes.jl
brings those goodies rightinto your hands by simply adding the package into your julia session. So, lets get started!
Adding the package is a few simple steps.
julia> using Pkgjulia> Pkg.add("DataFrames")julia> using DataFrames
The constructor for a DataFrame
provides flexibility to create from arrays, tuples, constants, or files. The documentation covers all these, but for this post we’ll just explore one of the more common ways.
julia> df = DataFrame(a = 1:4, b = rand(4), c = "My first DataFrame")43 DataFrame Row a b c Int64 Float64 String 1 1 0.141874 My first DataFrame 2 2 0.432084 My first DataFrame 3 3 0.47098 My first DataFrame 4 4 0.414639 My first DataFrame
You’ll notice in the code above we use a mix of datatypes including range, array, and scalar. The underlying vectors must be of the same sizeand the scalar gets broadcasted, or repeated, for each row. Also, pay attention that the types of each column are inferredbased on the arrays passed into the constructor.
Now, to access a column of a DataFrame
there are also a few different possibilities. Here are a few examples of accessing thefirst column “a”.
julia> df.a4-element Vector{Int64}: 1 2 3 4julia> df."a"4-element Vector{Int64}: 1 2 3 4julia> df[!, "a"]4-element Vector{Int64}: 1 2 3 4julia> df[!, :a]4-element Vector{Int64}: 1 2 3 4julia> df[:, :a]4-element Vector{Int64}: 1 2 3 4
In these examples columns can be access directly with literals such as df.a
, or more dynamically using brackets (since variables could be substituted.) You may also findyourself wondering the difference between !
and :
, which is an important distinction!
The !
returns the underlying vector and :
returns a copy. This can be showcased in anexample where we will attempt to change the description of the second value in column c
to “I love Julia!”
julia> df[:, :c][2] = "I love Julia!"3julia> df43 DataFrame Row a b c Int64 Float64 String 1 1 0.394165 My first DataFrame 2 2 0.809883 My first DataFrame 3 3 0.124035 My first DataFrame 4 4 0.886781 My first DataFramejulia> df[!, :c][2] = "I love Julia!"3julia> df43 DataFrame Row a b c Int64 Float64 String 1 1 0.394165 My first DataFrame 2 3 0.809883 I love Julia! 3 3 0.124035 My first DataFrame 4 4 0.886781 My first DataFrame
Notice how the change will only persist to df
when we access the column with !
.
There is often a tradeoff between returning copies versus the actual underlying vectors. Returning a copy is generally considered safer since if the copy is later mutated the underlyingDataFrame remains unchanged. However, with very large DataFrames copying every column access willresult in an increase in memory. It is best to weigh those considerations and figure out whatapproach will work best for a given program.
Another great feature of the Julia programming language is that many different packages will interact wellwhen used together. For instance, DataFrames.jl
and CSV.jl
can be used to very easily import and exportdata.
First, we can save the DataFrame
from above to CSV.
julia> using CSVjulia> path = joinpath(homedir(), "my_df.csv")julia> CSV.write(path, df)
And, reading in the DataFrame
from file is just as easy!
julia> CSV.read(path, DataFrame)43 DataFrame Row a b c Int64 Float64 String31 1 1 0.601361 My first DataFrame 2 2 0.178065 My first DataFrame 3 3 0.729591 My first DataFrame 4 4 0.280314 My first DataFrame
There are many keyword arguments to explore when handling csv files and the documentation is best forcovering all of these CSV.jl.
DataFrames.jl
also supports writing and reading to multiple files types such as Arrow, JSON, Parquet, and others.
A join is a way to merge data from two DataFrames into a single DataFrame. There are several typesand they generally mimic the same types that a database would support.
innerjoin
leftjoin
rightjoin
outerjoin
semijoin
antijoin
crossjoin
Definitions of each can be found in either the documentation, or docstrings, but lets take a look at a fewexamples. Say we have the following DataFrame
sets containing information from a school.
julia> student_df = DataFrame(student_id = 1:10, student_name = ["Joe", "Sally", "Jim", "Sandy", "Beth", "Alex", "Tom", "Liz", "Bill", "Carl"], teacher_id = repeat([1,2],5))103 DataFrame Row student_id student_name teacher_id Int64 String Int64 1 1 Joe 1 2 2 Sally 2 3 3 Jim 1 4 4 Sandy 2 5 5 Beth 1 6 6 Alex 2 7 7 Tom 1 8 8 Liz 2 9 9 Bill 1 10 10 Carl 2julia> teacher_df = DataFrame(teacher_id = 1:2, teacher_name = ["Mr. Jackson", "Ms. Smith"])22 DataFrame Row teacher_id teacher_name Int64 String 1 1 Mr. Jackson 2 2 Ms. Smithjulia> grade_df = DataFrame(exam_id = 1, student_id = vcat(1:3, 5:10), grade = [0.95, 0.93, 0.81, 0.85, 0.73, 0.88, 0.77, 0.75, 0.93])93 DataFrame Row exam_id student_id grade Int64 Int64 Float64 1 1 1 0.95 2 1 2 0.93 3 1 3 0.81 4 1 5 0.85 5 1 6 0.73 6 1 7 0.88 7 1 8 0.77 8 1 9 0.75 9 1 10 0.93
If we look at the grade_df
we can see there are 9 results, but in the student_df
we have 10 students.So, someone must have missed the exam! Let’s find out who that way we can alert the teacher to schedulea makeup.
Let’s do a leftjoin
, which means every row will persist from the first DataFrame
regardless if there isa match to the second DataFrame
. The leftjoin
function also takes an on
keyword argumentto signify what column needs to be used to find matches.
julia> student_grade_df = leftjoin(student_df, grade_df, on=:student_id)105 DataFrame Row student_id student_name teacher_id exam_id grade Int64 String Int64 Int64? Float64? 1 1 Joe 1 1 0.95 2 2 Sally 2 1 0.93 3 3 Jim 1 1 0.81 4 5 Beth 1 1 0.85 5 6 Alex 2 1 0.73 6 7 Tom 1 1 0.88 7 8 Liz 2 1 0.77 8 9 Bill 1 1 0.75 9 10 Carl 2 1 0.93 10 4 Sandy 2 missing missing
We notice Sandy has a missing
value for both the exam_id
and grade
fields. missing
is a special datatype in Julia that is similar to a null
value in databases. This would signify tous that there was no match in the grade_df
meaning Sandy missed the exam. We can add one more join to get the respective teacher’s name.
julia> result_df = innerjoin(student_grade_df, teacher_df, on=:teacher_id)106 DataFrame Row student_id student_name teacher_id exam_id grade teacher_name Int64 String Int64 Int64? Float64? String 1 1 Joe 1 1 0.95 Mr. Jackson 2 2 Sally 2 1 0.93 Ms. Smith 3 3 Jim 1 1 0.81 Mr. Jackson 4 5 Beth 1 1 0.85 Mr. Jackson 5 6 Alex 2 1 0.73 Ms. Smith 6 7 Tom 1 1 0.88 Mr. Jackson 7 8 Liz 2 1 0.77 Ms. Smith 8 9 Bill 1 1 0.75 Mr. Jackson 9 10 Carl 2 1 0.93 Ms. Smith 10 4 Sandy 2 missing missing Ms. Smith
We used an innerjoin
this time since we know that every student would have a teacher assigned.Now, we can let Ms. Smith know that she needs to reach out to Sandy to re-schedule her exam.
Another helpful function for analysis is sort
. Let’s sort our result_df
by the grade
column.
julia> sort(result_df, [:grade])106 DataFrame Row student_id student_name teacher_id exam_id grade teacher_name Int64 String Int64 Int64? Float64? String 1 6 Alex 2 1 0.73 Ms. Smith 2 9 Bill 1 1 0.75 Mr. Jackson 3 8 Liz 2 1 0.77 Ms. Smith 4 3 Jim 1 1 0.81 Mr. Jackson 5 5 Beth 1 1 0.85 Mr. Jackson 6 7 Tom 1 1 0.88 Mr. Jackson 7 2 Sally 2 1 0.93 Ms. Smith 8 10 Carl 2 1 0.93 Ms. Smith 9 1 Joe 1 1 0.95 Mr. Jackson 10 4 Sandy 2 missing missing Ms. Smith
The function takes the DataFrame
and an array of columns to sort on. Our result of sort
is putting the lowestgrade first, but if we wanted it descending we can pass a rev
keyword argument.
julia> sort(result_df, [:grade], rev=true)106 DataFrame Row student_id student_name teacher_id exam_id grade teacher_name Int64 String Int64 Int64? Float64? String 1 4 Sandy 2 missing missing Ms. Smith 2 1 Joe 1 1 0.95 Mr. Jackson 3 2 Sally 2 1 0.93 Ms. Smith 4 10 Carl 2 1 0.93 Ms. Smith 5 7 Tom 1 1 0.88 Mr. Jackson 6 5 Beth 1 1 0.85 Mr. Jackson 7 3 Jim 1 1 0.81 Mr. Jackson 8 8 Liz 2 1 0.77 Ms. Smith 9 9 Bill 1 1 0.75 Mr. Jackson 10 6 Alex 2 1 0.73 Ms. Smith
In both these cases a copy of the DataFrame
is returned and the result_df
is left unchanged. But, if we wanted tosort in-place we can also use the sort!
function that will update the passed DataFrame
.
julia> sort!(result_df, [:grade], rev=true)106 DataFrame Row student_id student_name teacher_id exam_id grade teacher_name Int64 String Int64 Int64? Float64? String 1 4 Sandy 2 missing missing Ms. Smith 2 1 Joe 1 1 0.95 Mr. Jackson 3 2 Sally 2 1 0.93 Ms. Smith 4 10 Carl 2 1 0.93 Ms. Smith 5 7 Tom 1 1 0.88 Mr. Jackson 6 5 Beth 1 1 0.85 Mr. Jackson 7 3 Jim 1 1 0.81 Mr. Jackson 8 8 Liz 2 1 0.77 Ms. Smith 9 9 Bill 1 1 0.75 Mr. Jackson 10 6 Alex 2 1 0.73 Ms. Smith
Now that we have some basics down, it’s time to dive into aggregating results. In DataFrames.jl
this isreferred to as a split-apply-combine strategy. It is a bit of a mouthful, but let’s walk through whatexactly this is referring to.
Split is simply breaking the DataFrame
into groups using the groupby
function. In our example lets splitour DataFrame
by the teacher_name
column.
julia> grouped_df = groupby(result_df, :teacher_name)GroupedDataFrame with 2 groups based on key: teacher_nameFirst Group (5 rows): teacher_name = "Mr. Jackson" Row student_id student_name teacher_id exam_id grade teacher_name Int64 String Int64 Int64? Float64? String 1 1 Joe 1 1 0.95 Mr. Jackson 2 3 Jim 1 1 0.81 Mr. Jackson 3 5 Beth 1 1 0.85 Mr. Jackson 4 7 Tom 1 1 0.88 Mr. Jackson 5 9 Bill 1 1 0.75 Mr. JacksonLast Group (5 rows): teacher_name = "Ms. Smith" Row student_id student_name teacher_id exam_id grade teacher_name Int64 String Int64 Int64? Float64? String 1 2 Sally 2 1 0.93 Ms. Smith 2 6 Alex 2 1 0.73 Ms. Smith 3 8 Liz 2 1 0.77 Ms. Smith 4 10 Carl 2 1 0.93 Ms. Smith 5 4 Sandy 2 missing missing Ms. Smith
The result of calling groupby
is of type GroupedDataFrame
, which is basically a wrapperaround one, or many, groups of a DataFrame
. In our example we have two teachers and so the resultGroupedDataFrame
has two groups.
Now, lets try to get an average exam grade for our two teachers. This will introduce the combine
function that takes a GroupedDataFrame
and any number of aggregation functions. Let’s also addthe Statistics.jl
package, so we can take advantage of the mean
function.
julia> using Statisticsjulia> combine(grouped_df, :grade => mean)22 DataFrame Row teacher_name grade_mean String Float64? 1 Mr. Jackson 0.848 2 Ms. Smith missing
The result is a DataFrame
where the first column(s) will match our GroupedDataFrame
key(s) and the subsequent column(s) will match the function(s) we pass for aggregation. However, Ms. Smith has a grade_mean
of missing
!?
In our earlier discussion we found that Sandy missed the exam, so her grade was set to missing
. A missing
valuebehaves differently than normal numbers, which is problematic in our aggregation function. Take a look at a very simpleexample.
julia> 1 + missingmissing
We notice that adding a value of 1 to missing
equals missing
. This is a necessary evil and you may be wonderingwhy don’t we just treat it as 0? Let’s see what happens to our results if we replace missing
with 0.
julia> combine(grouped_df, :grade => (x -> mean(coalesce.(x, 0))))22 DataFrame Row teacher_name grade_function String Float64 1 Mr. Jackson 0.848 2 Ms. Smith 0.672
In the above example, instead of just passing mean
as the function we create an anonymous function. This allows us to get a littlemore clever with adding a coalesce
to replace the missing
values with 0. We see from the results that Ms. Smith has a much lowerscoring average than Mr. Jackson. But, if we think about it the results are getting incorrectly skewed. We know Sandy didn’t actuallyscore a 0, but rather didn’t take the test at all. Treating her result as a 0 is skewing the average much lower than it should be.
In some cases replacing with a 0 would make sense, but not in this scenario. Here are a few better options:
We could just drop the rows that contain missing
values prior to aggregation. DataFrames.jl
provides a dropmissing
functionspecifically for this.
julia> result_no_missing_df = dropmissing(result_df)96 DataFrame Row student_id student_name teacher_id exam_id grade teacher_name Int64 String Int64 Int64 Float64 String 1 1 Joe 1 1 0.95 Mr. Jackson 2 2 Sally 2 1 0.93 Ms. Smith 3 3 Jim 1 1 0.81 Mr. Jackson 4 5 Beth 1 1 0.85 Mr. Jackson 5 6 Alex 2 1 0.73 Ms. Smith 6 7 Tom 1 1 0.88 Mr. Jackson 7 8 Liz 2 1 0.77 Ms. Smith 8 9 Bill 1 1 0.75 Mr. Jackson 9 10 Carl 2 1 0.93 Ms. Smithjulia> grouped_no_missing_df = groupby(result_no_missing_df, :teacher_name)GroupedDataFrame with 2 groups based on key: teacher_nameFirst Group (5 rows): teacher_name = "Mr. Jackson" Row student_id student_name teacher_id exam_id grade teacher_name Int64 String Int64 Int64 Float64 String 1 1 Joe 1 1 0.95 Mr. Jackson 2 3 Jim 1 1 0.81 Mr. Jackson 3 5 Beth 1 1 0.85 Mr. Jackson 4 7 Tom 1 1 0.88 Mr. Jackson 5 9 Bill 1 1 0.75 Mr. JacksonLast Group (4 rows): teacher_name = "Ms. Smith" Row student_id student_name teacher_id exam_id grade teacher_name Int64 String Int64 Int64 Float64 String 1 2 Sally 2 1 0.93 Ms. Smith 2 6 Alex 2 1 0.73 Ms. Smith 3 8 Liz 2 1 0.77 Ms. Smith 4 10 Carl 2 1 0.93 Ms. Smithjulia> combine(grouped_no_missing_df, :grade => mean)22 DataFrame Row teacher_name grade_mean String Float64 1 Mr. Jackson 0.848 2 Ms. Smith 0.84
We now see that the two teachers average test scores are very similar. This approach would work wellif we never again needed the rows containing missing
values.
But, if we wanted to keep those rows around and rather just exclude them from certain calculations. We can make use of another function, skipmissing
, which will simply skip over the missing
values.
julia> combine(grouped_df, :grade => (x -> mean(skipmissing(x))))22 DataFrame Row teacher_name grade_function String Float64 1 Mr. Jackson 0.848 2 Ms. Smith 0.84
One last thing to note on missing
values is that it is easy to identify if one, or more, of your DataFrame
columns contains missing
values. We talked earlier that DataFrames.jl
infers the types for eachcolumn and displays them in the output. You’ll notice in result_df
that the column teacher_id
is of datatypeInt64
and exam_id
is of Int64?
. Here the ?
denotes that missing
values were found, so be careful!
We’ve touched on some of the topics that makes DataFrames.jl
such a great general purpose package. It is a helpfultool to quickly interact for data exploration, or to be used in production code to manipulate tabular data. I hopeyou’ve enjoyed today’s reading and be sure to check out the rest of our blog posts on blog.glcs.io!