Tag Archives: Julia

Diagramming + Data Visualization with Julia

By: DSB

Re-posted from: https://medium.com/coffee-in-a-klein-bottle/diagramming-data-visualization-with-julia-37147ce63168?source=rss-8bd6ec95ab58------2

A new approach to Data Visualization using Vizagrams.jl

Data visualization and diagramming are usually treated as distinct subjects. Yet, for the keen observer, they are quite similar. In fact, one could say that data visualization is a subset of diagramming, where data is used in a structured manner to generate a diagram drawing. This idea is explored in the Julia package Vizagrams.jl, which implements a diagramming Domain Specific Language (DSL) with a data visualization grammar built on top of it.

In this article, we show how to use Vizagrams.jl both for diagramming and data visualization. I highly recommend using a notebook such as Jupyter or Pluto in order to follow along.

Installation

The package is registered in the Julia general repository, hence, it can be installed by simply doing:

julia>]
pkg> add Vizagrams

The Basics of Diagramming

Vizagrams implements a diagramming DSL inspired in the great Haskell library Diagrams. We can think of a diagram as simply a collection of primitive graphical objects (circles, rectangles, lines…), where the final drawing simply renders each object one after the other, much like an SVG.

We start with the most basic example, a single circle:

using Vizagrams
# My first Diagram
d = Circle()
draw(d)

By default, our circle is drawn as black. We can modify its color by applying a style transformation.

d = S(:fill=>:red)*Circle()
draw(d)

We can make things more interesting by adding another object to our drawing. How can we do this? Well, just add it:

d = S(:fill=>:red)*Circle() + Square()
draw(d)

Note that the order of the summation matters. By adding a square after the circle, Vizagrams renders the square above the circle.

We have used the S(:fill=>:red) to apply the fill color red to the circle. Besides stylistic transformations, we can also apply geometric transformations, such as translation T , rotation R and scaling U .

d = S(:fill=>:red)*Circle() +
S(:stroke=>:blue,:fill=>:white)T(2,0)*Square() +
S(:fill=>:white)R(3.14/4)*Square()

draw(d)

Finally, we can combine also combine existing diagrams to form new ones.

d_2 = d + T(4,0) * d
draw(d_2)

Constructing Data Visualizations

Let us now go to plotting. Again, we start with the most basic example:

plt = plot(x=rand(100),y=rand(100),color=rand(["a","b"],100))
draw(plt)

As expected, the result is a simple scatter plot. Yet, there is something interesting going on. The variable plt is actually holding a diagram. Which means that we can manipulate it just like we did previously with other diagrams, and we can also combine it with other diagrams.

d = R(π/4)plt +
T(250,0)*plt +
Line([[180,250],[350,350],[350,200]]) +
T(350,350)Circle(r=10)

draw(d)

Although the example above was silly, it does illustrates the possibilities of what can be done. In fact, it is easy to see how one can combine diagramming operations to construct more useful visualizations, such as:

Scatter Plot with a PCA fit over the data. The histogram represents the error distribution in the PCA fit. This plot was draw using Vizagrams.jl.

Visualization Grammars with Diagrams

In Vizagrams, the diagramming DSL was used to build a data visualization grammar, i.e. a specification that can produce a variety of visualizations. The syntax for the grammar is based on Vega-Lite, which is a very popular grammar in the Data Visualization community (the Python package Altair is based on Vega-Lite, and there is also a Julia package called VegaLite.jl).

Explaining visualization grammars would take an article on its own. Yet, the specification style if fairly simple, so hopefully even those not familiar with Vega-Lite will understand what is going on.

# Importing DataFrames to store the data
using DataFrames

# VegaDatasets is used to load the dataset `cars`
using VegaDatasets
df = DataFrame(dataset("cars"));
df = dropmissing(df);

# Here is where Vizagrams plot specification actually starts
plt = Plot(
data=df,
encodings=(
x=(field=:Horsepower,),
y=(field=:Miles_per_Gallon,),
color=(field=:Origin,),
size=(field=:Acceleration,),
),
graphic=Circle()
)
draw(plt)

Note that we specified our plot by first passing on the dataset. Then, we defined the “encodings”, which were x , y , color and size. Each encoding variable has a parameter field which says which column in the dataset is mapped to it. Thus, in our example, we are mapping “Horsepower” to the x-axis, “Miles_per_Gallon” to the y-axis, the color is varying according to the “Origin” and the size according to “Acceleration”. At last, the “graphic” is specifying what is to be drawn in this plot. Since we passed Circle() , this means that we are drawing circles.

Here is where things get interesting. We can pass diagrams to this “graphic” parameter in the plot specification. First, let us create a diagram:

d = S(:fill=>:white,:stroke=>:black)*Circle(r=2) +
Circle() +
T(2,0)*Square()
draw(d)

Now, we place this diagram inside the plot specification, and…

plt = Plot(
data=df,
encodings=(
x=(field=:Horsepower,),
y=(field=:Miles_per_Gallon,),
color=(field=:Origin,),
size=(field=:Acceleration,),
),
graphic = Mark(d)
)

draw(plt)

Again, this example is not very useful, yet, it illustrates the sort of things that can be achieved. Here is perhaps a more “useful” example:

Scatter plot using Penguin mark for the famous Palmer Penguis dataset. The complete example can be found in Vizagrams documentation.

Some Final Words

This article was a mere introduction to Vizagrams. There is much more to be explored, such as graphical marks creation, graphic expressions, stacking operations, and so on.

If you are interested in learning more about Vizagrams, check out the documentation. Again, I recommend using a notebook (Jupyter or Pluto), as one can quickly experiment with different designs.


Diagramming + Data Visualization with Julia was originally published in Coffee in a Klein Bottle on Medium, where people are continuing the conversation by highlighting and responding to this story.

Getting Started with DataFrames.jl: A Beginner’s Guide

By: Joel Nelson

Re-posted from: https://blog.glcs.io/julia-dataframes

When doing any sort of development one will often find themselves in need of working with data in atabular format. This is especially true for those of us in data science, or data analysis, fields.In the Julia programming language one of the more popular libraries for this type of datawrangling is DataFrames.jl. In this blog post we’ll explore the beginnings of working with thispackage.

Introduction

The great thing about a package like Dataframes.jl is that it bridges the gap between traditionalprogramming and SQL (Structured Query Language). Databases are great tools for easily gaining insightsinto your data by joining, filtering, aggregating, sorting, etc… Dataframes.jl brings those goodies rightinto your hands by simply adding the package into your julia session. So, lets get started!

Getting Started

Adding the package is a few simple steps.

julia> using Pkgjulia> Pkg.add("DataFrames")julia> using DataFrames

The constructor for a DataFrame provides flexibility to create from arrays, tuples, constants, or files. The documentation covers all these, but for this post we’ll just explore one of the more common ways.

julia> df = DataFrame(a = 1:4, b = rand(4), c = "My first DataFrame")43 DataFrame Row  a      b         c                        Int64  Float64   String                1      1  0.141874  My first DataFrame   2      2  0.432084  My first DataFrame   3      3  0.47098   My first DataFrame   4      4  0.414639  My first DataFrame

You’ll notice in the code above we use a mix of datatypes including range, array, and scalar. The underlying vectors must be of the same sizeand the scalar gets broadcasted, or repeated, for each row. Also, pay attention that the types of each column are inferredbased on the arrays passed into the constructor.

Now, to access a column of a DataFrame there are also a few different possibilities. Here are a few examples of accessing thefirst column “a”.

julia> df.a4-element Vector{Int64}: 1 2 3 4julia> df."a"4-element Vector{Int64}: 1 2 3 4julia> df[!, "a"]4-element Vector{Int64}: 1 2 3 4julia> df[!, :a]4-element Vector{Int64}: 1 2 3 4julia> df[:, :a]4-element Vector{Int64}: 1 2 3 4

In these examples columns can be access directly with literals such as df.a, or more dynamically using brackets (since variables could be substituted.) You may also findyourself wondering the difference between ! and :, which is an important distinction!

The ! returns the underlying vector and : returns a copy. This can be showcased in anexample where we will attempt to change the description of the second value in column cto “I love Julia!”

julia> df[:, :c][2] = "I love Julia!"3julia> df43 DataFrame Row  a      b         c                        Int64  Float64   String                1      1  0.394165  My first DataFrame   2      2  0.809883  My first DataFrame   3      3  0.124035  My first DataFrame   4      4  0.886781  My first DataFramejulia> df[!, :c][2] = "I love Julia!"3julia> df43 DataFrame Row  a      b         c                        Int64  Float64   String                1      1  0.394165  My first DataFrame   2      3  0.809883  I love Julia!   3      3  0.124035  My first DataFrame   4      4  0.886781  My first DataFrame

Notice how the change will only persist to df when we access the column with !.

There is often a tradeoff between returning copies versus the actual underlying vectors. Returning a copy is generally considered safer since if the copy is later mutated the underlyingDataFrame remains unchanged. However, with very large DataFrames copying every column access willresult in an increase in memory. It is best to weigh those considerations and figure out whatapproach will work best for a given program.

Data Wrangling

Import / Export

Another great feature of the Julia programming language is that many different packages will interact wellwhen used together. For instance, DataFrames.jl and CSV.jl can be used to very easily import and exportdata.

First, we can save the DataFrame from above to CSV.

julia> using CSVjulia> path = joinpath(homedir(), "my_df.csv")julia> CSV.write(path, df)

And, reading in the DataFrame from file is just as easy!

julia> CSV.read(path, DataFrame)43 DataFrame Row  a      b         c                        Int64  Float64   String31              1      1  0.601361  My first DataFrame   2      2  0.178065  My first DataFrame   3      3  0.729591  My first DataFrame   4      4  0.280314  My first DataFrame

There are many keyword arguments to explore when handling csv files and the documentation is best forcovering all of these CSV.jl.

DataFrames.jl also supports writing and reading to multiple files types such as Arrow, JSON, Parquet, and others.

Joins

A join is a way to merge data from two DataFrames into a single DataFrame. There are several typesand they generally mimic the same types that a database would support.

  • innerjoin
  • leftjoin
  • rightjoin
  • outerjoin
  • semijoin
  • antijoin
  • crossjoin

Definitions of each can be found in either the documentation, or docstrings, but lets take a look at a fewexamples. Say we have the following DataFrame sets containing information from a school.

julia> student_df = DataFrame(student_id = 1:10, student_name = ["Joe", "Sally", "Jim", "Sandy", "Beth", "Alex", "Tom", "Liz", "Bill", "Carl"], teacher_id = repeat([1,2],5))103 DataFrame Row  student_id  student_name  teacher_id       Int64       String        Int64         1           1  Joe                    1   2           2  Sally                  2   3           3  Jim                    1   4           4  Sandy                  2   5           5  Beth                   1   6           6  Alex                   2   7           7  Tom                    1   8           8  Liz                    2   9           9  Bill                   1  10          10  Carl                   2julia> teacher_df = DataFrame(teacher_id = 1:2, teacher_name = ["Mr. Jackson", "Ms. Smith"])22 DataFrame Row  teacher_id  teacher_name       Int64       String          1           1  Mr. Jackson   2           2  Ms. Smithjulia> grade_df = DataFrame(exam_id = 1, student_id = vcat(1:3, 5:10), grade = [0.95, 0.93, 0.81, 0.85, 0.73, 0.88, 0.77, 0.75, 0.93])93 DataFrame Row  exam_id  student_id  grade         Int64    Int64       Float64    1        1           1     0.95   2        1           2     0.93   3        1           3     0.81   4        1           5     0.85   5        1           6     0.73   6        1           7     0.88   7        1           8     0.77   8        1           9     0.75   9        1          10     0.93

If we look at the grade_df we can see there are 9 results, but in the student_df we have 10 students.So, someone must have missed the exam! Let’s find out who that way we can alert the teacher to schedulea makeup.

Let’s do a leftjoin, which means every row will persist from the first DataFrame regardless if there isa match to the second DataFrame. The leftjoin function also takes an on keyword argumentto signify what column needs to be used to find matches.

julia> student_grade_df = leftjoin(student_df, grade_df, on=:student_id)105 DataFrame Row  student_id  student_name  teacher_id  exam_id  grade            Int64       String        Int64       Int64?   Float64?      1           1  Joe                    1        1        0.95   2           2  Sally                  2        1        0.93   3           3  Jim                    1        1        0.81   4           5  Beth                   1        1        0.85   5           6  Alex                   2        1        0.73   6           7  Tom                    1        1        0.88   7           8  Liz                    2        1        0.77   8           9  Bill                   1        1        0.75   9          10  Carl                   2        1        0.93  10           4  Sandy                  2  missing  missing

We notice Sandy has a missing value for both the exam_id and grade fields. missing is a special datatype in Julia that is similar to a null value in databases. This would signify tous that there was no match in the grade_df meaning Sandy missed the exam. We can add one more join to get the respective teacher’s name.

julia> result_df = innerjoin(student_grade_df, teacher_df, on=:teacher_id)106 DataFrame Row  student_id  student_name  teacher_id  exam_id  grade       teacher_name       Int64       String        Int64       Int64?   Float64?    String          1           1  Joe                    1        1        0.95  Mr. Jackson   2           2  Sally                  2        1        0.93  Ms. Smith   3           3  Jim                    1        1        0.81  Mr. Jackson   4           5  Beth                   1        1        0.85  Mr. Jackson   5           6  Alex                   2        1        0.73  Ms. Smith   6           7  Tom                    1        1        0.88  Mr. Jackson   7           8  Liz                    2        1        0.77  Ms. Smith   8           9  Bill                   1        1        0.75  Mr. Jackson   9          10  Carl                   2        1        0.93  Ms. Smith  10           4  Sandy                  2  missing  missing     Ms. Smith

We used an innerjoin this time since we know that every student would have a teacher assigned.Now, we can let Ms. Smith know that she needs to reach out to Sandy to re-schedule her exam.

Sorting

Another helpful function for analysis is sort. Let’s sort our result_df by the grade column.

julia> sort(result_df, [:grade])106 DataFrame Row  student_id  student_name  teacher_id  exam_id  grade       teacher_name       Int64       String        Int64       Int64?   Float64?    String          1           6  Alex                   2        1        0.73  Ms. Smith   2           9  Bill                   1        1        0.75  Mr. Jackson   3           8  Liz                    2        1        0.77  Ms. Smith   4           3  Jim                    1        1        0.81  Mr. Jackson   5           5  Beth                   1        1        0.85  Mr. Jackson   6           7  Tom                    1        1        0.88  Mr. Jackson   7           2  Sally                  2        1        0.93  Ms. Smith   8          10  Carl                   2        1        0.93  Ms. Smith   9           1  Joe                    1        1        0.95  Mr. Jackson  10           4  Sandy                  2  missing  missing     Ms. Smith

The function takes the DataFrame and an array of columns to sort on. Our result of sort is putting the lowestgrade first, but if we wanted it descending we can pass a rev keyword argument.

julia> sort(result_df, [:grade], rev=true)106 DataFrame Row  student_id  student_name  teacher_id  exam_id  grade       teacher_name       Int64       String        Int64       Int64?   Float64?    String          1           4  Sandy                  2  missing  missing     Ms. Smith   2           1  Joe                    1        1        0.95  Mr. Jackson   3           2  Sally                  2        1        0.93  Ms. Smith   4          10  Carl                   2        1        0.93  Ms. Smith   5           7  Tom                    1        1        0.88  Mr. Jackson   6           5  Beth                   1        1        0.85  Mr. Jackson   7           3  Jim                    1        1        0.81  Mr. Jackson   8           8  Liz                    2        1        0.77  Ms. Smith   9           9  Bill                   1        1        0.75  Mr. Jackson  10           6  Alex                   2        1        0.73  Ms. Smith

In both these cases a copy of the DataFrame is returned and the result_df is left unchanged. But, if we wanted tosort in-place we can also use the sort! function that will update the passed DataFrame.

julia> sort!(result_df, [:grade], rev=true)106 DataFrame Row  student_id  student_name  teacher_id  exam_id  grade       teacher_name       Int64       String        Int64       Int64?   Float64?    String          1           4  Sandy                  2  missing  missing     Ms. Smith   2           1  Joe                    1        1        0.95  Mr. Jackson   3           2  Sally                  2        1        0.93  Ms. Smith   4          10  Carl                   2        1        0.93  Ms. Smith   5           7  Tom                    1        1        0.88  Mr. Jackson   6           5  Beth                   1        1        0.85  Mr. Jackson   7           3  Jim                    1        1        0.81  Mr. Jackson   8           8  Liz                    2        1        0.77  Ms. Smith   9           9  Bill                   1        1        0.75  Mr. Jackson  10           6  Alex                   2        1        0.73  Ms. Smith

Split-apply-combine

Now that we have some basics down, it’s time to dive into aggregating results. In DataFrames.jl this isreferred to as a split-apply-combine strategy. It is a bit of a mouthful, but let’s walk through whatexactly this is referring to.

Split is simply breaking the DataFrame into groups using the groupby function. In our example lets splitour DataFrame by the teacher_name column.

julia> grouped_df = groupby(result_df, :teacher_name)GroupedDataFrame with 2 groups based on key: teacher_nameFirst Group (5 rows): teacher_name = "Mr. Jackson" Row  student_id  student_name  teacher_id  exam_id  grade     teacher_name       Int64       String        Int64       Int64?   Float64?  String          1           1  Joe                    1        1      0.95  Mr. Jackson   2           3  Jim                    1        1      0.81  Mr. Jackson   3           5  Beth                   1        1      0.85  Mr. Jackson   4           7  Tom                    1        1      0.88  Mr. Jackson   5           9  Bill                   1        1      0.75  Mr. JacksonLast Group (5 rows): teacher_name = "Ms. Smith" Row  student_id  student_name  teacher_id  exam_id  grade       teacher_name       Int64       String        Int64       Int64?   Float64?    String          1           2  Sally                  2        1        0.93  Ms. Smith   2           6  Alex                   2        1        0.73  Ms. Smith   3           8  Liz                    2        1        0.77  Ms. Smith   4          10  Carl                   2        1        0.93  Ms. Smith   5           4  Sandy                  2  missing  missing     Ms. Smith

The result of calling groupby is of type GroupedDataFrame, which is basically a wrapperaround one, or many, groups of a DataFrame. In our example we have two teachers and so the resultGroupedDataFrame has two groups.

Now, lets try to get an average exam grade for our two teachers. This will introduce the combinefunction that takes a GroupedDataFrame and any number of aggregation functions. Let’s also addthe Statistics.jl package, so we can take advantage of the mean function.

julia> using Statisticsjulia> combine(grouped_df, :grade => mean)22 DataFrame Row  teacher_name  grade_mean        String        Float64?       1  Mr. Jackson         0.848   2  Ms. Smith     missing

The result is a DataFrame where the first column(s) will match our GroupedDataFrame key(s) and the subsequent column(s) will match the function(s) we pass for aggregation. However, Ms. Smith has a grade_mean of missing!?

In our earlier discussion we found that Sandy missed the exam, so her grade was set to missing. A missing valuebehaves differently than normal numbers, which is problematic in our aggregation function. Take a look at a very simpleexample.

julia> 1 + missingmissing

We notice that adding a value of 1 to missing equals missing. This is a necessary evil and you may be wonderingwhy don’t we just treat it as 0? Let’s see what happens to our results if we replace missing with 0.

julia> combine(grouped_df, :grade => (x -> mean(coalesce.(x, 0))))22 DataFrame Row  teacher_name  grade_function       String        Float64           1  Mr. Jackson            0.848   2  Ms. Smith              0.672

In the above example, instead of just passing mean as the function we create an anonymous function. This allows us to get a littlemore clever with adding a coalesce to replace the missing values with 0. We see from the results that Ms. Smith has a much lowerscoring average than Mr. Jackson. But, if we think about it the results are getting incorrectly skewed. We know Sandy didn’t actuallyscore a 0, but rather didn’t take the test at all. Treating her result as a 0 is skewing the average much lower than it should be.

In some cases replacing with a 0 would make sense, but not in this scenario. Here are a few better options:

We could just drop the rows that contain missing values prior to aggregation. DataFrames.jl provides a dropmissing functionspecifically for this.

julia> result_no_missing_df = dropmissing(result_df)96 DataFrame Row  student_id  student_name  teacher_id  exam_id  grade    teacher_name       Int64       String        Int64       Int64    Float64  String          1           1  Joe                    1        1     0.95  Mr. Jackson   2           2  Sally                  2        1     0.93  Ms. Smith   3           3  Jim                    1        1     0.81  Mr. Jackson   4           5  Beth                   1        1     0.85  Mr. Jackson   5           6  Alex                   2        1     0.73  Ms. Smith   6           7  Tom                    1        1     0.88  Mr. Jackson   7           8  Liz                    2        1     0.77  Ms. Smith   8           9  Bill                   1        1     0.75  Mr. Jackson   9          10  Carl                   2        1     0.93  Ms. Smithjulia> grouped_no_missing_df = groupby(result_no_missing_df, :teacher_name)GroupedDataFrame with 2 groups based on key: teacher_nameFirst Group (5 rows): teacher_name = "Mr. Jackson" Row  student_id  student_name  teacher_id  exam_id  grade    teacher_name       Int64       String        Int64       Int64    Float64  String          1           1  Joe                    1        1     0.95  Mr. Jackson   2           3  Jim                    1        1     0.81  Mr. Jackson   3           5  Beth                   1        1     0.85  Mr. Jackson   4           7  Tom                    1        1     0.88  Mr. Jackson   5           9  Bill                   1        1     0.75  Mr. JacksonLast Group (4 rows): teacher_name = "Ms. Smith" Row  student_id  student_name  teacher_id  exam_id  grade    teacher_name       Int64       String        Int64       Int64    Float64  String          1           2  Sally                  2        1     0.93  Ms. Smith   2           6  Alex                   2        1     0.73  Ms. Smith   3           8  Liz                    2        1     0.77  Ms. Smith   4          10  Carl                   2        1     0.93  Ms. Smithjulia> combine(grouped_no_missing_df, :grade => mean)22 DataFrame Row  teacher_name  grade_mean       String        Float64       1  Mr. Jackson        0.848   2  Ms. Smith          0.84

We now see that the two teachers average test scores are very similar. This approach would work wellif we never again needed the rows containing missing values.

But, if we wanted to keep those rows around and rather just exclude them from certain calculations. We can make use of another function, skipmissing, which will simply skip over the missing values.

julia> combine(grouped_df, :grade => (x -> mean(skipmissing(x))))22 DataFrame Row  teacher_name  grade_function       String        Float64           1  Mr. Jackson            0.848   2  Ms. Smith              0.84

One last thing to note on missing values is that it is easy to identify if one, or more, of your DataFramecolumns contains missing values. We talked earlier that DataFrames.jl infers the types for eachcolumn and displays them in the output. You’ll notice in result_df that the column teacher_id is of datatypeInt64 and exam_id is of Int64?. Here the ? denotes that missing values were found, so be careful!

Conclusion

We’ve touched on some of the topics that makes DataFrames.jl such a great general purpose package. It is a helpfultool to quickly interact for data exploration, or to be used in production code to manipulate tabular data. I hopeyou’ve enjoyed today’s reading and be sure to check out the rest of our blog posts on blog.glcs.io!