Author Archives: Josh Day

First Steps #4: Digging Into DataFrames

By: Josh Day

Re-posted from: https://www.juliafordatascience.com/first-steps-4-dataframes/

First Steps #4: Digging Into DataFrames

DataFrames.jl provides the most widely used tabular data structure in Julia.  In this post we'll explore DataFrames using sample data from RDatasets.jl (and we'll plot stuff using StatsPlots).

A rather timely event: DataFrames.jl has reached version 1.0!

⚙️ Setup

First, install DataFrames and RDatasets via Pkg Mode (]) in the REPL:

(@v1.6) pkg> add DataFrames RDatasets

Now load both packages along with the diamonds dataset from R's ggplot2 package.  The diamonds data contains price/size/quality information on 53,940 different diamonds.  

julia> using DataFrames, RDatasets

julia> df = dataset("ggplot2", "diamonds")
53940×10 DataFrame
   Row │ Carat    Cut        Color  Clarity  Depth    Tabl ⋯
       │ Float64  Cat…       Cat…   Cat…     Float64  Floa ⋯
───────┼────────────────────────────────────────────────────
     1 │    0.23  Ideal      E      SI2         61.5     5 ⋯
     2 │    0.21  Premium    E      SI1         59.8     6
     3 │    0.23  Good       E      VS1         56.9     6
     4 │    0.29  Premium    I      VS2         62.4     5
     5 │    0.31  Good       J      SI2         63.3     5 ⋯
     6 │    0.24  Very Good  J      VVS2        62.8     5
     7 │    0.24  Very Good  I      VVS1        62.3     5
     8 │    0.26  Very Good  H      SI1         61.9     5
   ⋮   │    ⋮         ⋮        ⋮       ⋮        ⋮        ⋮ ⋱

🚀 DataFrames Quickstart

  • Variables (columns) of a Dataframe can be referenced either by strings or symbols, e.g. "I am a string" and :I_am_a_symbol.

Make a Copy of a Column

df[:, "Carat"]

df[:, :Carat]

Extract a Column

  • These commands retrieve the exact data held in the DataFrame.  Warning!  Making a change to the extracted data will change the values in DataFrame.
df.Carat

df[!, "Carat"]

Selecting a Subset of Columns

select(df, "Carat")

select(df, ["Carat", "Cut"])

Filtering a Subset of Rows

The syntax x -> do something with x is an anonymous function (sometimes called lambda expression).  The filter function will apply a function to each row and return back a DataFrame for the rows that returned true.

filter(row -> row.Carat > 1, df)
  • We can also use indexing (with broadcasting) rather than filter:
df[df.Carat .> 1, :]
  • For functions that accept a function as its first argument, Julia's do-block syntax can help you clean up your code.  Here we are using &&, the logical "and" operator, to create multiple filter conditions.
filter(x -> x.Carat > 1 && x.Cut == "Premium" && x.Color == "J" && 5000 <= x.Price <= 6000, df)

# Same as above, but with do-block
filter(df) do x 
    x.Carat > 1 && 
        x.Cut == "Premium" && 
        x.Color == "J" &&
        5000 <= x.Price <= 6000
end

You can now do several essential DataFrame tasks:

  • Get a single column
  • Choose a subset of columns
  • Choose a subset of rows

Next we'll use groupby and combine to apply functions across groups of data.

🤔 How does Price relate to Cut?

We are big on learning by example, so let's start by answering this relatively simple question.  First things first: What do the Price and Cut variables look like?

julia> df.Price
53940-element Vector{Int32}:
  326
  326
    ⋮
 2757
 2757

julia> df.Cut
53940-element CategoricalArrays.CategoricalArray{String,1,UInt8}:
 "Ideal"
 "Premium"
 ⋮
 "Premium"
 "Ideal"
  • Price: The cost in US Dollars.
  • Cut: The rating of cut quality.  In order (best-to-worst): "Ideal", "Premium" "Very Good", "Good", and "Fair".  Side note: The data is stored in a CategoricalArray, which uses less memory than storing each element as a separate String.

Using groupby

We can use the groupby function to group our data by the "Cut" variable.

gdf = groupby(df, :Cut)

Using our grouped DataFrame, we can then apply a function to a variable in each group using combine.  Let's get the average Price for each level of Cut:

julia> using Statistics # for `mean`

julia> combine(gdf, :Price => mean)
5×2 DataFrame
 Row │ Cut        Price_mean
     │ Cat…       Float64
─────┼───────────────────────
   1 │ Fair          4358.76
   2 │ Good          3928.86
   3 │ Very Good     3981.76
   4 │ Premium       4584.26
   5 │ Ideal         3457.54

Now we know what the distribution center is for each Cut, but what about the spread and shape?

📊 Using StatsPlots

The StatsPlots package adds functionality and plot recipes to Plots.jl.  We'll use it to do the grouping for us so that we don't need groupby.  First, add StatsPlots:

(@v1.6) pkg> add StatsPlots

Next, use the @df <dataframe> <plot command> syntax to create a violin plot overlaid with a box plot for each level of Cut.

julia> @df df violin(string.(:Cut), :Price, lab="")

julia> @df df boxplot!(string.(:Cut), :Price, alpha=.4, lab="")
First Steps #4: Digging Into DataFrames
Price vs. Cut

Things to note in the code/plot above:

  • The @df macro will replace Symbols with the associated DataFrame columns.
  • We must use string.(:Cut) because Plots/StatsPlots doesn't know how to work with CategoricalArrays directly.
  • We use boxplot! (instead of boxplot) to add a new series to the existing plot.
  • We set lab (shorthand for label) to "" to avoid adding an entry to the plot legend.  If all legend entries are blank, the legend will not appear.
  • We use alpha=.4 to set the opacity of the boxplot so that it doesn't cover up the violin in the layer beneath it.

From our plot, we can see the distributions are all similarly skewed with a long right tail.  Some Cuts (Good, Premium, and Very Good) are bimodal (they have two "peaks").  However, we are ignoring some important factors (such as how Carat and Color affect the price!), so we shouldn't make any conclusions based solely on this plot.

🚀 That's It!

You now know how to do a little bit of data wrangling with DataFrames.  What do you want to learn about next?

Enjoying Julia For Data Science?  Please share us with a friend and follow us on Twitter at @JuliaForDataSci.

Additional Resources

Julia Quickstart

By: Josh Day

Re-posted from: https://www.juliafordatascience.com/quickstart/

Enjoying Julia For Data Science?  Please share us with a friend and follow us on Twitter at @JuliaForDataSci.


Julia Quickstart

This post is something between a FAQ and lightning-fast introduction to Julia.  Think of it as "First Steps #0: I've heard of Julia. What's it Like to Code in It?".  After you've read this, check out our First Steps series to keep on learning!  

This page was last updated June 10, 2021.


🤔 I'm Stuck.  Where Can I Find Help?

1. Try the Julia REPL's help mode.

2. Help mode didn't answer your question?

3. Still stuck? Time to ask for help! 🙋

  • Do you think other people have the same question?
    • Yes: Please post your question on Julia Discourse for posterity! Slack messages disappear after a time and we'd love to keep our shared knowledge searchable.
    • No: Ask on the Julia Slack.

The Julia community is full of people who like to help! We'll note that it's beneficial for everyone if you ask good questions.


Working with Arrays

Creating Vectors

x = [1, 2, 3, 4]

# A "Range" doesn't store the values between 1 and 4.
y = 1:4  

# `1:4` -> `[1, 2, 3, 4]`
collect(y)  

# 1 to 100 with a step size of 3: [1, 4, 7, ..., 94, 97, 100]
1:3:100

Creating Matrices

# Row-vector (1 x 4)
[1 2 3 4] 

# Matrix (2 x 3)
[1 2 3 ; 3 4 5]

# Matrix (100 x 3) of random Normal(0, 1) samples
randn(100, 3)

Indexing (1-Based)

If someone tells you a language is unusable because it uses 1 (or 0)-based indexing, they are just plain wrong.

1-based indexing is a big deal for same reason most other fad topics are a big deal: it’s such a simple idea that everyone can have an opinion on it, and everyone seems to think they can “help” by telling their personal experience about how this arbitrary choice has affected them at one time in their life.

Chris Rackauckus via Julia Discourse

x = rand(100, 2)

x[3, 2]  # retrieve 3rd row of column 2

Arrays are Column-Major

This means that data in a matrix is stored in computer memory with column elements next to each other.

x = rand(100, 2)

x[105] == x[5, 2]

Working With Strings

  • A big difference from some languages is that " is different from '.
  • Strings are made "like this".
  • Character literals are made like this: 's'.
  • String concatenation is achieved via *:
julia> "Hello, " * "World!"
"Hello, World!"
Hello, World!
  • String interpolation is achieved through $.
julia> x = "World!"
"World!"

julia> "Hello, $x"
"Hello, World!"
Hello Again
  • String macros change the interpretation of a string:
julia> r"[a-z]"  # I'm a regular expression!
r"[a-z]"

julia> html"<div>I'm html</div>"  # I'm HTML!
HTML{String}("<div>I'm html</div>")

📦 How do I Find/Install/Load Packages?

Finding Packages

JuliaHub is a great resource for discovering packages.  We find it's a bit easier to find stuff compared to Googling.

It's hard to know which Julia packages are "the good ones" at first glance.  However, good packages tend to have similar characteristics:

  • Active development.  GitHub's pulse feature shows a summary of package activity.
  • Quality documentation.  It's a good sign when the docs are both understandable and thorough, as they are for DataFrames.
  • Other people are interested in it.  On GitHub, the Watch number is the how many people receive notifications for activity, the Star number is how many people have "liked" it, and Fork is how many people have created their own copy of the package to potentially make changes to it.  It's typically a good sign when these numbers are large.
Julia Quickstart
"Interest" in DataFrames.jl

Installing Packages

The simplest way to add packages is to use Pkg Mode in the REPL by pressing ].  You'll notice the prompt will change to (current environment) pkg>

(@v1.6) pkg> add DataFrames, StatsBase

Loading Packages

using DataFrames, StatsBase

# Only bring certain names into the namespace
using StatsBase: countmap, zscore

Using Environments

Julia lets you use different environments that use different collections of packages/package versions.  The default environment is v1.6 (note the Pkg Mode prompt above).  You can activate a new environment with:

] activate <dir>

If you make changes (e.g. add a package) to an environment, two files will be created: Project.toml and Manifest.toml.

  • What's Project.toml?  How the user tells Julia what they want installed.  Version bounds for packages go here.
  • What's Manifest.toml?  How Julia tells the user what is installed.  

What are Types?

  • Everything in Julia has a type.
julia> typeof(1)
Int64
  • Types can be parameterized by another type.  For example, an Array is parameterized by the type of its elements and number of dimensions.  Therefore, a vector of 64-bit integers is an Array{Int64, 1}.
julia> typeof([1,2,3])
Vector{Int64} (alias for Array{Int64, 1})
  • If we follow Int64 "up the type tree" we'll eventually run into Any, the top level abstract type.  
julia> supertype(Int64)
Signed

julia> supertype(Signed)
Integer

julia> supertype(Integer)
Real

julia> supertype(Real)
Number

julia> supertype(Number)
Any
  • Abstract types "don't exist", but they define a set of concrete types (things that exist).  For example, you can create an instance of Int64, but not Real.  Inside the set of all Real numbers, Int64 is one of many concrete types.
Julia Quickstart
Real Numbers

🎉 What is Multiple Dispatch? 🎉

  • Multiple dispatch is a major part of why people love Julia.  The gist of it is that you can write functions so that different/specialized code is called depending on the types of the arguments.
julia> f(x::Int) = 1
f (generic function with 1 method)

julia> f(x::Float64) = 2
f (generic function with 2 methods)

julia> f(1)
1

julia> f(1.0)
2
Super Simple Multiple Dispatch Example
  • Above we used ::Type to add a type annotation.  Since we only added methods for ::Int and ::Float64, our function f can only be called on Ints and Float64s.  However, type annotations are not necessary:

Automatic Specialized Code

  • Julia uses a Just-in-time compiler, meaning that every time you call a function with new types, Julia compiles a specific method for exactly those types.  Thus the following two functions will have the same performance!
function f(x::Type1, y::Type2, z::Type3)
    # big computation
end

function f(x, y, z)
    # big computation
end

What is Broadcasting?

Broadcasting is a way of applying a function to multiple inputs at once.  

  • For example, there is no mathematical definition for calling the sine function on a vector, but many languages will automatically apply sine to each element.  In Julia, you must explicity broadcast a function over multiple inputs by adding a dot .
julia> sin([1,2,3])
ERROR: MethodError: no method matching sin(::Vector{Int64})

julia> sin.([1,2,3])
3-element Vector{Float64}:
 0.8414709848078965
 0.9092974268256817
 0.1411200080598672
  • You can even fuse broadcasted computations, which removes the need to create temporary vectors:
julia> x = [1,2,3];

julia> y = [4,5,6];

julia> z = [7,8,9];

julia> x .+ (y .* sin.(z))
3-element Vector{Float64}:
 3.6279463948751562
 6.946791233116909
 5.472710911450539
Broadcast Fusion

How do I Code in Julia?

According to the 2020 Julia User & Developer Survey (PDF), Julia programmers use the following editors/IDEs "frequently":

A new coding environment on the scene is Pluto.jl, which we love!  If you are new to Julia or programming in general, we recommend starting with Pluto 🎈.


What are Macros?

Macros (names that start with @) are functions of expressions.  They let you change an expression before it gets run.  For example, @time will record both the time elapsed and allocations generated from an expression.

julia> @time begin
          sleep(1)
          sleep(2)
       end
  3.008873 seconds (8 allocations: 256 bytes)

Metaprogramming (writing code that writes other code) is a pretty advanced topic.  It's also a super powerful tool.


That's it!

Did you like this post?  Have a question?  Did we miss something important?

Ping us on Twitter at @JuliaForDataSci 🚀

Additional Resources

First Steps #3: A Primer on Plots

By: Josh Day

Re-posted from: https://www.juliafordatascience.com/first-steps-3-primer-on-plots/

First Steps #3: A Primer on Plots

Visualizing data is an essential skill for a data scientist.  Unlike R, Julia does not ship with plotting functionality built-in.  If you search for ways to make plots in Julia, you'll discover a lot of options.  So what should you use?

📊 Plots.jl

We recommend the Plots package (especially for beginners).

Plots is a unified interface for creating visualizations with different backends (such as GR, Plotly.js, and UnicodePlots).  It's great for beginners and power users both and it's designed such that a lot things you try will "just work".

💻 Install Plots

In the Julia REPL, add the Plots package if you haven't already done so.  Recall that you enter Pkg Mode by pressing ]:

(@v1.6) pkg> add Plots

📈 Create Your First Plot

Back in Julia mode (by pressing delete), enter:

julia> using Plots

julia> plot(randn(10), title="My First Plot")
First Steps #3: A Primer on Plots

🎉 Congrats!  You made your first plot 📈!  You created it using:

  1. randn(10): A Vector of 10 random samples from a Normal(0,1) distribution.
  2. The GR backend (Plots' default).

✨ Core Principles

The main function you'll use, as you may have guessed, is

plot(args...; kw...)

Here args... means any number of positional arguments and kw... is any number of keyword arguments.  Look back at the first plot we created and notice that we provided data randn(10) as a positional argument and the title title="My First Plot" as a keyword argument.  Another function you'll use is

plot!(args...; kw...)

In Julia, ! is used as a convention to identify functions that mutate at least one of the arguments.  With Plots, this lets you make changes or additions to a plot.


Now that we know the functions we are using, let's look at the core principles:

Principle #1: Every Thing You Plot is a Series

When you give data to the plot function (like randn(10) above), the seriestype determines how Plots will interpet the data.  By default this is :path.  

plot(1:10, seriestype = :path, label = "Series 1")

plot!(rand(1:10,10), seriestype = :scatter, label = "Series 2")
First Steps #3: A Primer on Plots

Principle #2: Plot Attributes have Aliases

Plot attributes are passed by keyword arguments.   Because of aliases, you can often guess at the name of an attribute and Plots will interpret it correctly.  For example, the following commands are equivalent:

plot(randn(10), seriestype = :scatter)

plot(randn(10), st = :scatter)

scatter(randn(10))

Principle #3: Columns are Mapped to Series

For both data and attributes, the columns of matrices will be mapped to individual series.  In this example, we create two series by providing a 10 x 2 matrix.  Now look at the difference between p1 and p2.  If the st (seriestype) attribute is a vector, the provided attributes will loop through the available series.  If the st attribute is a matrix, the attributes in the i-th column will be mapped to the i-th series.  This provides a very succinct way of providing attributes to series.

x = randn(10, 2)

# Series 1 --> :scatter & :line
# Series 2 --> :scatter & :line
p1 = plot(x, st=[:scatter, :line])  

# Series 1 --> :scatter
# Series 2 --> :line
p2 = plot(x, st=[:scatter :line]) 

plot(p1, p2)
First Steps #3: A Primer on Plots

Principle #4:  Some Attributes are Magic 🪄

Some attributes can be provided with multiple values all at once and Plots will figure out what to do with them.  For example, using m=(10, .5, "blue") will set the marker size to 10, the marker alpha (opacity) to 0.5, and the marker color to "blue".

plot(randn(10), m = (10, .5, "blue"))
First Steps #3: A Primer on Plots
Plot Created with Magic

Principle #5: Many Types have Plot Recipes

This is best seen through example.  Let's add the RDatasets and OnlineStats packages via Pkg Mode in the REPL:

(@v1.6) pkg> add OnlineStats RDatasets

Now load the packages and retrieve the diamonds dataset that comes packaged with R's ggplot2.  The diamonds data is collection of variables on diamond price and quality.

using RDatasets, OnlineStats

df = dataset("ggplot2", "diamonds")

Suppose the first thing we want to see is the distribution of the :Cut variable in our diamonds data.  We'll use OnlineStats.CountMap to count the number of occurrences for each unique value in the :Cut column.  

When we plot the CountMap, a recipe is invoked to turn it into data that Plots knows how to display.  What recipes provide, other than say a plot_countmap function, is the ability to hook into plot attributes just as if you were plotting raw numbers.

o = CountMap(String)

fit!(o, string.(df.Cut))

plot(o, title="Neat!")
First Steps #3: A Primer on Plots

Try This!

Use a Different Backend

The backends of Plots can be changed interactively.  Try typing

plotly()

to switch to the interactive javascript library Plotly.js.  Then rerun the above examples.

That's It!

Now you know Plots' core principles.  Time to try a few things on your own!

Enjoying Julia For Data Science?  Please share us with a friend and follow us on Twitter at @JuliaForDataSci.

Resources