Tag Archives: first steps

First Steps #5: Pluto.jl

Re-posted from: https://www.juliafordatascience.com/first-steps-5-pluto/

What's Pluto?

First Steps #5: Pluto.jl 🎈

Notebook environments (e.g. Jupyter and Observable) have become extremely popularity in the last decade. They give programmers a way to intersperse code with markup, add interactive UI elements, and show off code in a format more interesting than text files. People love them (well, not everyone).

Pluto.jl is a newcomer (PlutoCon 2021 was just held to celebrate its one-year anniversary!) to the world of notebook environments. It provides a reactive environment specific to Julia. People are doing some very cool things with Pluto. Check out MIT's Introduction to Compuitational Thinking course for some fantastic public lectures with Pluto.

Pluto Quickstart

Installing Pluto:

] add Pluto

Starting the Pluto Server:

using Pluto

Pluto.run()

The above command will open up the following page.

First Steps #5: Pluto.jl 🎈 — Pluto Welcome Screen

To get back to this page from an opened notebook, click the Pluto.jl icon in the top navbar.
For a deeper introduction to Pluto, go through the sample notebooks (we highly recommend them!).
Press ctrl + ? to view keyboard shortcuts:

Key Points about Pluto

1. Your Code is Reactive.

When you change a variable, that change gets propagated through all cells which reference that variable.

2. Returned Values will Render as HTML.

That means things like the markdown string macro (md) will look nice. Note that output gets displayed above the code.

3. Code can be Hidden.

Click the eye icon on the top left of a cell to hide the code. It only appears if your cursor is hovering over the cell.

4. You can `@bind` HTML Inputs to Julia Variables.

Here we are using Pluto.@bind along with the html string macro to create a simple text input and bind it to a Julia variable my_input. The @bind macro works with any HTML input type.

5. You can Avoid Writing HTML by using PlutoUI.

First:

] add PlutoUI

Then:

To see all of the UI options in PlutoUI, open the PlutoUI.jl sample notebook.

Notes, Tips, and Tricks

Multiple Expressions

Pluto will try to get you to split multiple expressions into multiple cells (You can also put multiple expressions in a begin–end block). This helps Pluto manage the dependencies between cells and avoids unnecessary re-running of code that "reacts" to something it doesn't need to.

Custom Display Methods

If you want something to make use of Pluto's rich HTML display, you need to define your own Base.show method for the text/html MIME type.

Interpolating UI Elements into Markdown

You can use Julia's string interpolation syntax to interpolate values into a markdown block that will then get rendered as HTML. This includes html strings and PlutoUI elements! You can even define the UI element somewhere else to keep your markdown block look cleaner.

x_ui = @bind x Slider(1:10)

md"My UI Element: $x_ui"

# Provides the same result: 

md"My UI Element: $(@bind x Slider(1:10))"

Final Thoughts

On a personal note, I've found Pluto particularly useful for making:

Lightweight user interfaces for customers without strong Julia skills. I simply teach the customer to run Pluto.run() and then I don't need to deal with the overhead of developing a full web app. The downside is that Pluto notebooks can't (yet) be deployed as a web app.
Interactive presentations. Pluto works great for demonstrating code and more. A huge benefit is that thanks to reactivity, you'll never get in an awkward state with cells run out of order!
Data Visualization. Data visualization is often an iterative process that takes many incremental changes to get the plot you want. The reactivity of Pluto provides instant feedback and greatly speeds up this process.

That's It!

You now know how to do some really cool stuff with Pluto. What will you build with it?

Enjoying Julia For Data Science? Please share us with a friend and follow us on Twitter at @JuliaForDataSci.

Additional Resources

https://github.com/fonsp/Pluto.jl
PlutoCon 2021
Introduction to Computational Thinking – MIT
Pluto's example notebooks!

First Steps #4: Digging Into DataFrames

By: Josh Day

Re-posted from: https://www.juliafordatascience.com/first-steps-4-dataframes/

First Steps #4: Digging Into DataFrames

DataFrames.jl provides the most widely used tabular data structure in Julia. In this post we'll explore DataFrames using sample data from RDatasets.jl (and we'll plot stuff using StatsPlots).

A rather timely event: DataFrames.jl has reached version 1.0!

Setup

First, install DataFrames and RDatasets via Pkg Mode (]) in the REPL:

(@v1.6) pkg> add DataFrames RDatasets

Now load both packages along with the diamonds dataset from R's ggplot2 package. The diamonds data contains price/size/quality information on 53,940 different diamonds.

julia> using DataFrames, RDatasets

julia> df = dataset("ggplot2", "diamonds")
53940×10 DataFrame
   Row │ Carat    Cut        Color  Clarity  Depth    Tabl ⋯
       │ Float64  Cat…       Cat…   Cat…     Float64  Floa ⋯
───────┼────────────────────────────────────────────────────
     1 │    0.23  Ideal      E      SI2         61.5     5 ⋯
     2 │    0.21  Premium    E      SI1         59.8     6
     3 │    0.23  Good       E      VS1         56.9     6
     4 │    0.29  Premium    I      VS2         62.4     5
     5 │    0.31  Good       J      SI2         63.3     5 ⋯
     6 │    0.24  Very Good  J      VVS2        62.8     5
     7 │    0.24  Very Good  I      VVS1        62.3     5
     8 │    0.26  Very Good  H      SI1         61.9     5
   ⋮   │    ⋮         ⋮        ⋮       ⋮        ⋮        ⋮ ⋱

DataFrames Quickstart

Variables (columns) of a Dataframe can be referenced either by strings or symbols, e.g. "I am a string" and :I_am_a_symbol.

Make a Copy of a Column

df[:, "Carat"]

df[:, :Carat]

Extract a Column

These commands retrieve the exact data held in the DataFrame. Warning! Making a change to the extracted data will change the values in DataFrame.

df.Carat

df[!, "Carat"]

Selecting a Subset of Columns

select(df, "Carat")

select(df, ["Carat", "Cut"])

Filtering a Subset of Rows

The syntax x -> do something with x is an anonymous function (sometimes called lambda expression). The filter function will apply a function to each row and return back a DataFrame for the rows that returned true.

filter(row -> row.Carat > 1, df)

We can also use indexing (with broadcasting) rather than filter:

df[df.Carat .> 1, :]

For functions that accept a function as its first argument, Julia's do-block syntax can help you clean up your code. Here we are using &&, the logical "and" operator, to create multiple filter conditions.

filter(x -> x.Carat > 1 && x.Cut == "Premium" && x.Color == "J" && 5000 <= x.Price <= 6000, df)

# Same as above, but with do-block
filter(df) do x 
    x.Carat > 1 && 
        x.Cut == "Premium" && 
        x.Color == "J" &&
        5000 <= x.Price <= 6000
end

You can now do several essential DataFrame tasks:

Get a single column
Choose a subset of columns
Choose a subset of rows

Next we'll use groupby and combine to apply functions across groups of data.

How does Price relate to Cut?

We are big on learning by example, so let's start by answering this relatively simple question. First things first: What do the Price and Cut variables look like?

julia> df.Price
53940-element Vector{Int32}:
  326
  326
    ⋮
 2757
 2757

julia> df.Cut
53940-element CategoricalArrays.CategoricalArray{String,1,UInt8}:
 "Ideal"
 "Premium"
 ⋮
 "Premium"
 "Ideal"

Price: The cost in US Dollars.
Cut: The rating of cut quality. In order (best-to-worst): "Ideal", "Premium" "Very Good", "Good", and "Fair". Side note: The data is stored in a CategoricalArray, which uses less memory than storing each element as a separate String.

Using `groupby`

We can use the groupby function to group our data by the "Cut" variable.

gdf = groupby(df, :Cut)

Using our grouped DataFrame, we can then apply a function to a variable in each group using combine. Let's get the average Price for each level of Cut:

julia> using Statistics # for `mean`

julia> combine(gdf, :Price => mean)
5×2 DataFrame
 Row │ Cut        Price_mean
     │ Cat…       Float64
─────┼───────────────────────
   1 │ Fair          4358.76
   2 │ Good          3928.86
   3 │ Very Good     3981.76
   4 │ Premium       4584.26
   5 │ Ideal         3457.54

Now we know what the distribution center is for each Cut, but what about the spread and shape?

Using StatsPlots

The StatsPlots package adds functionality and plot recipes to Plots.jl. We'll use it to do the grouping for us so that we don't need groupby. First, add StatsPlots:

(@v1.6) pkg> add StatsPlots

Next, use the @df <dataframe> <plot command> syntax to create a violin plot overlaid with a box plot for each level of Cut.

julia> @df df violin(string.(:Cut), :Price, lab="")

julia> @df df boxplot!(string.(:Cut), :Price, alpha=.4, lab="")

First Steps #4: Digging Into DataFrames — Price vs. Cut

Things to note in the code/plot above:

The @df macro will replace Symbols with the associated DataFrame columns.
We must use string.(:Cut) because Plots/StatsPlots doesn't know how to work with CategoricalArrays directly.
We use boxplot! (instead of boxplot) to add a new series to the existing plot.
We set lab (shorthand for label) to "" to avoid adding an entry to the plot legend. If all legend entries are blank, the legend will not appear.
We use alpha=.4 to set the opacity of the boxplot so that it doesn't cover up the violin in the layer beneath it.

From our plot, we can see the distributions are all similarly skewed with a long right tail. Some Cuts (Good, Premium, and Very Good) are bimodal (they have two "peaks"). However, we are ignoring some important factors (such as how Carat and Color affect the price!), so we shouldn't make any conclusions based solely on this plot.

That's It!

You now know how to do a little bit of data wrangling with DataFrames. What do you want to learn about next?

Enjoying Julia For Data Science? Please share us with a friend and follow us on Twitter at @JuliaForDataSci.

Additional Resources

First Steps #3: A Primer on Plots

By: Josh Day

Re-posted from: https://www.juliafordatascience.com/first-steps-3-primer-on-plots/

First Steps #3: A Primer on Plots

Visualizing data is an essential skill for a data scientist. Unlike R, Julia does not ship with plotting functionality built-in. If you search for ways to make plots in Julia, you'll discover a lot of options. So what should you use?

Plots.jl

We recommend the Plots package (especially for beginners).

Plots is a unified interface for creating visualizations with different backends (such as GR, Plotly.js, and UnicodePlots). It's great for beginners and power users both and it's designed such that a lot things you try will "just work".

Install Plots

In the Julia REPL, add the Plots package if you haven't already done so. Recall that you enter Pkg Mode by pressing ]:

(@v1.6) pkg> add Plots

Create Your First Plot

Back in Julia mode (by pressing delete), enter:

julia> using Plots

julia> plot(randn(10), title="My First Plot")

Congrats! You made your first plot ! You created it using:

randn(10): A Vector of 10 random samples from a Normal(0,1) distribution.
The GR backend (Plots' default).

Core Principles

The main function you'll use, as you may have guessed, is

plot(args...; kw...)

Here args... means any number of positional arguments and kw... is any number of keyword arguments. Look back at the first plot we created and notice that we provided data randn(10) as a positional argument and the title title="My First Plot" as a keyword argument. Another function you'll use is

plot!(args...; kw...)

In Julia, ! is used as a convention to identify functions that mutate at least one of the arguments. With Plots, this lets you make changes or additions to a plot.

Now that we know the functions we are using, let's look at the core principles:

Principle #1: Every Thing You Plot is a Series

When you give data to the plot function (like randn(10) above), the seriestype determines how Plots will interpet the data. By default this is :path.

plot(1:10, seriestype = :path, label = "Series 1")

plot!(rand(1:10,10), seriestype = :scatter, label = "Series 2")

Principle #2: Plot Attributes have Aliases

Plot attributes are passed by keyword arguments. Because of aliases, you can often guess at the name of an attribute and Plots will interpret it correctly. For example, the following commands are equivalent:

plot(randn(10), seriestype = :scatter)

plot(randn(10), st = :scatter)

scatter(randn(10))

Principle #3: Columns are Mapped to Series

For both data and attributes, the columns of matrices will be mapped to individual series. In this example, we create two series by providing a 10 x 2 matrix. Now look at the difference between p1 and p2. If the st (seriestype) attribute is a vector, the provided attributes will loop through the available series. If the st attribute is a matrix, the attributes in the i-th column will be mapped to the i-th series. This provides a very succinct way of providing attributes to series.

x = randn(10, 2)

# Series 1 --> :scatter & :line
# Series 2 --> :scatter & :line
p1 = plot(x, st=[:scatter, :line])  

# Series 1 --> :scatter
# Series 2 --> :line
p2 = plot(x, st=[:scatter :line]) 

plot(p1, p2)

Principle #4: Some Attributes are Magic

Some attributes can be provided with multiple values all at once and Plots will figure out what to do with them. For example, using m=(10, .5, "blue") will set the marker size to 10, the marker alpha (opacity) to 0.5, and the marker color to "blue".

plot(randn(10), m = (10, .5, "blue"))

Principle #5: Many Types have Plot Recipes

This is best seen through example. Let's add the RDatasets and OnlineStats packages via Pkg Mode in the REPL:

(@v1.6) pkg> add OnlineStats RDatasets

Now load the packages and retrieve the diamonds dataset that comes packaged with R's ggplot2. The diamonds data is collection of variables on diamond price and quality.

using RDatasets, OnlineStats

df = dataset("ggplot2", "diamonds")

Suppose the first thing we want to see is the distribution of the :Cut variable in our diamonds data. We'll use OnlineStats.CountMap to count the number of occurrences for each unique value in the :Cut column.

When we plot the CountMap, a recipe is invoked to turn it into data that Plots knows how to display. What recipes provide, other than say a plot_countmap function, is the ability to hook into plot attributes just as if you were plotting raw numbers.

o = CountMap(String)

fit!(o, string.(df.Cut))

plot(o, title="Neat!")

Try This!

Use a Different Backend

The backends of Plots can be changed interactively. Try typing

plotly()

to switch to the interactive javascript library Plotly.js. Then rerun the above examples.

That's It!

Now you know Plots' core principles. Time to try a few things on your own!

Enjoying Julia For Data Science? Please share us with a friend and follow us on Twitter at @JuliaForDataSci.

Resources

Plots Documentation

What's Pluto?

Pluto Quickstart

Key Points about Pluto

1. Your Code is Reactive.

2. Returned Values will Render as HTML.

3. Code can be Hidden.

4. You can @bind HTML Inputs to Julia Variables.

5. You can Avoid Writing HTML by using PlutoUI.

Notes, Tips, and Tricks

Multiple Expressions

Custom Display Methods

Interpolating UI Elements into Markdown

Final Thoughts

That's It!

Additional Resources

Setup

DataFrames Quickstart

Make a Copy of a Column

Extract a Column

Selecting a Subset of Columns

Filtering a Subset of Rows

How does Price relate to Cut?

Using groupby

Using StatsPlots

That's It!

Additional Resources

Plots.jl

Install Plots

Create Your First Plot

Core Principles

Principle #1: Every Thing You Plot is a Series

Principle #2: Plot Attributes have Aliases

Principle #3: Columns are Mapped to Series

Principle #4: Some Attributes are Magic

Principle #5: Many Types have Plot Recipes

Try This!

Use a Different Backend

That's It!

Resources

4. You can `@bind` HTML Inputs to Julia Variables.

Using `groupby`