Tag Archives: julialang

Julia for Data Analysis Strikes Back

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2024/05/10/jda.html

Introduction

This week I got a nice little surprise in my office. A year after my Julia for Data Analysis
book has been published I got a package with a set of printed versions of its Korean translation
데이터 분석을 위한 줄리아. It was really a nice experience and I hope that Julia users from Korea will like it.

데이터 분석을 위한 줄리아

Therefore, for today, I decided to discuss a functionality that is little known, but often quite useful.
It is related to adding conditional columns to a data frame.

The post was written under Julia 1.10.1, DataFrames.jl 1.6.1, and DataFramesMeta.jl 0.15.2.

The problem

Assume you have the following data frame:

julia> using DataFrames

julia> df = DataFrame(x=-2.0:0.5:2.0)
9×1 DataFrame
 Row │ x
     │ Float64
─────┼─────────
   1 │    -2.0
   2 │    -1.5
   3 │    -1.0
   4 │    -0.5
   5 │     0.0
   6 │     0.5
   7 │     1.0
   8 │     1.5
   9 │     2.0

Now we want to add a second column to this data frame that contains a square root of column "x".

A basic approach fails:

julia> df.sqrtx = sqrt.(df.x)
ERROR: DomainError with -2.0:
sqrt was called with a negative real argument but will only return a complex result if called with a complex argument. Try sqrt(Complex(x)).

The reason is that we cannot normally take a square root of a negative number.

We can perform a conditional processing for example like this:

julia> df.sqrtx = (x -> x < 0.0 ? missing : sqrt(x)).(df.x)
9-element Vector{Union{Missing, Float64}}:
  missing
  missing
  missing
  missing
 0.0
 0.7071067811865476
 1.0
 1.224744871391589
 1.4142135623730951

julia> df
9×2 DataFrame
 Row │ x        sqrtx
     │ Float64  Float64?
─────┼─────────────────────────
   1 │    -2.0  missing
   2 │    -1.5  missing
   3 │    -1.0  missing
   4 │    -0.5  missing
   5 │     0.0        0.0
   6 │     0.5        0.707107
   7 │     1.0        1.0
   8 │     1.5        1.22474
   9 │     2.0        1.41421

but I do not find this approach very readable (especially from the perspective of a beginner).

The alternative that I prefer is to work with a view of the source data frame. Let us first create such a view that contains all columns of the original data frame, but only rows in which column "x" is non-negative:

julia> dfv = filter(:x => >=(0.0), df, view=true)
5×2 SubDataFrame
 Row │ x        sqrtx
     │ Float64  Float64?
─────┼───────────────────
   1 │     0.0  0.0
   2 │     0.5  0.707107
   3 │     1.0  1.0
   4 │     1.5  1.22474
   5 │     2.0  1.41421

Now, we can add a column to such a view by using a plain sqrt function without any decorations:

julia> dfv.sqrtx2 = sqrt.(dfv.x)
5-element Vector{Float64}:
 0.0
 0.7071067811865476
 1.0
 1.224744871391589
 1.4142135623730951

julia> dfv
5×3 SubDataFrame
 Row │ x        sqrtx     sqrtx2
     │ Float64  Float64?  Float64?
─────┼─────────────────────────────
   1 │     0.0  0.0       0.0
   2 │     0.5  0.707107  0.707107
   3 │     1.0  1.0       1.0
   4 │     1.5  1.22474   1.22474
   5 │     2.0  1.41421   1.41421

julia> df
9×3 DataFrame
 Row │ x        sqrtx           sqrtx2
     │ Float64  Float64?        Float64?
─────┼─────────────────────────────────────────
   1 │    -2.0  missing         missing
   2 │    -1.5  missing         missing
   3 │    -1.0  missing         missing
   4 │    -0.5  missing         missing
   5 │     0.0        0.0             0.0
   6 │     0.5        0.707107        0.707107
   7 │     1.0        1.0             1.0
   8 │     1.5        1.22474         1.22474
   9 │     2.0        1.41421         1.41421

Note that both dfv and df are updated as expected. The filtered-out rows get missing values.

It is important to highlight that this functionality works if the view (SubDataFrame) was created using all columns of the source data frame (like is done in the case of our filter call above).
The reason for this restriction is that if view contained some subset of columns the operation of adding a column would be unsafe (there would be a risk of accidental and unwanted overwrite of a column present in the source data frame that was not included in the view).

This functionality is especially nice in combination with DataFramesMeta.jl, just have a look:

julia> @chain df begin
           @rsubset(:x >= 0; view=true)
           @rtransform!(:sqrtx3 = sqrt(:x))
           parent
       end
9×4 DataFrame
 Row │ x        sqrtx           sqrtx2          sqrtx3
     │ Float64  Float64?        Float64?        Float64?
─────┼─────────────────────────────────────────────────────────
   1 │    -2.0  missing         missing         missing
   2 │    -1.5  missing         missing         missing
   3 │    -1.0  missing         missing         missing
   4 │    -0.5  missing         missing         missing
   5 │     0.0        0.0             0.0             0.0
   6 │     0.5        0.707107        0.707107        0.707107
   7 │     1.0        1.0             1.0             1.0
   8 │     1.5        1.22474         1.22474         1.22474
   9 │     2.0        1.41421         1.41421         1.41421

In the code above I used parent in the last step to recover the source df.

As a final comment note that an alternative in DataFramesMeta.jl is to just use a plain @rtransform! macro:

julia> @rtransform!(df, :sqrtx4 = :x < 0 ? missing : sqrt(:x))
9×5 DataFrame
 Row │ x        sqrtx           sqrtx2          sqrtx3          sqrtx4
     │ Float64  Float64?        Float64?        Float64?        Float64?
─────┼─────────────────────────────────────────────────────────────────────────
   1 │    -2.0  missing         missing         missing         missing
   2 │    -1.5  missing         missing         missing         missing
   3 │    -1.0  missing         missing         missing         missing
   4 │    -0.5  missing         missing         missing         missing
   5 │     0.0        0.0             0.0             0.0             0.0
   6 │     0.5        0.707107        0.707107        0.707107        0.707107
   7 │     1.0        1.0             1.0             1.0             1.0
   8 │     1.5        1.22474         1.22474         1.22474         1.22474
   9 │     2.0        1.41421         1.41421         1.41421         1.41421

In this case it also quite clean.

Conclusions

I am really happy that we have a Korean version of Julia for Data Analysis.

I hope that the example transformations I have shown today were useful and improved your knowledge of DataFrames.jl and DataFramesMeta.jl packages.

Breaking a passcode with Julia

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2024/05/03/pe79.html

Introduction

This week it is a holiday period in Poland so I decided to solve a puzzle.
I liked the code as it can be used to show some basic features of the Julia language.

The examples were written under Julia 1.10.1, HTTP.jl 1.10.6, and Graphs.jl 1.10.0.

The problem

I decided to use my favorite Project Euler puzzle set. This time I chose Problem 79.

Here is its statement (taken from the Project Euler website):

A common security method used for online banking is to ask the user for three random characters from a passcode. For example, if the passcode was 531278, they may ask for the 2nd, 3rd, and 5th characters; the expected reply would be: 317.
The text file, keylog.txt, contains fifty successful login attempts.
Given that the three characters are always asked for in order, analyse the file so as to determine the shortest possible secret passcode of unknown length.

The keylog.txt file can be found under this link: https://projecteuler.net/resources/documents/0079_keylog.txt.

Let us try solving the puzzle.

The solution

First we use the HTTP.jl package to get the data and pre-process it.
Start by storing the file as a string:

julia> using HTTP

julia> url = "https://projecteuler.net/resources/documents/0079_keylog.txt"
"https://projecteuler.net/resources/documents/0079_keylog.txt"

julia> str = String(HTTP.get(url).body)
"319\n680\n180\n690\n129\n620\n762\n689\n762\n318\n368\n710\n720\n710\n629\n168\n160\n689\n716\n731\n736\n729\n316\n729\n729\n710\n769\n290\n719\n680\n318\n389\n162\n289\n162\n718\n729\n319\n790\n680\n890\n362\n319\n760\n316\n729\n380\n319\n728\n716\n"

Now we want to process this string into a vector of vectors containing the digits verified by the user.
First we split the string by newlines using the split function. Next We process each line by transforming it into a vector of numbers. We use two features of Julia here. The first is the collect function, which when passed a string returns a vector of characters. The second is broadcasting. By broadcasted substraction of '0' from a vector of characters we get a vector of integers. Here is the code:

julia> v = [collect(x) .- '0' for x in split(str)]
50-element Vector{Vector{Int64}}:
 [3, 1, 9]
 [6, 8, 0]
 [1, 8, 0]
 ⋮
 [3, 1, 9]
 [7, 2, 8]
 [7, 1, 6]

Now we are ready to analyze the data. We will use a directed graph to represent it.
The directed graph will have 10 nodes. Each representing a digit. Because Julia uses
1-based indexing, node number of digit x will be x+1.
Here is the code creating the directed graph:

julia> using Graphs

julia> gr = DiGraph(10, 0)
{10, 0} directed simple Int64 graph

julia> for x in v
           add_edge!(gr, x[1] + 1, x[2] + 1)
           add_edge!(gr, x[2] + 1, x[3] + 1)
       end

julia> gr
{10, 23} directed simple Int64 graph

Note that we have 23 relationships constraining the sequence of the numbers in the unknown password.
Let us check, for each number the number of times it is the preceeding or a following in our graph:

julia> [outdegree(gr) indegree(gr)]
10×2 Matrix{Int64}:
 0  5
 5  2
 3  3
 3  1
 0  0
 0  0
 4  3
 5  0
 2  4
 1  5

From this summary we see that the first node (representing digit 0) is never a source, so it can be a last digit in a pass code. Similarly eighth node (representing 7) is never a destination, so it can be a first digit. Finally, digits 4 and 5 are never neither a source or a destination, so they can be dropped.

How can we programattically find the list of nodes that can be dropped? We can simply find all nodes whose total degree is 0:

julia> to_drop = findall(==(0), degree(gr)) .- 1
2-element Vector{Int64}:
 4
 5

Now we are ready for a final move. Let us assume that our directed graph does not have cycles (this is a simple case, as then we can assume that each number is present exactly once in the code). In this case we can use the topological sorting to find the shortest sequence of numbers consistent with the observed data. In our case to get the topological sorting of nodes in the graph we can write:

julia> ts = topological_sort(gr)
10-element Vector{Int64}:
  8
  6
  5
  4
  2
  7
  3
  9
 10
  1

We did not get an error, which means that our directed graph did not have any cycles, so we are done.

What is left to get a solution is to correct the node-numbering (as we start numbering with 1 and the smallest digit is 0) and remove the numbers that are never used. As usual, I leave the final solution un-evaluated, to encourage you to run the code yourself:

setdiff(ts .- 1, to_drop)

Conclusions

I hope you enjoyed the puzzle and the solution!

Annotating columns of a data frame with DataFramesMeta.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2024/04/26/labels.html

Introduction

Today I want to discuss a functionality that was recently added to DataFramesMeta.jl.
These utility macros and functions make it easy to add custom labels and notes to columns
of a data frame. This functionality is especially useful when working with wide data frames,
as is often the case when e.g. analyzing economic data.

This post is written under Julia 1.10.1, DataFrames.jl 1.6.1, and DataFramesMeta.jl 0.15.2.

Column labels

A column label is a short description of the contents of a column.
When using DataFramesMeta.jl you can use the following basic commands to work with them:

  • @label! attaches a label to a column;
  • label allows you to retrieve column label;
  • printlabels presents you labels of all annotated columns in a data frame.

Here is a simple example:

julia> using DataFramesMeta

julia> df = DataFrame(year=[2000, 2001], rev=[12, 17])
2×2 DataFrame
 Row │ year   rev
     │ Int64  Int64
─────┼──────────────
   1 │  2000     12
   2 │  2001     17

julia> @label!(df, :rev = "Revenue (USD)")
2×2 DataFrame
 Row │ year   rev
     │ Int64  Int64
─────┼──────────────
   1 │  2000     12
   2 │  2001     17

julia> label(df, :rev)
"Revenue (USD)"

julia> printlabels(df)
┌────────┬───────────────┐
│ Column │         Label │
├────────┼───────────────┤
│   year │          year │
│    rev │ Revenue (USD) │
└────────┴───────────────┘

Note that if some column did not get an explicit label (like :year in our example)
by default its name is its label.

Column notes

Column notes are meant to give more detailed information about a column in a data frame.
You can use the following basic commands to work with them:

  • @note! attaches a note to a column;
  • note allows you to retrieve column note;
  • printnotes presents you notes of all columns in a data frame.
julia> @note!(df, :rev = "Total revenue of a company in in a calendar year in nominal USD")
2×2 DataFrame
 Row │ year   rev
     │ Int64  Int64
─────┼──────────────
   1 │  2000     12
   2 │  2001     17

julia> note(df, :rev)
"Total revenue of a company in in a calendar year in nominal USD"

julia> printnotes(df)
Column: rev
───────────
Total revenue of a company in in a calendar year in nominal USD

julia> @note!(df, :year = "Calendar year")
2×2 DataFrame
 Row │ year   rev
     │ Int64  Int64
─────┼──────────────
   1 │  2000     12
   2 │  2001     17

julia> printnotes(df)
Column: year
────────────
Calendar year

Column: rev
───────────
Total revenue of a company in in a calendar year in nominal USD

Observe that printnotes only prints notes that were actually added to
a column (as opposed to printlabels which prints labels of all columns,
using the default fallback to column name).

Conclusions

Today I covered the basic functions allowing to work with column
metadata of data frames. If you are interested in learning more
advanced functionalities please refer to DataFrames.jl
and TableMetadataTools.jl documentations.

I hope that you will find the metadata functionality provided by
DataFramesMeta.jl useful in your work.