Author Archives: Blog by Bogumił Kamiński

Breaking a passcode with Julia

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2024/05/03/pe79.html

Introduction

This week it is a holiday period in Poland so I decided to solve a puzzle.
I liked the code as it can be used to show some basic features of the Julia language.

The examples were written under Julia 1.10.1, HTTP.jl 1.10.6, and Graphs.jl 1.10.0.

The problem

I decided to use my favorite Project Euler puzzle set. This time I chose Problem 79.

Here is its statement (taken from the Project Euler website):

A common security method used for online banking is to ask the user for three random characters from a passcode. For example, if the passcode was 531278, they may ask for the 2nd, 3rd, and 5th characters; the expected reply would be: 317.
The text file, keylog.txt, contains fifty successful login attempts.
Given that the three characters are always asked for in order, analyse the file so as to determine the shortest possible secret passcode of unknown length.

The keylog.txt file can be found under this link: https://projecteuler.net/resources/documents/0079_keylog.txt.

Let us try solving the puzzle.

The solution

First we use the HTTP.jl package to get the data and pre-process it.
Start by storing the file as a string:

julia> using HTTP

julia> url = "https://projecteuler.net/resources/documents/0079_keylog.txt"
"https://projecteuler.net/resources/documents/0079_keylog.txt"

julia> str = String(HTTP.get(url).body)
"319\n680\n180\n690\n129\n620\n762\n689\n762\n318\n368\n710\n720\n710\n629\n168\n160\n689\n716\n731\n736\n729\n316\n729\n729\n710\n769\n290\n719\n680\n318\n389\n162\n289\n162\n718\n729\n319\n790\n680\n890\n362\n319\n760\n316\n729\n380\n319\n728\n716\n"

Now we want to process this string into a vector of vectors containing the digits verified by the user.
First we split the string by newlines using the split function. Next We process each line by transforming it into a vector of numbers. We use two features of Julia here. The first is the collect function, which when passed a string returns a vector of characters. The second is broadcasting. By broadcasted substraction of '0' from a vector of characters we get a vector of integers. Here is the code:

julia> v = [collect(x) .- '0' for x in split(str)]
50-element Vector{Vector{Int64}}:
 [3, 1, 9]
 [6, 8, 0]
 [1, 8, 0]
 ⋮
 [3, 1, 9]
 [7, 2, 8]
 [7, 1, 6]

Now we are ready to analyze the data. We will use a directed graph to represent it.
The directed graph will have 10 nodes. Each representing a digit. Because Julia uses
1-based indexing, node number of digit x will be x+1.
Here is the code creating the directed graph:

julia> using Graphs

julia> gr = DiGraph(10, 0)
{10, 0} directed simple Int64 graph

julia> for x in v
           add_edge!(gr, x[1] + 1, x[2] + 1)
           add_edge!(gr, x[2] + 1, x[3] + 1)
       end

julia> gr
{10, 23} directed simple Int64 graph

Note that we have 23 relationships constraining the sequence of the numbers in the unknown password.
Let us check, for each number the number of times it is the preceeding or a following in our graph:

julia> [outdegree(gr) indegree(gr)]
10×2 Matrix{Int64}:
 0  5
 5  2
 3  3
 3  1
 0  0
 0  0
 4  3
 5  0
 2  4
 1  5

From this summary we see that the first node (representing digit 0) is never a source, so it can be a last digit in a pass code. Similarly eighth node (representing 7) is never a destination, so it can be a first digit. Finally, digits 4 and 5 are never neither a source or a destination, so they can be dropped.

How can we programattically find the list of nodes that can be dropped? We can simply find all nodes whose total degree is 0:

julia> to_drop = findall(==(0), degree(gr)) .- 1
2-element Vector{Int64}:
 4
 5

Now we are ready for a final move. Let us assume that our directed graph does not have cycles (this is a simple case, as then we can assume that each number is present exactly once in the code). In this case we can use the topological sorting to find the shortest sequence of numbers consistent with the observed data. In our case to get the topological sorting of nodes in the graph we can write:

julia> ts = topological_sort(gr)
10-element Vector{Int64}:
  8
  6
  5
  4
  2
  7
  3
  9
 10
  1

We did not get an error, which means that our directed graph did not have any cycles, so we are done.

What is left to get a solution is to correct the node-numbering (as we start numbering with 1 and the smallest digit is 0) and remove the numbers that are never used. As usual, I leave the final solution un-evaluated, to encourage you to run the code yourself:

setdiff(ts .- 1, to_drop)

Conclusions

I hope you enjoyed the puzzle and the solution!

Annotating columns of a data frame with DataFramesMeta.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2024/04/26/labels.html

Introduction

Today I want to discuss a functionality that was recently added to DataFramesMeta.jl.
These utility macros and functions make it easy to add custom labels and notes to columns
of a data frame. This functionality is especially useful when working with wide data frames,
as is often the case when e.g. analyzing economic data.

This post is written under Julia 1.10.1, DataFrames.jl 1.6.1, and DataFramesMeta.jl 0.15.2.

Column labels

A column label is a short description of the contents of a column.
When using DataFramesMeta.jl you can use the following basic commands to work with them:

  • @label! attaches a label to a column;
  • label allows you to retrieve column label;
  • printlabels presents you labels of all annotated columns in a data frame.

Here is a simple example:

julia> using DataFramesMeta

julia> df = DataFrame(year=[2000, 2001], rev=[12, 17])
2×2 DataFrame
 Row │ year   rev
     │ Int64  Int64
─────┼──────────────
   1 │  2000     12
   2 │  2001     17

julia> @label!(df, :rev = "Revenue (USD)")
2×2 DataFrame
 Row │ year   rev
     │ Int64  Int64
─────┼──────────────
   1 │  2000     12
   2 │  2001     17

julia> label(df, :rev)
"Revenue (USD)"

julia> printlabels(df)
┌────────┬───────────────┐
│ Column │         Label │
├────────┼───────────────┤
│   year │          year │
│    rev │ Revenue (USD) │
└────────┴───────────────┘

Note that if some column did not get an explicit label (like :year in our example)
by default its name is its label.

Column notes

Column notes are meant to give more detailed information about a column in a data frame.
You can use the following basic commands to work with them:

  • @note! attaches a note to a column;
  • note allows you to retrieve column note;
  • printnotes presents you notes of all columns in a data frame.
julia> @note!(df, :rev = "Total revenue of a company in in a calendar year in nominal USD")
2×2 DataFrame
 Row │ year   rev
     │ Int64  Int64
─────┼──────────────
   1 │  2000     12
   2 │  2001     17

julia> note(df, :rev)
"Total revenue of a company in in a calendar year in nominal USD"

julia> printnotes(df)
Column: rev
───────────
Total revenue of a company in in a calendar year in nominal USD

julia> @note!(df, :year = "Calendar year")
2×2 DataFrame
 Row │ year   rev
     │ Int64  Int64
─────┼──────────────
   1 │  2000     12
   2 │  2001     17

julia> printnotes(df)
Column: year
────────────
Calendar year

Column: rev
───────────
Total revenue of a company in in a calendar year in nominal USD

Observe that printnotes only prints notes that were actually added to
a column (as opposed to printlabels which prints labels of all columns,
using the default fallback to column name).

Conclusions

Today I covered the basic functions allowing to work with column
metadata of data frames. If you are interested in learning more
advanced functionalities please refer to DataFrames.jl
and TableMetadataTools.jl documentations.

I hope that you will find the metadata functionality provided by
DataFramesMeta.jl useful in your work.

Onboarding DataFrames.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2024/04/19/starting.html

Introduction

Working with data frames is one of the basic needs of any data scientist.
In the Julia ecosystem DataFrames.jl is a package providing support
for these operations. It was designed to be efficient and flexible.

Sometimes, however, novice users can be overwhelmed by the syntax due to its flexibility.
Therefore data scientists often find it useful to use the
packages that make it easier to do transformations of data frames.

Interestingly, these packages use metaprogramming, which might sound
to novices as something scary, while in reality it is the opposite. Metaprogramming
is used to make them easier to use.

Today I want do do a quick review of the main
metaprogramming packages that are available in the ecosystem.
I will not go into the details functionality and syntax of the packages, but rather just
present them briefly and give my personal (opinionated) view of their status.

This post is written under Julia 1.10.1, DataFrames.jl 1.6.1, Chain.jl 0.5.0, DataFramesMeta.jl 0.15.2,
DataFrameMacros.jl 0.4.1, and TidyData.jl 0.15.1.

A basic example

Let us start with a basic example of DataFrames.jl syntax, which we will later rewrite using metaprogramming:

julia> using Statistics

julia> using DataFrames

julia> df = DataFrame(id=[1, 2, 1, 2], v=1:4)
4×2 DataFrame
 Row │ id     v
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     1      3
   4 │     2      4

julia> transform(groupby(df, :id), :v => (x -> x .- mean(x)) => :v100)
4×3 DataFrame
 Row │ id     v      v100
     │ Int64  Int64  Float64
─────┼───────────────────────
   1 │     1      1     -1.0
   2 │     2      2     -1.0
   3 │     1      3      1.0
   4 │     2      4      1.0

The syntax looks complex and might be scary. Let us see if we can make it simpler.

Chain.jl

The first functionality we might want to use is to put the operations in a pipe. This is achieved with the Chain.jl package:

julia> using Chain

julia> @chain df begin
           groupby(:id)
           transform(:v => (x -> x .- mean(x)) => :v100)
       end
4×3 DataFrame
 Row │ id     v      v100
     │ Int64  Int64  Float64
─────┼───────────────────────
   1 │     1      1     -1.0
   2 │     2      2     -1.0
   3 │     1      3      1.0
   4 │     2      4      1.0

We have achieved the benefit of a better visual separation of operations. In my opinion Chain.jl can be considered
as a currently mostly accepted approach to piping operations in Julia (there are alternatives in the ecosystem
but as far as I can tell they have lower adoption level).

DataFramesMeta.jl

Still the transform(:v => (x -> x .- mean(x)) => :v100) part looks verbose. Let us start by showing
how it can be made simpler using DataFramesMeta.jl:

julia> using DataFramesMeta

julia> @chain df begin
           groupby(:id)
           @transform(:v100 = :v .- mean(:v))
       end
4×3 DataFrame
 Row │ id     v      v100
     │ Int64  Int64  Float64
─────┼───────────────────────
   1 │     1      1     -1.0
   2 │     2      2     -1.0
   3 │     1      3      1.0
   4 │     2      4      1.0

In my opinion the code is now really easy to read.

Here is the status of DataFramesMeta.jl:

  • It is actively maintained.
  • Its syntax is close to DataFrames.jl.
  • It uses : to signal that some name is a column of a data frame.

DataFrameMacros.jl

The DataFrameMacros.jl is another package that is closely tied to DataFrames.jl. Let us see how we can use it.
Note that you need to restart the Julia session before running the code as the macro names are overlapping with DataFramesMeta.jl:

julia> using DataFrameMacros

julia> @chain df begin
           groupby(:id)
           @transform(:v100 = @bycol :v .- mean(:v))
       end
4×3 DataFrame
 Row │ id     v      v100
     │ Int64  Int64  Float64
─────┼───────────────────────
   1 │     1      1     -1.0
   2 │     2      2     -1.0
   3 │     1      3      1.0
   4 │     2      4      1.0

Note the difference with the @bycol expression. It is needed because in DataFrameMacros.jl @transform by default vectorizes operations.
This is often more convenient for users, but sometimes (like in this case), one wants to suppress vectorization.

What is the status of DataFramesMeta.jl?

  • It is maintained but less actively developed than DataFramesMeta.jl.
  • Its syntax is close to DataFrames.jl, but several macros, for user convenience, vectorize operations by default (as opposed to Base Julia).
  • It uses : to signal that some text is a column of a data frame.

TidierData.jl

Now let us see the TidierData.jl package that is designed to follow dplyr from R:

julia> using TidierData

julia> @chain df begin
           @group_by(id)
           @mutate(v100 = v - mean(v))
           @ungroup
       end
4×3 DataFrame
 Row │ id     v      v100
     │ Int64  Int64  Float64
─────┼───────────────────────
   1 │     1      1     -1.0
   2 │     1      3      1.0
   3 │     2      2     -1.0
   4 │     2      4      1.0

If you know dplyr you should be at home with this syntax.

What is the status of DataFramesMeta.jl:

  • It is actively maintained.
  • It tries to guess as much as possible; the package automatically decides which functions should be vectorized (in our example - was vectorized but mean was not).
  • You do not need a : prefix in column names, the package uses scoping similar to R to resolve variable names.

As you can see, the R-style syntax is designed for maximum convenience, at the expense of control (a lot of “magic” happens behind the scenes;
admittedly most of the time this magic is what novice users would want).

Conclusions

Here is a recap of what we have discussed:

  • Meta-packages are here to make life easier for users. There is no need to be afraid of them.
  • For piping I recommend using Chain.jl.
  • Use plain DataFrames.jl if you are a die-hard Julia user and want all your code to be valid Julia syntax (I prefer it when writing production stuff).
  • Use DataFramesMeta.jl if you want an experience most consistent with Base Julia (this is my personal preference for interactive sessions, but it requires most knowledge of Julia).
  • DataFrameMacros.jl is an in-between package, it adds some more convenience (e.g. vectorization by default), but does not push it to the extreme
    (it also has a super convenient {} notation which you might find useful; I decided to skip it to keep the post simple to follow).
  • TidyData.jl goes for maximum convenience. It follows R-style and tries to guess what you most likely wanted to do. Users with dplyr should be able to start using it immediately.