A little exercise in CSV.jl and DataFrames.jl

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2024/01/19/puzzles.html

Introduction

This week I have discussed with my colleague the Lichess puzzle dataset
that I use in my Julia for Data Analysis book.

The dataset contains a list of puzzles along with information about them,
such as puzzle difficulty, puzzle solution, and tags describing puzzle type.

We were discussing if tags assigned to puzzles in this dataset are accurate.
In this post I give you an example how one can check it
(and practice a bit CSV.jl and DataFrames.jl).

The post was written under Julia 1.10.0, CSV.jl 0.10.12, and DataFrames.jl 1.6.1.

Getting the data

In this post I show you a relatively brief code. Therefore I assume that first
you download the file with the puzzle dataset and unpack it manually.
(In the book I show how to do it using Julia. You can find the source code on
GitHub repository of the book.)

Assuming you downloaded and unpacked the dataset into the puzzles.csv file
we read it in. We are interested only in columns 3 and 8 of this file,
so I use the following commands:

julia> using CSV

julia> using DataFrames

julia> df = CSV.read("puzzles.csv", DataFrame; select=[3, 8], header=false)
2132989×2 DataFrame
     Row │ Column3                            Column8
         │ String                             String
─────────┼──────────────────────────────────────────────────────────────────────
       1 │ f2g3 e6e7 b2b1 b3c1 b1c1 h6c1      crushing hangingPiece long middl…
       2 │ d3d6 f8d8 d6d8 f6d8                advantage endgame short
       3 │ b6c5 e2g4 h3g4 d1g4                advantage middlegame short
       4 │ g5e7 a5c3 b2c3 c6e7                advantage master middlegame short
       5 │ e8f7 e2e6 f7f8 e6f7                mate mateIn2 middlegame short
       6 │ a6a5 e5c7 a5b4 c7d8                crushing endgame fork short
       7 │ d4b6 f6e4 h1g1 e4f2                crushing endgame short trappedPi…
       8 │ d8f6 d1h5 h7h6 h5c5                advantage middlegame short
    ⋮    │                 ⋮                                  ⋮
 2132982 │ d2c2 c5d3 c2d3 c4d3                crushing fork middlegame short
 2132983 │ b8d7 c3b5 d6b8 a1c1 e8g8 b5c7      crushing long middlegame quietMo…
 2132984 │ g7g6 d5c6 c5c4 b3c4 b4c4 c6d6      crushing defensiveMove endgame l…
 2132985 │ g1h1 e3e1 f7f1 e1f1                endgame mate mateIn2 short
 2132986 │ g5c1 d5d6 d7f6 h7h8                advantage middlegame short
 2132987 │ d2f3 d8a5 c1d2 a5b5                advantage fork opening short
 2132988 │ f7f2 b2c2 c1b1 e2d1                endgame mate mateIn2 queensideAt…
 2132989 │ c6d4 f1e1 e8d8 b1c3 d4f3 g2f3      advantage long opening
                                                            2132973 rows omitted

julia> rename!(df, ["moves", "tags"])
2132989×2 DataFrame
     Row │ moves                              tags
         │ String                             String
─────────┼──────────────────────────────────────────────────────────────────────
       1 │ f2g3 e6e7 b2b1 b3c1 b1c1 h6c1      crushing hangingPiece long middl…
       2 │ d3d6 f8d8 d6d8 f6d8                advantage endgame short
       3 │ b6c5 e2g4 h3g4 d1g4                advantage middlegame short
       4 │ g5e7 a5c3 b2c3 c6e7                advantage master middlegame short
       5 │ e8f7 e2e6 f7f8 e6f7                mate mateIn2 middlegame short
       6 │ a6a5 e5c7 a5b4 c7d8                crushing endgame fork short
       7 │ d4b6 f6e4 h1g1 e4f2                crushing endgame short trappedPi…
       8 │ d8f6 d1h5 h7h6 h5c5                advantage middlegame short
    ⋮    │                 ⋮                                  ⋮
 2132982 │ d2c2 c5d3 c2d3 c4d3                crushing fork middlegame short
 2132983 │ b8d7 c3b5 d6b8 a1c1 e8g8 b5c7      crushing long middlegame quietMo…
 2132984 │ g7g6 d5c6 c5c4 b3c4 b4c4 c6d6      crushing defensiveMove endgame l…
 2132985 │ g1h1 e3e1 f7f1 e1f1                endgame mate mateIn2 short
 2132986 │ g5c1 d5d6 d7f6 h7h8                advantage middlegame short
 2132987 │ d2f3 d8a5 c1d2 a5b5                advantage fork opening short
 2132988 │ f7f2 b2c2 c1b1 e2d1                endgame mate mateIn2 queensideAt…
 2132989 │ c6d4 f1e1 e8d8 b1c3 d4f3 g2f3      advantage long opening
                                                            2132973 rows omitted

Note that the file does not have a header so when reading it we passed header=false
and then manually named the columns using rename!.

The task

I wanted only these two columns since today I want to check if the tags related
to mating are accurate. You can notice in the above printout that in the "tags"
column we have a tag "mateIn2". It indicates that the puzzle is mate in two moves.
This is the case for example for rows 5, 2132985, and 2132988.
In the matching "moves" column we see that we have 4 corresponding moves.
The reason is that we have two players making the move (and 2 + 2 = 4).

What we want to check if these "mateInX" tags are correct. I will check the
values of X from 1 to 5 (as only these five options are present in tags,
I leave it to you as an exercise to verify).

When should we call the tags correct. There are two conditions:

  • there is no duplicate tagging (e.g. a puzzle cannot be "mateIn1" and "mateIn2" at the same time);
  • the number of moves in a puzzle matches the tag.

Let us check it.

The solution

As a first step we (in place, i.e. modifying our df data frame) transform the original columns
into more convenient form. Instead of the raw "moves" I want the "nmoves" column that gives me
a number of moves in the puzzle. Similarly instead of "tags" I want indicator columns "mateInX"
for X ranging from 1 to 5 showing me the puzzle type. Here is how you can achieve this:

julia> select!(df,
               "moves" => ByRow(length∘split) => "nmoves",
               ["tags" => ByRow(contains("mateIn$i")) => "mateIn$i" for i in 1:5])
2132989×6 DataFrame
     Row │ nmoves  mateIn1  mateIn2  mateIn3  mateIn4  mateIn5
         │ Int64   Bool     Bool     Bool     Bool     Bool
─────────┼─────────────────────────────────────────────────────
       1 │      6    false    false    false    false    false
       2 │      4    false    false    false    false    false
       3 │      4    false    false    false    false    false
       4 │      4    false    false    false    false    false
       5 │      4    false     true    false    false    false
       6 │      4    false    false    false    false    false
       7 │      4    false    false    false    false    false
       8 │      4    false    false    false    false    false
    ⋮    │   ⋮        ⋮        ⋮        ⋮        ⋮        ⋮
 2132982 │      4    false    false    false    false    false
 2132983 │      6    false    false    false    false    false
 2132984 │      6    false    false    false    false    false
 2132985 │      4    false     true    false    false    false
 2132986 │      4    false    false    false    false    false
 2132987 │      4    false    false    false    false    false
 2132988 │      4    false     true    false    false    false
 2132989 │      6    false    false    false    false    false
                                           2132973 rows omitted

Now we see that some of the rows are not tagged as "mateInX". Let us filter them out,
to have only tagged rows left (again, we do the operation in-place):

julia> filter!(row -> any(row[Not("nmoves")]), df)
491743×6 DataFrame
    Row │ nmoves  mateIn1  mateIn2  mateIn3  mateIn4  mateIn5
        │ Int64   Bool     Bool     Bool     Bool     Bool
────────┼─────────────────────────────────────────────────────
      1 │      4    false     true    false    false    false
      2 │      4    false     true    false    false    false
      3 │      2     true    false    false    false    false
      4 │      4    false     true    false    false    false
      5 │      2     true    false    false    false    false
      6 │      4    false     true    false    false    false
      7 │      4    false     true    false    false    false
      8 │      2     true    false    false    false    false
   ⋮    │   ⋮        ⋮        ⋮        ⋮        ⋮        ⋮
 491736 │      6    false    false     true    false    false
 491737 │      4    false     true    false    false    false
 491738 │      2     true    false    false    false    false
 491739 │      4    false     true    false    false    false
 491740 │      2     true    false    false    false    false
 491741 │      2     true    false    false    false    false
 491742 │      4    false     true    false    false    false
 491743 │      4    false     true    false    false    false
                                           491727 rows omitted

Note that in the condition I used the row[Not("nmoves")] selector, as I wanted to check all columns except the "nmoves".

Now we are ready to check the correctness of tags:

julia> combine(groupby(df, "nmoves"), Not("nmoves") .=> sum)
10×6 DataFrame
 Row │ nmoves  mateIn1_sum  mateIn2_sum  mateIn3_sum  mateIn4_sum  mateIn5_sum
     │ Int64   Int64        Int64        Int64        Int64        Int64
─────┼─────────────────────────────────────────────────────────────────────────
   1 │      2       136843            0            0            0            0
   2 │      4            0       274135            0            0            0
   3 │      6            0            0        68623            0            0
   4 │      8            0            0            0         9924            0
   5 │     10            0            0            0            0         1691
   6 │     12            0            0            0            0          367
   7 │     14            0            0            0            0          127
   8 │     16            0            0            0            0           25
   9 │     18            0            0            0            0            7
  10 │     20            0            0            0            0            1

The table reads as follows:

  • There are no duplicates in tags.
  • Tags "mateInX" for X in 1 to 4 range are correct.
    The "mateIn5" tag actually means a situation where there are five or more moves.

So the verdict is that tagging is correct, but we need to know the interpretation of
"mateIn5" column as it is actually five or more moves. We could rename the column to
e.g. "mateIn5+" to reflect that or add a metadata to our df table where we would store
this information (I leave this to you as an exercise).

Conclusions

I hope that CSV.jl and DataFrames.jl users found the examples that I gave today useful and interesting. Enjoy!