Author Archives: Blog by Bogumił Kamiński

What does it mean that a data frame are a collection of rows?

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/02/24/dfrows.html

Introduction

In my recent post I have discussed what interfaces Julia defines for
working with containers. Today I want to make a closer look at data frame
objects that are defined in DataFrames.jl.

Before I move forward I want to make a small announcement. On my blog I have
recently added Learning section, where I collect a list of learning
materials that I find useful for doing data science with Julia. If you would
like to have some position added to this list please contact me.

This post was written under Julia 1.9.0-beta4 and DataFrames.jl 1.5.0.

Interfaces refresher

Let me start with recalling the discussion we had in this post about
data frame design:

  1. Data frame is not iterable. Instead, if you want to iterate its rows
    use the eachrow wrapper, and if you want to iterate its columns use the
    eachcol wrapper.
  2. You can index data frame like a matrix, but you are always required to pass
    both row and column indices (in other words: linear indexing is not
    supported).
  3. In broadcasting data frame behaves as a matrix (two dimensional container).
  4. You can get columns of a data frame by their name using property access.

For this reason often users are surprised when they read that data frame
is considered to be a collection of rows. From the way it supports the
standard interfaces it does not seem that it is the case.

What we are missing from the whole picture is that the four above interfaces are
related to only a limited number of functions:

  1. Iteration: iterate.
  2. Indexing: getindex, setindex!, firstindex, lastindex, size.
  3. Broadcasting: axes, broadcastable, BroadcastStyle, similar,
    copy , copyto!.
  4. Property access: getproperty and setproperty!.

In general, DataFrames.jl exposes over 120 functions that work on data frame
objects and out of them 38 are methods that are extensions to functions
that are defined in Base Julia and work on collections. All these functions
consider data frame to be a collection of rows.

In what follows I will go through all of them so that DataFrames.jl users have
an easy reference to them in one place.

Row sorting and reordering

We support sort, sort!, sortperm, issorted, permute!, invpermute!,
reverse!, reverse, shuffle!, and shuffle functions that work on data
frame rows.

Here let me remark that in particular shuffling functions are often quite handy
when preparing data to be passed to various statistical models.

Dropping rows

We support deleteat!, keepat!, empty!, empty, filter!, filter,
first, last, and resize!.

Let me mention that resize! allows not only to drop rows form a data frame
but also add them (although it is not often used).

In addition there is an isempty function that checks if data frame has zero
rows. It is a important to remember that it is not required that data frame
has no columns:

julia> using DataFrame

julia> df = DataFrame(a=1, b=2)
1×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      2

julia> isempty(df)
false

julia> empty!(df)
0×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┴──────────────

julia> isempty(df)
true

Remember that if the DataFrames.jl documentation says that data frame is empty
it means that it has zero rows (but it does not say anything about number of
columns).

Adding rows

You can add single rows to a data frame using push!, pushfirst!, and
insert! or collections of rows (in general Tables.jl tables) using
append! and prepend!.

Related functions are repeat! and repeat to repeat rows in a data frame.

Row extraction

There are four functions that allow you to extract one row from a data frame
only, pop!, popat!, and popfirst!.

I often find the only function useful, when I want to explicitly verify
the contract that some operation returned a data frame with only one row.

Identification of missing values in rows

You can use completecases to find which rows do not contain missing values
and dropmissing! and dropmissing to drop them.

Identification of unique rows

There are unique and unique! functions that return unique rows in a
source data frame. To get an indicator vector which rows are non-unique use
the nonunique function, and allunique checks if all rows in a data frame
are unique.

Here let me show an example of nonunique functionality that allows you to
choose which duplicates are highlighted (it was added in 1.5 release):

julia> df = DataFrame(a=[1, 2, 1, 3, 1])
5×1 DataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     1
   2 │     2
   3 │     1
   4 │     3
   5 │     1

julia> nonunique(df) # by default first duplicate is kept
5-element Vector{Bool}:
 0
 0
 1
 0
 1

julia> nonunique(df, keep=:last) # keep last duplicate
5-element Vector{Bool}:
 1
 0
 1
 0
 0

julia> nonunique(df, keep=:noduplicates) # do not keep any duplicates
5-element Vector{Bool}:
 1
 0
 1
 0
 1

The keep keyword argument name is used because often false in the returned
vector is meant to indicate which rows should be later kept in a data frame
(the same keyword argument name is consistently used in unique and unique!).

Conclusions

As you could see in this post there are many functions in Base Julia that
support working with collections. In DataFrames.jl we wanted users to be able
to reuse these functions when working with data frames. Therefore all of them
are supported and they consider data frames to be collections of rows.

Sometimes it is useful to have an iterable and indexable collection of,
respectively, rows and columns of a data frame. For this reason we provide the
eachrow and eachcol wrappers that provide this functionality. As a
consequence, for clarity and to minimize the risk of error on user’s side,
without being wrapped data frame is not iterable and behaves like a matrix in
indexing and broadcasting.

Julia and Python better together

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/02/17/python.html

Introduction

Many data scientists with whom I discuss tell me that they like Julia, but
there are some functionalities in Python that they like and would want to
keep using.

In this post I want to show that if you are in this situation the answer is:
it is fine – you can work with Julia and keep using Python packages you like
as a part of your Julia workflow.

The post is tested under Julia 1.8.5 and Status PyCall.jl 1.95.1,
Conda.jl 1.8.0, PyPlot.jl 2.11.0, and GLM.jl 1.8.1.

Level 1: Popular Python packages have an interface in Julia

Many (if not majority of) Python packages are written in other languages (like
C++) and Python is only a wrapper around them. Most Python users do not care
(or even think) about it – they focus on getting their projects delivered.

The same approach can be used in Julia. Specifically, Julia package can be just
a wrapper around some Python package. Examples of such popular packages are:

This is an easy scenario as all you need to do is install a Julia package
and you are ready to go and use your favorite Python package.

I will give here one example from Matplotlib documentation. I want to reproduce
the Python code given here:

# Python code
fig, ax = plt.subplots(figsize=(5, 2.7))
t = np.arange(0.0, 5.0, 0.01)
s = np.cos(2 * np.pi * t)
ax.plot(t, s, lw=2)
ax.annotate('local max', xy=(2, 1), xytext=(3, 1.5),
            arrowprops=dict(facecolor='black', shrink=0.05))
ax.set_ylim(-2, 2)

Now let us do the same in Julia using PyPlot.jl:

# Julia code
using PyPlot
fig, ax = plt.subplots(figsize=(5, 2.7))
t = 0.0:0.01:4.99
s = cos.(2 * π * t)
ax.plot(t, s, lw=2)
ax.annotate("local max", xy=(2, 1), xytext=(3, 1.5),
            arrowprops=Dict("facecolor" => "black", "shrink" => 0.05))
ax.set_ylim(-2, 2)

As you can see the codes are almost identical. The only differences are related
to syntax. For example ' is replaced by " and dict by Dict.

Both codes produce the following plot:

Example plot

Level 2: You can use any Python package from Julia

Not all Python packages have Julia wrappers. Also, in some cases you might
want to work with a different version or configuration of the Python package
than provided by the wrapper. Is this a problem? No, this is not a problem at
all:

You can load and use any Python package from Julia.

There are two Julia packages that allow for this PyCall.jl and
PythonCall.jl. Both packages provide the ability to directly call and
fully interoperate with Python from the Julia language. There are some
technical differences between them which are described here so that
you can decide which one you prefer.

Below I will give you an example of using PyCall.jl.

Assume you like using statsmodels package from Python and would want to
reproduce this example from its documentation:

# Python code
import numpy as np
import statsmodels.api as sm
spector_data = sm.datasets.spector.load()
spector_data.exog = sm.add_constant(spector_data.exog, prepend=False)
mod = sm.OLS(spector_data.endog, spector_data.exog)
res = mod.fit()
print(res.summary())

Let me show you how to reproduce it step-by-step in Julia.

We start with loading the packages:

using PyCall
sm = pyimport("statsmodels.api")

Note that using PyCall.jl we can load any Python package using the
pyimport function. The second line might give you an error like this:

julia> sm = pyimport("statsmodels.api")
ERROR: PyError (PyImport_ImportModule
...

This is not a problem. It is just an information that the statsmodels package
is not installed under Python. In this case you can easily install it from
Julia. Just do:

using Conda
Conda.add("statsmodels")

and you are ready to go. Now sm = pyimport("statsmodels.api") will work.

We are ready to build the model in Julia using statsmodels:

# Julia code
spector_data = sm.datasets.spector.load()
spector_data["exog"] = sm.add_constant(spector_data["exog"], prepend=false)
mod = sm.OLS(spector_data["endog", spector_data["exog"])
res = mod.fit()
res.summary()

and you get the output that is the same as in statsmodels documentation:

PyObject <class 'statsmodels.iolib.summary.Summary'>
"""
                            OLS Regression Results
==============================================================================
Dep. Variable:                  GRADE   R-squared:                       0.416
Model:                            OLS   Adj. R-squared:                  0.353
Method:                 Least Squares   F-statistic:                     6.646
Date:                Fri, 17 Feb 2023   Prob (F-statistic):            0.00157
Time:                        14:41:35   Log-Likelihood:                -12.978
No. Observations:                  32   AIC:                             33.96
Df Residuals:                      28   BIC:                             39.82
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
GPA            0.4639      0.162      2.864      0.008       0.132       0.796
TUCE           0.0105      0.019      0.539      0.594      -0.029       0.050
PSI            0.3786      0.139      2.720      0.011       0.093       0.664
const         -1.4980      0.524     -2.859      0.008      -2.571      -0.425
==============================================================================
Omnibus:                        0.176   Durbin-Watson:                   2.346
Prob(Omnibus):                  0.916   Jarque-Bera (JB):                0.167
Skew:                           0.141   Prob(JB):                        0.920
Kurtosis:                       2.786   Cond. No.                         176.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors
is correctly specified.

Note that the differences in codes are minimal. Again I needed to adjust to
Julia syntax by changing False to false and doing dictionary access using
square brackets like in spector_data["exog"]. All else is identical (and for
this reason I put a comment on top showing which language is used as it could
be easily confused).

You might, however, ask if it is possible to use data from Python in Julia (or
data from Julia in Python). Yes – this is also supported. The only thing to
remember is that automatic conversion of values between Python and Julia is done
for a predefined list of most common types (like arrays, dictionaries).

Let me give an example how this is done by estimating the same regression using
GLM.jl. What I will do is transport the data as arrays from Python to Julia
(I chose this case as it is most commonly used in my experience).

using GLM
lm(spector_data["exog"].to_numpy(), spector_data["endog"].to_numpy())

And you get the output:

LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64,
LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}:

Coefficients:
───────────────────────────────────────────────────────────────────
         Coef.  Std. Error      t  Pr(>|t|)   Lower 95%   Upper 95%
───────────────────────────────────────────────────────────────────
x1   0.463852    0.161956    2.86    0.0078   0.132099    0.795604
x2   0.0104951   0.0194829   0.54    0.5944  -0.0294137   0.0504039
x3   0.378555    0.139173    2.72    0.0111   0.0934724   0.663637
x4  -1.49802     0.523889   -2.86    0.0079  -2.57115    -0.42488
───────────────────────────────────────────────────────────────────

As you can see the results are the same, except that we have lost column
names. The reason is that we have used arrays to transport data from Python to
Julia (column names also could be transported, but I did not want to
complicate the example).

Conclusions

Today the conclusion is short (but for me extremely powerful):

From Julia you have all Julia and all Python packages available to use in
your projects.

If you know some package in Python and want to keep using it in Julia it is
easy. In most cases you can just copy-paste your Python code to Julia and do
minor syntax adjustments and you are done.

I find this interoperability of Julia and Python really amazing.

What is ∈ in Julia?

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/02/10/in.html

Introduction

Today I decided to discuss the in function, which is a basic topic that,
from my teaching experience, often surprises people learning Julia.
I will cover several concrete cases that are worth knowing as either you might
use them yourself or might encounter them in the code that you would be reading.

The post is tested under Julia 1.8.5.

The basic syntax of in

in is a function in Julia. It is used to determine whether an item is in the
given collection.

Since in is a function you can invoke it using the standard function call
syntax:

julia> in(1, [1, 2, 3])
true

However, this operation is so common that there are two other ways to perform
this operation:

julia> 1 in [1, 2, 3]
true

julia> 1 ∈ [1, 2, 3]

If you wonder how to type then you can check it in Julia’s help:

help?> ∈
"∈" can be typed by \in<tab>

Since is the same as in you can also write:

julia> ∈(1, [1, 2, 3])
true

although this is likely not the most readable way to do it.

Finally there is an accompanying syntax that has the order of arguments
reversed:

julia> ∋([1, 2, 3], 1)
true

julia> [1, 2, 3] ∋ 1
true

help?> ∋
"∋" can be typed by \ni<tab>

Negating in

Often you want to check if some element is not in a collection. Here are the
standard ways you can do it (you could similarly negate and ):

julia> !in(1, [1, 2, 3])
false

julia> !(1 in [1, 2, 3])
false

However, there are also convenience and operators:

julia> 1 ∉ [1, 2, 3]
false

julia> [1, 2, 3] ∌ 1
false

help?> ∉
"∉" can be typed by \notin<tab>

help?> ∌
"∌" can be typed by \nni<tab>

Higher-order function

In all cases of in, , , , and you can easily create a function
taking only one argument fixing the second argument of the operation.

For example writing in([1, 2, 3]) is equivalent to creation of an anonymous
function e -> e in [1, 2, 3]. Let us show this syntax at work:

julia> in([1, 2, 3])(1)
true

julia> ∋(1)([1, 2, 3])
true

julia> ∉([1, 2, 3])(1)
false

This syntax is particularly useful when working with higher-order functions:

julia> map(in(Set([1, 2, 3])), [-1, 1, 3, 5])
4-element Vector{Bool}:
 0
 1
 1
 0

Performance

In the last example above you probably noticed that I used Set instead of a
vector for lookup. This is an important pattern:

  • lookup in a vector does not have any preprocessing cost, but later in
    execution time is, on the average, linear with the size of the vector
    (advanced tip: if vector is sorted you can use the insorted function
    instead and it will be faster);
  • lookup in a set has the cost of creating it, but later in execution time
    does not grow with the size of the collection.

In summary, if you have a large collection in which you want to perform lookup
many times then make sure to convert this collection to a set (timings are after
compilation):

julia> v = rand(1:1_000_000, 10_000);

julia> @time count(in(v), 1)
  0.000022 seconds (2 allocations: 48 bytes)
0

julia> @time count(in(Set(v)), 1)
  0.000199 seconds (10 allocations: 144.648 KiB)
0

julia> @time count(in(v), 1:1_000_000)
  6.104646 seconds (5 allocations: 112 bytes)
9941

julia> @time count(in(Set(v)), 1:1_000_000)
  0.017825 seconds (13 allocations: 144.711 KiB)
9941

Note that if we made one lookup Set creation cost was significant, but if we
made one million lookups creation of a Set was crucial to ensure good
performance of the operation.

Broadcasting

It is tempting to run the operation:

julia> map(in(Set([1, 2, 3])), [-1, 1, 3, 5])
4-element Vector{Bool}:
 0
 1
 1
 0

using broadcasting like this:

julia> in.([-1, 1, 3, 5], Set([1, 2, 3]))
ERROR: DimensionMismatch: arrays could not be broadcast to a common size; got a dimension with lengths 4 and 3

However, this fails, because broadcasting iterates both arguments of the in
function [-1, 1, 3, 5] and Set([1, 2, 3]). There are two ways how you can
fix it. The first is protecting the collection in which you want to perform
lookup using Ref:

julia> in.([-1, 1, 3, 5], Ref(Set([1, 2, 3])))
4-element BitVector:
 0
 1
 1
 0

julia> [-1, 1, 3, 5] .∈ Ref(Set([1, 2, 3]))
4-element BitVector:
 0
 1
 1
 0

The other is to use higher-order function approach:

julia> in(Set([1, 2, 3])).([-1, 1, 3, 5])
4-element BitVector:
 0
 1
 1
 0

How does in lookup work?

The final issue is related to the definition of in. It states that in checks
if an item is in the given collection. But what does it mean exactly?

First you need to understand how the collections are iterated. If you do not
know much about this topic you can find a description of the iteration interface
in my recent post.

A typical example that is tricky is Dict lookup. Since Dict iterates
key-value pairs the following is incorrect:

julia> 1 in Dict(1 => "a", 2 => "b")
ERROR: AbstractDict collections only contain Pairs;

Instead you most likely wanted:

julia> 1 in keys(Dict(1 => "a", 2 => "b"))
true

The second issue is how does in check for equality between an item and
elements of the collection. This issue is particularly tricky. Normally the
== function is used, but for Set and Dict the isequal function is used.

Here are some examples showing you the difference:

julia> v = [1.0, missing, -0.0]
3-element Vector{Union{Missing, Float64}}:
  1.0
   missing
 -0.0

julia> s = Set(v)
Set{Union{Missing, Float64}} with 3 elements:
  missing
  -0.0
  1.0

julia> d = Dict(v .=> 'a':'c')
Dict{Union{Missing, Float64}, Char} with 3 entries:
  missing => 'b'
  -0.0    => 'c'
  1.0     => 'a'

julia> missing in v
missing

julia> missing in s
true

julia> missing in keys(d)
true

julia> (missing => 'b') in d
true

julia> 0.0 in v
true

julia> -0.0 in s
true

julia> 0.0 in s
false

julia> 0.0 in keys(d)
false

julia> -0.0 in keys(d)
true

julia> (0.0 => 'c') in d
false

julia> (-0.0 => 'c') in d
true

The reason for these results is:

julia> missing == missing
missing

julia> isequal(missing, missing)
true

julia> 0.0 == -0.0
true

julia> isequal(0.0, -0.0)
false

Conclusions

As you can see the in function has several non-obvious behaviors in terms
of:

  • syntax: you can use five different operations: in, , , , and ;
  • performance: be careful to avoid performance trap of doing many lookups in a
    vector;
  • lookup rule: Set and Dict use isequal test, while normally == is used;
    this is especially relevant in combination with performance recommendation –
    you might get a different result of your operations if you switch from vector
    to Set because you wanted to speed-up your computations.

All topics I discussed today are documented in the Julia Manual. However,
I hope that having them presented by example in a single place in this post is
useful for you.