Tag Archives: DataFrames

Using DataFrames and PyPlot in Julia

By: Jaafar Ballout

Re-posted from: https://www.supplychaindataanalytics.com/using-dataframes-and-pyplot-in-julia/

Julia includes many packages that could be used in operation research and optimization in general. This article serves as a brief introduction to different packages available in Julia’s ecosystem. I will focus on two packages that are stepping-stones in future work: DataFrames.jl and PyPlots. DataFrames.jl provides a set of tools for working with tabular data similar to Pandas in Python. The PyPlots module provides a Julia interface to the Matplotlib plotting library from Python. Although PyPlots is a distribution of Python and Matplotlib, Julia can install a private distribution that can’t be accessed outside Julia’s environment.

Adding the necessary packages

Open a new terminal window and run Julia. Use the code below to add the packages required for this article.

julia> using Pkg
julia> Pkg.add("DataFrames")
julia> Pkg.add("CSV")
julia> Pkg.add("Arrow")

Open a new terminal window and run Julia. Initialize the PYTHON environment variable:

julia> ENV["PYTHON"] = ""
""

Install PyPlot:

julia> using Pkg
julia> Pkg.add("PyPlot")

After adding all the required packages using Julia REPL, the following code is used to import the packages in the Jupyter editor (i.e. Jupyter Notebook).

using DataFrames
using CSV
using Arrow
using PyPlot

Using DataFrames in Julia

DataFrames requires some packages in the backend like CSV and Arrow to complete its operations properly. Thus, these packages were added initially in the section before.

Using PyPlots in Julia

Because PyPlots is an interface for Matplotlib in Julia, all the documentation is available on Matplotlib’s main page.

Covid-19 showcase using Julia

The first covid-19 case was detected on the 17th of November, 2019. Although two years passed since the pandemic started, the virus is still persisting and cases are increasing exponentially. Thus, data analysis is important to understand the growth of cases around the world. I will be using a .csv file containing data about the cases in country X. The file is available in a public repository on my GitHub page, so I can copy it to an excel sheet and move it to the directory file where the Jupyter notebook is located. Also, I can import the data from the web using excel and attach the link of the raw format of the database on GitHub. In both cases, saving the file as time.csv is necessary to fit with the code in later stages.

Here, I am showing the database, in csv format, that I opened via excel on my desktop.

I will read the csv file, in the Jupyter notebook, using the code block below:

df  = CSV.File("time.csv") |> DataFrame; # reading the csv file using CSV package and changing it to a DataFrame using the arrow operation |>
df[1:5,:] # output the first five rows

Importing data files is smoother using DataFrames. Some of the operations present in the code blocks below are explained in a previous post introducing Julia. Plotting is another important tool to understand data and visualize it better. Matplotlib is introduced before to the blog but in Python.

The code block below creates a bar chart showing the number of cumulative tests and cumulative negative cases over a period of six days from the DataFrame, df, imported above.

y1 = df[20:25,2]; 
y2 = df[20:25,3];
x = df[20:25,1];

fig = plt.figure() 
ax = plt.subplot() 
ax.bar(x, y1, label="cumulative tests",color="black")
ax.bar(x, y2, label="cumulative negative cases",color="grey")
ax.set_title("Covid-19 data in Country X",fontsize=18,color="green")
ax.set_xlabel("date",fontsize=14,color="red")
ax.set_ylabel("number of cases",fontsize=14,color="red")
ax.legend(fontsize=10)
ax.grid(b=1,color="blue",alpha=0.1)
plt.show()

The bar chart above could be improved by allocating a bar for each category. This is done using the code block below.

barWidth = 0.25
br1 = 1:1:length(x)
br2 = [x + barWidth for x in br1]

fig = plt.figure()
ax = plt.subplot() 
ax.bar(br1, y1,color="r", width=barWidth, edgecolor ="grey",label="cumulative tests")
ax.bar(br2, y2,color ="g",width=barWidth, edgecolor ="grey", label="cumulative negative cases")
ax.set_title("Covid-19 data in Country X",fontsize=18,color="green")
ax.set_xlabel("date", fontweight ="bold",fontsize=14)
ax.set_ylabel("number of cases", fontweight ="bold",fontsize=14)
ax.legend(fontsize=10)
ax.grid(b=1,color="blue",alpha=0.1)
plt.xticks([r + barWidth for r in br1],  ["2/8/2020", "2/9/2020", "2/10/2020", "2/11/2020", "2/12/2020","2/13/2020"])
plt.legend()
plt.show()

Understanding the exponential growth associated with covid19 requires plotting the whole dataset which is over the span of 44 days. Therefore, the next code block aims to show the covid cases of the complete dataset.

days = 1:1:44
days_array = collect(days)
confirmed_cases = df[:,4]

fig = plt.figure(figsize=(10,10))
ax = plt.subplot()
ax.bar(days_array,confirmed_cases , color ="blue",width = 0.4)
ax.set_title("A bar chart showing cumulative confirmed covid-19 cases in country X",fontsize=22,color="darkgreen")
ax.set_xlabel("day number",fontsize=16,color="darkgreen")
ax.set_ylabel("number of cases",fontsize=16,color="darkgreen")
ax.xaxis.set_ticks_position("none")
ax.yaxis.set_ticks_position("none")
ax.xaxis.set_tick_params(pad = 5)
ax.yaxis.set_tick_params(pad = 10)
ax.grid(b = 1, color ="grey",linestyle ="-.", linewidth = 0.5,alpha = 0.2)
fig.text(0.3, 0.8, "SCDA-JaafarBallout", fontsize = 10,color ="grey", ha ="right", va ="bottom",alpha = 0.6)
plt.show()

Even though bar charts are powerful in our case, it is still nice to observe the exponential curve. For that, a code block is presented below to plot the growth curve of confirmed cases.

fig = plt.figure(figsize=(10,10))
ax = plt.subplot()
ax.plot(days_array,confirmed_cases , color ="blue",marker="o",markersize=6, linewidth=2, linestyle ="--")
ax.set_title("A bar chart showing cumulative confirmed covid-19 cases in country X",fontsize=22,color="darkgreen")
ax.set_xlabel("day number",fontsize=16,color="darkgreen")
ax.set_ylabel("number of cases",fontsize=16,color="darkgreen")
ax.grid(b = 1, color ="grey",linestyle ="-.", linewidth = 0.5,alpha = 0.2)
fig.text(0.3, 0.8, "SCDA-JaafarBallout", fontsize = 10,color ="grey", ha ="right", va ="bottom",alpha = 0.6)
plt.xticks(size=16, color ="black")
plt.yticks(size=16, color ="black")
plt.show()

Realizing a meme! (with Julia)

Recently, this meme got viral. So, it is nice to figure the missing functions by plotting.

fig, axs = plt.subplots(2, 2);
fig.tight_layout(pad=4);

x1 = range(-10, 10, length=1000);
y1 = range(1, 1,length =1000);

axs[1].plot(x1, y1, color="blue", linewidth=2.0, linestyle="-");
axs[1].set_xlabel("x1");
axs[1].set_ylabel("y1");
axs[1].set_title("Constant");

x2 = range(-10, 10, length=1000);
y2 = x2.^3;

axs[2].plot(x2, y2, color="blue", linewidth=2.0, linestyle="-");
axs[2].set_xlabel("x2");
axs[2].set_ylabel("y2");
axs[2].set_title("Y = X^(3)");

x3 = range(-10, 10, length=1000);
y3 = x2.^2;

axs[3].plot(x3, y3, color="blue", linewidth=2.0, linestyle="-");
axs[3].set_xlabel("x3");
axs[3].set_ylabel("y3");
axs[3].set_title("Y = X^(2)");

x4 = range(-5, 5, length=1000);
y4 = cos.(x4);

axs[4].plot(x4, y4, color="blue", linewidth=2.0, linestyle="-");
axs[4].set_xlabel("x4");
axs[4].set_ylabel("y4");
axs[4].set_title("Y = cos(X)");

Graphing networks by hand or traditional software seems exhausting. In future posts, I will demonstrate how to draw complicated networks using Julia and its dependencies. Then, I will solve the network problem using the JuMP package.

The post Using DataFrames and PyPlot in Julia appeared first on Supply Chain Data Analytics.

DataFrames.jl Style Linear Relations

By: Philip Zucker

Re-posted from: https://www.philipzucker.com/dataframes-jl-style-linear-relations/

DataFrames.jl is a Julia library to store, retrieve and manipulate tabular data. It is the analog of Pandas for python or related tools in R. It implements an interface that is a mish-mash of numpy-like slicing and SQL-like queries and can be used as lightweight flexible relational database. It has proven to be a popular and intuitive interface.

Relational Databases are computer implementations of the mathematical concept of finite relations, but there are other classes of relations. This blog post is about using (abusing?) the DataFrame.jl interface to implement linear relations. Linear relations are useful for describing physical systems whose behavior is constrained by linear equations, such as electrical circuits.

It all comes out quite cute.

Databases and Relations

A finite relation over sets A,B,C,D is a subset of A\times B \times C \times D . In the database, each of these sets corresponds to a column of the database, and each element of the subset corresponds to a row of the database.

Much of the time, we think of storing discrete data in databases, things like names, ids, occupations, ages. However, DataFrame.jl is certainly also capable of storing floats. In this case, the DataFrame could be interpreted as storing a finite number of points in the continuous space R^n , one dimension for each column, one point for each row.

In continuous space, it is desirable to not only describe discrete sets of points, but also blobs. Conceptually, blobs are only a subset of R^n, and are still a mathematical relation. In some sense a blob could be thought of as the contents of a database holding an uncountable number of rows, but then the enumerative storage schemes of an actual physical database doesn’t cut it anymore. We need an extremely compressed storage scheme.

Describing Linear Subspaces

Trying to store arbitrary blobs is a bit too unruly, but if we constrain ourselves to certain classes of blobs, we can still make headway. In this case, we are going to stick to linear blobs, i.e. hyperplanes.

One can concretely describe a hyperplane of R^n by two main representations. First one can list a set of vectors \{v_i \} that span the hyperplane. From this list, we are implicitly describing any linear combination of these vectors v = \sum \lambda_i v_i . A second distinct representation is to list the coefficient vectors \{c_i^T \} of a set of linear equations \{ c_i^T v = 0 \} describing the hyperplane. I will call these the span and constraint representations.

An alternative name for these two representations might be the range and nullspace representation, the generator and equation representation, or by analogy to polyhedral computation, the v-rep and h-rep.

These two representations are dual in some sense. One way in which they feel dual is that adding a new non independent vector to the span increases the dimensionality of the linear subspace by 1. Adding a new independent linear equation to the constraint rep decreases the dimension of the subspace by 1.

It is important to have both representations because different operations on linear subspaces are easy in each. The intersection of two subspaces is easy in the constraint representation. It corresponds to just concatenating two constraint sets. Projection is easy in the span representation where it corresponds to just taking a slice out of each individual generator vector.

As more evidence for the two representation’s duality, two other important operations that are easy in a particular representation are the image and preimage operations under linear maps. The image is easy to compute in the span representation (matrix multiply {Av_i}) and the preimage is easy in the constraint representation (matrix multiply maybe with a transpose thrown in there {c_i^T A}).

The Julia Implementation

The interface of DataFrames still makes sense for linear relations and we can slightly abuse its already existent machinery to make fast headway.

First the constructors of the two representations. As a slight hack, to differentiate between the two representations, I will tag a special blank column to differentiate which representation we are in. Otherwise I forward all arguments to the ordinary DataFrames constructor so you can build these as you would any DataFrame. Each row of the dataframe is either one vector from the span set or one constraint from the constraint set. Column are components of these vectors

using DataFrames
using LinearAlgebra

function DataFrameSpan( x...; kwargs...)
    df = DataFrame(x...; kwargs...)
    df[:spanrep] = 0 # missing
    df
end

# representing linear relations via constraints
function DataFrameCons( x...; kwargs...)
    df = DataFrame(x...; kwargs...)
    df[:consrep] = 0 # missing
    df
end

Because each individual vector in the span obeys the constraint equations, we can numerically convert between them by calling the Julia function nullspace. There is a pleasing symmetry between the two conversion directions.

# nullconvert turns span rep to constraint rep or vice versa
function nullconvert(df)
    cols = names(df)
    A = nullspace(Matrix(df))
    
    DataFrame(mapslices(x->[x], A, dims=2)[:] , cols ) 
end

#https://discourse.julialang.org/t/converting-a-matrix-into-an-array-of-arrays/17038/2


# We tag somewhat inelegantly whether we're inb the span rep or constrain rep via a column tag
isspanrep(x) = "spanrep" ∈ names(x)
isconsrep(x) = "consrep" ∈ names(x)


function spanrep(x) 
    if isspanrep(x)
        return x
    else
        df = select(x, Not(:consrep))
        df = nullconvert(df)
        df[:spanrep] = 0 #add tag
        return df
    end
end

function consrep(x) 
    if isconsrep(x)
        return x
    else
        df = select(x, Not(:spanrep)) #remove tag
        df = nullconvert(df)
        df[:consrep] = 0 #add tag 
        return df
     end
end

The ability to name columns and the analog of the join operation from database gives us a way to compositionally construct systems of linear equations. We also define a projection operations which in some sense solves these equations.

function linearjoin(df,dg)  #should we add makefresh flag?
        df = consrep(df)
        dg = consrep(dg)
        coalesce.(vcat(df,dg, cols=:union), 0)
end


function linearproject(df, cols)
        df = spanrep(df)
        df = df[ [cols ; :spanrep]]
end

And shockingly, that’s about it. A very elegant implementation compared to my previous efforts.

Here as an example we join two resistors in series. We then project out the inner voltage, which returns a relation describing the

resistor(R, i , v1, v2) = DataFrameCons( [i => [R], v1 => [1], v2 => [-1] ] )
vsource(V, v1, v2) =  DataFrameCons( [:one => [V], v1 => [1], v2 => [-1] ] )
isource(I, i) =  DataFrameCons( [:one => [I], i => [-1] ] )

r1 = resistor(10, :i1, :v1, :v2)
r2 = resistor(10, :i1, :v2, :v3)


sys = consrep(linearproject(linearjoin(r1,r2), [:i1, :v1, :v3]) ) 
1×4 DataFrame
│ Row │ i1       │ v1        │ v3         │ consrep │
│     │ Float64  │ Float64   │ Float64    │ Int64   │
├────┼──────────┼───────────┼────────────┼─────────┤
│ 1   │ 0.997509 │ 0.0498755 │ -0.0498755 │ 0       │

This solution is linear proportional to [20,1,-1] hence representing the linear relation of a 20ohm resistor.

For more examples of usage of linear relations, check out these blog posts.

That one could use pandas or Dataframes.jl in this way is something I realized about a year ago when I was writing these posts, but I didn’t realize how pleasantly it would all fall out when I actually sat down to do it.

Bits and Bobbles

  • One often wants not only linear subspaces, but affine subspaces. The convenient and standard trick to convert affine subspaces to linear subspaces is the method of homogenous coordinates. We can increase our linear space with an extra dimension called :one. Whenever we need a scalar occurring in an equation, it now goes in front of this new variable. In the span representation, it is convenient to normalize this entry to 1 by scaling the each span vector. Then the span representation representing the affine subspace is \sum \lambda_i v_i with the constraint that \sum \lambda_i = 1.
  • Linear relations are not the only sensible relations. Other classes of relations that have computational teeth include
  1. Polyhedral relations https://github.com/JuliaPolyhedra/Polyhedra.jl
  2. Polynomial Relations (aka algebraic or semialgebraic relations) https://www.juliahomotopycontinuation.org/
  3. Convex relations https://juliareach.github.io/LazySets.jl/dev/

It seems reasonable that each of these could also use the DataFrames interface as a way of modelling themselves.

  • Linear relations elegantly express operations I would ordinarily phrase in terms of Schur complements, one of my favorite pieces of linear algebra. Projecting a linear relation is finding the effective or reduced system of equations.

In this manner, linear relations are related to Gaussian integrals, which are the canonical example of “integrating out” dynamics in physics. Multidimensional Gaussian integrals are ultimately basically linear algebra engines, giving a way to formulate the Schur complement operation. I think linear relations would form a good core for an “exact” Gaussian integral library.

Quadratic Optimization with linear equality constraints is reducible to a linear relation. I used this for example in my LQR control example above. This is also related to Gaussians, since the method of steepest descent approximates an integration problem by an optimization problem. Gaussian integrals in the presence of linear delta functions \delta(Ax-b) (which are important) convert under the steepest descent method to a quadratic optimization problem with linear equality constraints. All of this is completely ubiquitous in the connection between classical and quantum mechanics, and between statistical and thermodynamics. The Feynman integration over all paths becomes the principle of least action. The partition function integrals become free energy minimizations.

  • In the past, I’ve modeled relations in a Categorical style. This is still I think an excellent internal representation, but perhaps not the best user facing interface.

Query optimization is a big topic in database theory. The ordering of the joins and projections and the chosen points at which to change representations or indices can greatly change the time of the query. It is intriguing that such disparate seeming calculations may benefit from a unified relational query optimizer. I suggest that a categorical IR and the tools of Catlab https://github.com/AlgebraicJulia/Catlab.jl might be a way to do this.

  • Using sparse matrices would be a boon to linear relations (they tend to be very sparse), but we may have to roll our own DataFrame implementation to do so.
  • Should we allow columns to have width? Then variables can represent blocks of the matrix rather than just a single component. It feels like this has nice modelling power. It was also not convenient to do so with the stock DataFrame implementation