Author Archives: Uwe

Why am I using Julia?

By: Uwe

Re-posted from: https://ufechner7.github.io/2022/08/13/why-julia.html

Introduction

I started programming in 1976 using the KIM 1 and the 6502 machine language. Since then I learned and used many languages, Assembler, Pascal, Delphi, Basic, Foxpro, Java, C, C++, Bash, Matlab, Python and more…

KIM 1
KIM 1

Why do I think that Julia is superior to all of them in many (not all) applications?

What is so great about Julia?

Reason no. one: speed

When you need to process large amounts of data speed is important. Julia is a just-in-time compiled language using the mature LLVM compiler infrastructure, so you get fast machine code for different processors. In contrast to Python also multi threading is well supported. I wrote a simulation package in Python+Numba, and re-wrote it in Julia.

Performance advantage of Julia vs Python+Numba
Performance advantage of Julia vs Python+Numba

Depending on the test case Julia was 7 to 70 times faster than Python+Numba. And writing the Julia code was much easier because mixing Python with Numba is a real pain because they look similar, but behave very differently.

If you only use Python with algorithms for which C++ or Fortran libraries are available, then you will not see much of an advantage from using Julia. But if you write new programs or libraries using new algorithms Julia is much faster (if you follow the Performance Tips).

Reason no. two: simplicity and elegance

You can operate with vectors as you would do it in mathematical notation, you can use × for the cross product and for the dot product. And functions can return tuples (multiple values), which is very useful.

# pos:        vector of tether particle positions
# v_apparent: apparent wind speed vector
function kite_ref_frame(pos, v_apparent)
    c = pos[end-1] - pos[end]
    z = normalize(c)
    y = normalize(v_apparent × c)
    x = normalize(y × c)
    x, y, z
end

And you can use greek letters in the code:

using Distributions, Random

T = [84.5, 83.6, 83.6] # temperature measurements
μ = mean(T)            # average cpu temperature
σ = 2.5                # 95.45 % of the measurements are within ± 5 deg
d = Normal(μ, σ)
good = cdf(d, 88.0)    # this percentage of the cpus will not start to throttle

And you can write arrays cleanly:

julia> A = [1 2π
            3 4π]
2×2 Matrix{Float64}:
 1.0   6.28319
 3.0  12.5664

And any scalar function incl. user defined functions can be applied on a vector using the dot syntax:

julia> sin.([1,2,3])
3-element Vector{Float64}:
 0.8414709848078965
 0.9092974268256817
 0.1411200080598672

Reason no. three: Great REPL (Read, Evaluate and Print Loop)

Normally, compiled languages don’t have any REPL, but if you ever used it you do not want to miss it again:

Julia REPL
Julia REPL

You can do basic calculations, but also try out any function you want to use. You get help by typing ? and press <Enter>, or help for a specific function by typing for example:

?sin

You can run your scripts using the include function, for example:

include("hello_world.jl")

It has four modes of operation, the julia mode for executing julia code, the help modus, the shell modus (which you reach by pressing ;) and the package manager mode explained in the next section. I always develop code in the REPL and when a line of code (or a function) works I copy it into the script. If you wounder how to get the character π: just type \pi and then the TAB key.

Reason no. four: Built-in package manager

In Python it can be very difficult to install packages. There is pip that might work, but not good for binary dependencies. There is easy_install, there is conda… Many package managers, but all of them only work sometimes, for some packages on some operating systems… On Julia is only one package manager and it works fine with all packages on all supported operating systems (Windows, Mac, Linux…) on ARM and X86.

You can reach the prompt of the package manager by typing ]. It looks like this:

(@v1.8) pkg> st
Status `~/.julia/environments/v1.8/Project.toml`
  [e1d33f9a] DLMReader v0.4.5
  [31c24e10] Distributions v0.25.66
  [5c01b14b] InMemoryDatasets v0.7.7
 [14b8a8f1] PkgTemplates v0.7.26
 [91a5bcdd] Plots v1.29.0
  [0c614874] TerminalPager v0.3.1
Info Packages marked with  have new versions available

The prompt tells you which environment is active, and you can add, update and delete packages, but also display statistics or download the source code of a package for development. Each Julia package has a clear definition of the versions of the dependencies it works with in a file called Project.toml.

Therefore it is easy to write code that will never brake in the future. From my point of view thats a killer feature. If required, the package manager also installs pre-compiled C++ or Fortran libraries which works very well.

If you are looking for a specific package, JuliaHub will help you to find it.

Reason no. five: the composability of packages

Julia packages that are written independently often (not always) work together surprisingly well. Example: We have designed a control system, but now we want to do a monte-carlo analysis to find out how robust it is in the presence of parameter uncertainties.

using ControlSystems, MonteCarloMeasurements, Plots

ω = 1 ± 0.1  # create an uncertain Gaussian parameter
ζ = 0.3..0.4 # create an uncertain uniform parameter

G = tf(ω^2, [1, 2ζ*ω, ω^2]) # systems accept uncertain parameters

Output:

TransferFunction{Continuous, ControlSystems.SisoRational{Particles{Float64, 2000}}}
            1.01 ± 0.2
----------------------------------
1.0s^2 + 0.7 ± 0.092s + 1.01 ± 0.2

Continuous-time transfer function model

The function tf that was originally written for real parameters only, when supplied with uncertain parameters it calculates and prints the correct result. Magic! 😄

Now lets plot the result for the frequency range 0.01 to 100 rad/s on a logarithmic scale:

w = exp10.(-2:0.02:2)
bodeplot(G, w)

Bodeplot

Plotting of diagrams with uncertainty works, even though the packages Plot.jl and MonteCarloMeasurements.jl were developed independently by different authors. You can also combine calculations with physical units and get plots that show the correct physical units. Just magic! 😄

This is so much better than Python or C++ where every library comes with its own implementation of types like vectors that are not compatible with each other. In Julia you write a generic library, and then it can work with StaticArrays.jl (stack allocated) or normal arrays (heap allocated) or sparse arrays or whatever and it just works.

Reason no. six: an awesome community

If you ask a question on discourse you usually get an answer within 15 minutes. And half of the people that reply are scientists, they know what they are talking about. So you get very qualified answers quickly, often some extra infos that might help you to improve your approach.

State-of-the-art solvers for differential equations

If you are creating models of physical, chemical or biological systems, you will often need to solve differential equations. The package DifferentialEquations.jl provides the world best set of solvers for these kind of problems.

State-of-the-art optimizers

If you want to optimize a system, for example minimize the costs or the time or other properties of a system you need a mathematical optimizer. JuMP provides a domain-specific modeling language for mathematical optimization embedded in Julia. This is easy to use, fast and powerful. As example here a solution of the Rosenbrock function:

using JuMP
import Ipopt

model = Model(Ipopt.Optimizer)

@variable(model, x)
@variable(model, y)
@NLobjective(model, Min, (1 - x)^2 + 100 * (y - x^2)^2)

optimize!(model)

println("x: ", value(x))
println("y: ", value(y))

The Scientific Machine Learning ecosystem

The SciML ecosystem provides a unified interface to libraries for scientific computing and machine learning. So far I mainly used NonlinearSolve and Optimization.jl for root-finding and for optimization problems where JuMP is not the best choice. The unified interfaces make it easy to try different solvers to find the best one for your problem.

If you are modelling real world problems like the Corona disease then combining differential equations and machine learning can give superior results.

Limitations

Limitations of language/ compiler and tooling

Compilation and load times

Loading packages takes some time, and also the just-in-time compilation kicks in when you make you first function call. This is known as the time-to-first-plot problem (TTFP). Simple example:

using Plots
p = plot(rand(2,2))
display(p)

If we now save this code in the file ttfp.jl run the program with the command:

@time include("ttfp.jl")

we will see that it takes about 9 seconds on a computer with an i7-7700K CPU. This is quite a lot of time for
such a small program. There a different approaches to mitigate this issue. What I usually do is to create a system image with all packages I use for a curtain project and do an ahead-of-time compilation of these packages.
I will explain in a separate blog post how to do that. If you are curious already, have a look at PackageCompiler.jl.

There is ongoing effort to use binary caches of the compiled code so that you need to recompile only when the source code has changed, but it will take some time until this will work without any user interaction.

For use on embedded systems

Currently, with Julia 1.8 the first issue is the amount of RAM that is needed: Using Julia
with less than 4GB RAM is not very enjoyable. A cross compiler is missing. For these
reasons writing code for smaller embedded systems is not feasible. The smallest system
I would use with Julia is a Raspberry Pi 4 and a 64 bit OS.

I know that the Julia developers work hard to mitigate these issues, so in a few years
Julia will hopefully also be usable for smaller embedded systems.

Working with modules

While working with modules is supported by the language, using modules that are NOT packages is not well supported by the tooling. The VSCode Julia plugin will NOT find symbols of modules in the LOAD_PATH, see this bug report. Workarounds exist, but I find them ugly. As a consequence I use only one module per app/ package and split the program using includes. This is not so good because a file is no longer a stand-alone unit of code which you can read on its own and requires some adaptation of your habits if you are an experienced C++ or Python programmer.

Limitations of the ecosystem

  • The first limitation I see is that there is no easy to use, well integrated GUI library. Yes, you can use QML.jl or GTK.jl and many others, but they all have their issues, one of the issues is the lack of Julia specific documentation.

  • My preferred solver for stiff differential algebraic equations is the Radau5DAE solver, which is available in Fortran and Python, but not yet wrapped or re-written in Julia. There is an open issue for this.

Luckily you can use most Python and R libraries from within Julia, so the lack of a specific library is usually no show-stopper. Have a look at PythonCall.jl and RCall.jl. Also Fortran and C code can be called directly from Julia, for C++ you need an interface library like CxxWrap.

Conclusions

If you want to analyze data, model physical systems, design control systems or create a web portal with dynamic data Julia is a good choice for you. Also if you do number crunching on graphic cards or clusters. So far it is not a good choice for mobile apps or small embedded systems. It can also be used for machine learning, but I am not qualified to say how well that works (I guess it depends…)

So give Julia a try if you are now using Python, Matlab, Fortran or C++ for one of the mentioned tasks. Your life will become happier…

Acknowledgements

Thanks to Fredrik Bagge Carlson for providing the example for composability .

Filtering and browsing datasets

By: Uwe

Re-posted from: https://ufechner7.github.io/2022/08/12/filtering-and-browsing-datasets.html

Introduction

The basic work flow of data analysis is filtering a large data set to extract the interesting aspects and to browse and/or plot the result. In this post I will give a basic example for filtering and browsing of data, plotting deserves a blog post on its own.

Creating a test project

For trying out a new package and/or example it is always good to create a new project first.
When you followed my first post you only have to add the package TerminalPager.

For example using the following commands:

mkdir can
cd can
julia --project="."

And then install the required packages:

julia> using pkg
julia> pkg"add InMemoryDatasets"
julia> pkg"add TerminalPager"

Now quit julia with and restart it with:

julia --project -t auto

This uses the set of packages we just installed and starts julia using all available threads. This is useful when handling large data sets (millions of messages).

Creating a sample data set

To get a realistic impression of the performance we now create a sample data set with 100k entries. We use again a CAN bus log file as example, this time with m=8 data columns, one byte per column. For the address column, which uses two bytes we use a range of 46 different, randomly choosen values, starting with 0x101.

using InMemoryDatasets, DLMReader, Printf

function demo_data()
    n = 100000
    m = 8
    time = 0.1:0.1:n*0.1
    addr = rand(Int16(0x101):Int16(0x12e), n)
    data = (Symbol("d", i) => rand(UInt8, n) for i in 1:m)
    ds = Dataset(;time, addr, data...)
    ds.addr[5] = missing
    ds
end
ds = demo_data()

If we now print the data set in the julia console, only the first and the last values are shown:

julia> ds = demo_data()
100000×10 Dataset
    Row │ time      addr      d1        d2        d3        d4        d5        d6        d7        d8       
        │ identity  identity  identity  identity  identity  identity  identity  identity  identity  identity 
        │ Float64?  Int16?    UInt8?    UInt8?    UInt8?    UInt8?    UInt8?    UInt8?    UInt8?    UInt8?   
────────┼────────────────────────────────────────────────────────────────────────────────────────────────────
      1 │      0.1       264       127         5       247       185       212       216       171       147
      2 │      0.2       278       127       239       183        48       127        98        27        44
      3 │      0.3       301         6        16       159       201       225        95       196        51
   ⋮    │    ⋮         ⋮         ⋮         ⋮         ⋮         ⋮         ⋮         ⋮         ⋮         ⋮
  99999 │   9999.9       294        95       185        15       205       178        64        13        31
 100000 │  10000.0       285        86       224        79       214       248        39        34       147
                                                                                           99995 rows omitted

Formating the data

For CAN bus messages usually the hexadecimal data format is used. To see the numbers in hex format we need to define two functions, one that is formating the data bytes with two hex digits, called hex2(n) and one that formats the address column with four hex digits, called hex4(n). To save space we want to display missing values with the string “–”.

function hex2(n)
    if ismissing(n) return "--" end
    string(n, base=16, pad=2)
end

function hex4(n)
    if ismissing(n) return "--" end
    string(n, base=16, pad=4)
end

The setformat! function can be used to set the format of a column of a dataset. The exclamation mark indicates that this function modifies the first argument. The function Symbol can be used to create symbols, composed of a letter and a number to be used in a loop.

setformat!(ds, :addr => hex4)
for i in 1:8
    setformat!(ds, Symbol("d", i) => hex2)
end

Filtering the data on one column

Let us assume that the devices with the address 0x103, 0x106 and 0x109 are output devices, and we want to filter all the messages sent to the output devices. For complex filter conditions a function that returns true or false is needed. Missing values must be handled correctly.

# CAN bus devices with a given set of addresses are output devices
function isoutput(addr)
    if ismissing(addr) return false end
    addr in [0x103, 0x106, 0x109]
end

Finally we can apply the filter function, which requires three parameters, the input dataset, the column for applying the filter function and – as named parameter – the filter function.

# create the dataset msg_out with all messages sent to output devices
msg_out = filter(ds, :addr, by=isoutput)

Filtering the data on two columns

Lets assume that in addition to apply a filter on the column addr we are only interested in the data from t=1000s to t=3000s. Then we need a second filter function:

function ininterval(time)
    if ismissing(time) return false end
    time >= 1000.0 && time <= 3000.0
end

and pass arrays of the column names and of the column filter functions to the dataset filter function.

msg_out_ininterval = filter(ds, [:time, :addr], by=[ininterval, isoutput])

Browsing through the output data

The filtered data set msg_out is still too big to print it on one screen, so we need a way to browse through the result set. I am using the following function to do that:

function browse(ds)
    io = IOContext(IOBuffer(), :color => true);
    show(io, ds, show_row_number=false, eltypes=false)
    pager(String(take!(io.io)), frozen_rows=3)
end

The option :color => true allows the use of formatting, e.g. the header will be in bold.
The options show_row_number=false and eltypes=false suppress the row numbers and the row with the column types.

You can now browse through the output messages with the command:

browse(msg_out)

It will look like this (well, with the header in bold):

6475×10 Dataset
 time    addr  d1  d2  d3  d4  d5  d6  d7  d8 
──────────────────────────────────────────────
    8.2  0109  17  d6  13  e2  60  14  a4  8f
    9.9  0103  45  14  61  a3  6f  18  e7  35
   12.1  0109  7b  16  43  65  e9  f2  87  99
   12.8  0106  87  c0  46  f1  4e  49  ad  74
   13.8  0106  bc  88  f1  0a  a9  c4  99  ef
   14.1  0106  63  bb  83  96  8f  56  9a  99
   14.8  0106  f7  94  2b  01  b3  45  12  3a
   15.3  0109  3e  74  04  16  15  93  5f  0c
   15.6  0103  7b  e6  53  c7  0d  89  6b  f3
   15.7  0109  8e  db  d2  30  c1  b0  32  fa
   16.0  0106  5e  52  da  97  74  81  e6  b8
:  

You can scroll through the dataset with the cursor keys and the page up and page down keys. You can also search for a string with the slash key, then typing the search term and then <ENTER>. To finish browsing press q.

As you can see, all the data is nicely formatted as hexadecimal numbers.

Further reading

To learn more about the filtering options, have a look at Filter observations.

Acknowledgements

Thanks to James D Foster for providing the browse() function and the developers of InMemoryDataframes for their nice package.

Exporting formatted datasets

By: Uwe

Re-posted from: https://ufechner7.github.io/2022/08/07/exporting-formatted-datasets.html

Introduction

For analysing CAN bus log files I am exporting the data sets with CAN messages to .csv and then import them in LibreOffice spread sheets. Finally I create Excel files that I can give to my collegues.

The CAN bus is used in cars, wind turbines, electric chargers, UAVs and many other industrial devices for the communication between intelligent sensors, actuators, controllers and user interface devices.

One log file can easily contain millions of messages, therefore Julia is a good choice for statistics, error analysis and graphical presentation of relevant aspects of the CAN bus traffic due to the simplicity, power and performance Julia provides.

But CAN messages are usually hex encoded. So how can we export a dataset with some hex
encoded columns?

Creating a test project

For trying out a new package and/or example it is always good to create a new project first.
For example using the following commands:

mkdir can
cd can
julia --project="."

And then install the required packages:

julia> using pkg
julia> pkg"add InMemoryDatasets"
julia> pkg"add DLMReader"
julia> pkg"add Printf"

Now quit julia with and restart it with:

julia --project -t auto

This uses the set of packages we just installed and starts julia using all available threads. This is useful when handling large data sets (millions of messages).

Creating a sample data set

using InMemoryDatasets, DLMReader, Printf

ds = Dataset(time=0.0, d1=10, d2=20)
time = 0.1
for i in 1:9
    global time
    ds2 = Dataset(time=time, d1=10+1, d2=20+i)
    append!(ds, ds2)
    time += 0.1
end
ds.d1[4] = missing

If you run this code the dataset should look like this:

julia> ds
10×3 Dataset
 Row  time      d1        d2       
      identity  identity  identity 
      Float64?  Int64?    Int64?   
─────┼──────────────────────────────
   1       0.0        10        20
   2       0.1        11        21
   3       0.2        12        22
   4       0.3        13        23
   5       0.4   missing        24
   6       0.5        15        25
   7       0.6        16        26
   8       0.7        17        27
   9       0.8        18        28
  10       0.9        19        29

Formatting the output

For formatting the columns d1 and d2 in hex we need the following lines of code:

function round6(value)
    @sprintf("%12.6f", value)
end

function hex(n)
    if ismissing(n) return "--" end
    string(n, base=16, pad=2)
end

setformat!(ds, :time => round6)
setformat!(ds, :d1 => hex)
setformat!(ds, :d2 => hex)

Because our dataset can contain missing values we need to handle the special case
that n is missing. The round6 function is not strictly required, but for easy readability
of the csv output I wanted to have a fixed number of digits for the time stamp.

If we now print the dataset in the REPL we get:

julia> show(ds, show_row_number=false, eltypes=false)
10×3 Dataset
 time          d1  d2 
──────────────────────
     0.000000  0a  14
     0.100000  0b  15
     0.200000  0c  16
     0.300000  0d  17
     0.400000  --  18
     0.500000  0f  19
     0.600000  10  1a
     0.700000  11  1b
     0.800000  12  1c
     0.900000  13  1d

Now all columns are nicely formatted. I am using here the keyword parameters show_row_number=false
andeltypes=false to suppress the output of the column types and the row numbers.

Saving this as .csv file is now easy:

filewriter("output.csv", ds, mapformats=true)

The trick is to use the named parameter mapformat=true, if you do that the formatting function
is applied on the .csv output. The resulting file looks like this:

shell> cat output.csv
time,d1,d2
    0.000000,0a,14
    0.100000,0b,15
    0.200000,0c,16
    0.300000,0d,17
    0.400000,--,18
    0.500000,0f,19
    0.600000,10,1a
    0.700000,11,1b
    0.800000,12,1c
    0.900000,13,1d

Importing this with LibreOffice

You can just double click on the file output.csv and a dialog box will open. Just make sure to select the column type text for colum d1 and d2.

Dialog

When you click on OK you have a well formatted table which you can save as .odf spreadsheet or in Excel format
for further analysis and distribution.