Filtering and browsing datasets

By: Uwe

Re-posted from: https://ufechner7.github.io/2022/08/12/filtering-and-browsing-datasets.html

Introduction

The basic work flow of data analysis is filtering a large data set to extract the interesting aspects and to browse and/or plot the result. In this post I will give a basic example for filtering and browsing of data, plotting deserves a blog post on its own.

Creating a test project

For trying out a new package and/or example it is always good to create a new project first.
When you followed my first post you only have to add the package TerminalPager.

For example using the following commands:

mkdir can
cd can
julia --project="."

And then install the required packages:

julia> using pkg
julia> pkg"add InMemoryDatasets"
julia> pkg"add TerminalPager"

Now quit julia with and restart it with:

julia --project -t auto

This uses the set of packages we just installed and starts julia using all available threads. This is useful when handling large data sets (millions of messages).

Creating a sample data set

To get a realistic impression of the performance we now create a sample data set with 100k entries. We use again a CAN bus log file as example, this time with m=8 data columns, one byte per column. For the address column, which uses two bytes we use a range of 46 different, randomly choosen values, starting with 0x101.

using InMemoryDatasets, DLMReader, Printf

function demo_data()
    n = 100000
    m = 8
    time = 0.1:0.1:n*0.1
    addr = rand(Int16(0x101):Int16(0x12e), n)
    data = (Symbol("d", i) => rand(UInt8, n) for i in 1:m)
    ds = Dataset(;time, addr, data...)
    ds.addr[5] = missing
    ds
end
ds = demo_data()

If we now print the data set in the julia console, only the first and the last values are shown:

julia> ds = demo_data()
100000×10 Dataset
    Row │ time      addr      d1        d2        d3        d4        d5        d6        d7        d8       
        │ identity  identity  identity  identity  identity  identity  identity  identity  identity  identity 
        │ Float64?  Int16?    UInt8?    UInt8?    UInt8?    UInt8?    UInt8?    UInt8?    UInt8?    UInt8?   
────────┼────────────────────────────────────────────────────────────────────────────────────────────────────
      1 │      0.1       264       127         5       247       185       212       216       171       147
      2 │      0.2       278       127       239       183        48       127        98        27        44
      3 │      0.3       301         6        16       159       201       225        95       196        51
   ⋮    │    ⋮         ⋮         ⋮         ⋮         ⋮         ⋮         ⋮         ⋮         ⋮         ⋮
  99999 │   9999.9       294        95       185        15       205       178        64        13        31
 100000 │  10000.0       285        86       224        79       214       248        39        34       147
                                                                                           99995 rows omitted

Formating the data

For CAN bus messages usually the hexadecimal data format is used. To see the numbers in hex format we need to define two functions, one that is formating the data bytes with two hex digits, called hex2(n) and one that formats the address column with four hex digits, called hex4(n). To save space we want to display missing values with the string “–”.

function hex2(n)
    if ismissing(n) return "--" end
    string(n, base=16, pad=2)
end

function hex4(n)
    if ismissing(n) return "--" end
    string(n, base=16, pad=4)
end

The setformat! function can be used to set the format of a column of a dataset. The exclamation mark indicates that this function modifies the first argument. The function Symbol can be used to create symbols, composed of a letter and a number to be used in a loop.

setformat!(ds, :addr => hex4)
for i in 1:8
    setformat!(ds, Symbol("d", i) => hex2)
end

Filtering the data on one column

Let us assume that the devices with the address 0x103, 0x106 and 0x109 are output devices, and we want to filter all the messages sent to the output devices. For complex filter conditions a function that returns true or false is needed. Missing values must be handled correctly.

# CAN bus devices with a given set of addresses are output devices
function isoutput(addr)
    if ismissing(addr) return false end
    addr in [0x103, 0x106, 0x109]
end

Finally we can apply the filter function, which requires three parameters, the input dataset, the column for applying the filter function and – as named parameter – the filter function.

# create the dataset msg_out with all messages sent to output devices
msg_out = filter(ds, :addr, by=isoutput)

Filtering the data on two columns

Lets assume that in addition to apply a filter on the column addr we are only interested in the data from t=1000s to t=3000s. Then we need a second filter function:

function ininterval(time)
    if ismissing(time) return false end
    time >= 1000.0 && time <= 3000.0
end

and pass arrays of the column names and of the column filter functions to the dataset filter function.

msg_out_ininterval = filter(ds, [:time, :addr], by=[ininterval, isoutput])

Browsing through the output data

The filtered data set msg_out is still too big to print it on one screen, so we need a way to browse through the result set. I am using the following function to do that:

function browse(ds)
    io = IOContext(IOBuffer(), :color => true);
    show(io, ds, show_row_number=false, eltypes=false)
    pager(String(take!(io.io)), frozen_rows=3)
end

The option :color => true allows the use of formatting, e.g. the header will be in bold.
The options show_row_number=false and eltypes=false suppress the row numbers and the row with the column types.

You can now browse through the output messages with the command:

browse(msg_out)

It will look like this (well, with the header in bold):

6475×10 Dataset
 time    addr  d1  d2  d3  d4  d5  d6  d7  d8 
──────────────────────────────────────────────
    8.2  0109  17  d6  13  e2  60  14  a4  8f
    9.9  0103  45  14  61  a3  6f  18  e7  35
   12.1  0109  7b  16  43  65  e9  f2  87  99
   12.8  0106  87  c0  46  f1  4e  49  ad  74
   13.8  0106  bc  88  f1  0a  a9  c4  99  ef
   14.1  0106  63  bb  83  96  8f  56  9a  99
   14.8  0106  f7  94  2b  01  b3  45  12  3a
   15.3  0109  3e  74  04  16  15  93  5f  0c
   15.6  0103  7b  e6  53  c7  0d  89  6b  f3
   15.7  0109  8e  db  d2  30  c1  b0  32  fa
   16.0  0106  5e  52  da  97  74  81  e6  b8
:  

You can scroll through the dataset with the cursor keys and the page up and page down keys. You can also search for a string with the slash key, then typing the search term and then <ENTER>. To finish browsing press q.

As you can see, all the data is nicely formatted as hexadecimal numbers.

Further reading

To learn more about the filtering options, have a look at Filter observations.

Acknowledgements

Thanks to James D Foster for providing the browse() function and the developers of InMemoryDataframes for their nice package.