Julia: Data Wrangling using JuliaDB.jl and JuliaDBMeta.jl

By: Estadistika -- Julia

Re-posted from: https://estadistika.github.io//data/analyses/wrangling/julia/programming/packages/2018/06/08/Julia-Introduction-to-Data-Wrangling.html

I’m a heavy user of Python’s pandas and R’s dplyr both at work and when I was taking my master’s degree. Hands down, both of these tools are very good at handling the data. So what about Julia? It’s a fairly new programming language that’s been around for almost 6 years already with a very active community. If you have no idea, I encourage you to visit Julialang.org. In summary, it’s a programming language that walks like a Python, but runs like a C.

For data wrangling, there are two packages that we can use, and these are DataFrames.jl and JuliaDB.jl. Let me reserve a separate post for DataFrames.jl, and instead focus on JuliaDB.jl and JuliaDBMeta.jl (an alternative for querying the data, like that of R’s dplyr) packages.

Package Installation

By default, the libraries I mentioned above are not built-in in Julia, and hence we need to install it:

Pkg.add("JuliaDB")
Pkg.add("JuliaDBMeta")
view raw code-2.jl hosted with ❤ by GitHub

Data: nycflights13

In order to compare Julia’s capability on data wrangling with that of R’s dplyr, we’ll reproduce the example in this site. It uses all 336,776 flights that departed from New York City in 2013. I have a copy of it on github, and the following will download and load the data:

using JuliaDB
using JuliaDBMeta
data_url = "https://raw.githubusercontent.com/estadistika/assets/master/data/nycflights13.csv";
down_dir = joinpath(homedir(), "Downloads", "nycflights13.csv");
download(data_url, down_dir);
# Load the csv file
nycflights = loadtable(down_dir);
nycflights
#> Table with 336776 rows, 20 columns:
#> Columns:
#> # colname type
#> ───────────────────────────────────────────────
#> 1 Int64
#> 2 year Int64
#> 3 month Int64
#> 4 day Int64
#> 5 dep_time DataValues.DataValue{Int64}
#> 6 sched_dep_time Int64
#> 7 dep_delay DataValues.DataValue{Int64}
#> 8 arr_time DataValues.DataValue{Int64}
#> 9 sched_arr_time Int64
#> 10 arr_delay DataValues.DataValue{Int64}
#> 11 carrier String
#> 12 flight Int64
#> 13 tailnum String
#> 14 origin String
#> 15 dest String
#> 16 air_time DataValues.DataValue{Int64}
#> 17 distance Int64
#> 18 hour Int64
#> 19 minute Int64
#> 20 time_hour DateTime
view raw code-3.jl hosted with ❤ by GitHub

The rows of the data are not displayed as we execute nycflights in line 7, that’s because we have a lot of columns, and by default JuliaDB.jl will not print all these unless you have a big display (unfortunately, I’m using my 13 inch laptop screen, and that’s why). Hence, for the rest of the article, we’ll be using selected columns only:
# Filter the columns
cols = (:year, :month, :day, :dep_time, :dep_delay, :arr_time, :arr_delay, :tailnum, :air_time, :dest, :distance);
flights = select(nycflights, cols);
# Access the dimension of the data
length(rows(flights)) #> 336776
length(columns(flights)) #> 11
# Accessing the Head of the Data
select(flights, cols)[1:6]
#> Table with 6 rows, 11 columns:
#> year month day dep_time dep_delay arr_time arr_delay tailnum air_time dest distance
#> ───────────────────────────────────────────────────────────────────────────────────────────────
#> 2013 1 1 517 2 830 11 "N14228" 227 "IAH" 1400
#> 2013 1 1 533 4 850 20 "N24211" 227 "IAH" 1416
#> 2013 1 1 542 2 923 33 "N619AA" 160 "MIA" 1089
#> 2013 1 1 544 -1 1004 -18 "N804JB" 183 "BQN" 1576
#> 2013 1 1 554 -6 812 -25 "N668DN" 116 "ATL" 762
#> 2013 1 1 554 -4 740 12 "N39463" 150 "ORD" 719
# Accessing the Tail of the Data
select(flights, cols)[end - 5:end]
#> Table with 6 rows, 11 columns:
#> year month day dep_time dep_delay arr_time arr_delay tailnum air_time dest distance
#> ───────────────────────────────────────────────────────────────────────────────────────────────
#> 2013 9 30 missing missing missing missing "N740EV" missing "BNA" 764
#> 2013 9 30 missing missing missing missing "NA" missing "DCA" 213
#> 2013 9 30 missing missing missing missing "NA" missing "SYR" 198
#> 2013 9 30 missing missing missing missing "N535MQ" missing "BNA" 764
#> 2013 9 30 missing missing missing missing "N511MQ" missing "CLE" 419
#> 2013 9 30 missing missing missing missing "N839MQ" missing "RDU" 431
view raw code-4.jl hosted with ❤ by GitHub

Filter Rows

Filtering is a row-wise operation and is done using the Base.filter function with extended method for JuliaDB.IndexedTables.
Therefore, to filter the data for month equal to 1 (January) and day equal to 1 (first day of the month), is done as follows:

# Using JuliaDB
filter((:month => x -> x .== 1, :day => x -> x .== 1), flights);
# Using JuliaDBMeta
@filter flights :month .== 1 && :day .== 1
#> Table with 842 rows, 11 columns:
#> year month day dep_time dep_delay arr_time arr_delay tailnum air_time dest distance
#> ───────────────────────────────────────────────────────────────────────────────────────────────
#> 2013 1 1 517 2 830 11 "N14228" 227 "IAH" 1400
#> 2013 1 1 533 4 850 20 "N24211" 227 "IAH" 1416
#> 2013 1 1 542 2 923 33 "N619AA" 160 "MIA" 1089
#> 2013 1 1 544 -1 1004 -18 "N804JB" 183 "BQN" 1576
#> ⋮
#> 2013 1 1 missing missing missing missing "N3EHAA" missing "DFW" 1389
#> 2013 1 1 missing missing missing missing "N3EVAA" missing "MIA" 1096
#> 2013 1 1 missing missing missing missing "N618JB" missing "FLL" 1069
view raw code-5.jl hosted with ❤ by GitHub

To see the output for line 2 using Base.filter, simply remove the semicolon and you’ll have the same output as that of line 5 (using JuliaDBMeta.@filter).

Arrange Rows

To arrange the rows of the columns, use Base.sort function:

sort(flights, (:year, :month, :day)) # for ascending order
#> Table with 336776 rows, 11 columns:
#> year month day dep_time dep_delay arr_time arr_delay tailnum air_time dest distance
#> ───────────────────────────────────────────────────────────────────────────────────────────────
#> 2013 1 1 517 2 830 11 "N14228" 227 "IAH" 1400
#> 2013 1 1 533 4 850 20 "N24211" 227 "IAH" 1416
#> 2013 1 1 542 2 923 33 "N619AA" 160 "MIA" 1089
#> 2013 1 1 544 -1 1004 -18 "N804JB" 183 "BQN" 1576
#> ⋮
#> 2013 12 31 missing missing missing missing "N844MQ" missing "RDU" 431
#> 2013 12 31 missing missing missing missing "NA" missing "ORD" 719
#> 2013 12 31 missing missing missing missing "NA" missing "LAX" 2475
sort(flights, (:year, :month, :day), rev = true) # for descending order
#> Table with 336776 rows, 11 columns:
#> year month day dep_time dep_delay arr_time arr_delay tailnum air_time dest distance
#> ───────────────────────────────────────────────────────────────────────────────────────────────
#> 2013 12 31 13 14 439 2 "N566JB" 189 "BQN" 1576
#> 2013 12 31 18 19 449 5 "N713TW" 192 "SJU" 1598
#> 2013 12 31 26 101 129 96 "N374JB" 50 "PWM" 273
#> 2013 12 31 459 -1 655 4 "N557UW" 95 "CLT" 529
#> ⋮
#> 2013 1 1 missing missing missing missing "N3EHAA" missing "DFW" 1389
#> 2013 1 1 missing missing missing missing "N3EVAA" missing "MIA" 1096
#> 2013 1 1 missing missing missing missing "N618JB" missing "FLL" 1069
view raw code-6.jl hosted with ❤ by GitHub

Select Columns

We’ve seen above how to select the columns, but we can also use ranges of columns for selection.

select(flights, (:year, :month, :day))
#> Table with 336776 rows, 3 columns:
#> year month day
#> ────────────────
#> 2013 1 1
#> 2013 1 1
#> 2013 1 1
#> 2013 1 1
#> ⋮
#> 2013 9 30
#> 2013 9 30
# select columns between year and day (inclusive)
select(flights, Between(:year, :day))
#> Table with 336776 rows, 3 columns:
#> year month day
#> ────────────────
#> 2013 1 1
#> 2013 1 1
#> 2013 1 1
#> 2013 1 1
#> ⋮
#> 2013 9 30
#> 2013 9 30
# select all columns except those from year to day (inclusive)
select(flights, Not(Between(:year, :day)))
#> Table with 336776 rows, 8 columns:
#> dep_time dep_delay arr_time arr_delay tailnum air_time dest distance
#> ─────────────────────────────────────────────────────────────────────────────
#> 517 2 830 11 "N14228" 227 "IAH" 1400
#> 533 4 850 20 "N24211" 227 "IAH" 1416
#> 542 2 923 33 "N619AA" 160 "MIA" 1089
#> 544 -1 1004 -18 "N804JB" 183 "BQN" 1576
#> ⋮
#> missing missing missing missing "N535MQ" missing "BNA" 764
#> missing missing missing missing "N511MQ" missing "CLE" 419
#> missing missing missing missing "N839MQ" missing "RDU" 431
view raw code-7.jl hosted with ❤ by GitHub

Rename Column

To rename the column, use JuliaDB.renamecol function:

renamecol(flights, :tailnum, :tail_num)
#> Table with 336776 rows, 11 columns:
#> year month day dep_time dep_delay arr_time arr_delay tail_num air_time dest distance
#> ───────────────────────────────────────────────────────────────────────────────────────────────
#> 2013 1 1 517 2 830 11 "N14228" 227 "IAH" 1400
#> 2013 1 1 533 4 850 20 "N24211" 227 "IAH" 1416
#> 2013 1 1 542 2 923 33 "N619AA" 160 "MIA" 1089
#> 2013 1 1 544 -1 1004 -18 "N804JB" 183 "BQN" 1576
#> ⋮
#> 2013 9 30 missing missing missing missing "N535MQ" missing "BNA" 764
#> 2013 9 30 missing missing missing missing "N511MQ" missing "CLE" 419
#> 2013 9 30 missing missing missing missing "N839MQ" missing "RDU" 431
view raw code-8.jl hosted with ❤ by GitHub

Add New Column

To add a new column, use insertcol, insertcolafter and insertcolbefore of the JuliaDB.jl.

gain = map(x -> x.arr_delay - x.dep_delay, flights, select = (:arr_delay, :dep_delay));
speed = map(x -> x.distance / x.air_time * 60, flights, select = (:distance, :air_time));
# insert the column gain
insertcolafter(flights, :distance, :gain, gain)
#> Table with 336776 rows, 12 columns:
#> year month day dep_time dep_delay arr_time arr_delay tailnum air_time dest distance gain
#> ────────────────────────────────────────────────────────────────────────────────────────────────────────
#> 2013 1 1 517 2 830 11 "N14228" 227 "IAH" 1400 9
#> 2013 1 1 533 4 850 20 "N24211" 227 "IAH" 1416 16
#> 2013 1 1 542 2 923 33 "N619AA" 160 "MIA" 1089 31
#> 2013 1 1 544 -1 1004 -18 "N804JB" 183 "BQN" 1576 -17
#> ⋮
#> 2013 9 30 missing missing missing missing "N535MQ" missing "BNA" 764 missing
#> 2013 9 30 missing missing missing missing "N511MQ" missing "CLE" 419 missing
#> 2013 9 30 missing missing missing missing "N839MQ" missing "RDU" 431 missing
# insert the column speed
insertcolafter(flights, :distance, :speed, speed)
#> Table with 336776 rows, 12 columns:
#> year month day dep_time dep_delay arr_time arr_delay tailnum air_time dest distance speed
#> ────────────────────────────────────────────────────────────────────────────────────────────────────────
#> 2013 1 1 517 2 830 11 "N14228" 227 "IAH" 1400 370.044
#> 2013 1 1 533 4 850 20 "N24211" 227 "IAH" 1416 374.273
#> 2013 1 1 542 2 923 33 "N619AA" 160 "MIA" 1089 408.375
#> 2013 1 1 544 -1 1004 -18 "N804JB" 183 "BQN" 1576 516.721
#> ⋮
#> 2013 9 30 missing missing missing missing "N535MQ" missing "BNA" 764 missing
#> 2013 9 30 missing missing missing missing "N511MQ" missing "CLE" 419 missing
#> 2013 9 30 missing missing missing missing "N839MQ" missing "RDU" 431 missing
view raw code-9.jl hosted with ❤ by GitHub

or use the @transform macro of the JuliaDBMeta.jl:
@transform flights {gain = :arr_delay - :dep_delay, speed = :distance / :air_time * 60}
#> Table with 336776 rows, 13 columns:
#> year month day dep_time dep_delay arr_time arr_delay tailnum air_time dest distance gain speed
#> ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> 2013 1 1 517 2 830 11 "N14228" 227 "IAH" 1400 9 370.044
#> 2013 1 1 533 4 850 20 "N24211" 227 "IAH" 1416 16 374.273
#> 2013 1 1 542 2 923 33 "N619AA" 160 "MIA" 1089 31 408.375
#> 2013 1 1 544 -1 1004 -18 "N804JB" 183 "BQN" 1576 -17 516.721
#> ⋮
#> 2013 9 30 missing missing missing missing "N535MQ" missing "BNA" 764 missing missing
#> 2013 9 30 missing missing missing missing "N511MQ" missing "CLE" 419 missing missing
#> 2013 9 30 missing missing missing missing "N839MQ" missing "RDU" 431 missing missing

Summarize Data

The data can be summarized using the JuliaDB.summarize function

using Statistics
summarize(mean, dropmissing(flights), select = :dep_delay) #> (mean = 12.555155706805643)
@with dropmissing(flights) mean(:dep_delay) #> 12.555155706805643

@with macro is an alternative from JuliaDBMeta.jl.

Grouped Operations

For grouped operations, we can use the JuliaDB.groupby function or the JuliaDBMeta.@groupby:

# Using JuliaDB
delay = groupby(
(
count = length,
dist = :distance => x -> mean(skipmissing(x)),
delay = :arr_delay => x -> mean(skipmissing(x))
),
flights,
:tailnum
); # remove ; to see result
# Using JuliaDBMeta
delay = @groupby flights :tailnum {
count = length(_),
dist = mean(skipmissing(:distance)),
delay = mean(skipmissing(:arr_delay))
}
#> Table with 4044 rows, 4 columns:
#> tailnum count dist delay
#> ─────────────────────────────────
#> "D942DN" 4 854.5 31.5
#> "N0EGMQ" 371 676.189 9.98295
#> "N10156" 153 757.948 12.7172
#> "N102UW" 48 535.875 2.9375
#> ⋮
#> "N9EAMQ" 248 674.665 9.23529
#> "NA" 2512 710.258 NaN

We’ll use the summarized data above and plot the flight delay in relation to the distance travelled. We’ll use the Gadfly.jl package for plotting and DataFrames.jl for converting JuliaDB.jl’s IndexedTable objects to DataFrames.DataFrame object, that’s because Gadfly.plot has no direct method for JuliaDB.IndexedTables.
Pkg.add("Gadfly")
Pkg.add("DataFrames")

To plot, run the following:
using Gadfly
Gadfly.push_theme(:dark)
using DataFrames
delay = filter((:count => x -> x .> 20, :dist => x -> x .< 2000), delay);
plot(
DataFrame(filter(:delay => x -> !isnan(x), delay)),
layer(
x = :dist,
y = :delay,
Geom.smooth,
style(default_color = colorant"red", line_width = 2pt)
),
layer(
x = :dist,
y = :delay,
color = :count,
size = :count,
Geom.point,
style(default_color = colorant"orange", highlight_width = 0pt)
)
)


To find the number of planes and the number of flights that go to each possible destination, run:
destinations = groupby(
(
planes = :tailnum => x -> length(unique(x)),
flights = length
),
flights,
:dest
); # remove ; to see result
# Using JuliaDBMeta
destinations = @groupby flights :dest {
planes = length(unique(:tailnum)),
flights = length(_)
}
#> Table with 105 rows, 3 columns:
#> dest planes flights
#> ──────────────────────
#> "ABQ" 108 254
#> "ACK" 58 265
#> "ALB" 172 439
#> "ANC" 6 8
#> ⋮
#> "TYS" 273 631
#> "XNA" 176 1036

Piping Multiple Operations

For multiple operations, it is convenient to use piping and that is the reason why we have tools like JuliaDBMeta.jl. The following example using R’s dplyr:

flights %>%
group_by(year, month, day) %>%
select(arr_delay, dep_delay) %>%
summarise(
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
) %>%
filter(arr > 30 | dep > 30)
view raw code-14.R hosted with ❤ by GitHub

is equivalent to the following Julia code using JuliaDBMeta.jl:
@apply dropmissing(flights) begin
@groupby (:year, :month, :day) {
arr = mean(:arr_delay),
dep = mean(:dep_delay)
}
@filter :arr .> 30 || :dep .> 30
end
#> Table with 49 rows, 5 columns:
#> year month day arr dep
#> ──────────────────────────────────
#> 2013 1 16 34.2474 24.4549
#> 2013 1 31 32.6029 28.717
#> 2013 2 11 36.2901 39.0755
#> 2013 2 27 31.2525 37.6146
#> ⋮
#> 2013 12 17 55.8719 40.7713
#> 2013 12 23 32.226 32.3927

Conclusion

I’ve demonstrated how easy it is to use Julia for doing data wrangling, and I love it. In fact, there is a library that can query any table-like data structure in Julia, and is called Query.jl (will definitely write a separate article for this in the future).

For more on JuliaDB.jl, watch the Youtube tutorial.