Re-posted from: https://estadistika.github.io//data/analyses/wrangling/julia/programming/packages/2018/06/08/Julia-Introduction-to-Data-Wrangling.html
I’m a heavy user of Python’s pandas and R’s dplyr both at work and when I was taking my master’s degree. Hands down, both of these tools are very good at handling the data. So what about Julia? It’s a fairly new programming language that’s been around for almost 6 years already with a very active community. If you have no idea, I encourage you to visit Julialang.org. In summary, it’s a programming language that walks like a Python, but runs like a C.
For data wrangling, there are two packages that we can use, and these are DataFrames.jl and JuliaDB.jl. Let me reserve a separate post for DataFrames.jl, and instead focus on JuliaDB.jl and JuliaDBMeta.jl (an alternative for querying the data, like that of R’s dplyr) packages.
Package Installation
By default, the libraries I mentioned above are not built-in in Julia, and hence we need to install it:
Pkg.add("JuliaDB") | |
Pkg.add("JuliaDBMeta") |
Data: nycflights13
In order to compare Julia’s capability on data wrangling with that of R’s dplyr, we’ll reproduce the example in this site. It uses all 336,776 flights that departed from New York City in 2013. I have a copy of it on github, and the following will download and load the data:
using JuliaDB | |
using JuliaDBMeta | |
data_url = "https://raw.githubusercontent.com/estadistika/assets/master/data/nycflights13.csv"; | |
down_dir = joinpath(homedir(), "Downloads", "nycflights13.csv"); | |
download(data_url, down_dir); | |
# Load the csv file | |
nycflights = loadtable(down_dir); | |
nycflights | |
#> Table with 336776 rows, 20 columns: | |
#> Columns: | |
#> # colname type | |
#> ─────────────────────────────────────────────── | |
#> 1 Int64 | |
#> 2 year Int64 | |
#> 3 month Int64 | |
#> 4 day Int64 | |
#> 5 dep_time DataValues.DataValue{Int64} | |
#> 6 sched_dep_time Int64 | |
#> 7 dep_delay DataValues.DataValue{Int64} | |
#> 8 arr_time DataValues.DataValue{Int64} | |
#> 9 sched_arr_time Int64 | |
#> 10 arr_delay DataValues.DataValue{Int64} | |
#> 11 carrier String | |
#> 12 flight Int64 | |
#> 13 tailnum String | |
#> 14 origin String | |
#> 15 dest String | |
#> 16 air_time DataValues.DataValue{Int64} | |
#> 17 distance Int64 | |
#> 18 hour Int64 | |
#> 19 minute Int64 | |
#> 20 time_hour DateTime |
The rows of the data are not displayed as we execute
nycflights
in line 7, that’s because we have a lot of columns, and by default JuliaDB.jl will not print all these unless you have a big display (unfortunately, I’m using my 13 inch laptop screen, and that’s why). Hence, for the rest of the article, we’ll be using selected columns only:# Filter the columns | |
cols = (:year, :month, :day, :dep_time, :dep_delay, :arr_time, :arr_delay, :tailnum, :air_time, :dest, :distance); | |
flights = select(nycflights, cols); | |
# Access the dimension of the data | |
length(rows(flights)) #> 336776 | |
length(columns(flights)) #> 11 | |
# Accessing the Head of the Data | |
select(flights, cols)[1:6] | |
#> Table with 6 rows, 11 columns: | |
#> year month day dep_time dep_delay arr_time arr_delay tailnum air_time dest distance | |
#> ─────────────────────────────────────────────────────────────────────────────────────────────── | |
#> 2013 1 1 517 2 830 11 "N14228" 227 "IAH" 1400 | |
#> 2013 1 1 533 4 850 20 "N24211" 227 "IAH" 1416 | |
#> 2013 1 1 542 2 923 33 "N619AA" 160 "MIA" 1089 | |
#> 2013 1 1 544 -1 1004 -18 "N804JB" 183 "BQN" 1576 | |
#> 2013 1 1 554 -6 812 -25 "N668DN" 116 "ATL" 762 | |
#> 2013 1 1 554 -4 740 12 "N39463" 150 "ORD" 719 | |
# Accessing the Tail of the Data | |
select(flights, cols)[end - 5:end] | |
#> Table with 6 rows, 11 columns: | |
#> year month day dep_time dep_delay arr_time arr_delay tailnum air_time dest distance | |
#> ─────────────────────────────────────────────────────────────────────────────────────────────── | |
#> 2013 9 30 missing missing missing missing "N740EV" missing "BNA" 764 | |
#> 2013 9 30 missing missing missing missing "NA" missing "DCA" 213 | |
#> 2013 9 30 missing missing missing missing "NA" missing "SYR" 198 | |
#> 2013 9 30 missing missing missing missing "N535MQ" missing "BNA" 764 | |
#> 2013 9 30 missing missing missing missing "N511MQ" missing "CLE" 419 | |
#> 2013 9 30 missing missing missing missing "N839MQ" missing "RDU" 431 |
Filter Rows
Filtering is a row-wise operation and is done using the Base.filter
function with extended method for JuliaDB.IndexedTables
.
Therefore, to filter the data for month equal to 1 (January) and day equal to 1 (first day of the month), is done as follows:
# Using JuliaDB | |
filter((:month => x -> x .== 1, :day => x -> x .== 1), flights); | |
# Using JuliaDBMeta | |
@filter flights :month .== 1 && :day .== 1 | |
#> Table with 842 rows, 11 columns: | |
#> year month day dep_time dep_delay arr_time arr_delay tailnum air_time dest distance | |
#> ─────────────────────────────────────────────────────────────────────────────────────────────── | |
#> 2013 1 1 517 2 830 11 "N14228" 227 "IAH" 1400 | |
#> 2013 1 1 533 4 850 20 "N24211" 227 "IAH" 1416 | |
#> 2013 1 1 542 2 923 33 "N619AA" 160 "MIA" 1089 | |
#> 2013 1 1 544 -1 1004 -18 "N804JB" 183 "BQN" 1576 | |
#> ⋮ | |
#> 2013 1 1 missing missing missing missing "N3EHAA" missing "DFW" 1389 | |
#> 2013 1 1 missing missing missing missing "N3EVAA" missing "MIA" 1096 | |
#> 2013 1 1 missing missing missing missing "N618JB" missing "FLL" 1069 |
To see the output for line 2 using
Base.filter
, simply remove the semicolon and you’ll have the same output as that of line 5 (using JuliaDBMeta.@filter
).
Arrange Rows
To arrange the rows of the columns, use Base.sort
function:
sort(flights, (:year, :month, :day)) # for ascending order | |
#> Table with 336776 rows, 11 columns: | |
#> year month day dep_time dep_delay arr_time arr_delay tailnum air_time dest distance | |
#> ─────────────────────────────────────────────────────────────────────────────────────────────── | |
#> 2013 1 1 517 2 830 11 "N14228" 227 "IAH" 1400 | |
#> 2013 1 1 533 4 850 20 "N24211" 227 "IAH" 1416 | |
#> 2013 1 1 542 2 923 33 "N619AA" 160 "MIA" 1089 | |
#> 2013 1 1 544 -1 1004 -18 "N804JB" 183 "BQN" 1576 | |
#> ⋮ | |
#> 2013 12 31 missing missing missing missing "N844MQ" missing "RDU" 431 | |
#> 2013 12 31 missing missing missing missing "NA" missing "ORD" 719 | |
#> 2013 12 31 missing missing missing missing "NA" missing "LAX" 2475 | |
sort(flights, (:year, :month, :day), rev = true) # for descending order | |
#> Table with 336776 rows, 11 columns: | |
#> year month day dep_time dep_delay arr_time arr_delay tailnum air_time dest distance | |
#> ─────────────────────────────────────────────────────────────────────────────────────────────── | |
#> 2013 12 31 13 14 439 2 "N566JB" 189 "BQN" 1576 | |
#> 2013 12 31 18 19 449 5 "N713TW" 192 "SJU" 1598 | |
#> 2013 12 31 26 101 129 96 "N374JB" 50 "PWM" 273 | |
#> 2013 12 31 459 -1 655 4 "N557UW" 95 "CLT" 529 | |
#> ⋮ | |
#> 2013 1 1 missing missing missing missing "N3EHAA" missing "DFW" 1389 | |
#> 2013 1 1 missing missing missing missing "N3EVAA" missing "MIA" 1096 | |
#> 2013 1 1 missing missing missing missing "N618JB" missing "FLL" 1069 |
Select Columns
We’ve seen above how to select the columns, but we can also use ranges of columns for selection.
select(flights, (:year, :month, :day)) | |
#> Table with 336776 rows, 3 columns: | |
#> year month day | |
#> ──────────────── | |
#> 2013 1 1 | |
#> 2013 1 1 | |
#> 2013 1 1 | |
#> 2013 1 1 | |
#> ⋮ | |
#> 2013 9 30 | |
#> 2013 9 30 | |
# select columns between year and day (inclusive) | |
select(flights, Between(:year, :day)) | |
#> Table with 336776 rows, 3 columns: | |
#> year month day | |
#> ──────────────── | |
#> 2013 1 1 | |
#> 2013 1 1 | |
#> 2013 1 1 | |
#> 2013 1 1 | |
#> ⋮ | |
#> 2013 9 30 | |
#> 2013 9 30 | |
# select all columns except those from year to day (inclusive) | |
select(flights, Not(Between(:year, :day))) | |
#> Table with 336776 rows, 8 columns: | |
#> dep_time dep_delay arr_time arr_delay tailnum air_time dest distance | |
#> ───────────────────────────────────────────────────────────────────────────── | |
#> 517 2 830 11 "N14228" 227 "IAH" 1400 | |
#> 533 4 850 20 "N24211" 227 "IAH" 1416 | |
#> 542 2 923 33 "N619AA" 160 "MIA" 1089 | |
#> 544 -1 1004 -18 "N804JB" 183 "BQN" 1576 | |
#> ⋮ | |
#> missing missing missing missing "N535MQ" missing "BNA" 764 | |
#> missing missing missing missing "N511MQ" missing "CLE" 419 | |
#> missing missing missing missing "N839MQ" missing "RDU" 431 |
Rename Column
To rename the column, use JuliaDB.renamecol
function:
renamecol(flights, :tailnum, :tail_num) | |
#> Table with 336776 rows, 11 columns: | |
#> year month day dep_time dep_delay arr_time arr_delay tail_num air_time dest distance | |
#> ─────────────────────────────────────────────────────────────────────────────────────────────── | |
#> 2013 1 1 517 2 830 11 "N14228" 227 "IAH" 1400 | |
#> 2013 1 1 533 4 850 20 "N24211" 227 "IAH" 1416 | |
#> 2013 1 1 542 2 923 33 "N619AA" 160 "MIA" 1089 | |
#> 2013 1 1 544 -1 1004 -18 "N804JB" 183 "BQN" 1576 | |
#> ⋮ | |
#> 2013 9 30 missing missing missing missing "N535MQ" missing "BNA" 764 | |
#> 2013 9 30 missing missing missing missing "N511MQ" missing "CLE" 419 | |
#> 2013 9 30 missing missing missing missing "N839MQ" missing "RDU" 431 |
Add New Column
To add a new column, use insertcol
, insertcolafter
and insertcolbefore
of the JuliaDB.jl.
gain = map(x -> x.arr_delay - x.dep_delay, flights, select = (:arr_delay, :dep_delay)); | |
speed = map(x -> x.distance / x.air_time * 60, flights, select = (:distance, :air_time)); | |
# insert the column gain | |
insertcolafter(flights, :distance, :gain, gain) | |
#> Table with 336776 rows, 12 columns: | |
#> year month day dep_time dep_delay arr_time arr_delay tailnum air_time dest distance gain | |
#> ──────────────────────────────────────────────────────────────────────────────────────────────────────── | |
#> 2013 1 1 517 2 830 11 "N14228" 227 "IAH" 1400 9 | |
#> 2013 1 1 533 4 850 20 "N24211" 227 "IAH" 1416 16 | |
#> 2013 1 1 542 2 923 33 "N619AA" 160 "MIA" 1089 31 | |
#> 2013 1 1 544 -1 1004 -18 "N804JB" 183 "BQN" 1576 -17 | |
#> ⋮ | |
#> 2013 9 30 missing missing missing missing "N535MQ" missing "BNA" 764 missing | |
#> 2013 9 30 missing missing missing missing "N511MQ" missing "CLE" 419 missing | |
#> 2013 9 30 missing missing missing missing "N839MQ" missing "RDU" 431 missing | |
# insert the column speed | |
insertcolafter(flights, :distance, :speed, speed) | |
#> Table with 336776 rows, 12 columns: | |
#> year month day dep_time dep_delay arr_time arr_delay tailnum air_time dest distance speed | |
#> ──────────────────────────────────────────────────────────────────────────────────────────────────────── | |
#> 2013 1 1 517 2 830 11 "N14228" 227 "IAH" 1400 370.044 | |
#> 2013 1 1 533 4 850 20 "N24211" 227 "IAH" 1416 374.273 | |
#> 2013 1 1 542 2 923 33 "N619AA" 160 "MIA" 1089 408.375 | |
#> 2013 1 1 544 -1 1004 -18 "N804JB" 183 "BQN" 1576 516.721 | |
#> ⋮ | |
#> 2013 9 30 missing missing missing missing "N535MQ" missing "BNA" 764 missing | |
#> 2013 9 30 missing missing missing missing "N511MQ" missing "CLE" 419 missing | |
#> 2013 9 30 missing missing missing missing "N839MQ" missing "RDU" 431 missing |
or use the
@transform
macro of the JuliaDBMeta.jl:@transform flights {gain = :arr_delay - :dep_delay, speed = :distance / :air_time * 60} | |
#> Table with 336776 rows, 13 columns: | |
#> year month day dep_time dep_delay arr_time arr_delay tailnum air_time dest distance gain speed | |
#> ───────────────────────────────────────────────────────────────────────────────────────────────────────────────── | |
#> 2013 1 1 517 2 830 11 "N14228" 227 "IAH" 1400 9 370.044 | |
#> 2013 1 1 533 4 850 20 "N24211" 227 "IAH" 1416 16 374.273 | |
#> 2013 1 1 542 2 923 33 "N619AA" 160 "MIA" 1089 31 408.375 | |
#> 2013 1 1 544 -1 1004 -18 "N804JB" 183 "BQN" 1576 -17 516.721 | |
#> ⋮ | |
#> 2013 9 30 missing missing missing missing "N535MQ" missing "BNA" 764 missing missing | |
#> 2013 9 30 missing missing missing missing "N511MQ" missing "CLE" 419 missing missing | |
#> 2013 9 30 missing missing missing missing "N839MQ" missing "RDU" 431 missing missing |
Summarize Data
The data can be summarized using the JuliaDB.summarize
function
using Statistics | |
summarize(mean, dropmissing(flights), select = :dep_delay) #> (mean = 12.555155706805643) | |
@with dropmissing(flights) mean(:dep_delay) #> 12.555155706805643 |
@with
macro is an alternative from JuliaDBMeta.jl.
Grouped Operations
For grouped operations, we can use the JuliaDB.groupby
function or the JuliaDBMeta.@groupby
:
# Using JuliaDB | |
delay = groupby( | |
( | |
count = length, | |
dist = :distance => x -> mean(skipmissing(x)), | |
delay = :arr_delay => x -> mean(skipmissing(x)) | |
), | |
flights, | |
:tailnum | |
); # remove ; to see result | |
# Using JuliaDBMeta | |
delay = @groupby flights :tailnum { | |
count = length(_), | |
dist = mean(skipmissing(:distance)), | |
delay = mean(skipmissing(:arr_delay)) | |
} | |
#> Table with 4044 rows, 4 columns: | |
#> tailnum count dist delay | |
#> ───────────────────────────────── | |
#> "D942DN" 4 854.5 31.5 | |
#> "N0EGMQ" 371 676.189 9.98295 | |
#> "N10156" 153 757.948 12.7172 | |
#> "N102UW" 48 535.875 2.9375 | |
#> ⋮ | |
#> "N9EAMQ" 248 674.665 9.23529 | |
#> "NA" 2512 710.258 NaN |
We’ll use the summarized data above and plot the flight delay in relation to the distance travelled. We’ll use the Gadfly.jl package for plotting and DataFrames.jl for converting JuliaDB.jl’s IndexedTable objects to DataFrames.DataFrame object, that’s because Gadfly.plot has no direct method for JuliaDB.IndexedTables.
Pkg.add("Gadfly") | |
Pkg.add("DataFrames") |
To plot, run the following:
using Gadfly | |
Gadfly.push_theme(:dark) | |
using DataFrames | |
delay = filter((:count => x -> x .> 20, :dist => x -> x .< 2000), delay); | |
plot( | |
DataFrame(filter(:delay => x -> !isnan(x), delay)), | |
layer( | |
x = :dist, | |
y = :delay, | |
Geom.smooth, | |
style(default_color = colorant"red", line_width = 2pt) | |
), | |
layer( | |
x = :dist, | |
y = :delay, | |
color = :count, | |
size = :count, | |
Geom.point, | |
style(default_color = colorant"orange", highlight_width = 0pt) | |
) | |
) |
To find the number of planes and the number of flights that go to each possible destination, run:
destinations = groupby( | |
( | |
planes = :tailnum => x -> length(unique(x)), | |
flights = length | |
), | |
flights, | |
:dest | |
); # remove ; to see result | |
# Using JuliaDBMeta | |
destinations = @groupby flights :dest { | |
planes = length(unique(:tailnum)), | |
flights = length(_) | |
} | |
#> Table with 105 rows, 3 columns: | |
#> dest planes flights | |
#> ────────────────────── | |
#> "ABQ" 108 254 | |
#> "ACK" 58 265 | |
#> "ALB" 172 439 | |
#> "ANC" 6 8 | |
#> ⋮ | |
#> "TYS" 273 631 | |
#> "XNA" 176 1036 |
Piping Multiple Operations
For multiple operations, it is convenient to use piping and that is the reason why we have tools like JuliaDBMeta.jl. The following example using R’s dplyr:
flights %>% | |
group_by(year, month, day) %>% | |
select(arr_delay, dep_delay) %>% | |
summarise( | |
arr = mean(arr_delay, na.rm = TRUE), | |
dep = mean(dep_delay, na.rm = TRUE) | |
) %>% | |
filter(arr > 30 | dep > 30) |
is equivalent to the following Julia code using JuliaDBMeta.jl:
@apply dropmissing(flights) begin | |
@groupby (:year, :month, :day) { | |
arr = mean(:arr_delay), | |
dep = mean(:dep_delay) | |
} | |
@filter :arr .> 30 || :dep .> 30 | |
end | |
#> Table with 49 rows, 5 columns: | |
#> year month day arr dep | |
#> ────────────────────────────────── | |
#> 2013 1 16 34.2474 24.4549 | |
#> 2013 1 31 32.6029 28.717 | |
#> 2013 2 11 36.2901 39.0755 | |
#> 2013 2 27 31.2525 37.6146 | |
#> ⋮ | |
#> 2013 12 17 55.8719 40.7713 | |
#> 2013 12 23 32.226 32.3927 |
Conclusion
I’ve demonstrated how easy it is to use Julia for doing data wrangling, and I love it. In fact, there is a library that can query any table-like data structure in Julia, and is called Query.jl (will definitely write a separate article for this in the future).
For more on JuliaDB.jl, watch the Youtube tutorial.