Re-posted from: https://bkamins.github.io/julialang/2023/01/13/tmt.html
Introduction
Before I begin let me make a small announcement. In my last post I
have advertised an introductory Julia course for data science that will start
next week at MIT. If someone is interested in the material we are going to
cover you can find it here. There is also an option to attend the course
remotely. The content will be streamed via MS Teams. If you would like to join
please register yourself here.
Now, back to the usual business. This week I have registered the
TableMetadataTools.jl package. It aims to make common operations
on table metadata more convenient.
Recently I have written about how one can work with metadata in
this post. Today, I want to give another example of working with
table metadata and discuss the major functionalities that
TableMetadataTools.jl offers. The key difference is that this time
we will use JSON source.
The post was written under Julia 1.8.5, DataFrames.jl 1.4.4, JSON3.jl 1.12.0,
TableMetadataTools.jl 0.1.0, and Parquet2.jl 0.2.5.
The data
This time I chose the Electric Vehicle Population Data data to
analyze. This dataset shows the Battery Electric Vehicles (BEVs) and Plug-in
Hybrid Electric Vehicles (PHEVs) that are currently registered through
Washington State Department of Licensing (DOL).
We will want with the JSON file rows.json that can be downloaded from
that website, so before proceeding please download it to your working folder.
The analysis I present is done for the file that I downloaded on Jan 12, 2023.
In the post we will:
- Read the JSON file and load both data and metadata contained in it into a
data frame. - Perform some simple analysis of the data.
- Save the data as Parquet file (since it supports metadata).
(note – some of the output in this post will be wide, so in some displays
it might not look very nice; however, it looks OK if you try it in the terminal
or Jupyter Notebook)
Data ingest
First load the required packages:
julia> using DataFrames
julia> using JSON3
julia> using Parquet2
julia> using Statistics
julia> using TableMetadataTools
Next we read the JSON file to a dictionary-like structure:
julia> data_json = open(JSON3.read, "rows.json")
JSON3.Object{Base.CodeUnits{UInt8, String}, Vector{UInt64}} with 2 entries: :meta => {…
:data => JSON3.Array[Any["row-briu~kmbv-8jap", "00000000-0000-0000-6480-…
It is instructive to investigate the structure of this object, as it is complex:
julia> function traverse(obj; level=0)
if level > 4 || !(obj isa AbstractDict)
print(" ...")
return
end
for key in keys(obj)
print("\n", " " ^ level, key, ":")
traverse(obj[key], level=level+2)
end
end
traverse (generic function with 2 methods)
julia> traverse(data_json)
meta:
view:
id: ...
name: ...
assetType: ...
attribution: ...
averageRating: ...
category: ...
createdAt: ...
description: ...
displayType: ...
downloadCount: ...
hideFromCatalog: ...
hideFromDataJson: ...
newBackend: ...
numberOfComments: ...
oid: ...
provenance: ...
publicationAppendEnabled: ...
publicationDate: ...
publicationGroup: ...
publicationStage: ...
rowsUpdatedAt: ...
rowsUpdatedBy: ...
tableId: ...
totalTimesRated: ...
viewCount: ...
viewLastModified: ...
viewType: ...
approvals: ...
clientContext: ...
columns: ...
grants: ...
metadata: ...
owner: ...
query: ...
rights: ...
tableAuthor: ...
tags: ...
flags: ...
data: ...
As you can see we have descended only two levels in the hierarchy and already
there is a lot of information.
Actual data is stored in :data
field. The rest is metadata,
of which interesting for us will be :name
, :description
, and :columns
.
Before move forward, let us peek at data_json.meta.view.columns
vector:
julia> data_json.meta.view.columns[[1; end]]
2-element Vector{JSON3.Object}:
{
"id": -1,
"name": "sid",
"dataTypeName": "meta_data",
"fieldName": ":sid",
"position": 0,
"renderTypeName": "meta_data",
"format": {},
"flags": [
"hidden"
]
}
{
"id": 561974359,
"name": "WAOFM - GIS - Legislative District Boundary",
"dataTypeName": "number",
"description": "",
"fieldName": ":@computed_region_8ddd_yn5v",
"position": 21,
"renderTypeName": "number",
"tableColumnId": 87270518,
"computationStrategy": {
"source_columns": [
"geocoded_column"
],
"type": "georegion_match_on_point",
"parameters": {
"region": "_8ddd-yn5v",
"primary_key": "_feature_id"
}
},
"format": {}
}
(I have displayed only first and last element of the vector contained in this
metadata entry as the rest are similar). This vector stores column-level
metadata. The interesting fields are :name
, :fieldName
, and :description
(the last is not always present).
Finally, let us investigate data_json.data
object. Again, we pick only
the first and last element:
julia> data_json.data[[1; end]]
2-element Vector{JSON3.Array}:
Any["row-briu~kmbv-8jap", "00000000-0000-0000-6480-436D376BD73A", 0,
1670541195, nothing, 1670541238, nothing, "{ }", "JTMEB3FV6N",
"Monroe" … "42", "0", nothing, "198968248",
"POINT (-81.80023 24.5545)", nothing, "12087972100", "51", nothing, nothing]
Any["row-acyi~zx2n-aq3x", "00000000-0000-0000-BA64-2573B5CAA8F7", 0,
1670541195, nothing, 1670541258, nothing, "{ }", "YV4BR0CL8N",
"King" … "18", "0", "47", "194673692",
"POINT (-122.09124 47.33778)", "PUGET SOUND ENERGY INC||CITY OF TACOMA - (WA)",
"53033032005", "3009", "8", "46"]
We see that the data is stored in rows.
Loading data to a data frame
Having this knowledge let us load the data into a data frame.
First we extract the column-level metadata into a data frame for convenience:
julia> data_colmeta = DataFrame(name=getproperty.(cols, :fieldName),
label=getproperty.(cols, :name),
note=get.(cols, :description, nothing))
28×3 DataFrame
Row │ name label note
│ String String Union…
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────
1 │ :sid sid
2 │ :id id
3 │ :position position
4 │ :created_at created_at
5 │ :created_meta created_meta
6 │ :updated_at updated_at
7 │ :updated_meta updated_meta
8 │ :meta meta
9 │ vin_1_10 VIN (1-10) The 1st 10 characters of each ve…
10 │ county County The county in which the register…
11 │ city City The city in which the registered…
12 │ state State The state in which the registere…
13 │ zip_code Postal Code The 5 digit zip code in which th…
14 │ model_year Model Year The model year of the vehicle, d…
15 │ make Make The manufacturer of the vehicle,…
16 │ model Model The model of the vehicle, determ…
17 │ ev_type Electric Vehicle Type This distinguishes the vehicle a…
18 │ cafv_type Clean Alternative Fuel Vehicle (… This categorizes vehicle as Clea…
19 │ electric_range Electric Range Describes how far a vehicle can …
20 │ base_msrp Base MSRP This is the lowest Manufacturer'…
21 │ legislative_district Legislative District The specific section of Washingt…
22 │ dol_vehicle_id DOL Vehicle ID Unique number assigned to each v…
23 │ geocoded_column Vehicle Location The center of the ZIP Code for t…
24 │ electric_utility Electric Utility This is the electric power retai…
25 │ _2020_census_tract 2020 Census Tract The census tract identifier is a…
26 │ :@computed_region_x4ys_rtnd Counties
27 │ :@computed_region_fny7_vc3j Congressional Districts
28 │ :@computed_region_8ddd_yn5v WAOFM - GIS - Legislative Distri…
We now, in a structured way, see that the metadata contains: column name,
column descriptive label, and, optionally, note about column contents.
Assume that from all the available columns we want to keep only few most
interesting. The DATA_COLSKEEP
vector keeps column numbers we want to store
in a data frame.
julia> const DATA_COLSKEEP = [9, 12, 14, 15, 17, 18, 19];
julia> keepat!(data_colmeta, DATA_COLSKEEP)
7×3 DataFrame
Row │ name label note
│ String String Union…
─────┼──────────────────────────────────────────────────────────────────────────────────────
1 │ vin_1_10 VIN (1-10) The 1st 10 characters of each ve…
2 │ state State The state in which the registere…
3 │ model_year Model Year The model year of the vehicle, d…
4 │ make Make The manufacturer of the vehicle,…
5 │ ev_type Electric Vehicle Type This distinguishes the vehicle a…
6 │ cafv_type Clean Alternative Fuel Vehicle (… This categorizes vehicle as Clea…
7 │ electric_range Electric Range Describes how far a vehicle can …
Now we are ready to create a data frame with our data. First create its schema
and populate the metadata:
julia> data_df = DataFrame(data_colmeta.name .=> Ref(String[]))
0×7 DataFrame
Row │ vin_1_10 state model_year make ev_type cafv_type electric_range
│ String String String String String String String
─────┴──────────────────────────────────────────────────────────────────────────
julia> caption!(data_df, data_json.meta.view.name);
julia> note!(data_df, data_json.meta.view.description);
julia> for row in eachrow(data_colmeta)
label!(data_df, row.name, row.label)
note!(data_df, row.name, row.note)
end
Let us check if indeed metadata got properly stored. We can either extract
dictionaries:
julia> metadata(data_df)
Dict{String, String} with 2 entries:
"note" => "This dataset shows the Battery Electric Vehicles (BEVs) and Plug-in Hybrid Electric Vehicles (PHEVs) that are currently registered through Washington State Department of Licensing (DOL)."
"caption" => "Electric Vehicle Population Data"
julia> colmetadata(data_df)
Dict{Symbol, Dict{String, String}} with 7 entries:
:ev_type => Dict("label"=>"Electric Vehicle Type", "note"=>"This distinguishes the vehicle as all electric or a plug-in hybrid.")
:make => Dict("label"=>"Make", "note"=>"The manufacturer of the vehicle, determined by decoding the Vehicle Identification Number (VIN).")
:cafv_type => Dict("label"=>"Clean Alternative Fuel Vehicle (CAFV) Eligibility", "note"=>"This categorizes vehicle as Clean Alternative Fuel Vehicles (CAFVs) based on the fuel requirement and electric-o…
:state => Dict("label"=>"State", "note"=>"The state in which the registered owner resides.")
:vin_1_10 => Dict("label"=>"VIN (1-10)", "note"=>"The 1st 10 characters of each vehicle's Vehicle Identification Number (VIN).")
:electric_range => Dict("label"=>"Electric Range", "note"=>"Describes how far a vehicle can travel purely on its electric charge.")
:model_year => Dict("label"=>"Model Year", "note"=>"The model year of the vehicle, determined by decoding the Vehicle Identification Number (VIN).")
Or get a TOML representation of metadata (the first entry of TOML output means
that we did not store metadata style information):
julia> println(meta2toml(data_df, style=false))
style = false
[colmetadata.cafv_type]
label = "Clean Alternative Fuel Vehicle (CAFV) Eligibility"
note = "This categorizes vehicle as Clean Alternative Fuel Vehicles (CAFVs) based on the fuel requirement and electric-only range requirement in House Bill 2042 as passed in the 2019 legislative session."
[colmetadata.electric_range]
label = "Electric Range"
note = "Describes how far a vehicle can travel purely on its electric charge."
[colmetadata.ev_type]
label = "Electric Vehicle Type"
note = "This distinguishes the vehicle as all electric or a plug-in hybrid."
[colmetadata.make]
label = "Make"
note = "The manufacturer of the vehicle, determined by decoding the Vehicle Identification Number (VIN)."
[colmetadata.model_year]
label = "Model Year"
note = "The model year of the vehicle, determined by decoding the Vehicle Identification Number (VIN)."
[colmetadata.state]
label = "State"
note = "The state in which the registered owner resides."
[colmetadata.vin_1_10]
label = "VIN (1-10)"
note = "The 1st 10 characters of each vehicle's Vehicle Identification Number (VIN)."
[metadata]
caption = "Electric Vehicle Population Data"
note = "This dataset shows the Battery Electric Vehicles (BEVs) and Plug-in Hybrid Electric Vehicles (PHEVs) that are currently registered through Washington State Department of Licensing (DOL)."
All looks good so far.
Additionally you can easily extract all column labels using the labels
function:
julia> labels(data_df)
7-element Vector{String}:
"VIN (1-10)"
"State"
"Model Year"
"Make"
"Electric Vehicle Type"
"Clean Alternative Fuel Vehicle (CAFV) Eligibility"
"Electric Range"
Now we store the data in our data frame:
julia> foreach(row -> push!(data_df, row[DATA_COLSKEEP]), data_json.data)
julia> show(data_df, truncate=10)
112634×7 DataFrame
Row │ vin_1_10 state model_year make ev_type cafv_type electric_range
│ String String String String String String String
────────┼───────────────────────────────────────────────────────────────────────────────────────
1 │ JTMEB3FV6N FL 2022 TOYOTA Plug-in Hy… Clean Alte… 42
2 │ 1G1RD6E45D NV 2013 CHEVROLET Plug-in Hy… Clean Alte… 38
3 │ JN1AZ0CP8B WA 2011 NISSAN Battery El… Clean Alte… 73
4 │ 1G1FW6S08H WA 2017 CHEVROLET Battery El… Clean Alte… 238
5 │ 3FA6P0SU1K WA 2019 FORD Plug-in Hy… Not eligib… 26
6 │ 5YJ3E1EB5J WA 2018 TESLA Battery El… Clean Alte… 215
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
112630 │ 7SAYGDEF2N WA 2022 TESLA Battery El… Eligibilit… 0
112631 │ 1N4BZ1CP7K WA 2019 NISSAN Battery El… Clean Alte… 150
112632 │ 1FMCU0KZ4N WA 2022 FORD Plug-in Hy… Clean Alte… 38
112633 │ KNDCD3LD4J WA 2018 KIA Plug-in Hy… Not eligib… 26
112634 │ YV4BR0CL8N WA 2022 VOLVO Plug-in Hy… Not eligib… 18
112623 rows omitted
julia> describe(data_df, :min, :max, :nmissing)
7×4 DataFrame
Row │ variable min max nmissing
│ Symbol String String Int64
─────┼────────────────────────────────────────────────────────────────────────────────────────────────
1 │ vin_1_10 1C4JJXP60M YV4H60DZ9N 0
2 │ state AK WY 0
3 │ model_year 1997 2023 0
4 │ make AUDI VOLVO 0
5 │ ev_type Battery Electric Vehicle (BEV) Plug-in Hybrid Electric Vehicle … 0
6 │ cafv_type Clean Alternative Fuel Vehicle E… Not eligible due to low battery … 0
7 │ electric_range 0 97 0
We are almost done. It looks like :model_year
and :electric_range
are
integer, so we perform an appropriate conversion:
julia> transform!(data_df,
[:model_year, :electric_range] .=> ByRow(x -> parse(Int, x)),
renamecols=false);
Here one important comment has to be made. Since we keep the old column names
for the transformed columns their "label"
and "note"
metadata is kept.
The reason is that this metadata originally had :note
style. Let us check
that indeed metadata is kept:
julia> colmetadata(data_df, :model_year)
Dict{String, String} with 2 entries:
"label" => "Model Year"
"note" => "The model year of the vehicle, determined by decoding the Vehicle Identification Number (VIN)."
julia> colmetadata(data_df, :electric_range)
Dict{String, String} with 2 entries:
"label" => "Electric Range"
"note" => "Describes how far a vehicle can travel purely on its electric charge."
It is an important rule when working with :note
-style metadata to keep old
column name only if we do not change the meaning of the column (like in our
case, where we just parsed it).
Doing some analysis of the data
Let us check how many observations we have for combination of :ev_type
and :cafv_type
columns:
julia> data_gdf = groupby(data_df, [:ev_type, :cafv_type], sort=true)
julia> data_agg1 = combine(data_gdf, nrow, :electric_range .=> extrema)
5×4 DataFrame
Row │ ev_type cafv_type nrow electric_range_extrema
│ String String Int64 Tuple{Int64, Int64}
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ Battery Electric Vehicle (BEV) Clean Alternative Fuel Vehicle E… 46799 (39, 337)
2 │ Battery Electric Vehicle (BEV) Eligibility unknown as battery r… 39236 (0, 0)
3 │ Battery Electric Vehicle (BEV) Not eligible due to low battery … 9 (29, 29)
4 │ Plug-in Hybrid Electric Vehicle … Clean Alternative Fuel Vehicle E… 11840 (30, 153)
5 │ Plug-in Hybrid Electric Vehicle … Not eligible due to low battery … 14750 (6, 29)
The results seem interesting, but we most likely cannot immediately understand
them. Let us check the column metadata:
julia> colmetadata(data_agg1)
Dict{Symbol, Dict{String, String}} with 2 entries:
:ev_type => Dict("label"=>"Electric Vehicle Type", "note"=>"This distinguishes the vehicle as all electric or a plug-in hybrid.")
:cafv_type => Dict("label"=>"Clean Alternative Fuel Vehicle (CAFV) Eligibility", "note"=>"This categorizes vehicle as Clean Alternative Fuel Vehicles (CAFVs) based on the fuel requirement and electric-only r…
We can see that only :ev_type
and :cafv_type
columns have metadata. This is
intended. These columns were not transformed so their metadata is kept because
it had :note
style. On the other hand the :electric_range
column was
transformed, by taking its extrema so its metadata was not propagated.
Now we can more easily understand the results. The :ev_type
column tells us
if the vehicle is electric only or hybrid. On the other hand :cafv_type
column tells us if the vehicle meets the requirements of Clean Alternative
Fuel Vehicles.
Interestingly, for some BEV cars it is not known if they meet CAFV condition,
and this is indicated by 0
in their :electric_range
. For other cases we see
that 30
is a minimal range that allows to classify vehicle as meeting CAFV.
Surprisingly six BEV do not meet it, as their battery range is too low.
Let us check how does the frequency of electric range equal to 0
behaves:
julia> data_agg2 = sort(unstack(data_df,
:model_year,
:ev_type,
:electric_range,
combine = x -> mean(x .> 0)))
20×3 DataFrame
Row │ model_year Plug-in Hybrid Electric Vehicle (PHEV) Battery Electric Vehicle (BEV)
│ Int64 Union{Missing, Float64} Union{Missing, Float64}
─────┼────────────────────────────────────────────────────────────────────────────────────
1 │ 1997 missing 1.0
2 │ 1998 missing 1.0
3 │ 1999 missing 1.0
4 │ 2000 missing 1.0
5 │ 2002 missing 1.0
6 │ 2008 missing 1.0
7 │ 2010 missing 1.0
8 │ 2011 1.0 1.0
9 │ 2012 1.0 1.0
10 │ 2013 1.0 1.0
11 │ 2014 1.0 1.0
12 │ 2015 1.0 1.0
13 │ 2016 1.0 1.0
14 │ 2017 1.0 1.0
15 │ 2018 1.0 1.0
16 │ 2019 1.0 0.999763
17 │ 2020 1.0 0.994388
18 │ 2021 1.0 0.0327439
19 │ 2022 1.0 0.0
20 │ 2023 1.0 0.0
We see that indeed the newest BEV do not have the range data (because it was not
collected yet). Also we learn that hybrids were introduced only in year 2011.
As above we can check that column metadata is only kept for :model_year
column
since we did not transform it:
julia> colmetadata(data_agg2)
Dict{Symbol, Dict{String, String}} with 1 entry:
:model_year => Dict("label"=>"Model Year", "note"=>"The model year of the vehicle, determined by decoding the Vehicle Identification Number (VIN).")
Saving the data frame while keeping metadata
The meta2toml
function was designed to allow for easy saving of metadata
to an independent file. However, I find it more convenient to save metadata
along with data. Here Parquet2.jl comes handy.
Let us work with data_agg2
data frame:
julia> Parquet2.writefile("data_agg2.parquet", data_agg2)
✏ Parquet2.FileWriter{IOStream}(data_agg2.parquet)
julia> data_pq = Parquet2.Dataset("data_agg2.parquet")
≔ Parquet2.Dataset (928 bytes)
1. "model_year": Int64
2. "Plug-in Hybrid Electric Vehicle (PHEV)": Union{Missing, Float64}
3. "Battery Electric Vehicle (BEV)": Union{Missing, Float64}
julia> data_pq_df = DataFrame(data_pq);
We can check that both data_pq
and data_pq_df
have metadata information:
julia> println(meta2toml(data_pq))
style = true
[colmetadata."Battery Electric Vehicle (BEV)"]
[colmetadata."Plug-in Hybrid Electric Vehicle (PHEV)"]
[colmetadata.model_year]
label = ["Model Year", "default"]
note = ["The model year of the vehicle, determined by decoding the Vehicle Identification Number (VIN).", "default"]
[metadata]
caption = ["Electric Vehicle Population Data", "default"]
note = ["This dataset shows the Battery Electric Vehicles (BEVs) and Plug-in Hybrid Electric Vehicles (PHEVs) that are currently registered through Washington State Department of Licensing (DOL).", "default"]
julia> println(meta2toml(data_pq_df))
style = true
[colmetadata.model_year]
label = ["Model Year", "default"]
note = ["The model year of the vehicle, determined by decoding the Vehicle Identification Number (VIN).", "default"]
[metadata]
caption = ["Electric Vehicle Population Data", "default"]
note = ["This dataset shows the Battery Electric Vehicles (BEVs) and Plug-in Hybrid Electric Vehicles (PHEVs) that are currently registered through Washington State Department of Licensing (DOL).", "default"]
We note that the metadata has :default
style, because Parquet does not
support metadata styles. Therefore, if we wanted to turn all metadata to :note
style, so that it would be propagated we can write e.g.:
julia> setallmetadatastyle!(data_pq_df);
julia> print(meta2toml(data_pq_df))
style = true
[colmetadata.model_year]
label = ["Model Year", "note"]
note = ["The model year of the vehicle, determined by decoding the Vehicle Identification Number (VIN).", "note"]
[metadata]
caption = ["Electric Vehicle Population Data", "note"]
note = ["This dataset shows the Battery Electric Vehicles (BEVs) and Plug-in Hybrid Electric Vehicles (PHEVs) that are currently registered through Washington State Department of Licensing (DOL).", "note"]
What features does TableMetadataTools.jl provide?
Here is the list of functionalities that you can find in
TableMetadataTools.jl that are added on top of standard metadata
management functions defined in DataAPI.jl:
- the
label
,label!
,labels
, andfindlabels
functions for
convenient work with column label metadata; - the
caption
andcaption!
functions for convenient work with
table caption metadata; note
andnote!
functions for convenient work with note metadata both
on table and column level;unit
(re-exported from Unitful.jl),unit!
, andunits
functions for
convenient work with column unit metadata;setmetadatastyle!
,setcolmetadatastyle!
,setallmetadatastyle!
for group
setting style for keys matching a passed pattern; usually needed when working
with metadata that initially has:default
style set and one wants it to have
:note
style (common when reading metadata from storage formats that do not
support metadata style information);meta2toml
andtoml2meta!
for storing and loading metadata in TOML format;dict2metadata!
,dict2colmetadata!
for setting table and column level
metadata stored in a dictionary (e.g. earlier retrieved from some storage
format or by using metadata or colmetadata functions);- the
@track
macro for tracking operations applied to table and thetracklog
function for convenient printing of tracking metadata.
Conclusions
I hope you will find this post useful as a tutorial to both working with
metadata and ingesting data from a complex JSON file into a data frame.
My general feeling is that, while metadata is commonly used by Stata or SAS
users, it had less adoption in other ecosystems. The reason is likely that
there was no decent support for storing and working with tabular metadata.
Now, I hope the combination of DataFrames.jl + Parquet2.jl helps to solve
this issue.
Admittedly in many cases you do not need table metadata. However, I am convinced
that if you work with tables that have hundreds or thousands of columns or that
you want to share your tables between teams having proper metadata attached
to data is essential.