Basic DataFrames.jl: getting the data

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2021/07/23/getdata.html

Introduction

During JuliaCon 2021 I got some reqests to show some more basic
functionalities of DataFrames.jl. I started writing such a post.
It seems it ended up not so basic, but I hope it is useful.

What we will want to do is to create a plot of PLN/USD buy and sell exchange
rates for the year 2020. The key challenge is getting the data. I fetch it from
a Web API provided by NBP (Narodowy Bank Polski).

Getting ready

Before writing our solutoin let us play a bit with the NBP Web API.

Here is a sample a query that you can submit to get information about bid
and ask rates of PLN/USD on 2020-01-03:

https://api.nbp.pl/api/exchangerates/rates/c/usd/2020-01-03/?format=json

(you can copy-paste it into your browser to see the rendered result)

We have to be careful though as sometimes the prices are not available, and
we get the dreaded 404 error. It happens e.g. for 2020-01-01:

https://api.nbp.pl/api/exchangerates/rates/c/usd/2020-01-01/?format=json

Before we start let us learn how to handle such queries in Julia.

The first step is getting the query result as a string:

julia> using HTTP

julia> q = "https://api.nbp.pl/api/exchangerates/rates/c/usd/2020-01-03/?format=json"
"https://api.nbp.pl/api/exchangerates/rates/c/usd/2020-01-03/?format=json"

julia> r = HTTP.get(q)
HTTP.Messages.Response:
"""
HTTP/1.1 200 OK
Date: Fri, 23 Jul 2021 17:04:48 GMT
Cache-Control: no-cache
Pragma: no-cache
Content-Length: 147
Content-Type: application/json; charset=utf-8
Expires: -1
ETag: "ahlzfBWI/hBaFs4mrcmnRvKkGdL1gH6hS2LIHIfppJI="
Set-Cookie: ee3la5eizeiY4Eix=ud5ahSho; path=/

{"table":"C","currency":"dolar amerykański","code":"USD","rates":[{"no":"002/C/NBP/2020","effectiveDate":"2020-01-03","bid":3.7667,"ask":3.8427}]}"""

julia> s = String(r.body)
"{\"table\":\"C\",\"currency\":\"dolar amerykański\",\"code\":\"USD\",\"rates\":[{\"no\":\"002/C/NBP/2020\",\"effectiveDate\":\"2020-01-03\",\"bid\":3.7667,\"ask\":3.8427}]}"

We can see that the result of the request is a JSON file, so let us parse it and inspect a bit:

julia> using JSON3

julia> j = JSON3.read(s)
JSON3.Object{Base.CodeUnits{UInt8, String}, Vector{UInt64}} with 4 entries:
  :table    => "C"
  :currency => "dolar amerykański"
  :code     => "USD"
  :rates    => JSON3.Object[{…

julia> j.code
"USD"

julia> j.rates
1-element JSON3.Array{JSON3.Object, Base.CodeUnits{UInt8, String}, SubArray{UInt64, 1, Vector{UInt64}, Tuple{UnitRange{Int64}}, true}}:
 {
              "no": "002/C/NBP/2020",
   "effectiveDate": "2020-01-03",
             "bid": 3.7667,
             "ask": 3.8427
}

julia> j.rates[1].effectiveDate
"2020-01-03"

julia> j.rates[1].bid
3.7667

julia> j.rates[1].ask
3.8427

Now we know enough about the structure of the response to write a full solution.

Before moving forward let us just check what happens if we try to fetch data
in a day for which there are no quotations:

julia> q = "https://api.nbp.pl/api/exchangerates/rates/c/usd/2020-01-01/?format=json"
"https://api.nbp.pl/api/exchangerates/rates/c/usd/2020-01-01/?format=json"

julia> r = HTTP.get(q)
ERROR: HTTP.ExceptionRequest.StatusError(404, "GET", "/api/exchangerates/rates/c/usd/2020-01-01/?format=json", HTTP.Messages.Response:
"""
HTTP/1.1 404 Not Found
Date: Fri, 23 Jul 2021 17:09:07 GMT
Cache-Control: no-cache
Pragma: no-cache
Content-Length: 38
Content-Type: text/plain; charset=utf-8
Expires: -1
Set-Cookie: ee3la5eizeiY4Eix=Naew5Ohp; path=/

404 NotFound - Not Found - Brak danych""")

We can see that in this case we get a 404 error thrown as
HTTP.ExceptionRequest.StatusError exception.

We are ready to write the solution we wanted.

Julia in action

First define a function that returns a NamedTuple with the date and bid and
ask rates. If the rates are not available it should return them as missing.

Here it goes:

julia> using Chain

julia> using Dates

julia> function get_rate(date::Date)
           q = "https://api.nbp.pl/api/exchangerates/rates/c/usd/$date/?format=json"

           try
               return @chain q begin
                   HTTP.get
                   String(_.body)
                   JSON3.read
                   @aside @assert _.code == "USD"
                   _.rates[1]
                   @aside @assert date == Date(_.effectiveDate)
                   (date=date, bid=_.bid, ask=_.ask)
               end
           catch e
               if e isa HTTP.ExceptionRequest.StatusError
                   return (date=date, bid=missing, ask=missing)
               else
                   rethrow(e)
               end
           end
       end
get_rate (generic function with 1 method)

Let us note a few things:

  1. In order to streamline processing I used @chain macro from the Chain.jl
    package (a lot of people are asking for more examples how it can be used in
    practice).
  2. In particular I used the @aside statement twice to perform validation of
    the result. Note that this is a useful way to inject into a chain of commands
    an operation whose result should not be stored but it is still useful to be
    performed.
  3. I had to use trycatch block to handle the 404 error. Note though, that
    if for some strange reason we got an exception of type other than
    HTTP.ExceptionRequest.StatusError I rethrow it to avoid covering such error
    (and many bad things can happen in the wild e.g. we could have
    run out of memory).

Let us test our function before moving forward:

julia> get_rate(Date("2020-01-03"))
(date = Date("2020-01-03"), bid = 3.7667, ask = 3.8427)

julia> get_rate(Date("2020-01-01"))
(date = Date("2020-01-01"), bid = missing, ask = missing)

I am sure everyone is waiting now to get to working with DataFrames.jl. Here
it goes, we collect the data from the whole year to a DataFrame using the
push! function:

julia> using DataFrames

julia> usd_pln = DataFrame()
0×0 DataFrame

julia> for date in Date("2020-01-01"):Day(1):Date("2020-12-31")
           push!(usd_pln, get_rate(date), promote=true)
       end

julia> usd_pln
366×3 DataFrame
 Row │ date        bid           ask
     │ Date        Float64?      Float64?
─────┼────────────────────────────────────────
   1 │ 2020-01-01  missing       missing
   2 │ 2020-01-02        3.7597        3.8357
   3 │ 2020-01-03        3.7667        3.8427
   4 │ 2020-01-04  missing       missing
   5 │ 2020-01-05  missing       missing
   6 │ 2020-01-06  missing       missing
   7 │ 2020-01-07        3.7679        3.8441
  ⋮  │     ⋮            ⋮             ⋮
 360 │ 2020-12-25  missing       missing
 361 │ 2020-12-26  missing       missing
 362 │ 2020-12-27  missing       missing
 363 │ 2020-12-28        3.6611        3.7351
 364 │ 2020-12-29  missing       missing
 365 │ 2020-12-30        3.6833        3.7577
 366 │ 2020-12-31        3.6714        3.7456
                              352 rows omitted

For me all worked cleanly. Note that I am using the promote=true keyword
argument, as some of the rows of :bid and :ask columns contain missing
values while other contain Float64 values.

We are ready to produce the plot we wanted:

julia> using Plots
julia> plot(usd_pln.date, [usd_pln.bid usd_pln.ask], label=["bid" "ask"],
            xlabel="date", ylabel="PLN/USD", xrotation=45);

And here is the plot:

PLN/USD bid and ask rates

Now feel free to perform any analyses that you find interesting on this data.

From my side let me just show you how to compute the fraction of days in which
the bid data is missing by day of the week.

julia> using Statistics

julia> @chain usd_pln begin
           transform(:date => ByRow(dayofweek))
           groupby(:date_dayofweek)
           combine(:bid => (x -> ismissing.(x) |> mean) => :fraction_missing)
       end
7×2 DataFrame
 Row │ date_dayofweek  fraction_missing
     │ Int64           Float64
─────┼──────────────────────────────────
   1 │              1         0.0769231
   2 │              2         0.0769231
   3 │              3         0.113208
   4 │              4         0.0754717
   5 │              5         0.0961538
   6 │              6         1.0
   7 │              7         1.0

As you probably might have expected on Saturdays and on Sundays we never
have any information on bid and ask rates.

Conclusion

I hope you enjoyed this post. I have chosen the topic because, at least in my
experience, these days fetching JSON data from the Internet is bread and butter
of virtually every data scientist, so I think it is well worth learning how to
do it.