Basic DataFrames.jl: getting the data
Introduction
During JuliaCon 2021 I got some requests to show some more basic functionalities of DataFrames.jl. I started writing such a post. It seems it ended up not so basic, but I hope it is useful.
What we will want to do is to create a plot of PLN/USD buy and sell exchange rates for the year 2020. The key challenge is getting the data. I fetch it from a Web API provided by NBP (Narodowy Bank Polski).
This post was tested under Julia 1.6.1, HTTP.jl 0.9.12, JSON3.jl 1.9.0, Chain.jl 0.4.7, DataFrames.jl 1.2.1, and Plots.jl 1.19.3.
Getting ready
Before writing our solution let us play a bit with the NBP Web API.
Here is a sample query that you can submit to get information about bid and ask rates of PLN/USD on 2020-01-03:
https://api.nbp.pl/api/exchangerates/rates/c/usd/2020-01-03/?format=json
(you can copy-paste it into your browser to see the rendered result)
We have to be careful though as sometimes the prices are not available, and we get the dreaded 404 error. It happens e.g. for 2020-01-01:
https://api.nbp.pl/api/exchangerates/rates/c/usd/2020-01-01/?format=json
Before we start let us learn how to handle such queries in Julia.
The first step is getting the query result as a string:
julia> using HTTP
julia> q = "https://api.nbp.pl/api/exchangerates/rates/c/usd/2020-01-03/?format=json"
"https://api.nbp.pl/api/exchangerates/rates/c/usd/2020-01-03/?format=json"
julia> r = HTTP.get(q)
HTTP.Messages.Response:
"""
HTTP/1.1 200 OK
Date: Fri, 23 Jul 2021 17:04:48 GMT
Cache-Control: no-cache
Pragma: no-cache
Content-Length: 147
Content-Type: application/json; charset=utf-8
Expires: -1
ETag: "ahlzfBWI/hBaFs4mrcmnRvKkGdL1gH6hS2LIHIfppJI="
Set-Cookie: ee3la5eizeiY4Eix=ud5ahSho; path=/
{"table":"C","currency":"dolar amerykański","code":"USD","rates":[{"no":"002/C/NBP/2020","effectiveDate":"2020-01-03","bid":3.7667,"ask":3.8427}]}"""
julia> s = String(r.body)
"{\"table\":\"C\",\"currency\":\"dolar amerykański\",\"code\":\"USD\",\"rates\":[{\"no\":\"002/C/NBP/2020\",\"effectiveDate\":\"2020-01-03\",\"bid\":3.7667,\"ask\":3.8427}]}"
We can see that the result of the request is a JSON file, so let us parse it and inspect a bit:
julia> using JSON3
julia> j = JSON3.read(s)
JSON3.Object{Base.CodeUnits{UInt8, String}, Vector{UInt64}} with 4 entries:
:table => "C"
:currency => "dolar amerykański"
:code => "USD"
:rates => JSON3.Object[{…
julia> j.code
"USD"
julia> j.rates
1-element JSON3.Array{JSON3.Object, Base.CodeUnits{UInt8, String}, SubArray{UInt64, 1, Vector{UInt64}, Tuple{UnitRange{Int64}}, true}}:
{
"no": "002/C/NBP/2020",
"effectiveDate": "2020-01-03",
"bid": 3.7667,
"ask": 3.8427
}
julia> j.rates[1].effectiveDate
"2020-01-03"
julia> j.rates[1].bid
3.7667
julia> j.rates[1].ask
3.8427
Now we know enough about the structure of the response to write a full solution.
Before moving forward let us just check what happens if we try to fetch data in a day for which there are no quotations:
julia> q = "https://api.nbp.pl/api/exchangerates/rates/c/usd/2020-01-01/?format=json"
"https://api.nbp.pl/api/exchangerates/rates/c/usd/2020-01-01/?format=json"
julia> r = HTTP.get(q)
ERROR: HTTP.ExceptionRequest.StatusError(404, "GET", "/api/exchangerates/rates/c/usd/2020-01-01/?format=json", HTTP.Messages.Response:
"""
HTTP/1.1 404 Not Found
Date: Fri, 23 Jul 2021 17:09:07 GMT
Cache-Control: no-cache
Pragma: no-cache
Content-Length: 38
Content-Type: text/plain; charset=utf-8
Expires: -1
Set-Cookie: ee3la5eizeiY4Eix=Naew5Ohp; path=/
404 NotFound - Not Found - Brak danych""")
We can see that in this case we get a 404 error thrown as
HTTP.ExceptionRequest.StatusError
exception.
We are ready to write the solution we wanted.
Julia in action
First define a function that returns a NamedTuple
with the date and bid and
ask rates. If the rates are not available it should return them as missing
.
Here it goes:
julia> using Chain
julia> using Dates
julia> function get_rate(date::Date)
q = "https://api.nbp.pl/api/exchangerates/rates/c/usd/$date/?format=json"
try
return @chain q begin
HTTP.get
String(_.body)
JSON3.read
@aside @assert _.code == "USD"
_.rates[1]
@aside @assert date == Date(_.effectiveDate)
(date=date, bid=_.bid, ask=_.ask)
end
catch e
if e isa HTTP.ExceptionRequest.StatusError
return (date=date, bid=missing, ask=missing)
else
rethrow(e)
end
end
end
get_rate (generic function with 1 method)
Let us note a few things:
- In order to streamline processing I used
@chain
macro from the Chain.jl package (a lot of people are asking for more examples how it can be used in practice). - In particular I used the
@aside
statement twice to perform validation of the result. Note that this is a useful way to inject into a chain of commands an operation whose result should not be stored but it is still useful to be performed. - I had to use
try
-catch
block to handle the 404 error. Note though, that if for some strange reason we got an exception of type other thanHTTP.ExceptionRequest.StatusError
Irethrow
it to avoid covering such error (and many bad things can happen in the wild e.g. we could have run out of memory).
Let us test our function before moving forward:
julia> get_rate(Date("2020-01-03"))
(date = Date("2020-01-03"), bid = 3.7667, ask = 3.8427)
julia> get_rate(Date("2020-01-01"))
(date = Date("2020-01-01"), bid = missing, ask = missing)
I am sure everyone is waiting now to get to working with DataFrames.jl. Here
it goes, we collect the data from the whole year to a DataFrame
using the
push!
function:
julia> using DataFrames
julia> usd_pln = DataFrame()
0×0 DataFrame
julia> for date in Date("2020-01-01"):Day(1):Date("2020-12-31")
push!(usd_pln, get_rate(date), promote=true)
end
julia> usd_pln
366×3 DataFrame
Row │ date bid ask
│ Date Float64? Float64?
─────┼────────────────────────────────────────
1 │ 2020-01-01 missing missing
2 │ 2020-01-02 3.7597 3.8357
3 │ 2020-01-03 3.7667 3.8427
4 │ 2020-01-04 missing missing
5 │ 2020-01-05 missing missing
6 │ 2020-01-06 missing missing
7 │ 2020-01-07 3.7679 3.8441
⋮ │ ⋮ ⋮ ⋮
360 │ 2020-12-25 missing missing
361 │ 2020-12-26 missing missing
362 │ 2020-12-27 missing missing
363 │ 2020-12-28 3.6611 3.7351
364 │ 2020-12-29 missing missing
365 │ 2020-12-30 3.6833 3.7577
366 │ 2020-12-31 3.6714 3.7456
352 rows omitted
For me all worked cleanly. Note that I am using the promote=true
keyword
argument, as some of the rows of :bid
and :ask
columns contain missing
values while other contain Float64
values.
We are ready to produce the plot we wanted:
julia> using Plots
julia> plot(usd_pln.date, [usd_pln.ask usd_pln.bid], label=["ask" "bid"],
xlabel="date", ylabel="PLN/USD", xrotation=45)
And here is the plot:
Now feel free to perform any analyses that you find interesting on this data.
From my side let me just show you how to compute the fraction of days in which the bid data is missing by day of the week.
julia> using Statistics
julia> @chain usd_pln begin
transform(:date => ByRow(dayofweek))
groupby(:date_dayofweek)
combine(:bid => (x -> ismissing.(x) |> mean) => :fraction_missing)
end
7×2 DataFrame
Row │ date_dayofweek fraction_missing
│ Int64 Float64
─────┼──────────────────────────────────
1 │ 1 0.0769231
2 │ 2 0.0769231
3 │ 3 0.113208
4 │ 4 0.0754717
5 │ 5 0.0961538
6 │ 6 1.0
7 │ 7 1.0
As you probably might have expected on Saturdays and on Sundays we never have any information on bid and ask rates.
Conclusion
I hope you enjoyed this post. I have chosen the topic because, at least in my experience, these days fetching JSON data from the Internet is bread and butter of virtually every data scientist, so I think it is well worth learning how to do it.