Author Archives: Dean Markwick's Blog -- Julia

Ecomonic Indicators from AlphaVantage

By: Dean Markwick's Blog -- Julia

Re-posted from: https://dm13450.github.io/2021/11/08/AlphaVantage-Economic-Indicators.html

AlphaVantage has added endpoints to
the FRED data repository and I’ve extended the Julia package
AlphaVantage.jl to use them. This gives you an easy way to include some economic data into your models. This blog post will detail the new functions and I’ll be dusting off my AS-Level in economics to try and explain what they all mean.

What AlphaVantage has done here is nothing new, and you can get the
FRED data directly from source https://fred.stlouisfed.org both
through an API and also just downloading csvs. But having another a way to get this economic data into a Julia environment is always a bonus.


Enjoy these types of posts? Then you should sign up for my newsletter. It’s a short monthly recap of anything and everything I’ve found interesting recently plus
any posts I’ve written. So sign up and stay informed!






Make sure you’ve upgraded your AlphaVantage.jl to version to 0.4.1. I’m
running Julia 1.6.

using AlphaVantage
using Plots, DataFrames, DataFramesMeta, Dates

GDP

Gross Domestic Product (GDP) is the overall output of a country. It comprises of both goods and services, so both things that are made (goods) and things that are provided (services). You can think of it as countries overall revenue and summarise how well a country is doing. Good if it is increasing, bad if it is decreasing.

AlphaVantage gives the ability to pull both quarterly and annual values.

realgdp = AlphaVantage.real_gdp("annual") |> DataFrame
realgdp[!, :timestamp] = Date.(realgdp[!, :timestamp])

quartgdp = AlphaVantage.real_gdp("quarterly") |> DataFrame
quartgdp[!, :timestamp] = Date.(quartgdp[!, :timestamp]);
a_tks = minimum(realgdp.timestamp):Year(15):maximum(realgdp.timestamp)
a_tkslbl = Dates.format.(a_tks, "yyyy")

q_tks = minimum(quartgdp.timestamp):Year(4):maximum(quartgdp.timestamp)
q_tkslbl = Dates.format.(q_tks, "yyyy")

aGDP = plot(realgdp[!, :timestamp], realgdp[!, :value], label=:none, title="Annual GDP", xticks = (a_tks, a_tkslbl))
qGDP = plot(quartgdp[!, :timestamp], quartgdp[!, :value], label = :none, title = "Quarterly GDP", xticks = (q_tks, q_tkslbl))

plot(aGDP, qGDP)

svg

There are very few periods where GDP has decreased, although it has
recently been because of COVID. The effects of COVID will crop up
quite a bit in this post.

Real GDP per Capita

The problem with GDP is that it doesn’t take into account how big the country is. If you have more people in your economy then you can probably generate more money. Likewise, to compare your current GDP with historical values it is probably wise to divide by the population size, which gives a general indication of overal quality of life.

gdpPerCapita = AlphaVantage.real_gdp_per_capita() |> DataFrame
gdpPerCapita[!, :timestamp] = Date.(gdpPerCapita[!, :timestamp])

plot(gdpPerCapita.timestamp, gdpPerCapita.value, label=:none)

svg

Again, another drop because of COVID but getting close to reverting on trend

Treasury Yield

The treasury yield represents what percentage return you get for lending money to the US government. As the US government is continuously issuing new debt you could choose lots of different lengths of times to lend the money, the longer you lend money for, the higher your rate of return because you are taking on more risk. FRED provides four different tenors (lengths of time) and what the average yield on your money would be if you bought on that day.

yields = AlphaVantage.treasury_yield.("daily", ["3month", "5year", "10year", "30year"]);

We take advantage of broadcasting to pull the data of each tenor before joining all the data into one big dataframe.

yields = DataFrame.(yields)

tenor = [3, 5*12, 10*12, 30*12]

allyields = mapreduce(i -> begin 
        yields[i][!, :Tenor] .= tenor[i]
        yields[i][!, :timestamp] = Date.(yields[i][!, :timestamp])
        return yields[i]
        end, vcat, 1:4)

allyields = @subset(allyields, :value .!= ".")
allyields[!, :value] = convert.(Float64, allyields[!, :value]);
plot(allyields[!, :timestamp], allyields[!, :value], group = allyields[!, :Tenor], ylabel = "Yield (%)")

svg

Rates have continuously fallen since the peaks in the 1980s. You can also

What happens though when the short-term yields are higher than the
long-term rates? This is when the yield curve ‘inverts’ and the market
believes the short-term risk is higher than the long-term risk. It is very rare, as we can see above, the blue line has only crossed the highest a few times. When was the last time? It was over the great financial crisis. On the 26th of Feb 2007, this happened with the 3-month rate crossing 5% whilst the 30 year was still less than 5%. The very next day there was a market crash and the stock market has one of the largest falls in history.

creditcrunch = @subset(allyields, in.(:timestamp, [[Date("2007-02-26"), 
                                                    Date("2008-02-26"), 
                                                    Date("2009-02-26"),
                                                    Date("2010-02-26")]]))

plot(creditcrunch.Tenor, 
    creditcrunch.value, 
    group = creditcrunch.timestamp, marker = :d,
    legendposition = :bottomright,
     title = "Yield Curve", xlabel="Tenor (days)", legend=:bottomright, ylabel = "Yield (%)")

svg

This is the yield curve throughout the years on that same day. We can see that usually, it is increasing, but on the day before the market crash it flipped and there was little difference in the other rates. So next time you hear about the yield curve inverting, you can join everyone else in getting nervous.

Federal Fund Rate

This is what is decided by the FOMC on Thursday afternoons.

The rate at which lending institutions lend overnight and is
uncollateralised, which means that don’t have to put down any type of
collateral for the loan. Gives an indication of the overall interest
rate in the American economy. Our (the UK) equivalent is the Bank of England rate, or in Europe the ECB right. This is essentially what you could put as r, the risk-free rate in the Black Scholes model.

fedFundRate = AlphaVantage.federal_fund_rate("monthly") |> DataFrame

fedFundRate[!, :timestamp] = Date.(fedFundRate[!, :timestamp])
fedFundRate[!, :value] = convert.(Float64, fedFundRate[!, :value])

plot(fedFundRate.timestamp, fedFundRate.value, label=:none, title="Federal Fund Rate")

svg

Again, much like the treasury rates, it has fallen steadily since the 1980s. You might be wondering what the difference is between this rate and the above treasury interest rates. This Federal Fund Rate is set by the FOMC and represents bank to bank lending, whereas anyone can buy a treasury and receive that return. Essentially, the Federal Fund Rate is the overall driver of the treasuries.

So, why would the FOMC change the Federal Fund Rate? One of the reasons would be down to inflation and how prices are changing for the average person.

Consumer Price Index (CPI)

This is the consumer price index and represents the price of goods in a basket and how it has changed over time. This provides some measure of inflation and tells us how prices have changed.

AlphaVantage provides this both on a monthly and a semiannual basis.

cpi = AlphaVantage.cpi("monthly") |> DataFrame

cpi[!, :timestamp] = Date.(cpi[!, :timestamp])
cpi[!, :value] = convert.(Float64, cpi[!, :value])

plot(cpi.timestamp, cpi.value, title = "CPI", label = :none)

svg

Prices have been consistently increasing which indicates inflation, but quoting the CPI value isn’t all that intuitive. Instead, what we need is the change in prices to truly reflect how prices have increased, or decreased.

Inflation and Inflation Expectation

Inflation is the compliment to the above CPI measure and provides a percentage to understand how prices have changed over some time.

Inflation is also funny as people will change their behaviour based on
what they think inflation is, rather than what it actually might
be. This is where the inflation expectation comes in handy. If there
is an expectation of high future inflation people might save more to
prepare for higher prices, or they might spend more now to get in front of higher prices. Likewise, if a bank is trying to price a mortgage, higher inflation in the future would reduce the value of the future repayments, so they would adjust the interest rate accordingly.

AlphaVantage provides both from the FRED Datasource inflation (yearly) and inflation_expectation (monthly).

inflation = AlphaVantage.inflation() |> DataFrame
expInflation = AlphaVantage.inflation_expectation() |> DataFrame

inflation[!, :Label] .= "Actual"
expInflation[!, :Label] .= "Expectation"
inflation = vcat(inflation, expInflation)

inflation[!, :timestamp] = Date.(inflation[!, :timestamp])
inflation[!, :value] = convert.(Float64, inflation[!, :value])

plot(inflation.timestamp, inflation.value, group=inflation.Label, title="Inflation")

svg

Since the GFC inflation expectation has been consistently higher than the actual value of inflation. Expectations have also seen a large increase recently. Inflation is becoming an increasing concern in this current economy.

Consumer Sentiment

Consumer sentiment comes from a survey of around 500 people in
America. They are asked how they feel about the economy and their general outlook on what is happening. This is then condensed down into a number which we can view as an indicator of how people feel. Again, like the inflation expectation, it can sometimes be more important to focus on people’s thoughts vs everyone’s actions. Take the petrol crisis here in the UK, I imagine everyone believes they are not the ones panic buying, however, if no-one was panic buying, there would still be petrol! Likewise, if everyone is talking negatively about the economy but not changing behaviour, then it could still have a negative overall effect.

sentiment = AlphaVantage.consumer_sentiment() |> DataFrame

sentiment = @subset(sentiment, :value .!= ".")
sentiment.timestamp = Date.(sentiment.timestamp)

plot(sentiment.timestamp, sentiment.value, label=:none, title="Consumer Sentiment")

svg

Consumer sentiment has been consistent throughout the years, with overall sentiment peaking in the 2000s and at its worse in the 1980s. Understandably, COVID had a major effect causing a fall that had been recovering but has since reversed I imagine based on inflation fears.

Retail Sales and Durable Goods

Retail sales and durable goods are all about what is being bought in the economy. Retail sales consist of things like eating at restaurants, buying clothes, and similar goods. Think of it as doing your weekly shop and how that can vary week on week. Sometimes you might be stocking up on cleaning products, other times you might be buying more food. All of those will be counted in the retail sales survey.

Whereas durable goods are your big-ticket purchases, things that you use more than once and have sort of further use. Cars, ovens, and refrigerators are good examples. Something you’ll save up for and buy at a special store rather than at Tesco.

So these two measures can give a good idea of how people are acting in the economy, are the weekly shops decreasing at the same time as the durable sales because people are spending less across the board? Or is there a sudden increase in durable goods as retail sales remain constant because people suddenly have access to more money to buy a car etc. All sorts of ways you can interpret the numbers.

retails = AlphaVantage.retail_sales() |> DataFrame
goods = AlphaVantage.durables() |> DataFrame;
retails[!, :Label] .= "Retail Sales"
goods[!, :Label] .= "Durable Goods"
retails = vcat(retails, goods)
retails.timestamp = Date.(retails.timestamp)

plot(retails.timestamp, retails.value, group=retails.Label, legend=:topleft)

svg

We can see that they are very seasonal, with large variations
throughout the year. Durable goods took a hit over the COVID crisis,
whereas retail sales have continued to increase and regained their highs even after a COVID decrease.

Unemployment and Non-Farm Payrolls

Finally, we have the unemployment figures. This includes the explicit unemployment rate, expressed as a percentage and also the Non-Farm Payrolls (NFP) number. This is the opposite of an unemployment rate and indicates the current number of people employed. Both of these numbers are monthly figures.

unemployment = AlphaVantage.unemployment() |> DataFrame
nfp = AlphaVantage.nonfarm_payroll() |> DataFrame

unemployment[!, :label] .= "Unemployment"
nfp[!, :label] .= "NFP"

nfp.timestamp = Date.(nfp.timestamp)
unemployment.timestamp = Date.(unemployment.timestamp)

utks = minimum(unemployment.timestamp):Year(12):maximum(unemployment.timestamp)
utkslabel = Dates.format.(utks, "yyyy")

ntks = minimum(nfp.timestamp):Year(13):maximum(nfp.timestamp)
ntkslabel = Dates.format.(ntks, "yyyy")

unemPlot = plot(unemployment.timestamp, unemployment.value, title = "Unemployment", label=:none, xticks = (utks, utkslabel))
nfpPlot = plot(nfp.timestamp, nfp.value, title = "NFP", label=:none, xticks = (ntks , ntkslabel))

plot(unemPlot, nfpPlot)

svg

Unemployment went very high briefly after COVID before coming back down, so seems to have adverted that crisis. NFP numbers are also progressing upwards since the COVID disruption.

Conclusion

Well done on making it this far. Quite a few words and also graphs that all appear to look very similar. Hopefully, you’ve learned something new, or you are about to correct me on something by leaving a comment below!

QuestDB Part 2 – High Frequency Finance (again!)

By: Dean Markwick's Blog -- Julia

Re-posted from: https://dm13450.github.io/2021/08/12/questdb-part2.html

Last time I showed you how to set up a producer/consumer model to
build a database of BTCUSD trades using the CoinbasePro WebSocket
feed. Now I’ll show you how you can connect to the same database to
pull out the data, use some specific timeseries database queries and
hopefully show where this type of database is useful by improving some
of my old calculations. You should read one of my older blog posts on
high frequency finance (here) as I’m going to repeat some of the calculations using more data this time.


Enjoy these types of post? Then you should sign up to my newsletter. It’s a short monthly recap of anything and everything I’ve found interesting recently plus
any posts I’ve written. So sign up and stay informed!






I ingested just over 24 hours worth of data over the 24th and 25th of July. Completely missed the massive rally though, which is just my luck, that would have been interesting to look at! Oh well.

Julia can connect to the database of the LibPQ.jl package and execute queries using all their functions. This is very handy as we don’t have to worry about database drivers or connection methods, we can just connect and go.

using LibPQ
using DataFrames, DataFramesMeta
using Plots
using Statistics, StatsBase
using CategoricalArrays

Default connection details to the database are used to connect to the database.

conn = LibPQ.Connection("""
             dbname=qdb
             host=127.0.0.1
             password=quest
             port=8812
             user=admin""")
PostgreSQL connection (CONNECTION_OK) with parameters:
  user = admin
  password = ********************
  dbname = qdb
  host = 127.0.0.1
  port = 8812
  client_encoding = UTF8
  options = -c DateStyle=ISO,YMD -c IntervalStyle=iso_8601 -c TimeZone=UTC
  application_name = LibPQ.jl
  sslmode = prefer
  sslcompression = 0
  gssencmode = disable
  krbsrvname = postgres
  target_session_attrs = any

Very easy, Julia just thinks that it is a Postgres database. We can
quickly move onto working with the data.

I start with simply getting all the trades out of the database.

@time trades = execute(conn, "SELECT * FROM coinbase_trades") |> DataFrame
dropmissing!(trades);
nrow(trades)
  4.828067 seconds (9.25 M allocations: 335.378 MiB, 1.64% gc time)

210217

It took about 5 seconds to pull 210 thousand rows into the notebook.

plot(trades.timestamp, trades.price, label=:none, fmt=:png)

png

Like I said in my last post, I missed the sudden rally on Sunday 25th
which was a bit unlucky. Side note, Plots.jl does struggle with
formatting the x axis with a timeseries plot.

Now to move onto updating my previous graphs with this new dataset.

Order Sign Correlation

The correlation between buys and sells follows a power law. Last time,
I only had 1000 trades to work after pulling them using the REST API. Now I’ve got 200x more, which should improve the uncertainty around the previous values.

ac = autocor(trades.side)
acplot = plot(1:length(ac), ac, seriestype=:scatter, label = :none, xlab="Lag", ylab = "Correlation")
aclogplot = plot(log.(1:length(ac)), log.(ac), seriestype=:scatter, label=:none, xlab= "log(Lag)", ylab="log(Correlation)")
plot(acplot, aclogplot, fmt=:png)

png

In the log-log plot we can see a nice straight line which we fit a linear model on.

using GLM

sideModel = lm(@formula(log(AC) ~ log(Lag)), DataFrame(AC=ac, Lag=1:length(ac)))
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}

:(log(AC)) ~ 1 + :(log(Lag))

Coefficients:
──────────────────────────────────────────────────────────────────────────
                 Coef.  Std. Error       t  Pr(>|t|)  Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept)  -0.439012   0.049596    -8.85    <1e-11  -0.538534  -0.339491
log(Lag)     -0.70571    0.0156489  -45.10    <1e-42  -0.737112  -0.674308
──────────────────────────────────────────────────────────────────────────

This time we’ve got a γ value of 0.7 with more certainty.

plot(log.(1:length(ac)), log.(ac), seriestype=:scatter, label=:none)
plot!(log.(1:length(ac)), coef(sideModel)[1] .+ coef(sideModel)[2] .* log.(1:length(ac)), 
      label="Model", xlab= "log(Lag)", ylab="log(Correlation)", fmt=:png)

png

Lines up nicely with the data and better than the previous attempt
with just 1000 trades. γ is less than one which means it is
a ‘long memory’ process, so trades in the past effect trades in the
future for a long time. This is usually explained as the effect of
people breaking up large trades into slices and executing them bit by bit.

Size Distribution

Again, the size of each trade follows a power law distribution too. We use a slightly different method to estimate the exponent and last time with just 1000 trades we struggled to get a stable value. Now, with so much more data we can have another crack.

uSizes = minimum(trades.size):0.05:maximum(trades.size)

empF = ecdf(trades.size)

tradesSizePlot = plot((uSizes), (1 .- empF(uSizes)), seriestype=:scatter, label="P(V > x)", xlabel="Trade Size", ylabel="Probability")
tradesSizeLogPlot = plot(log.(uSizes), log.(1 .- empF(uSizes)), seriestype=:scatter, label="P(V > x)", xlabel = "log(Trade Size)", ylabel="log(Probability)")

plot(tradesSizePlot, tradesSizeLogPlot, fmt=:png)

png

Using the same Hill estimator as last time

function hill_estimator(sizes_sort, k)
    #sizes_sort = sort(sizes)
    N = length(sizes_sort)
    res = log.(sizes_sort[(N-k+1):N] / sizes_sort[N-k])
    k*(1/sum(res))
end
sizes = trades.size
sizes_sort = sort(sizes)
bds = 2:100:(length(sizes)-1000-1)


alphak = [hill_estimator(sizes_sort, k) for k in bds]
plot(bds, alphak, xlabel="k", ylabel="Alpha", label=:none, fmt=:png)

png

Still hard to make a judgement as to whether it is converging to a
value or not. It is always appears to be decreasing no mate the sample
size. Maybe I still need more data or maybe need a better
understanding of the Hill estimator!

Market Impact

I’ve not been using QuestDB to its full potential and repeating all my
previous graphs hasn’t fully exploited the available features. One of those features is the
ability to group by the timestamp across a bucket size (1 second, 5
minutes etc.) and aggregate the data. We will use that to
try and come up with a better model of market impact than I had in my
previous post.

We aggregate the trades into 1 minute buckets and calculate the total volume traded, the total signed volume (sell trades count as negative), the last price and also the number of trades in each bucket.

@time marketimpact = execute(conn, 
    "SELECT timestamp, sum(size) as TotalVolume, 
            sum(size*side) as SignedVolume,
            last(price) as Close,
            count(*) as NTrades
     FROM coinbase_trades 
    SAMPLE by 1m") |> DataFrame
dropmissing!(marketimpact)
marketimpact[1:3, :]
  0.223987 seconds (167.29 k allocations: 8.708 MiB, 56.20% compilation time)

3 rows × 5 columns

timestamp TotalVolume SignedVolume Close NTrades
DateTim… Float64 Float64 Float64 Int64
1 2021-07-24T08:50:34.365 1.75836 -0.331599 33649.0 52
2 2021-07-24T08:51:34.365 4.18169 -3.01704 33625.2 67
3 2021-07-24T08:52:34.365 0.572115 -0.325788 33620.1 46

This took less than a second and is a really easy line of code to write.

Now for the market impact calculation, we calculated the return bucket
to bucket and normalise the signed volume by the total volume traded
to give a value of between -1 and 1. This is taken from
https://arxiv.org/pdf/1206.0682.pdf and equation 26.

marketimpact[!, :returns] .= 1e4.*[NaN; diff(log.(marketimpact.Close))]
marketimpact[!, :NormVolume] .= marketimpact[!, :SignedVolume] ./ marketimpact[!, :TotalVolume]

miModel = lm(@formula(returns ~ NormVolume + 0), marketimpact[2:end, :])
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}

returns ~ 0 + NormVolume

Coefficients:
──────────────────────────────────────────────────────────────────────
              Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────
NormVolume  4.55478    0.290869  15.66    <1e-51     3.9843    5.12526
──────────────────────────────────────────────────────────────────────

Here we can see that there is a positive coefficient, θ in
the paper, as expected, and we can interpret this at how much the price moves after buying or selling. Specifically, in these minute buckets, those that contained only buy trades moved the market up by 4.5bps and the same for sells in the opposite direction.

plot(marketimpact.NormVolume, marketimpact.returns, seriestype=:scatter, label=:none, 
     xlab="Normalised Volume", ylab="Market Impact (log(bps))")
plot!(-1:0.1:1, coef(miModel)[1] .* collect(-1:0.1:1), label="Model", linewidth=3, legend=:topleft, fmt=:png)

png

You can see how the model lines of with the data and there is a very
slight trend that is picked. So overal, a better, if still very simple
model of market impact.

Trades with Top of Book

Now I’ve saved down the best bid and offer using the same process as Part 1 of this series. Over the same time period, the best bid and offer data has 17 million rows. So quite a bit more.

I use this best bid-offer data to do an ASOF join. This takes two tables with timestamps and joins them such that the timestamps align or the previous observation is used. In our case, we can take the trades, join it with the best bid and best offer table to get where the mid price was at the time of the trade.

@time trades2 = execute(conn, 
    "SELECT *
     FROM coinbase_trades 
    ASOF JOIN coinbase_bbo") |> DataFrame
dropmissing!(trades2);
  9.745210 seconds (18.49 M allocations: 671.544 MiB, 1.84% gc time)

This took 11 seconds, but was all done in the database, so no issue
with regards to blowing out the memory after pulling it into your
Julia session. Doing a normal join in Julia would only match
timestamps exactly, whereas we want the last observed bid/offer price
at least making the ASOF function very useful.

We now go through and calculate a mid price, how far the traded price was from the mid price and also add an indicator for what quantile the trade size landed in. We then group by this quantile indicator and calculate the average trade size and average distance from the mid price.

trades2[!, :Mid] .= (trades2.bid .+ trades2.ask)./2;
trades2[!, :Cost] .= 1e4 .* trades2.side .* ((trades2.price .- trades2.Mid) ./ (trades2.Mid))
trades2[!, :SizeBucket] .= cut(trades2[!, :size], [quantile(trades2[!, :size], 0:0.1:1); Inf])
gdata = groupby(@where(trades2, :Cost .> 0), :SizeBucket)
costData = @combine(gdata, MeanSize = mean(:size), MeanCost = mean(:Cost))

logCostPlot = plot(log.(costData.MeanSize), 
                   log.(costData.MeanCost), seriestype=:scatter, 
                   label=:none, 
                   xlab="log(Size)", 
                   ylab="log(Cost)", fmt=:png)

png

Unsurprisingly, we can see that larger trades are further away from
the mid price when they execute. This is because they are eating
through the posted liquidity.

This is very similar to my https://cryptoliquiditymetrics.com/ sweep
the book graph which is estimating the cost of eating liquidity. This graph above is showing
the actual cost of eating liquidity for real trades that have happened
on Coinbase.

We can fit a model to this plot and it is commonly referred to as the square root law of market impact. We ignore the smaller trade sizes, as they aren’t following the nice linear log-log plot.

costModel = lm(@formula(log(MeanCost) ~ log(MeanSize)),
                     @where(costData, :MeanSize .> exp(-7)))
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}

:(log(MeanCost)) ~ 1 + :(log(MeanSize))

Coefficients:
───────────────────────────────────────────────────────────────────────────
                   Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
───────────────────────────────────────────────────────────────────────────
(Intercept)    -0.534863   0.0683721  -7.82    0.0001  -0.696537  -0.373189
log(MeanSize)   0.259424   0.0154468  16.79    <1e-06   0.222898   0.29595
───────────────────────────────────────────────────────────────────────────

The γ value of 0.25 is pretty low compared to other assets,
which we would expect to be around 0.5. But we haven’t included the usual volatility calculation which is in front of the volume component.

plot(log.(costData.MeanSize), 
     log.(costData.MeanCost), seriestype=:scatter, 
     label=:none, 
     xlab="log(Size)", 
     ylab="log(Cost)")
plot!(-8:0.1:3, coef(costModel)[1] .+ coef(costModel)[2] .* (-8:0.1:3), 
      label="Model", legend=:topleft, fmt=:png)

png

Apart from the small trades, the model lines up well with the
increasing trade size.

Using this model you can start to estimate how much a strategy
might cost to implement. At the end of the day, the outcome of your
strategy is unknown, but your trading costs are known. If it costs you
1bp to enter and exit a trade (round trip) but you only think the
price will change by 0.5bps, then your at a loss even if you were 100%
right on the price direction!

Summary

QuestDB makes working with this data incredibly easy. Both aggregating
the data using SAMPLE BY and joining two datasets using AS
OF
. Connecting to the database is a doddle using LibPQ.jl, so you
can get up and running without any issues. Very rare that these things
happen straight out the box.

Then by using this data I’ve ramped up the sample sizes and all my
plots and models start to look better. Again, all free data and with hopefully, very minimal
technical difficulty. As someone that usually finds themselves
drowning in csvs QuestDB has shown how much more efficient things can
be when you use a database.

Using QuestDB to Build a Crypto Trade Database in Julia

By: Dean Markwick's Blog -- Julia

Re-posted from: https://dm13450.github.io/2021/08/05/questdb-part-1.html

QuestDB is a timeseries database that is well suited for financial
data. It is built with timestamps in mind both when storing the data
and also when getting the data out. This makes it the ideal candidate
as a storage system for crypto trades.

You might have heard of KDB (or q the language used to work in KDB)
which is a popular comercial timeseries database. You’ll see it being
used all over the finance industry. Unfortunately KDB
costs money if you go beyond their personal license terms whereas
QuestDB is a free and also open source.

So this is a blog post on how to get started with QuestDB. We will be
connecting to Coinbase WebSockets in Julia to download trades that
will be stored in a QuestDB table.


Enjoy these types of post? Then you should sign up to my newsletter. It’s a short monthly recap of anything and everything I’ve found interesting recently plus
any posts I’ve written. So sign up and stay informed!






I’ve written about getting data using Coinbase’s REST APIs
here and
this WebSocket method is an alternative way to get the same
data. Using a WebSocket though has other benefits. With the REST API it is impolite to constantly query the endpoint to pull data in realtime. Plus you don’t know when something has changed, you pull the data, compare it to what you know and then see if there has been an update. With a WebSocket, it just tells you when something has changed, like when a new trade has occurred. This makes it great for building up a database, you can just connect to the firehose and save down all the trades that are coming through. Open the flood gates rather than always asking if something had changed.

Before we get started though you need to download and install QuestDB. As I’m on a Mac I downloaded it using Homebrew and started the process like questdb start.
You can get more information on downloading QuestDB here for your system.

The rest of the blog cost will walk you through the following steps.

  • The Producer/Consumer pattern
  • Using WebSockets in Julia
  • Putting data into QuestDB

In part two, which I will post sometime next week, I will show you how
we get data out of the database and use some of the specialised
features for timeseries data.

The Producer/Consumer Pattern

I am setting the processes up using a programming pattern called
‘product/consumer’. This means building one process that produces data and a separate process that can consume data. Having the two functions separate allows for a better scalability if you wanted to add in more exchanges or workers. It also means that there is a degree of independence between the two processes and reducing the coupling should make for an easier development experience.

To set this up in Julia we need to create a RemoteChannel which is how the producer and consumer processes will comunicate. It will be fileld up with the type Trade that we will also create.

struct Trade
    id::String
    time::String
    price::Float64
    size::Float64
    side::Int64
    exchange::String
end

Trade() = Trade("", "", NaN, NaN, 0, "")

Base.isempty(x::Trade) = x.id == ""

After creating the struct we also add in the null creator function and also a method for checking whether a trade but overal it is a simple type that just contains the relevant information for each trade found.

The RemoteChannel comes from the Distributed package.

using Distributed

const trades = RemoteChannel(()->Channel{Trade}(500));

It can store 500 trades and we will fill it up by connecting to the CoinbasePro WebSocket feed. Any of the producer processes would be able to add to this trades channel if needed.

WebSockets in Julia

A WebSocket needs a url and a call back function to run on the WebSocket. In our case we want to connect to the WebSocket, subscribe to the market data and parse the incoming messages.

Parsing the message is simple. As it is a JSON object it gets converted to a Julia dictionary, so we can just pull the appropriate fields and parse them to number if needed.

This can be accomplished as so:

using JSON
using Dates
using WebSockets

coinbase_url = "wss://ws-feed.pro.coinbase.com"
coinbase_subscribe_string = JSON.json(Dict(:type=>"subscribe", 
                         :product_ids=>["BTC-USD"], 
                         :channels=>["ticker", "heartbeat"]))

function parse_coinbase_data(x)
    if (get(x, "type", "") == "heartbeat") || (haskey(x, "channels"))
        println("Worker $(myid()): Coinbase Heartbeat")
        return Trade()
    end
    
    ts = get(x, "time", "")
    
    side = get(x, "side", "")
    tradedprice = parse(Float64, get(x, "price", "NaN"))
    size = parse(Float64, get(x, "last_size", "NaN"))
    id = get(x, "trade_id", "")
    
    Trade(string(id), ts, tradedprice, size, lowercase(side) == "buy" ? 1 : -1, "Coinbase")
end

function save_coinbase_trades(coinbase_url, coinbase_subscribe_string)

    WebSockets.open(coinbase_url) do ws
        write(ws, coinbase_subscribe_string)
        data, success = readguarded(ws)
        println("Entering Loop")
        while true
            data, success = readguarded(ws)
            jdata = JSON.parse(String(data))
            clean_data = parse_coinbase_data(jdata)
            if !isempty(clean_data)
              put!(trades, clean_data)
            end
        end
    end
    
end

We subscribe to the ticker channel, which gives us trades as they occur and we also use the heartbeat channel to keep the WebSocket alive.

Once the message has been parsed and we have created a Trade object, we can then add it to the queue for the database writer to pick up and save down.

This is finishes the producer part. We can now move onto the consumer process.

Getting Julia to Talk to QuestDB

We’ve connected to the WebSocket and our RemoteChannel is filling up. How do we get this into a database?. QuestDB exposes a socket (a normal socket not a WebSocket!) that Julia can connect to. So we simply connect to that exposed port and can send data to QuestDB.

QuestDB uses the
InfluxDB line protocol
to ingest data. This is as easy as sending a string down the
connection and QuestDB does the parsing to place it into the database
table. This string needs to take on a specific format:

table, string_column=value, numeric_column_1=value, numeric_column_2=value timestamp

We build this using an IOBuffer to incrementally add to the payload string.

The timestamp is the number of nanoseconds since the UNIX epoch. The timestamp of the trade from Coinbase does have this precision, but unfortunately Julia DateTime’s do not support nanoseconds but the Time type does. So we have to be a bit creative.

The timestamp looks like 2021-01-01T12:00:00.123456. I split on the . to get the datetime up to seconds and the nanoseconds. The datetime gets easily parsed into epoch time which we get into nanoseconds since the epoch by multiplying by 1e9. For the nanoseconds, we right pad it with any 0 to makes sure it is 9 digits long and can then convert to nanoseconds. Then it is as simple as adding the two values together and using @sprintf to get the full integer number without scientific notation.

using Printf

function parse_timestamp(ts::String)
    
    p1, p2 = split(ts, ".")
    
    ut = datetime2unix(DateTime(p1)) * 1e9
    ns = Nanosecond(rpad(chop(String(p2), tail=1), 9, "0"))
    
    @sprintf "%.0f" ut + ns.value 
end

function build_payload(x::Trade)
    buff = IOBuffer()
    write(buff, "coinbase_trades2,")
    write(buff, "exchange=$(getfield(x, :exchange)), ")
    for field in [:id, :price, :size]
        val = getfield(x, field)
        write(buff, "$(field)=$(val),")
    end
    write(buff, "side=$(getfield(x, :side)) ")
    
    tspretty = parse_timestamp(getfield(x, :time))
    
    write(buff, tspretty)
    write(buff, "\n")
    String(take!(buff))
end

The build_payload function takes in a trade and outputs a string to
write to QuestDB. We connect to port 9009 and continuously take trades
from the trades RemoteChannel and write it to the database.

using Sockets
function save_trades_quest(trades)
    cs = connect("localhost", 9009)
    while true
        payload = build_payload(take!(trades))
        write(cs, (payload))
    end
    close(cs)
end

All pretty simple. The annoying thing is that it won’t give any
indication as to whether it was successful in writing the file or
not. You can check by looking at the QuestDB web interface and seeing
if your value appears after querying the database or you can check the
QuestDB logs to see if any errors have been found. You’ll find the logs at /usr/local/var/questdb on a Mac.

Now we’ve got everything sorted we just need to get both processes
running. We will kick off the WebSocket process asynchronously so that is runs in the background and likewise, the save_trades_quest function so that it doesn’t lock your computer up if you are running the code along side it. With scaling in mind, you could run both of these processes on different threads or cores if needed. But in this case, both processes are light enough to be ran asynchronously on the main thread.

@async save_coinbase_trades(coinbase_url, coinbase_subscribe_string)
@async save_trades_quest(trades)

This is now saving down into the database every time a new trade is seen. If you go to localhost:9000 you query the data and see how it is evolving. QuestDB uses SQL like equivalent and so you can write things like

select * from coinbase_trades

and it will show you all the trades saved down so far. Or

select min(timestamp), max(timestamp) from coinbase_trades

to see the earliest and lastest timestamp in traditional SQL. Or using the timeseries database features:

select * from coinbase_trades
latest by exchange

which will pull out the last timestamp.

Summary

That’s the end of part 1. You should hopefully QuestDB installed and slowly filling up with trades from Coinbase now. In the next part, I’ll show you how to connect to the database and pull the data to do some analysis.

If you want something extra to do, why not try extending the program to pull in other cryptocurrencies, you’ll want to edit the subscribe string and how each message is parsed.
You can also connect to QuestDB using Grafana and build out some dashboards to monitor the trades without needing any other code.

Questions? Feedback? Comments? Let me know below!

Version Info

  • QuestDB 6.0.4
  • Julia 1.6