Author Archives: Dean Markwick's Blog -- Julia

Fitting Mixed Effects Models – Python, Julia or R?

By: Dean Markwick's Blog -- Julia

Re-posted from: https://dm13450.github.io/2022/01/06/Mixed-Models-Benchmarking.html

I’m benchmarking how long it takes to fit a mixed
effects model using lme4 in R, statsmodels in Python, plus
showing how MixedModels.jl in Julia is also a viable option.


Enjoy these types of posts? Then you should sign up for my newsletter. It’s a short monthly recap of anything and everything I’ve found interesting recently plus
any posts I’ve written. So sign up and stay informed!






Data science is always up for debating whether R or Python is the better
language when it comes to analysing some data. Julia has been
making its case as a viable alternative as well. In most cases you can
perform a task in all three with a little bit of syntax
adjustment, so no need for any real commitment. Yet, I’ve
recently been having performance issues in R with a mixed model.

I have a dataset with 1604 groups for a random effect that has been
grinding to a halt when fitting in R using lme4. The team at lme4
have a vignette titled
performance tips
which at the bottom suggests using Julia to speed things up. So I’ve
taken it upon myself to benchmark the basic model-fitting performances
to see if there is a measurable difference. You can use this post as
an example of fitting a mixed effects model in Python, R and Julia.

The Setup

In our first experiment, I am using the palmerspenguins dataset to
fit a basic linear model. I’ve followed the most basic method in all
three languages, using what the first thing in Google
displays.

The dataset has 333 observations with 3 groups for the random effects
parameter.

I’ll make sure all the parameters are close across the three
languages before benchmarking the performance, again, using what
Google says is the best approach to time some code.

I’m running everything on a Late 2013 Macbook. 2.6GHz i5 with 16GB of
RAM. I’m more than happy to repeat this on an M1 Macbook if someone is
willing to sponsor the purchase!

Now onto the code.

Mixed Effects Models in R with lme4

We will start with R as that is where the dataset comes from. Loading
up the palmerspenguins package and filtering out the NaN in the
relevant columns provide a consistent dataset for the other
languages.

  • R – 4.1.0
  • lme4 – 1.1-27.1
require(palmerpenguins)
require(lme4)
require(microbenchmark)

testData <- palmerpenguins::penguins %>%
                    drop_na(sex, species, island, bill_length_mm)
lmer(bill_length_mm ~ (1 | species) + sex + 1,  testData)
  • Intercept – 43.211
  • sex:male – 3.694
  • species variance – 29.55

This testData gets saved as a csv for the rest of the languages
to read and use in the benchmarking.

To benchmark the function I use the
microbenchmark
package. It is an excellent and lightweight way of quickly working out
how long a function takes to run.

microbenchmark(lmer(bill_length_mm ~ (1 | species) + sex + 1,  testData), times = 1000)

This outputs the relevant quantities of the benchmarking times.

Mixed Effects Models Python with statsmodels

  • Python 3.9.4
  • statmodels 0.13.1
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import timeit as tt

modelData = pd.read_csv("~/Downloads/penguins.csv")

md = smf.mixedlm("bill_length_mm  ~ sex + 1", modelData, groups=modelData["species"])
mdf = md.fit(method=["lbfgs"])

Which gives us parameter values:

  • Intercept – 43.211
  • sex:male – 3.694
  • Species variance: 29.496

When we benchmark the code, we define a specific function and
repeatedly run it 10000 times. This is all contained in the timeit
module, part of the Python standard library.

def run():
    md = smf.mixedlm("bill_length_mm  ~ sex + 1", modelData, groups=modelData["species"])
    mdf = md.fit(method=["lbfgs"])

times = tt.repeat('run()', repeat = 10000, setup = "from __main__ import run", number = 1)

We’ll be taking the mean, median and, range of the times array.

Mixed Effects Models in Julia with MixedModels.jl

  • Julia 1.6.0
  • MixedModels 4.5.0

Julia follows the R syntax very closely, so this needs little
explanation.

using DataFrames, DataFramesMeta, CSV, MixedModels
using BenchmarkTools

modelData = CSV.read("/Users/deanmarkwick/Downloads/penguins.csv",
                                     DataFrame)

m1 = fit(MixedModel, @formula(bill_length_mm ~ 1 + (1|species) + sex), modelData)
  • Intercept – 43.2
  • sex:male coefficient – 3.694
  • group variance of – 19.68751

We use the
BenchmarkTools.jl
package to run the function 10,000 times.

@benchmark fit(MixedModel, @formula(bill_length_mm ~ 1 + (1|species) + sex), $modelData)

As a side note, if you run this benchmarking code in a Jupyter
notebook, you get this beautiful output from the BenchmarkTools
package. Gives you a lovely overview of all the different metrics and
the distribution on the running times.

Julia benchmarking screenshot

Timing Results

All the parameters are close enough, how about the running times?

In milliseconds:

Language Mean Median Min Max
Julia 0.482 0.374 0.320 34
Python 340 260 19 1400
R 29.5 24.5 20.45 467

Julia blows both Python and R out of the water. About 60 times
faster.

I don’t think Python is that slow in practice, I think it is more of
an artefact of the benchmarking code that doesn’t behave in the
same way as Julia and R.

What About Bigger Data and More Groups?

What if we increased the scale of the problem and also the number of
groups in the random effects parameters?

I’ll now fit a Poisson mixed model to some football data. I’ll
be modeling the goals scored by each team as a Poisson variable, with
a fixed effect of whether the team played at home or not and random
effects for the team and another random effect of the opponent.

This new data set is from
football-data.co.uk and has 98,242
rows with 151 groups in the random effects. Much bigger than the
Palmer Penguins dataset.

Now, poking around the statsmodels documentation, there doesn’t
appear to be a way to fit this model in a frequentist
way. The closest is the PoissonBayesMixedGLM, which isn’t comparable
to the R/Julia methods. So in this case we will be dropping Python
from the analysis. If I’m wrong, please let me know in the comments
below and I’ll add it to the benchmarking.

With generalised linear models in both R and Julia, there are
additional parameters to help speed up the fitting but at the expense of
the parameter accuracy. I’ll be testing these parameters to judge how
much of a tradeoff there is between speed and accuracy.

R

The basic mixed-effects generalised linear model doesn’t change much from the
above in R.

glmer(Goals ~ Home + (1 | Team) + (1 | Opponent), data=modelData, family="poisson")

The documentation states that you can pass nAGQ=0 to speed up the
fitting process but might lose some accuracy. So our fast version of
this model is simply:

glmer(Goals ~ Home + (1 | Team) + (1 | Opponent), data=modelData, family="poisson", nAGQ = 0)

Julia

Likewise for Julia hardly any difference in fitting this type of Poisson model.

fit(MixedModel, @formula(Goals ~ Home + (1 | Team) + (1 | Opponent)), footballData, Poisson())

And even mode simply, there is a fast parameter to use which speeds
up the fitting.

fit(MixedModel, @formula(Goals ~ Home + (1 | Team) + (1 | Opponent)), footballData, Poisson(), fast = true)

Big Data Results

Let’s check the fitted coefficients.

Method Intercept Home σTeam σOpponent
R slow 0.1345 0.2426 0.2110 0.2304
R fast 0.1369 0.2426 0.2110 0.2304
Julia slow 0.13455 0.242624 0.211030 0.230415
Julia fast 0.136924 0.242625 0.211030 0.230422

The parameters are all very similar, showing that for this parameter
specification the different speed flags do not change the coefficient
results, which is good. But for any specific model, you should verify
on a subsample at least to make sure the flags don’t change anything.

Now, what about speed.

Language Additional Parameter Mean Median Min Max
Julia 11.151 10.966 9.963 16.150
Julia fast=true 5.94 5.924 4.98 8.15
R 35.4 33.12 24.33 66.48
R nAGQ=0 8.06 7.99 7.37 9.56

So setting fast=true gives a 2x speed boost in Julia which is
nice. Likewise, setting nAGQ=0 in R improves the speed by almost 3x over
the default. Julia set to fast = true is the quickest, but I’m
surprised that R can get close with its speed-up parameter.

Conclusion

If you are fitting a large mixed-effects model with lots of groups
hopefully, this convinces you that Julia is the way forward. The syntax
for fitting the model is equivalent, so you can do all your
preparation in R before importing the data into Julia to do the model
fitting.

Ecomonic Indicators from AlphaVantage

By: Dean Markwick's Blog -- Julia

Re-posted from: https://dm13450.github.io/2021/11/08/AlphaVantage-Economic-Indicators.html

AlphaVantage has added endpoints to
the FRED data repository and I’ve extended the Julia package
AlphaVantage.jl to use them. This gives you an easy way to include some economic data into your models. This blog post will detail the new functions and I’ll be dusting off my AS-Level in economics to try and explain what they all mean.

What AlphaVantage has done here is nothing new, and you can get the
FRED data directly from source https://fred.stlouisfed.org both
through an API and also just downloading csvs. But having another a way to get this economic data into a Julia environment is always a bonus.


Enjoy these types of posts? Then you should sign up for my newsletter. It’s a short monthly recap of anything and everything I’ve found interesting recently plus
any posts I’ve written. So sign up and stay informed!






Make sure you’ve upgraded your AlphaVantage.jl to version to 0.4.1. I’m
running Julia 1.6.

using AlphaVantage
using Plots, DataFrames, DataFramesMeta, Dates

GDP

Gross Domestic Product (GDP) is the overall output of a country. It comprises of both goods and services, so both things that are made (goods) and things that are provided (services). You can think of it as countries overall revenue and summarise how well a country is doing. Good if it is increasing, bad if it is decreasing.

AlphaVantage gives the ability to pull both quarterly and annual values.

realgdp = AlphaVantage.real_gdp("annual") |> DataFrame
realgdp[!, :timestamp] = Date.(realgdp[!, :timestamp])

quartgdp = AlphaVantage.real_gdp("quarterly") |> DataFrame
quartgdp[!, :timestamp] = Date.(quartgdp[!, :timestamp]);
a_tks = minimum(realgdp.timestamp):Year(15):maximum(realgdp.timestamp)
a_tkslbl = Dates.format.(a_tks, "yyyy")

q_tks = minimum(quartgdp.timestamp):Year(4):maximum(quartgdp.timestamp)
q_tkslbl = Dates.format.(q_tks, "yyyy")

aGDP = plot(realgdp[!, :timestamp], realgdp[!, :value], label=:none, title="Annual GDP", xticks = (a_tks, a_tkslbl))
qGDP = plot(quartgdp[!, :timestamp], quartgdp[!, :value], label = :none, title = "Quarterly GDP", xticks = (q_tks, q_tkslbl))

plot(aGDP, qGDP)

svg

There are very few periods where GDP has decreased, although it has
recently been because of COVID. The effects of COVID will crop up
quite a bit in this post.

Real GDP per Capita

The problem with GDP is that it doesn’t take into account how big the country is. If you have more people in your economy then you can probably generate more money. Likewise, to compare your current GDP with historical values it is probably wise to divide by the population size, which gives a general indication of overal quality of life.

gdpPerCapita = AlphaVantage.real_gdp_per_capita() |> DataFrame
gdpPerCapita[!, :timestamp] = Date.(gdpPerCapita[!, :timestamp])

plot(gdpPerCapita.timestamp, gdpPerCapita.value, label=:none)

svg

Again, another drop because of COVID but getting close to reverting on trend

Treasury Yield

The treasury yield represents what percentage return you get for lending money to the US government. As the US government is continuously issuing new debt you could choose lots of different lengths of times to lend the money, the longer you lend money for, the higher your rate of return because you are taking on more risk. FRED provides four different tenors (lengths of time) and what the average yield on your money would be if you bought on that day.

yields = AlphaVantage.treasury_yield.("daily", ["3month", "5year", "10year", "30year"]);

We take advantage of broadcasting to pull the data of each tenor before joining all the data into one big dataframe.

yields = DataFrame.(yields)

tenor = [3, 5*12, 10*12, 30*12]

allyields = mapreduce(i -> begin 
        yields[i][!, :Tenor] .= tenor[i]
        yields[i][!, :timestamp] = Date.(yields[i][!, :timestamp])
        return yields[i]
        end, vcat, 1:4)

allyields = @subset(allyields, :value .!= ".")
allyields[!, :value] = convert.(Float64, allyields[!, :value]);
plot(allyields[!, :timestamp], allyields[!, :value], group = allyields[!, :Tenor], ylabel = "Yield (%)")

svg

Rates have continuously fallen since the peaks in the 1980s. You can also

What happens though when the short-term yields are higher than the
long-term rates? This is when the yield curve ‘inverts’ and the market
believes the short-term risk is higher than the long-term risk. It is very rare, as we can see above, the blue line has only crossed the highest a few times. When was the last time? It was over the great financial crisis. On the 26th of Feb 2007, this happened with the 3-month rate crossing 5% whilst the 30 year was still less than 5%. The very next day there was a market crash and the stock market has one of the largest falls in history.

creditcrunch = @subset(allyields, in.(:timestamp, [[Date("2007-02-26"), 
                                                    Date("2008-02-26"), 
                                                    Date("2009-02-26"),
                                                    Date("2010-02-26")]]))

plot(creditcrunch.Tenor, 
    creditcrunch.value, 
    group = creditcrunch.timestamp, marker = :d,
    legendposition = :bottomright,
     title = "Yield Curve", xlabel="Tenor (days)", legend=:bottomright, ylabel = "Yield (%)")

svg

This is the yield curve throughout the years on that same day. We can see that usually, it is increasing, but on the day before the market crash it flipped and there was little difference in the other rates. So next time you hear about the yield curve inverting, you can join everyone else in getting nervous.

Federal Fund Rate

This is what is decided by the FOMC on Thursday afternoons.

The rate at which lending institutions lend overnight and is
uncollateralised, which means that don’t have to put down any type of
collateral for the loan. Gives an indication of the overall interest
rate in the American economy. Our (the UK) equivalent is the Bank of England rate, or in Europe the ECB right. This is essentially what you could put as r, the risk-free rate in the Black Scholes model.

fedFundRate = AlphaVantage.federal_fund_rate("monthly") |> DataFrame

fedFundRate[!, :timestamp] = Date.(fedFundRate[!, :timestamp])
fedFundRate[!, :value] = convert.(Float64, fedFundRate[!, :value])

plot(fedFundRate.timestamp, fedFundRate.value, label=:none, title="Federal Fund Rate")

svg

Again, much like the treasury rates, it has fallen steadily since the 1980s. You might be wondering what the difference is between this rate and the above treasury interest rates. This Federal Fund Rate is set by the FOMC and represents bank to bank lending, whereas anyone can buy a treasury and receive that return. Essentially, the Federal Fund Rate is the overall driver of the treasuries.

So, why would the FOMC change the Federal Fund Rate? One of the reasons would be down to inflation and how prices are changing for the average person.

Consumer Price Index (CPI)

This is the consumer price index and represents the price of goods in a basket and how it has changed over time. This provides some measure of inflation and tells us how prices have changed.

AlphaVantage provides this both on a monthly and a semiannual basis.

cpi = AlphaVantage.cpi("monthly") |> DataFrame

cpi[!, :timestamp] = Date.(cpi[!, :timestamp])
cpi[!, :value] = convert.(Float64, cpi[!, :value])

plot(cpi.timestamp, cpi.value, title = "CPI", label = :none)

svg

Prices have been consistently increasing which indicates inflation, but quoting the CPI value isn’t all that intuitive. Instead, what we need is the change in prices to truly reflect how prices have increased, or decreased.

Inflation and Inflation Expectation

Inflation is the compliment to the above CPI measure and provides a percentage to understand how prices have changed over some time.

Inflation is also funny as people will change their behaviour based on
what they think inflation is, rather than what it actually might
be. This is where the inflation expectation comes in handy. If there
is an expectation of high future inflation people might save more to
prepare for higher prices, or they might spend more now to get in front of higher prices. Likewise, if a bank is trying to price a mortgage, higher inflation in the future would reduce the value of the future repayments, so they would adjust the interest rate accordingly.

AlphaVantage provides both from the FRED Datasource inflation (yearly) and inflation_expectation (monthly).

inflation = AlphaVantage.inflation() |> DataFrame
expInflation = AlphaVantage.inflation_expectation() |> DataFrame

inflation[!, :Label] .= "Actual"
expInflation[!, :Label] .= "Expectation"
inflation = vcat(inflation, expInflation)

inflation[!, :timestamp] = Date.(inflation[!, :timestamp])
inflation[!, :value] = convert.(Float64, inflation[!, :value])

plot(inflation.timestamp, inflation.value, group=inflation.Label, title="Inflation")

svg

Since the GFC inflation expectation has been consistently higher than the actual value of inflation. Expectations have also seen a large increase recently. Inflation is becoming an increasing concern in this current economy.

Consumer Sentiment

Consumer sentiment comes from a survey of around 500 people in
America. They are asked how they feel about the economy and their general outlook on what is happening. This is then condensed down into a number which we can view as an indicator of how people feel. Again, like the inflation expectation, it can sometimes be more important to focus on people’s thoughts vs everyone’s actions. Take the petrol crisis here in the UK, I imagine everyone believes they are not the ones panic buying, however, if no-one was panic buying, there would still be petrol! Likewise, if everyone is talking negatively about the economy but not changing behaviour, then it could still have a negative overall effect.

sentiment = AlphaVantage.consumer_sentiment() |> DataFrame

sentiment = @subset(sentiment, :value .!= ".")
sentiment.timestamp = Date.(sentiment.timestamp)

plot(sentiment.timestamp, sentiment.value, label=:none, title="Consumer Sentiment")

svg

Consumer sentiment has been consistent throughout the years, with overall sentiment peaking in the 2000s and at its worse in the 1980s. Understandably, COVID had a major effect causing a fall that had been recovering but has since reversed I imagine based on inflation fears.

Retail Sales and Durable Goods

Retail sales and durable goods are all about what is being bought in the economy. Retail sales consist of things like eating at restaurants, buying clothes, and similar goods. Think of it as doing your weekly shop and how that can vary week on week. Sometimes you might be stocking up on cleaning products, other times you might be buying more food. All of those will be counted in the retail sales survey.

Whereas durable goods are your big-ticket purchases, things that you use more than once and have sort of further use. Cars, ovens, and refrigerators are good examples. Something you’ll save up for and buy at a special store rather than at Tesco.

So these two measures can give a good idea of how people are acting in the economy, are the weekly shops decreasing at the same time as the durable sales because people are spending less across the board? Or is there a sudden increase in durable goods as retail sales remain constant because people suddenly have access to more money to buy a car etc. All sorts of ways you can interpret the numbers.

retails = AlphaVantage.retail_sales() |> DataFrame
goods = AlphaVantage.durables() |> DataFrame;
retails[!, :Label] .= "Retail Sales"
goods[!, :Label] .= "Durable Goods"
retails = vcat(retails, goods)
retails.timestamp = Date.(retails.timestamp)

plot(retails.timestamp, retails.value, group=retails.Label, legend=:topleft)

svg

We can see that they are very seasonal, with large variations
throughout the year. Durable goods took a hit over the COVID crisis,
whereas retail sales have continued to increase and regained their highs even after a COVID decrease.

Unemployment and Non-Farm Payrolls

Finally, we have the unemployment figures. This includes the explicit unemployment rate, expressed as a percentage and also the Non-Farm Payrolls (NFP) number. This is the opposite of an unemployment rate and indicates the current number of people employed. Both of these numbers are monthly figures.

unemployment = AlphaVantage.unemployment() |> DataFrame
nfp = AlphaVantage.nonfarm_payroll() |> DataFrame

unemployment[!, :label] .= "Unemployment"
nfp[!, :label] .= "NFP"

nfp.timestamp = Date.(nfp.timestamp)
unemployment.timestamp = Date.(unemployment.timestamp)

utks = minimum(unemployment.timestamp):Year(12):maximum(unemployment.timestamp)
utkslabel = Dates.format.(utks, "yyyy")

ntks = minimum(nfp.timestamp):Year(13):maximum(nfp.timestamp)
ntkslabel = Dates.format.(ntks, "yyyy")

unemPlot = plot(unemployment.timestamp, unemployment.value, title = "Unemployment", label=:none, xticks = (utks, utkslabel))
nfpPlot = plot(nfp.timestamp, nfp.value, title = "NFP", label=:none, xticks = (ntks , ntkslabel))

plot(unemPlot, nfpPlot)

svg

Unemployment went very high briefly after COVID before coming back down, so seems to have adverted that crisis. NFP numbers are also progressing upwards since the COVID disruption.

Conclusion

Well done on making it this far. Quite a few words and also graphs that all appear to look very similar. Hopefully, you’ve learned something new, or you are about to correct me on something by leaving a comment below!

QuestDB Part 2 – High Frequency Finance (again!)

By: Dean Markwick's Blog -- Julia

Re-posted from: https://dm13450.github.io/2021/08/12/questdb-part2.html

Last time I showed you how to set up a producer/consumer model to
build a database of BTCUSD trades using the CoinbasePro WebSocket
feed. Now I’ll show you how you can connect to the same database to
pull out the data, use some specific timeseries database queries and
hopefully show where this type of database is useful by improving some
of my old calculations. You should read one of my older blog posts on
high frequency finance (here) as I’m going to repeat some of the calculations using more data this time.


Enjoy these types of post? Then you should sign up to my newsletter. It’s a short monthly recap of anything and everything I’ve found interesting recently plus
any posts I’ve written. So sign up and stay informed!






I ingested just over 24 hours worth of data over the 24th and 25th of July. Completely missed the massive rally though, which is just my luck, that would have been interesting to look at! Oh well.

Julia can connect to the database of the LibPQ.jl package and execute queries using all their functions. This is very handy as we don’t have to worry about database drivers or connection methods, we can just connect and go.

using LibPQ
using DataFrames, DataFramesMeta
using Plots
using Statistics, StatsBase
using CategoricalArrays

Default connection details to the database are used to connect to the database.

conn = LibPQ.Connection("""
             dbname=qdb
             host=127.0.0.1
             password=quest
             port=8812
             user=admin""")
PostgreSQL connection (CONNECTION_OK) with parameters:
  user = admin
  password = ********************
  dbname = qdb
  host = 127.0.0.1
  port = 8812
  client_encoding = UTF8
  options = -c DateStyle=ISO,YMD -c IntervalStyle=iso_8601 -c TimeZone=UTC
  application_name = LibPQ.jl
  sslmode = prefer
  sslcompression = 0
  gssencmode = disable
  krbsrvname = postgres
  target_session_attrs = any

Very easy, Julia just thinks that it is a Postgres database. We can
quickly move onto working with the data.

I start with simply getting all the trades out of the database.

@time trades = execute(conn, "SELECT * FROM coinbase_trades") |> DataFrame
dropmissing!(trades);
nrow(trades)
  4.828067 seconds (9.25 M allocations: 335.378 MiB, 1.64% gc time)

210217

It took about 5 seconds to pull 210 thousand rows into the notebook.

plot(trades.timestamp, trades.price, label=:none, fmt=:png)

png

Like I said in my last post, I missed the sudden rally on Sunday 25th
which was a bit unlucky. Side note, Plots.jl does struggle with
formatting the x axis with a timeseries plot.

Now to move onto updating my previous graphs with this new dataset.

Order Sign Correlation

The correlation between buys and sells follows a power law. Last time,
I only had 1000 trades to work after pulling them using the REST API. Now I’ve got 200x more, which should improve the uncertainty around the previous values.

ac = autocor(trades.side)
acplot = plot(1:length(ac), ac, seriestype=:scatter, label = :none, xlab="Lag", ylab = "Correlation")
aclogplot = plot(log.(1:length(ac)), log.(ac), seriestype=:scatter, label=:none, xlab= "log(Lag)", ylab="log(Correlation)")
plot(acplot, aclogplot, fmt=:png)

png

In the log-log plot we can see a nice straight line which we fit a linear model on.

using GLM

sideModel = lm(@formula(log(AC) ~ log(Lag)), DataFrame(AC=ac, Lag=1:length(ac)))
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}

:(log(AC)) ~ 1 + :(log(Lag))

Coefficients:
──────────────────────────────────────────────────────────────────────────
                 Coef.  Std. Error       t  Pr(>|t|)  Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept)  -0.439012   0.049596    -8.85    <1e-11  -0.538534  -0.339491
log(Lag)     -0.70571    0.0156489  -45.10    <1e-42  -0.737112  -0.674308
──────────────────────────────────────────────────────────────────────────

This time we’ve got a γ value of 0.7 with more certainty.

plot(log.(1:length(ac)), log.(ac), seriestype=:scatter, label=:none)
plot!(log.(1:length(ac)), coef(sideModel)[1] .+ coef(sideModel)[2] .* log.(1:length(ac)), 
      label="Model", xlab= "log(Lag)", ylab="log(Correlation)", fmt=:png)

png

Lines up nicely with the data and better than the previous attempt
with just 1000 trades. γ is less than one which means it is
a ‘long memory’ process, so trades in the past effect trades in the
future for a long time. This is usually explained as the effect of
people breaking up large trades into slices and executing them bit by bit.

Size Distribution

Again, the size of each trade follows a power law distribution too. We use a slightly different method to estimate the exponent and last time with just 1000 trades we struggled to get a stable value. Now, with so much more data we can have another crack.

uSizes = minimum(trades.size):0.05:maximum(trades.size)

empF = ecdf(trades.size)

tradesSizePlot = plot((uSizes), (1 .- empF(uSizes)), seriestype=:scatter, label="P(V > x)", xlabel="Trade Size", ylabel="Probability")
tradesSizeLogPlot = plot(log.(uSizes), log.(1 .- empF(uSizes)), seriestype=:scatter, label="P(V > x)", xlabel = "log(Trade Size)", ylabel="log(Probability)")

plot(tradesSizePlot, tradesSizeLogPlot, fmt=:png)

png

Using the same Hill estimator as last time

function hill_estimator(sizes_sort, k)
    #sizes_sort = sort(sizes)
    N = length(sizes_sort)
    res = log.(sizes_sort[(N-k+1):N] / sizes_sort[N-k])
    k*(1/sum(res))
end
sizes = trades.size
sizes_sort = sort(sizes)
bds = 2:100:(length(sizes)-1000-1)


alphak = [hill_estimator(sizes_sort, k) for k in bds]
plot(bds, alphak, xlabel="k", ylabel="Alpha", label=:none, fmt=:png)

png

Still hard to make a judgement as to whether it is converging to a
value or not. It is always appears to be decreasing no mate the sample
size. Maybe I still need more data or maybe need a better
understanding of the Hill estimator!

Market Impact

I’ve not been using QuestDB to its full potential and repeating all my
previous graphs hasn’t fully exploited the available features. One of those features is the
ability to group by the timestamp across a bucket size (1 second, 5
minutes etc.) and aggregate the data. We will use that to
try and come up with a better model of market impact than I had in my
previous post.

We aggregate the trades into 1 minute buckets and calculate the total volume traded, the total signed volume (sell trades count as negative), the last price and also the number of trades in each bucket.

@time marketimpact = execute(conn, 
    "SELECT timestamp, sum(size) as TotalVolume, 
            sum(size*side) as SignedVolume,
            last(price) as Close,
            count(*) as NTrades
     FROM coinbase_trades 
    SAMPLE by 1m") |> DataFrame
dropmissing!(marketimpact)
marketimpact[1:3, :]
  0.223987 seconds (167.29 k allocations: 8.708 MiB, 56.20% compilation time)

3 rows × 5 columns

timestamp TotalVolume SignedVolume Close NTrades
DateTim… Float64 Float64 Float64 Int64
1 2021-07-24T08:50:34.365 1.75836 -0.331599 33649.0 52
2 2021-07-24T08:51:34.365 4.18169 -3.01704 33625.2 67
3 2021-07-24T08:52:34.365 0.572115 -0.325788 33620.1 46

This took less than a second and is a really easy line of code to write.

Now for the market impact calculation, we calculated the return bucket
to bucket and normalise the signed volume by the total volume traded
to give a value of between -1 and 1. This is taken from
https://arxiv.org/pdf/1206.0682.pdf and equation 26.

marketimpact[!, :returns] .= 1e4.*[NaN; diff(log.(marketimpact.Close))]
marketimpact[!, :NormVolume] .= marketimpact[!, :SignedVolume] ./ marketimpact[!, :TotalVolume]

miModel = lm(@formula(returns ~ NormVolume + 0), marketimpact[2:end, :])
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}

returns ~ 0 + NormVolume

Coefficients:
──────────────────────────────────────────────────────────────────────
              Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────
NormVolume  4.55478    0.290869  15.66    <1e-51     3.9843    5.12526
──────────────────────────────────────────────────────────────────────

Here we can see that there is a positive coefficient, θ in
the paper, as expected, and we can interpret this at how much the price moves after buying or selling. Specifically, in these minute buckets, those that contained only buy trades moved the market up by 4.5bps and the same for sells in the opposite direction.

plot(marketimpact.NormVolume, marketimpact.returns, seriestype=:scatter, label=:none, 
     xlab="Normalised Volume", ylab="Market Impact (log(bps))")
plot!(-1:0.1:1, coef(miModel)[1] .* collect(-1:0.1:1), label="Model", linewidth=3, legend=:topleft, fmt=:png)

png

You can see how the model lines of with the data and there is a very
slight trend that is picked. So overal, a better, if still very simple
model of market impact.

Trades with Top of Book

Now I’ve saved down the best bid and offer using the same process as Part 1 of this series. Over the same time period, the best bid and offer data has 17 million rows. So quite a bit more.

I use this best bid-offer data to do an ASOF join. This takes two tables with timestamps and joins them such that the timestamps align or the previous observation is used. In our case, we can take the trades, join it with the best bid and best offer table to get where the mid price was at the time of the trade.

@time trades2 = execute(conn, 
    "SELECT *
     FROM coinbase_trades 
    ASOF JOIN coinbase_bbo") |> DataFrame
dropmissing!(trades2);
  9.745210 seconds (18.49 M allocations: 671.544 MiB, 1.84% gc time)

This took 11 seconds, but was all done in the database, so no issue
with regards to blowing out the memory after pulling it into your
Julia session. Doing a normal join in Julia would only match
timestamps exactly, whereas we want the last observed bid/offer price
at least making the ASOF function very useful.

We now go through and calculate a mid price, how far the traded price was from the mid price and also add an indicator for what quantile the trade size landed in. We then group by this quantile indicator and calculate the average trade size and average distance from the mid price.

trades2[!, :Mid] .= (trades2.bid .+ trades2.ask)./2;
trades2[!, :Cost] .= 1e4 .* trades2.side .* ((trades2.price .- trades2.Mid) ./ (trades2.Mid))
trades2[!, :SizeBucket] .= cut(trades2[!, :size], [quantile(trades2[!, :size], 0:0.1:1); Inf])
gdata = groupby(@where(trades2, :Cost .> 0), :SizeBucket)
costData = @combine(gdata, MeanSize = mean(:size), MeanCost = mean(:Cost))

logCostPlot = plot(log.(costData.MeanSize), 
                   log.(costData.MeanCost), seriestype=:scatter, 
                   label=:none, 
                   xlab="log(Size)", 
                   ylab="log(Cost)", fmt=:png)

png

Unsurprisingly, we can see that larger trades are further away from
the mid price when they execute. This is because they are eating
through the posted liquidity.

This is very similar to my https://cryptoliquiditymetrics.com/ sweep
the book graph which is estimating the cost of eating liquidity. This graph above is showing
the actual cost of eating liquidity for real trades that have happened
on Coinbase.

We can fit a model to this plot and it is commonly referred to as the square root law of market impact. We ignore the smaller trade sizes, as they aren’t following the nice linear log-log plot.

costModel = lm(@formula(log(MeanCost) ~ log(MeanSize)),
                     @where(costData, :MeanSize .> exp(-7)))
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}

:(log(MeanCost)) ~ 1 + :(log(MeanSize))

Coefficients:
───────────────────────────────────────────────────────────────────────────
                   Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
───────────────────────────────────────────────────────────────────────────
(Intercept)    -0.534863   0.0683721  -7.82    0.0001  -0.696537  -0.373189
log(MeanSize)   0.259424   0.0154468  16.79    <1e-06   0.222898   0.29595
───────────────────────────────────────────────────────────────────────────

The γ value of 0.25 is pretty low compared to other assets,
which we would expect to be around 0.5. But we haven’t included the usual volatility calculation which is in front of the volume component.

plot(log.(costData.MeanSize), 
     log.(costData.MeanCost), seriestype=:scatter, 
     label=:none, 
     xlab="log(Size)", 
     ylab="log(Cost)")
plot!(-8:0.1:3, coef(costModel)[1] .+ coef(costModel)[2] .* (-8:0.1:3), 
      label="Model", legend=:topleft, fmt=:png)

png

Apart from the small trades, the model lines up well with the
increasing trade size.

Using this model you can start to estimate how much a strategy
might cost to implement. At the end of the day, the outcome of your
strategy is unknown, but your trading costs are known. If it costs you
1bp to enter and exit a trade (round trip) but you only think the
price will change by 0.5bps, then your at a loss even if you were 100%
right on the price direction!

Summary

QuestDB makes working with this data incredibly easy. Both aggregating
the data using SAMPLE BY and joining two datasets using AS
OF
. Connecting to the database is a doddle using LibPQ.jl, so you
can get up and running without any issues. Very rare that these things
happen straight out the box.

Then by using this data I’ve ramped up the sample sizes and all my
plots and models start to look better. Again, all free data and with hopefully, very minimal
technical difficulty. As someone that usually finds themselves
drowning in csvs QuestDB has shown how much more efficient things can
be when you use a database.