Author Archives: Dean Markwick's Blog -- Julia

Alpha Capture and Acquired

By: Dean Markwick's Blog -- Julia

Re-posted from: https://dm13450.github.io/2024/09/19/Alpha-Capture-and-Acquired.html

People are never short of a trade idea. There is a whole industry of
researchers, salespeople and amateurs coming up with trading ideas and
making big calls on what stock will go up, what country will cut
interest rates and what the price of gold will do next. Alpha capture
is about systematically assessing ideas and working out who has
alpha and generates profitable ideas and who is just making it up as
they are going along.


Enjoy these types of posts? Then you should sign up for my newsletter.


Alpha capture started as a way of profiling a broker’s stock
recommendation. If you have 50 people recommending you 50 different
ideas, how do you know who is good? You’ll quickly run out of money if
you blindly follow all the recommendations that hit your
inbox. Instead, you need to profile each person’s idea and see
who on average can make good recommendations. Whoever is good at
picking stocks probably deserves more of your business.

It has since expanded that some hedge fund have internal desks that
are doing a similar analysis on their portfolio managers (PMs) to double
down on profitable bets and mitigate risks of all the PMs picking the
same stock. Picking stocks and managing a portfolio across many PMs
are two different skills and different departments at your modern
hedge fund.

A simple way to measure the alpha of a PM or broker recommendation
will be to see if the price of a stock they buy (or recommend) goes up
after the day they suggest it. Those with alpha would see their
picks move higher on a large enough sample and those without alpha
would average out to zero, some ideas would go higher, some ideas
lower, the net result being 0 alpha. If a PM has the opposite effect,
every stock they buy goes down they are a contrarian
indicator so take their idea and do the opposite!

Alpha capture markout graph

Alpha Capture Systems: Past, Present, and Future
Directions

goes through the history of alpha capture and is a good short read
that inspired this blog post.

Basic Alpha Capture

What if we wanted to try our own Alpha Capture? We need some stock recommendations and a way of calculating what happens to the price after the recommendation. This is where the Acquired podcast comes in.

Acquired logo

Acquired tells the stories and strategies of great companies (taken from their website). It’s a pretty popular podcast and each episode gets close to a million listeners. So this makes it an ideal Alpha Capture study – when they release an episode about a company does the stock price of that company go higher or lower on average?
If it were to go higher then each time an episode is released call your broker and go long the stock!

They aren’t explicitly recommending a stock by talking about
it, as they say in their intro. So it’s just a toy exercise to see if
there is any correlation between the stock price and the release date
of an episode.

To systematically test this we need to get a list of the episodes and calculate a ‘markout’ from each episode.

Collecting Podcast Data

The internet is a wonderful thing and each episode of Acquired is
available as a XML feed from transistor.fm. So doing some fun parsing
of XML I can get the full history of the podcast with each date
and title.

function parseEpisode(x)
  rawDate = first(simplevalue.(x[tag.(x) .== "pubDate"]))
  date = ZonedDateTime(rawDate, dateformat"eee, dd uuu yyyy HH:MM:ss z")

  Dict("title" => first(simplevalue.(x[tag.(x) .== "title"])),
       "date" =>date)
end

function parse_date(t)
   Date(string(split(t, "T")[1]))
end

url = "https://feeds.transistor.fm/acquired"

data = parse(Node, String(HTTP.get(url).body))

episodes = children(data[3][1])
filter!(x -> tag(x) == "item", episodes)
episodes = children.(episodes)

episodeData = parseEpisode.(episodes)

episodeFrame = vcat(DataFrame.(episodeData)...)
CSV.write("episodeRaw.csv", episodeFrame)

After writing the data to a CSV I need to somehow parse the episode
title into a stock ticker. This is a tricky task as the episode names
are human friendly not computer friendly. So time for our LLM
overlords to lend a hand a do the heavy lifting. I drop the CSV into
Perplexity and prompt it to add the relevant stock ticker to the
file. I then reread the CSV into my notebook.

episodeFrame = CSV.read("episodeTicker.csv", DataFrame)
episodeFrame.date = ZonedDateTime.(String.(episodeFrame.date), dateformat"yyyy-mm-ddTHH:MM:SS.sss-z")

vcat(first(@subset(episodeFrame, :stock_ticker .!= "-"), 4),
        last(@subset(episodeFrame, :stock_ticker .!= "-"), 4))
date
ZonedDateTime
title
String
stock_ticker
String15
sector_etf
String7
2024-03-17T17:54:00.400+07:00 Renaissance Technologies RNR PSI
2024-02-19T17:56:00.410+08:00 Hermès RMS.PA GXLU
2024-01-21T17:59:00.450+08:00 Novo Nordisk (Ozempic) NOVO-B.CO IHE
2023-11-26T16:24:00.250+08:00 Visa V IPAY
2018-09-23T18:28:00.550+07:00 Season 3, Episode 5: Alibaba BABA KWEB
2018-08-20T09:20:00.370+07:00 Season 3, Episode 3: The Sonos IPO SONO GAMR
2018-08-05T18:15:00.030+07:00 Season 3, Episode 2: The Xiaomi IPO XIACF KWEB
2018-07-16T21:40:00.560+07:00 Season 3, Episode 1: Tesla TSLA TSLA

It’s done an ok job. Most of the episodes seem to correspond to the
right ticker but we can see it has hallucinated the RenTech stock
ticker as RNR. RenTech is a private company, no stock ticker and
instead, Perplexity has decided the RNR (a reinsurance company) is the
correct stock ticker. So not 100% accurate. Still, it has saved me a
good chunk of time and we can move on to getting the stock price data.

We want to measure the average price move of a stock after an episode is released. If Acquired had stock-picking skill, you expect the price to increase after the release of an episode as they are generally speaking positively about the various companies.

So using AlpacaMarkets.jl we get the stock price for the days before and the days after the episode. As AlpacaMarkets only has US stock data then only some of the episodes end up with a full dataset.

What is a Markout?

We calculate the percentage change relative to the episode date and then aggregate all the stock tickers together.

\[\text{Markout} = \frac{p – p_{\text{episode released}}}{p_{\text{episode released}}}\]

Acquired is about great companies so they choose to speak favourably about a company, therefore I think it’s a reasonable assumption that we expect the stock price to increase after everyone gets round to listening to it.
So once we aggregate all the episodes we should hopefully have
enough data to decide if this is true.

function getStockData(stock, startDate)
  prices = AlpacaMarkets.stock_bars(stock, "1Day", startTime=startDate - Month(1), limit=10000)[1]
  prices.date .= startDate
  prices.t = parse_date.(prices.t)
  prices[:, [:t, :symbol, :vw, :date]]
end

function calcMarkout(data)
   arrivalInd = findlast(data.t .<= data.date)
   arrivalPrice = data[arrivalInd, :vw]
   data.arrivalPrice .= arrivalPrice
   data.ts = [x.value for x in (data.t .- data.date)]
   data.markout = 1e4*(data.vw .- data.arrivalPrice) ./ data.arrivalPrice
   data
end

res = []

for row in eachrow(episodeFrame)
    
    try 
        stockData = getStockData(row.stock_ticker, Date(row.date))
        stockData = calcMarkout(stockData)
        append!(res, [stockData])
    catch e
        println(row.stock_ticker)
    end
end

res = vcat(res...)

With the data pulled we now aggregate by each day before and after the episode.

markoutRes = @combine(groupby(res, :ts), :n = length(:markout), 
                                         :avgMarkout = mean(:markout),
                                         :devMarkout = std(:markout))
markoutRes = @transform(markoutRes, :errMarkout = :devMarkout ./sqrt.(:n))

Always need error bars as this data gets noisy.


markoutResSub = @subset(markoutRes, :ts .<= 60, :n .>= 10)
plot(markoutResSub.ts, markoutResSub.avgMarkout, yerr=markoutResSub.errMarkout, 
     xlabel = "Days", ylabel = "Markout", title = "Acquired Alpha Capture", label = :none)
hline!([0], ls = :dash, color = "grey", label = :none)
vline!([0], ls = :dash, color = "grey", label = :none)

Average markout

Not really a pattern. The majority of the error bars are intercepting zero after the podcast is released.
If you squint a little bit there seems to be a bit of a downward trend post-episode which would suggest they talk about a company at the peak of the stock price.

Beforehand there is a bit of positive momentum, again suggesting that
they release the podcast at the peak of the stock price. Now this is
even more of a stretch given there is only 1 podcast a month and it
takes more than 20 days to prepare an episode (I imagine!), so
more noise than signal.

markoutIndRes = @combine(groupby(res, [:symbol, :ts]), :n = length(:markout), 
                                         :avgMarkout = mean(:markout),
                                         :devMarkout = std(:markout))
markoutIndRes = @transform(markoutIndRes, :errMarkout = :devMarkout ./sqrt.(:n))

p = plot()
hline!(p, [0], ls = :dash, color = "grey", label = :none)
vline!(p, [0], ls = :dash, color = "grey", label = :none)
for sym in ["TSLA", "V", "META"]
   markoutResSub = sort(@subset(markoutIndRes, :symbol .== sym, :ts .<= 60, :n .>= 1), :ts)
    plot!(p, markoutResSub.ts, markoutResSub.avgMarkout, yerr=markoutResSub.errMarkout, 
     xlabel = "Days", ylabel = "Markout", title = "Acquired Alpha Capture", label = sym, lw =2) 
end
p

Individual markouts

When we pull out 3 examples of episodes we can see the randomness and specifically the volatility of TSLA here.

Conclusion

From this, we would not put any specific weight on the stock
performance after an episode is released. There doesn’t appear to be
any statistical pattern to exploit. No alpha means no alpha
capture. It is a nice exercise though and has hopefully explained the
concept of a markout.

Calibrating an Ornstein–Uhlenbeck Process

By: Dean Markwick's Blog -- Julia

Re-posted from: https://dm13450.github.io/2024/03/09/Calibrating-an-Ornstein-Uhlenbeck-Process.html

Read enough quant finance papers or books and you’ll come across the
Ornstein–Uhlenbeck (OU) process. This is a post that explores the OU
process, the equations, how we can simulate such a process and then estimate the parameters.


Enjoy these types of posts? Then you should sign up for my newsletter.


I’ve briefly touched on mean reversion and OU processes before in my
Stat Arb – An Easy Walkthrough
blog post where we modelled the spread between an asset and its
respective ETF. The whole concept of ‘mean reversion’ is something
that comes up frequently in finance and at different time scales. It
can be thought of as the first basic extension as Brownian motion and
instead of things moving randomly there is now a slight structure
where it be oscillating around a constant value.

The Hudson Thames group have a similar post on OU processes (Mean-Reverting Spread Modeling: Caveats in Calibrating the OU Process) and
my post should be a nice compliment with code and some extensions.

The Ornstein-Uhlenbeck Equation

As a continuous process, we write the change in \(X_t\) as an increment in time and some noise

\[\mathrm{d}X_t = \theta (\mu – x_t) \mathrm{d}t + \sigma \mathrm{d}W_t\]

The amount it changes in time depends on the previous \(X_t\) and to free parameters \(\mu\) and \(\theta\).

  • The \(\mu\) is the long-term drift of the process
  • The \(\theta\) is the mean reversion or momentum parameter depending on the sign.

If \(\theta\) is 0 we can see the equation collapses down to a simple random walk.

If we assume \(\mu = 0\), so the long-term average is 0, then a positive value of \(\theta\) means we see mean reversion. Large values of \(X\) mean the next change is likely to have a negative sign, leading to a smaller value in \(X\).

A negative value of \(\theta\) means the opposite and we end up with a large value in X generating a further large positive change and the process explodes.
E
If discretise the process we can simulate some samples with different parameters to illustrate these two modes.

\[X_{t+1} – X_t = \theta (\mu – X_t) \Delta t + \sigma \sqrt{\Delta t} W_t\]

where \(W_t \sim N(0,1)\).

which is easy to write out in Julia. We can save some time by drawing the random values first and then just summing everything together.

using Distributions, Plots

function simulate_os(theta, mu, sigma, dt, maxT, initial)
    p = Array{Float64}(undef, length(0:dt:maxT))
    p[1] = initial
    w = sigma * rand(Normal(), length(p)) * sqrt(dt)
    for i in 1:(length(p)-1)
        p[i+1] = p[i] + theta*(mu-p[i])*dt + w[i]
    end
    return p
end

We have two classes of OU processes we want to simulate, a mean
reverting \(\theta > 0\) and a momentum version (\(\theta < 0\)) and
we also want to simulate a random walk at the same time, so \(\theta =
0\). We will assume \(\mu = 0\) which keeps the pictures simple.

maxT = 5
dt = 1/(60*60)
vol = 0.005

initial = 0.00*rand(Normal())

p1 = simulate_os(-0.5, 0, vol, dt, maxT, initial)
p2 = simulate_os(0.5, 0, vol, dt, maxT, initial)
p3 = simulate_os(0, 0, vol, dt, maxT, initial)

plot(0:dt:maxT, p1, label = "Momentum")
plot!(0:dt:maxT, p2, label = "Mean Reversion")
plot!(0:dt:maxT, p3, label = "Random Walk")

Different values an OU process can look

The mean reversion (orange) hasn’t moved away from the long-term average (\(\mu=0\)) and the momentum has diverged the furthest from the starting point, which lines up with the name. The random walk, inbetween both as we would expect.

Now we have successfully simulated the process we want to try and
estimate the \(\theta\) parameter from the simulation. We have two
slightly different (but similar methods) to achieve this.

OLS Calibration of an OU Process

When we look at the generating equation we can simply rearrange it into a linear equation.

\[\Delta X = \theta \mu \Delta t – \theta \Delta t X_t + \epsilon\]

and the usual OLS equation

\[y = \alpha + \beta X + \epsilon\]

such that

\[\alpha = \theta \mu \Delta t\]

\[\beta = -\theta \Delta t\]

where \(\epsilon\) is the noise. So we just need a DataFrame with the difference between subsequent observations and relate that to the current observation. Just a diff and a shift.

using DataFrames, DataFramesMeta
momData = DataFrame(y=p1)
momData = @transform(momData, :diffY = [NaN; diff(:y)], :prevY = [NaN; :y[1:(end-1)]])

Then using the standard OLS process from the GLM package.

mdl = lm(@formula(diffY ~ prevY), momData[2:end, :])
alpha, beta = coef(mdl)

theta = -beta / dt
mu = alpha / (theta * dt)

Which gives us \(\mu = 0.0075, \theta = -0.3989\), so close to zero
for the drift and the reversion parameter has the correct sign.

Doing the same for the mean reversion data.

mdl = lm(@formula(diffY ~ prevY), revData[2:end, :])
alpha, beta = coef(mdl)

theta = -beta / dt
mu = alpha / (theta * dt)

This time \(\mu = 0.001\) and \(\theta = 1.2797\). So a little wrong
compared to the true values, but at least the correct sign.

Does Bootstrapping Help?

It could be that we need more data, so we use the bootstrap to randomly sample from the population to give us pseudo-new draws. We use the DataFrames again and pull random rows with replacement to build out the data set. We do this sampling 1000 times.

res = zeros(1000)
for i in 1:1000
    mdl = lm(@formula(diffY ~ prevY + 0), momData[sample(2:nrow(momData), nrow(momData), replace=true), :])
    res[i] = -first(coef(mdl)/dt)
end

bootMom = histogram(res, label = :none, title = "Momentum", color = "#7570b3")
bootMom = vline!(bootMom, [-0.5], label = "Truth", momentum = 2)
bootMom = vline!(bootMom, [0.0], label = :none, color = "black")

We then do the same for the reversion data.

res = zeros(1000)
for i in 1:1000
    mdl = lm(@formula(diffY ~ prevY + 0), revData[sample(2:nrow(revData), nrow(revData), replace=true), :])
    res[i] = first(-coef(mdl)/dt)
end

bootRev = histogram(res, label = :none, title = "Reversion", color = "#1b9e77")
bootRev = vline!(bootRev, [0.5], label = "Truth", lw = 2)
bootRev = vline!(bootRev, [0.0], label = :none, color = "black")

Then combining both the graphs into one plot.

plot(bootMom, bootRev, 
  layout=(2,1),dpi=900, size=(800, 300),
  background_color=:transparent, foreground_color=:black,
     link=:all)

Bootstrapping an OU process

The momentum bootstrap has worked and centred around the correct
value, but the same cannot be said for the reversion plot. However, it
has correctly guessed the sign.

AR(1) Calibration of a OU Process

If we continue assuming that \(\mu = 0\) then we can simplify the OLS
to a 1-parameter regression – OLS without an intercept. From the
generating process, we can see that this is an AR(1) process – each
observation depends on the previous observation by some amount.

\[\phi = \frac{\sum _i X_i X_{i-1}}{\sum _i X_{i-1}^2}\]

then the reversion parameter is calculated as

\[\theta = – \frac{\log \phi}{\Delta t}\]

This gives us a simple equation to calculate \(\theta\) now.

For the momentum sample:

phi = sum(p1[2:end] .* p1[1:(end-1)]) / sum(p1[1:(end-1)] .^2)
-log(phi)/dt

Givens \(\theta = -0.50184\), so very close to the true value.

For the reversion sample

phi = sum(p2[2:end] .* p2[1:(end-1)]) / sum(p2[1:(end-1)] .^2)
-log(phi)/dt

Gives \(\theta = 1.26\), so correct sign, but quite a way off.

Finally, for the random walk

phi = sum(p3[2:end] .* p3[1:(end-1)]) / sum(p3[1:(end-1)] .^2)
-log(phi)/dt

Produces \(\theta = -0.027\), so quite close to zero.

Again, values are similar to what we expect, so our estimation process
appears to be working.

Using Multiple Samples for Calibrating an OU Process

If you aren’t convinced I don’t blame you. Those point estimates above are nowhere near the actual values that simulated the data so it’s hard to believe the estimation method is working. Instead, what we need to do is repeat the process and generate many more price paths and estimate the parameters of each one.

To make things a bit more manageable code-wise though I’m going to
introduce a struct that contains the parameters and allows to
simulate and estimate in a more contained manner.

struct OUProcess
    theta
    mu 
    sigma
    dt
    maxT
    initial
end

We now write specific functions for this object and this allows us to
simplify the code slightly.

function simulate(ou::OUProcess)
    simulate_os(ou.theta, ou.mu, ou.sigma, ou.dt, ou.maxT, ou.initial)
end

function estimate(ou::OUProcess)
   p = simulate(ou)
   phi =  sum(p[2:end] .* p[1:(end-1)]) / sum(p[1:(end-1)] .^2)
   -log(phi)/ou.dt
end

function estimate(ou::OUProcess, N)
    res = zeros(N)
    for i in 1:N
        p = simulate(ou)
        res[i] = estimate(ou)
    end
    res
end

We use these new functions to draw from the process 1,000 times and
sample the parameters for each one, collecting the results as an
array.

ou = OUProcess(0.5, 0.0, vol, dt, maxT, initial)
revPlot = histogram(estimate(ou, 1000), label = :none, title = "Reversion")
vline!(revPlot, [0.5], label = :none);

And the same for the momentum OU process

ou = OUProcess(-0.5, 0.0, vol, dt, maxT, initial)
momPlot = histogram(estimate(ou, 1000), label = :none, title = "Momentum")
vline!(momPlot, [-0.5], label = :none);

Plotting the distribution of the results gives us a decent
understanding of how varied the samples can be.

plot(revPlot, momPlot, layout = (2,1), link=:all)

Multiple sample estimation of an OU process

We can see the heavy-tailed nature of the estimation process, but
thankfully the histograms are centred around the correct number. This
goes to show how difficult it is to estimate the mean reversion
parameter even in this simple setup. So for a real dataset, you need to
work out how to collect more samples or radically adjust how accurate
you think your estimate is.

Summary

We have progressed from simulating an Ornstein-Uhlenbeck process to
estimating its parameters using various methods. We attempted to
enhance the accuracy of the estimates through bootstrapping, but we
discovered that the best approach to improve the estimation is to have
multiple samples.

So if you are trying to fit this type of process on some real world
data, be it the spread between two stocks
(Statistical Arbitrage in the U.S. Equities Market),
client flow (Unwinding Stochastic Order Flow: When to Warehouse Trades) or anything
else you believe might be mean reverting, then understand how much
data you might need to accurately model the process.

Exploring Causal Regularisation

By: Dean Markwick's Blog -- Julia

Re-posted from: https://dm13450.github.io/2023/12/28/Exploring-Causal-Regularisation.html

A good prediction model isn’t necessarily a good causal model. You could be missing a key variable in your dataset that is driving the underlying behavior so you end up with a good predictive model but not the correct explanation as to why things behave that way. Taking a causal approach is a tougher problem and needs an understanding of whether we have access to the right variables or we are making the right link between variables and an outcome. Causal regularisation is a method that uses machine learning techniques (regularisation!) to try and produce models that can be interpreted causally.


Enjoy these types of posts? Then you should sign up for my newsletter.


Regularisation is normally taught as a method to reduce overfitting, you have a big model and you make it smaller by shrinking some of the factors. Work by Janzing (papers below) argues that this can help produce better causal models too and in this blog post I will work through two papers to try and understand the process better.

I’ll work off two main papers for causal regularisation:

  1. Causal Regularisation
  2. Detecting non-causal artifacts in multivariate linear regression models

In truth, I am working backward. I first encountered causal regularisation in Better AB testing via Causal Regularisation where it uses causal regularisation to produce better estimates by combining a biased and an unbiased dataset. I want to take a step back and understand casual regularisation from the original papers. Using free data from the UCI Machine Learning Repository we can attempt to replicate the methods from the papers and see how causal regularisation works to produce better causal models.

As ever, I’m in Julia (1.9), so fire up that notebook and follow along.

using CSV, DataFrames, DataFramesMeta
using Plots
using GLM, Statistics

Wine Tasting Data

The wine-quality dataset from the UCI repository provides measurements of the chemical properties of wine and a quality rating from someone drinking the wine. It’s a simple CSV file that you can download (winequality) and load with minimal data wrangling needed.

We will be working with the red wine data set as that’s what both Janzing papers use.

rawData = CSV.read("wine+quality/winequality-red.csv", DataFrame)
first(rawData)

APD! Always Plotting the Data to make sure the values are something you expect. Sometimes you need a visual confirmation that things line up with what you believe.

plot(scatter(rawData.alcohol, rawData.quality, title = "Alcohol", label = :none, color="#eac435"),
     scatter(rawData.pH, rawData.quality, title = "pH", label = :none, color="#345995"),
     scatter(rawData.sulphates, rawData.quality, title= "Sulphates", label = :none, color="#E40066"),
     scatter(rawData.density, rawData.quality, title = "Density", label = :none, color="#03CEA4"), ylabel = "Quality")

Wine quality variable relationships

By choosing four of the variables randomly we can see that some are correlated with quality and some are not.

A loose goal is to come up with a causal model that can explain the quality of the wine using the provided factors. We will change the data slightly to highlight how causal regularisation helps, but for now, let’s start with the simple OLS model.

In the paper they normalise the variables to be unit variance, so we divide by the standard deviation.
We then model the quality of the wine using all the available variables.

vars = names(rawData, Not(:quality))

cleanData = deepcopy(rawData)

for var in filter(!isequal("White"), vars)
    cleanData[!, var] = cleanData[!, var] ./ std(cleanData[!, var])
end

cleanData[!, :quality] .= Float64.(cleanData[!, :quality])

ols = lm(term(:quality) ~ sum(term.(Symbol.(vars))), cleanData)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

quality ~ 1 + fixed acidity + volatile acidity + citric acid + residual sugar + chlorides + free sulfur dioxide + total sulfur dioxide + density + pH + sulphates + alcohol

Coefficients:
────────────────────────────────────────────────────────────────────────────────────────
                           Coef.  Std. Error      t  Pr(>|t|)     Lower 95%    Upper 95%
────────────────────────────────────────────────────────────────────────────────────────
(Intercept)           21.9652     21.1946      1.04    0.3002  -19.6071      63.5375
fixed acidity          0.043511    0.0451788   0.96    0.3357   -0.0451055    0.132127
volatile acidity      -0.194027    0.0216844  -8.95    <1e-18   -0.23656     -0.151494
citric acid           -0.0355637   0.0286701  -1.24    0.2150   -0.0917989    0.0206716
residual sugar         0.0230259   0.0211519   1.09    0.2765   -0.0184626    0.0645145
chlorides             -0.088211    0.0197337  -4.47    <1e-05   -0.126918    -0.0495041
free sulfur dioxide    0.0456202   0.0227121   2.01    0.0447    0.00107145   0.090169
total sulfur dioxide  -0.107389    0.0239718  -4.48    <1e-05   -0.154409    -0.0603698
density               -0.0337477   0.0408289  -0.83    0.4086   -0.113832     0.0463365
pH                    -0.0638624   0.02958    -2.16    0.0310   -0.121883    -0.00584239
sulphates              0.155325    0.019381    8.01    <1e-14    0.11731      0.19334
alcohol                0.294335    0.0282227  10.43    <1e-23    0.238977     0.349693
────────────────────────────────────────────────────────────────────────────────────────

The dominant factor is the alcohol amount which is the strongest variable in predicting the quality, i.e. higher quality has a higher alcohol content. We also note that 5 out of the 12 variables are deemed insignificant at the 5% level. We save these parameters and then look at the regression without the alcohol variable.

olsParams = DataFrame(Dict(zip(vars, coef(ols)[2:end])))
olsParams[!, :Model] .= "OLS"
olsParams
1×12 DataFrame
Row alcohol chlorides citric acid density fixed acidity free sulfur dioxide pH residual sugar sulphates total sulfur dioxide volatile acidity Model
Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 String
1 0.294335 -0.088211 -0.0355637 -0.0337477 0.043511 0.0456202 -0.0638624 0.0230259 0.155325 -0.107389 -0.194027 OLS
cleanDataConfounded = select(cleanData, Not(:alcohol))
vars = names(cleanDataConfounded, Not(:quality))

confoundOLS = lm(term(:quality) ~ sum(term.(Symbol.(vars))), cleanDataConfounded)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

quality ~ 1 + fixed acidity + volatile acidity + citric acid + residual sugar + chlorides + free sulfur dioxide + total sulfur dioxide + density + pH + sulphates

Coefficients:
───────────────────────────────────────────────────────────────────────────────────────────
                             Coef.  Std. Error       t  Pr(>|t|)     Lower 95%    Upper 95%
───────────────────────────────────────────────────────────────────────────────────────────
(Intercept)           189.679       14.2665      13.30    <1e-37  161.696       217.662
fixed acidity           0.299551     0.0391918    7.64    <1e-13    0.222678      0.376424
volatile acidity       -0.176182     0.0223382   -7.89    <1e-14   -0.219997     -0.132366
citric acid             0.00912711   0.0292941    0.31    0.7554   -0.0483321     0.0665863
residual sugar          0.133781     0.0189031    7.08    <1e-11    0.0967031     0.170858
chlorides              -0.107215     0.0203052   -5.28    <1e-06   -0.147043     -0.0673877
free sulfur dioxide     0.0394281    0.023462     1.68    0.0931   -0.00659172    0.0854479
total sulfur dioxide   -0.128248     0.0246854   -5.20    <1e-06   -0.176668     -0.0798287
density                -0.355576     0.0276265  -12.87    <1e-35   -0.409765     -0.301388
pH                      0.0965662    0.0261087    3.70    0.0002    0.0453551     0.147777
sulphates               0.213697     0.0191745   11.14    <1e-27    0.176087      0.251307
───────────────────────────────────────────────────────────────────────────────────────────

citric acid and free sulfur dioxide are now the only insignificant variables, the rest are believed to contribute to the quality. This means we are experiencing confounding as alcohol is the better explainer but the effect of alcohol is now hiding behind these other variables.

Confounding – When a variable influences other variables and the outcome at the same time leading to an incorrect view on the correlation between the variables and outcomes.

This regression after dropping the alcohol variable is incorrect and provides the wrong causal conclusion. So can we do better and get closer to the true regression coefficients using some regularisation methods?

For now, we save these incorrect parameters and explore the causal regularisation methods.

olsParamsConf = DataFrame(Dict(zip(vars, coef(confoundOLS)[2:end])))
olsParamsConf[!, :Model] .= "OLS No Alcohol"
olsParamsConf[!, :alcohol] .= NaN

olsParamsConf
1×12 DataFrame
Row chlorides citric acid density fixed acidity free sulfur dioxide pH residual sugar sulphates total sulfur dioxide volatile acidity Model alcohol
Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 String Float64
1 -0.107215 0.00912711 -0.355576 0.299551 0.0394281 0.0965662 0.133781 0.213697 -0.128248 -0.176182 OLS No Alcohol NaN

Regularisation and Regression

Some maths. Regression is taking our variables \(X\) and finding the parameters \(a\) that get us closest to \(Y\).

\[Y = a X\]

\(X\) is a matrix, and \(a\) is a vector. When we fit this to some data, the values of \(a\) are free to converge to any value they want, so long as it gets close to the outcome variable. This means we are minimising the difference between \(Y\) and \(X\)

\[||(Y – a X)|| ^2.\]

Regularisation is the act of restricting the values \(a\) can take.

For example, we can make the sum of all the \(a\)’s equal to a constant (L_1 regularisation), or the sum of the square of the $a$ values equal a constant (L_2 regularisation).
In simpler terms, if we want to increase the coefficient of one parameter, we need to reduce the parameter of a different term. Think of there being a finite amount of mass that we can allocate to the parameters, they can’t take on whatever value they like, but instead need to regulate amongst themselves. This helps reduce overfitting as it constrains how much influence a parameter can have and the final result should converge to a model that doesn’t overfit.

In ridge regression we are minimising the \(L_2\) norm, so restricting the sum of the square of the \(a\)’s and at the same time minimising the original OLS regression.

\[||(Y – a X)|| ^2 – \lambda || a || ^2.\]

So we can see how regularisation is an additional component of OLS regression. \(\lambda\) is a hyperparameter that is just a number and controls how much restriction we place on the \(a\) values.

To do ridge regression in Julia I’ll be leaning on the MLJ.jl framework and using that to build out the learning machines.

using MLJ

@load RidgeRegressor pkg=MLJLinearModels

We will take the confounded dataset (so the data where the alcohol column is deleted), partition it into train and test sets, and get started with some regularisation.

y, X = unpack(cleanDataConfounded, ==(:quality); rng=123);

train, test = partition(eachindex(y), 0.7, shuffle=true)

mdl = MLJLinearModels.RidgeRegressor()
RidgeRegressor(
  lambda = 1.0, 
  fit_intercept = true, 
  penalize_intercept = false, 
  scale_penalty_with_samples = true, 
  solver = nothing)

Can see the hyperparameter lambda is initialised to 1.

Basic Ridge Regression

We want to know the optimal \(\lambda\) value so will use cross-validation to train the model on one set of data and verify on a hold-out set before repeating. This is all simple in MLJ.jl, we define a grid of penalisations between 0 and 1 and fit the regression using cross-validation across the different lambdas. We are optimising for the best \(R^2\) value.

lambda_range = range(mdl, :lambda, lower = 0, upper = 1)

lmTuneModel = TunedModel(model=mdl,
                          resampling = CV(nfolds=6, shuffle=true),
                          tuning = Grid(resolution=200),
                          range = [lambda_range],
                          measures=[rsq]);

lmTunedMachine = machine(lmTuneModel, X, y);

fit!(lmTunedMachine, rows=train, verbosity=0)
report(lmTunedMachine).best_model
RidgeRegressor(
  lambda = 0.020100502512562814, 
  fit_intercept = true, 
  penalize_intercept = false, 
  scale_penalty_with_samples = true, 
  solver = nothing)

The best value of \(\lambda\) is 0.0201. When we plot the \(R^2\) vs the \(\lambda\) values there isn’t that much of a change just a minor inflection around the small ones.

plot(lmTunedMachine)

R2 and lambda for basic ridge regression

Let’s save those parameters. This will be our basic ridge regression result that the other technique builds off.

res = fitted_params(lmTunedMachine).best_fitted_params.coefs

ridgeParams = DataFrame(res)
ridgeParams = hcat(ridgeParams, DataFrame(Model = "Ridge", alcohol=NaN))
ridgeParams
1×12 DataFrame
Row fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates Model alcohol
Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 String Float64
1 0.190892 -0.157286 0.0410523 0.117846 -0.142458 0.0374597 -0.153419 -0.29919 0.0375852 0.232461 Ridge NaN

Implementing Causal Regularisation

The main result from the paper is that we first need to estimate the confounding effect \(\beta\) and then choose a penalisation factor \(\lambda\) that satisfies

\[(1-\beta) || a || ^ 2\]

So the \(L_2\) norm of the ridge parameters can only be so much. In the 2nd paper, they estimate \(\beta\) to be 0.8. For us, we can use the above grid search, calculate the norm of the parameters, and find which ones satisfy those criteria.

So iterate through the above results of the grid search, and calculate the L2 norm of the parameters.

mdls = report(lmTunedMachine).history

l = zeros(length(mdls))
a = zeros(length(mdls))

for (i, mdl) in enumerate(mdls)
    l[i] = mdl.model.lambda
    a[i] = sum(map( x-> x[2], fitted_params(fit!(machine(mdl.model, X, y))).coefs) .^2)
end

Plotting the results gives us a visual idea of how the penalisation works. Larger values of \(\lambda\) mean the model parameters are more and more restricted.

inds = sortperm(l)
l = l[inds]
a = a[inds]

mdlsSorted = report(lmTunedMachine).history[inds]

scatter(l, a, label = :none)
hline!([(1-0.8) * sum(coef(confoundOLS)[2:end] .^ 2)], label = "Target Length", xlabel = "Lambda", ylabel = "a Length")

R2 and lambda for basic ridge regression with target length

We search the lengths for the one closest to the target length and save those parameters.

targetLength = (1-0.8) * sum(coef(confoundOLS)[2:end] .^ 2)
ind = findfirst(x-> x < targetLength, a)

res = fitted_params(fit!(machine(mdlsSorted[ind].model, X, y))).coefs

finalParams = DataFrame(res)
finalParams = hcat(finalParams, DataFrame(Model = "With Beta", alcohol=NaN))
finalParams
1×12 DataFrame
Row fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates Model alcohol
Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 String Float64
1 0.0521908 -0.139099 0.0598797 0.0377729 -0.0786037 0.00654776 -0.0856938 -0.124057 0.00682623 0.11735 With Beta NaN

What if we don’t want to calculate the confounding effect?

Now the code to calculate \(\beta\) isn’t the easiest or straightforward to implement (hence why I took their estimate). Instead, we could take the approach from Better AB Testing via Causal Regularisation and use the test set to optimise the penalisation parameter \(\lambda\) and then use that value when training the model on the train set.

Applying this method to the wine dataset isn’t a true replication of their paper, as their test and train data sets are instead two data sets, one with bias and one without like you might observe from an AB test. So it’s more of a demonstration of the method rather than a direct comparison to the Janzing method.

Again, MLJ makes this simple, we just fit the machine using the test rows to produce the best-fitting model.

lambda_range = range(mdl, :lambda, lower = 0, upper = 1)

lmTuneModel = TunedModel(model=mdl,
                          resampling = CV(nfolds=6, shuffle=true),
                          tuning = Grid(resolution=200),
                          range = [lambda_range],
                          measures=[rsq]);

lmTunedMachine = machine(lmTuneModel, X, y);

fit!(lmTunedMachine, rows=test, verbosity=0)
plot(lmTunedMachine)

R2 and lambda for basic ridge regression on the test set

report(lmTunedMachine).best_model
RidgeRegressor(
  lambda = 0.010050251256281407, 
  fit_intercept = true, 
  penalize_intercept = false, 
  scale_penalty_with_samples = true, 
  solver = nothing)

Our best \(\lambda\) is 0.01 so we retrain the same machine, this time using the training rows.

res2 = fit!(machine(report(lmTunedMachine).best_model, X, y), rows=train)

Again saving these parameters down leaves us with three methods and three sets of parameters.

finalParams2 = DataFrame(fitted_params(res2).coefs)
finalParams2 = hcat(finalParams2, DataFrame(Model = "No Beta", alcohol=NaN))

allParams = vcat([olsParams, olsParamsConf, ridgeParams, finalParams, finalParams2]...)
allParams
5×12 DataFrame
Row alcohol chlorides citric acid density fixed acidity free sulfur dioxide pH residual sugar sulphates total sulfur dioxide volatile acidity Model
Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 String
1 0.294335 -0.088211 -0.0355637 -0.0337477 0.043511 0.0456202 -0.0638624 0.0230259 0.155325 -0.107389 -0.194027 OLS
2 NaN -0.107215 0.00912711 -0.355576 0.299551 0.0394281 0.0965662 0.133781 0.213697 -0.128248 -0.176182 OLS No Alcohol
3 NaN -0.142458 0.0410523 -0.29919 0.190892 0.0374597 0.0375852 0.117846 0.232461 -0.153419 -0.157286 Ridge
4 NaN -0.0786037 0.0598797 -0.124057 0.0521908 0.00654776 0.00682623 0.0377729 0.11735 -0.0856938 -0.139099 With Beta
5 NaN -0.141766 0.031528 -0.323596 0.222812 0.03869 0.048907 0.127026 0.23961 -0.153488 -0.157603 No Beta

What method has done the best at uncovering the confounded relationship?

Relative Squared Error

We have our different estimates of the parameters of the model, we now want to compare these to the ‘true’ unconfounded variables and see whether we have recovered the correct variables. To do this we calculate the square difference and normalise by the overall \(L_2\) norm of the parameters.

In practice, this just means we are comparing how far the fitted parameters are away from the true (unconfounded) model parameters.

allParamsLong = stack(allParams, Not(:Model))
trueParams = select(@subset(allParamsLong, :Model .== "OLS"), Not(:Model))
rename!(trueParams, ["variable", "truth"])
allParamsLong = leftjoin(allParamsLong, trueParams, on = :variable)
errorRes = @combine(groupby(@subset(allParamsLong, :variable .!= "alcohol"), :Model), 
         :a = sum((:truth .- :value) .^2),
         :a2 = sum(:value .^ 2))
errorRes = @transform(errorRes, :e = :a ./ :a2)
sort(errorRes, :e)
5×4 DataFrame
Row Model a a2 e
String Float64 Float64 Float64
1 OLS 0.0 0.0920729 0.0
2 With Beta 0.0291038 0.0698576 0.416616
3 Ridge 0.129761 0.266952 0.486085
4 No Beta 0.157667 0.301286 0.523314
5 OLS No Alcohol 0.213692 0.349675 0.611116

Using the \(\beta\) estimation method gives the best model (smallest \(e\)), which lines up with the paper and the magnitude of error is also inline with the paper (they had 0.35 and 0.45 for Lasoo/ridge regression respectively).
The ridge regression and no beta method also improved on the naive OLS approach, so that indicates that there is some improvement from using these methods. The No Beta method is not a faithful reproduction of the Better AB testing paper because it requires the ‘test’ dataset to be an AB test scenario, which we don’t have from the above, so that might explain why the values don’t quite line up.

All methods improve on the naive ‘OLS No Alcohol’ parameters though, which shows this approach to causal regularisation can uncover better models if you have underlying confounding in your data.

Summary

We are always stuck with the data we are given and most of the time can’t collect more to try and uncover more relationships. Causal regularisation gives us a chance to use normal machine learning techniques to build better causal relationships by guiding what the regularisation parameters should be and using that to restrict the overall parameters. When we can estimate the expected confounding value \(\beta\) we get the best results, but regular ridge regression and the Webster-Westray method also provide an improvement on just doing a naive regression.
So whilst overfitting is the main driver for doing regularisation it also brings with it some causal benefits and lets you understand true relationships between variables in a truer sense.

Another Causal Post

I’ve written about causal analysis techniques before with Double Machine Learning – An Easy Introduction. This is another way of building causal models.