Author Archives: Dean Markwick's Blog -- Julia

Free Finance Data Sets for the Quants

By: Dean Markwick's Blog -- Julia

Re-posted from: https://dm13450.github.io/2023/11/25/Free-Finance-Datasets-for-Quants.html

Now and then I am asked how to get started in quant finance and
my advice has always been to just get hold of some data and play about
with different models. The first step is to get some data and this post takes you
through several different sources and hopefully gives you the
launchpad to start poking around with financial data.


Enjoy these types of posts? Then you should sign up for my newsletter.


I’ve tried to cover different assets and frequencies to hopefully
inspire the various types of quant finance
out there.

High-Frequency FX Market Data

My day-to-day job is in FX so naturally, that’s where I think all the
best data can be found. TrueFX provides
tick-by-tick in milliseconds, so high-frequency data is available for free and across lots of different currencies.
So if you are interested in working out how to deal with large amounts
of data (1 month of EURUSD is 600MB) efficiently, this source is a
good place to start.

As a demo, I’ve downloaded the USDJPY October dataset.

using CSV, DataFrames, DataFramesMeta, Dates, Statistics
using Plots

It’s a big CSV file, so this isn’t the best way to store the data,
instead, stick it into a database like QuestDB
that are made for time series data.

usdjpy = CSV.read("USDJPY-2023-10.csv", DataFrame,
                 header = ["Ccy", "Time", "Bid", "Ask"])
usdjpy.Time = DateTime.(usdjpy.Time, dateformat"yyyymmdd HH:MM:SS.sss")
first(usdjpy, 4)
4×4 DataFrame
Row Ccy Time Bid Ask
String7 DateTime Float64 Float64
1 USD/JPY 2023-10-01T21:04:56.931 149.298 149.612
2 USD/JPY 2023-10-01T21:04:56.962 149.298 149.782
3 USD/JPY 2023-10-01T21:04:57.040 149.589 149.782
4 USD/JPY 2023-10-01T21:04:58.201 149.608 149.782

It’s simple data, just a bid and ask price with a time stamp.

usdjpy = @transform(usdjpy, :Spread = :Ask .- :Bid, 
                            :Mid = 0.5*(:Ask .+ :Bid), 
                            :Hour = round.(:Time, Minute(10)))

usdjpyHourly = @combine(groupby(usdjpy, :Hour), :open = first(:Mid), :close = last(:Mid), :avg_spread = mean(:Spread))
usdjpyHourly.Time = Time.(usdjpyHourly.Hour)

plot(usdjpyHourly.Hour, usdjpyHourly.open, lw =1, label = :none, title = "USDJPY Price Over October")

Looking at the hourly price over the month gives you flat periods
over the weekend.

USDJPY October price chart

Let’s look at the average spread (ask – bid) throughout the day.

hourlyAvgSpread = sort(@combine(groupby(usdjpyHourly, :Time), :avg_spread = mean(:avg_spread)), :Time)

plot(hourlyAvgSpread.Time, hourlyAvgSpread.avg_spread, lw =2, title = "USDJPY Intraday Spread", label = :none)

USDJPY average intraday spread

We see a big spike at 10 pm because of the day roll and the
secondary markets go offline briefly, which pollutes the data
bit. Looking at just midnight to 8 pm gives a more indicative picture.

plot(hourlyAvgSpread[hourlyAvgSpread.Time .<= Time("20:00:00"), :].Time, 
     hourlyAvgSpread[hourlyAvgSpread.Time .<= Time("20:00:00"), :].avg_spread, label = :none, lw=2,
     title = "USDJPY Intraday Spread")

USDJPY average intraday spread zoomed

In October spreads have generally been wider in the later part of the
day compared to the morning.

There is much more that can be done with this data across the
different currencies though. For example:

  1. How stable are correlations across currencies at different time frequencies?
  2. Can you replicate my microstructure noise post? How does the microstructure noise change between currencies
  3. Price updates are irregular, what are some statistical properties?

Daily Futures Market Data

Let’s zoom out a little bit now, decrease the frequency, and widen the
asset pool. Futures cover many asset classes, oil, coal, currencies,
metals, agriculture, stocks, bonds, interest rates, and probably
something else I’ve missed. This data is daily and roll adjusted, so
you have a continuous time series of an asset for many years. This means you can look at the classic momentum/mean reversion portfolio models and have a real stab at long-term trends.

The data is part of the Nasdaq data link product (formerly Quandl)
and once you sign up for an account you have access to the free
data. This futures dataset is
Wiki Continuous Futures
and after about 50 clicks and logging in, re-logging in, 2FA codes
you can view the pages.

To get the data you can go through one of the API packages in
your favourite language. In Julia, this means the QuandlAccess.jl
package which keeps things simple.

using QuandlAccess

futuresMeta = CSV.read("continuous.csv", DataFrame)
futuresCodes = futuresMeta[!, "Quandl Code"] .* "1"

quandl = Quandl("QUANDL_KEY")

function get_data(code)
    futuresData = quandl(TimeSeries(code))
    futuresData.Code .= code
    futuresData
end
futureData = get_data.(rand(futuresCodes, 4));

We have an array of all the available contracts futuresCodes and
sample 4 of them randomly to see what the data looks like.

p = []
for df in futureData
    append!(p, plot(df.Date, df.Settle, label = df.Code[1]))
end

plot(plot.(p)..., layout = 4)

Futures examples

  • ABY – WTI Brent Bullet – Spread between two oil futures on different
    exchanges.
  • TZ6 – Transco Zone 6 Non-N.Y. Natural Gas (Platts IFERC) Basis – Spread between
    two different natural gas contracts
  • PG – PG&E Citygate Natural Gas (Platts IFERC) Basis – Again, spread between
    two different natural gas contracts
  • FMJP – MSCI Japan Index – Index containing Japanese stocks

I’ve managed to randomly select 3 energy futures and one stock
index.

Project ideas with this data:

  1. Cross-asset momentum and mean reversion.
  2. Cross-asset correlations, does the price of oil drive some equity indexes?
  3. Macro regimes, can you pick out commonalities of market factors over
    the years?

Equity Order Book Data

Out there in the wild is the FI2010 dataset which is essentially a
sample of the full order book for five different
stocks on the Nordic stock exchange for 10 days. You have 10 levels of
prices and volumes and so can reconstruct the order book throughout the
day. It is the benchmark dataset for limit order book prediction and you will see it referenced
in papers that are trying to implement new prediction models. For
example Benchmark Dataset for Mid-Price Forecasting of Limit
Order Book Data with Machine Learning Methods

references some basic methods on the dataset and how they perform when
predicting the mid-price.

I found the dataset (as a Python package) here
https://github.com/simaki/fi2010 but it’s just stored as a CSV which
you can lift easily.

fi2010 = CSV.read(download("https://raw.githubusercontent.com/simaki/fi2010/main/data/data.csv"),DataFrame);

Update on 7/01/2024

Since posting this the above link has gone offline and the user has
deleted their Github account! Instead the data set can be found here:
https://etsin.fairdata.fi/dataset/73eb48d7-4dbc-4a10-a52a-da745b47a649/data
. I’ve not verified if its in the same format, so there might be some
additional work going from the raw data to how this blog post sets it
up. Thank’s to the commentators below pointing this out.

The data is wide (each column is a depth level of the price and
volume) so I turn each into a long data set and add the level, side
and variable as a new column.

fi2010Long = stack(fi2010, 4:48, [:Column1, :STOCK, :DAY])
fi2010Long = @transform(fi2010Long, :a = collect.(eachsplit.(:variable, "_")))
fi2010Long = @transform(fi2010Long, :var = first.(:a), :level = last.(:a), :side = map(x->x[2], :a))
fi2010Long = @transform(groupby(fi2010Long, [:STOCK, :DAY]), :Time = collect(1:length(:Column1)))
first(fi2010Long, 4)

The ‘book depth’ is the sum of the liquidity available at all the
levels and indicates how easy it is to trade the stock. As a
quick example, we can take the average of each stock per day and use
that as a proxy for the ease of trading these stocks.

intraDayDepth = @combine(groupby(fi2010Long, [:STOCK, :DAY, :var]), :avgDepth = mean(:value))
intraDayDepth = @subset(intraDayDepth, :var .== "VOLUME");
plot(intraDayDepth.DAY, intraDayDepth.avgDepth, group=intraDayDepth.STOCK, 
     marker = :circle, title = "Avg Daily Book Depth - FI2010")

FI2010 Intraday book depth

Stock 3 and 4 have the highest average depth, so most likely the
easier to trade, whereas Stock 1 has the thinnest depth. Stock 2 has
an interesting switch between liquid and not liquid.

So if you want to look beyond top-of-book data, this dataset provides
the extra level information needed and is closer to what a
professional shop is using. Better than trying to predict daily Yahoo
finance mid-prices with neural nets at least.

Build Your Own Crypto Datasets

If you want to take a further step back then being able to build the
tools that take in streaming data directly from the exchanges and
save that into a database is another way you can build out your
technical capabilities. This means you have full control over what you
download and save. Do you want just the top of book every update, the
full depth of the book, or just the reported trades?
I’ve written about this before, Getting Started with High Frequency Finance using Crypto Data and Julia, and learned a lot in the
process. Doing things this way means you have full control over the entire
process and can fully understand the data you are saving and any
additional quirks around the process.

Conclusion

Plenty to get stuck into and learn from. Being able to get the data
and loading it into an environment is always the first challenge and
learning how to do that with all these different types of data should
help you understand what these types of jobs entail.

Stat Arb – An Easy Walkthrough

By: Dean Markwick's Blog -- Julia

Re-posted from: https://dm13450.github.io/2023/07/15/Stat-Arb-Walkthrough.html

Statistical arbitrage (stat arb) is a pillar of quantitate trading
that relies on mean reversion to predict the future returns of an
asset. Mean reversion believes that if a stock has risen higher it’s
more likely to revert in the short term which is the opposite of a
momentum strategy that believes if a stock has been rising it will
continue to rise. This blog post will walk you the ‘the’ statistical
arbitrage paper
Statistical Arbitrage in the US Equities Market
apply it to a stock/ETF pair and then look at an intraday crypto stat
arb strategy.


Enjoy these types of posts? Then you should sign up for my newsletter.


I’m using Julia 1.9 and my AlpacaMarkets.jl package gets all the data we
need.

using AlpacaMarkets
using DataFrames, DataFramesMeta
using Dates
using Plots
using RollingFunctions, Statistics
using GLM

To start with we simply want the daily prices of JPM, XLF, and SPY. JPM
is the stock we think will go through mean reversion, XLF is the
financial sector ETF and SPY is the general SPY ETF.

We this that if JPM rises higher than XLF then it will soon revert and
trade lower shortly. Likewise, if JPM falls lower than XLF
then we think it will soon trade higher. Our mean reversion is all
about JPM around XLF. We’ve chosen XLF as it represents the general
financial sector landscape, so will represent the general sector outlook
more consistently than JPM on its own.

jpm = AlpacaMarkets.stock_bars("JPM", "1Day"; startTime = Date("2017-01-01"), limit = 10000, adjustment="all")[1]
xlf = AlpacaMarkets.stock_bars("XLF", "1Day"; startTime = Date("2017-01-01"), limit = 10000, adjustment="all")[1];
spy = AlpacaMarkets.stock_bars("SPY", "1Day"; startTime = Date("2017-01-01"), limit = 10000, adjustment="all")[1];

We want to clean the data to format the date correctly and select the
close and open columns.

function parse_date(t)
   Date(string(((split(t, "T")))[1]))
end

function clean(df, x) 
    df = @transform(df, :Date = parse_date.(:t), :Ticker = x, :NextOpen = [:o[2:end]; NaN])
   @select(df, :Date, :c, :o, :Ticker, :NextOpen)
end

Now we calculate the close-to-close log returns and format the data
into a column for each asset.

jpm = clean(jpm, "JPM")
xlf = clean(xlf, "XLF")
spy = clean(spy, "SPY")
allPrices = vcat(jpm, xlf, spy)
allPrices = sort(allPrices, :Date)

allPrices = @transform(groupby(allPrices, :Ticker), 
                      :Return = [NaN; diff(log.(:c))], 
                      :ReturnO = [NaN; diff(log.(:o))],
                      :ReturnTC = [NaN; diff(log.(:NextOpen))]);

modelData = unstack(@select(allPrices, :Date, :Ticker, :Return), :Date, :Ticker, :Return)
modelData = modelData[2:end, :];

last(modelData, 4)

4 rows × 4 columns

Date JPM XLF SPY
Date Float64? Float64? Float64?
1 2023-06-30 0.0138731 0.00864001 0.0117316
2 2023-07-03 0.00799894 0.00562049 0.00114985
3 2023-07-05 -0.00661524 -0.00206703 -0.0014883
4 2023-07-06 -0.00993581 -0.00860923 -0.00786148

Looking at the actual returns we can see that all three move in sync

plot(modelData.Date, cumsum(modelData.JPM), label = "JPM")
plot!(modelData.Date, cumsum(modelData.XLF), label = "XLF")
plot!(modelData.Date, cumsum(modelData.SPY), label = "SPY", legend = :left)

Stock returns

The key point is that they are moving in sync with each other. Given XLF has JPM included in it, this is expected but it also presents the opportunity to trade around any dispersion between the ETF and the individual name.

The Stat Arb Modelling Process

  • https://math.stackexchange.com/questions/345773/how-the-ornstein-uhlenbeck-process-can-be-considered-as-the-continuous-time-anal

Let’s think simply about pairs trading. We have two securities that we want to trade if their prices change too much, so our variable of interest is

\[e = P_1 – P_2\]

and we will enter a trade if \(e\) becomes large enough in both the positive and negative directions.

To translate that into a statistical problem we have two steps.

  1. Work out the difference between the two securities
  2. Model how the difference changes over time.

Step 1 is a simple regression of the stock vs the ETF we are trading against. Step 2 needs a bit more thought, but is still only a simple regression.

The Macro Regression – Stock vs ETF

In our data, we have the daily returns of JPM, the XLF ETF, and the SPY ETF. To work out the interdependence, it’s just a case of simple linear regression.

regModel = lm(@formula(JPM ~ XLF + SPY), modelData)
JPM ~ 1 + XLF + SPY

Coefficients:
──────────────────────────────────────────────────────────────────────────────────
                    Coef.   Std. Error       t  Pr(>|t|)   Lower 95%     Upper 95%
──────────────────────────────────────────────────────────────────────────────────
(Intercept)   0.000188758  0.000162973    1.16    0.2469  -0.0001309   0.000508417
XLF           1.35986      0.0203485     66.83    <1e-99   1.31995     1.39977
SPY          -0.363187     0.0260825    -13.92    <1e-41  -0.414345   -0.312028
──────────────────────────────────────────────────────────────────────────────────

From the slope of the model, we can see that JPM = 1.36XLF – 0.36SPY,
so JPM has a \(\beta\) of 1.36 to the XLF index and a \(\beta\) of
-0.36 to the SPY ETF, or general market. So each day, we can
approximate JPMs return by multiplying the XLF returns and SPY
returns.

This is our economic factor model, which describes from a
‘big picture’ kind of way how the stock trades vs the general market (SPY)
and its sector-specific market (XLF).

What we need to do next is look at what this model doesn’t explain
and try and describe that.

The Reversion Regression

Any difference around this model can be explained by the summation of
the residuals over time. In the paper the sum of the residuals
over time is called the ‘auxiliary process’ and this is the data behind
the second regression.

plot(scatter(modelData.Date, residuals(regModel), label = "Residuals"),
       plot(modelData.Date,cumsum(residuals(regModel)),
       label = "Aux Process"),
	  layout = (2,1))

Auxiliary process

We believe the auxiliary process (cumulative sum of the residuals)
can be modeled using a
Ornstein-Uhlenbeck
(OU) process.

An OU process is a type of differential equation that displays mean
reversion behaviour. If the process falls away from its average level
then it will be forced back.

\[dX = \kappa (m – X(t))dt + \sigma \mathrm{d} W\]

\(\kappa\) represents how quickly the mean reversion occurs.

To fit this type of process we need to recognise that the above differential form of an OU process can be discretised to become a simple AR(1) model where the model parameters can be transformed to get the OU parameters.

We now fit the OU process onto the cumulative sum of the residuals from the first model. If the residuals have some sort of structure/pattern then this means our original model was missing some variable that explains the difference.

X = cumsum(residuals(regModel))
xDF = DataFrame(y=X[2:end], x = X[1:end-1])
arModel = lm(@formula(y~x), xDF)
y ~ 1 + x

Coefficients:
─────────────────────────────────────────────────────────────────────────────────
                  Coef.   Std. Error       t  Pr(>|t|)     Lower 95%    Upper 95%
─────────────────────────────────────────────────────────────────────────────────
(Intercept)  4.41618e-6  0.000162655    0.03    0.9783  -0.000314618  0.000323451
x            0.997147    0.00186733   534.00    <1e-99   0.993484     1.00081
─────────────────────────────────────────────────────────────────────────────────

We take these coefficients and transform them into the parameters from the paper.

varEta = var(residuals(arModel))
a, b = coef(arModel)
k = -log(b)*252
m = a/(1-b)
sigma = sqrt((varEta * 2 * k) / (1-b^2))
sigma_eq = sqrt(varEta / (1-b^2))
[m, sigma_eq]
2-element Vector{Float64}:
 0.0015477568390823153
 0.08709971423424319

So \(m\) gives us the average level and \(\sigma_{\text{eq}}\) the
appropriate scale.

Now to build the mean reversion signal. We still have \(X\) as our
auxiliary process which we believe is mean reverting. We now have the
estimated parameters on the scale of this mean reversion so we can
transform the auxiliary process by these parameters and use this to see when the process is higher or lower than the model suggests it should be.

modelData.Score = (X .- m)./sigma_eq;

plot(modelData.Date, modelData.Score, label = "s")
hline!([-1.25], label = "Long JPM, Short XLF", color = "red")
hline!([-0.5], label = "Close Long Position", color = "red", ls=:dash)

hline!([1.25], label = "Short JPM, Long XLF", color = "purple")
hline!([0.75], label = "Close Short Position", color = "purple", ls = :dash, legend=:topleft)

Stock signal

The red lines indicate when JPM has diverged from XLF on the negative side, i.e. we expect JPM to move higher and XLF to move lower. We enter the position if s < -1.25 (solid red line) and exit the position when s > -0.5 (dashed red line).

  • Buy to open if \(s < -s_{bo}\) (< -1.25) Buy 1 JPM, sell Beta XLF
  • Close long if \(s > -s_{c}\) (-0.5)

The purple line is the same but in the opposite direction.

  • Sell to open if \(s > s_{so}\) (>1.25) Sell 1 JPM, buy Beta XLF
  • Close short if \(s < s_{bc}\) (<0.75)

That’s the modeling part done. We model how the stock moves based on
the overall market and then any differences to this we use the OU
process to come up with the mean reversion parameters.

So, does it make money?

Backtesting the Stat Arb Strategy

To backtest this type of model we have to roll through time and
calculate both regressions to construct the signal.

A couple of new additions too

  • We shift and scale the returns when doing the macro regression.
  • The auxiliary process on the last day is always 0, which makes
    calculating the signal simple.
paramsRes = Array{DataFrame}(undef, length(90:(nrow(modelData) - 90)))

for (j, i) in enumerate(90:(nrow(modelData) - 90))
    modelDataSub = modelData[i:(i+90), :]
    modelDataSub.JPM = (modelDataSub.JPM .- mean(modelDataSub.JPM)) ./ std(modelDataSub.JPM)
    modelDataSub.XLF = (modelDataSub.XLF .- mean(modelDataSub.XLF)) ./ std(modelDataSub.XLF)
    modelDataSub.SPY = (modelDataSub.SPY .- mean(modelDataSub.SPY)) ./ std(modelDataSub.SPY)
    
    macroRegr = lm(@formula(JPM ~ XLF + SPY), modelDataSub)
    auxData = cumsum(residuals(macroRegr))
    ouRegr = lm(@formula(y~x), DataFrame(x=auxData[1:end-1], y=auxData[2:end]))
    
    varEta = var(residuals(ouRegr))
    a, b = coef(ouRegr)
    k = -log(b)*252
    m = a/(1-b)
    sigma = sqrt((varEta * 2 * k) / (1-b^2))
    sigma_eq = sqrt(varEta / (1-b^2))
    
    
    paramsRes[j] = DataFrame(Date= modelDataSub.Date[end], 
                             MacroBeta_XLF = coef(macroRegr)[2], MacroBeta_SPY = coef(macroRegr)[3], MacroAlpha = coef(macroRegr)[1],
                             VarEta = varEta, OUA = a, OUB = b, OUK = k, Sigma = sigma, SigmaEQ=sigma_eq,
                             Score = -m/sigma_eq)
    
end

paramsRes = vcat(paramsRes...)
last(paramsRes, 4)

4 rows × 11 columns (omitted printing of 4 columns)

Date MacroBeta_XLF MacroBeta_SPY MacroAlpha VarEta OUA OUB
Date Float64 Float64 Float64 Float64 Float64 Float64
1 2023-06-30 0.974615 -0.230273 1.10933e-17 0.331745 0.175358 0.830417
2 2023-07-03 0.96943 -0.228741 -5.73883e-17 0.331222 0.198176 0.826816
3 2023-07-05 0.971319 -0.230438 2.38846e-17 0.335844 0.242754 0.841018
4 2023-07-06 0.974721 -0.232765 5.09875e-17 0.331695 0.256579 0.823822

The benefit of doing it this way also means we can see how each
\(\beta\) in the macro regression evolves.

plot(paramsRes.Date, paramsRes.MacroBeta_XLF, label = "XLF Beta")
plot!(paramsRes.Date, paramsRes.MacroBeta_SPY, label = "SPY Beta")

Stock betas

Good to see they are consistent in their signs and generally don’t
vary a great deal.

In the OU process, we are also interested in the speed of the mean
reversion as we don’t want to take a position that is very slow to
revert to the mean level.

kplot = plot(paramsRes.Date, paramsRes.OUK, label = :none)
kplot = hline!([252/45], label = "K Threshold")

In the paper, they suggest making sure the reversion happens with half
of the estimation period. As we are using 90 days, that means the
horizontal line shows when \(k\) is above this value.

svg

Plotting the score function also shows how the model wants to go
long/short the different components over time.

splot = plot(paramsRes.Date, paramsRes.Score, label = "Score")
hline!([-1.25], label = "Long JPM, Short XLF", color = "red")
hline!([-0.5], label = "Close Long Position", color = "red", ls=:dash)

hline!([1.25], label = "Short JPM, Long XLF", color = "purple")
hline!([0.75], label = "Close Short Position", color = "purple", ls = :dash)

Stock position

We run through the allocation procedure and label whether we are long
(+1) or short (-\(\beta\)) an amount of either the stock or ETFs.

paramsRes.JPM_Pos .= 0.0
paramsRes.XLF_Pos .= 0.0
paramsRes.SPY_Pos .= 0.0

for i in 2:nrow(paramsRes)
    
    if paramsRes.OUK[i] > 252/45
    
        if paramsRes.Score[i] >= 1.25
            paramsRes.JPM_Pos[i] = -1
            paramsRes.XLF_Pos[i] = paramsRes.MacroBeta_XLF[i]
            paramsRes.SPY_Pos[i] = paramsRes.MacroBeta_SPY[i]
        elseif paramsRes.Score[i] >= 0.75 && paramsRes.JPM_Pos[i-1] == -1
            paramsRes.JPM_Pos[i] = -1
            paramsRes.XLF_Pos[i] = paramsRes.MacroBeta_XLF[i]    
            paramsRes.SPY_Pos[i] = paramsRes.MacroBeta_SPY[i]
        end

        if paramsRes.Score[i] <= -1.25
            paramsRes.JPM_Pos[i] = 1
            paramsRes.XLF_Pos[i] = -paramsRes.MacroBeta_XLF[i]   
            paramsRes.SPY_Pos[i] = -paramsRes.MacroBeta_SPY[i]
        elseif paramsRes.Score[i] <= -0.5 && paramsRes.JPM_Pos[i-1] == 1
            paramsRes.JPM_Pos[i] = 1
            paramsRes.XLF_Pos[i] = -paramsRes.MacroBeta_XLF[i] 
            paramsRes.SPY_Pos[i] = -paramsRes.MacroBeta_SPY[i]
        end
    end
        
end

To make sure we use the right price return we lead the return columns
by one so that we enter the position and get the next return.

modelData = @transform(modelData, :NextJPM= lead(:JPM, 1), 
                                   :NextXLF = lead(:XLF, 1),
                                   :NextSPY = lead(:SPY, 1))

paramsRes = leftjoin(paramsRes, modelData[:, [:Date, :NextJPM, :NextXLF, :NextSPY]], on=:Date)

portRes = @combine(groupby(paramsRes, :Date), :Return = :NextJPM .* :JPM_Pos .+ :NextXLF .* :XLF_Pos .+ :NextSPY .* :SPY_Pos);

plot(portRes.Date, cumsum(portRes.Return), label = "Stat Arb Return")

Stock portfolio return

Sad trombone noise. This is not a great result as we’ve ended up
negative over the period. However, given the paper is 15 years old it
would be very rare to still be able to make money this way
after everyone knows how to do it. Plus, I’ve only used one stock vs
the ETF portfolio, you typically want to diversify out and use all the
stocks in the ETF to be long and short multiple single names and use
the ETF as a minimal hedge,

The good thing about it being a negative result means that we don’t have
to start considering transaction costs or other annoying things like
that.

When we break out the components of the strategy we can see that
it appears to pick out the right times to short/long JPM and
SPY, its the hedging with the XLF ETF that is bringing the portfolio
down.

plot(paramsRes.Date, cumsum(paramsRes.NextJPM .* paramsRes.JPM_Pos), label = "JPM Component")
plot!(paramsRes.Date, cumsum(paramsRes.NextXLF .* paramsRes.XLF_Pos), label = "XLF Component")
plot!(paramsRes.Date, cumsum(paramsRes.NextSPY .* paramsRes.SPY_Pos), label = "SPY Component")
plot!(portRes.Date, cumsum(portRes.Return), label = "Stat Arb Portfolio")

Stock portfolio components

So whilst naively trying to trade the stat arb portfolio is probably
a loss maker, there might be some value in using the model as a signal
input or overlay to another strategy.

What about if we up the frequency and look at intraday stat arb?

Intraday Stat Arb in Crypto – ETH and BTC

Crypto markets are open 24 hours a day 7 days a week and so gives that
much more opportunity to build out a continuous trading model. We look
back since the last year and repeat the backtesting process to see if
this bares any fruit.

Once again AlpacaMarkets gives us an easy way to pull the hourly bar
data for both ETH and BTC.

btcRaw = AlpacaMarkets.crypto_bars("BTC/USD", "1Hour"; startTime = now() - Year(1), limit = 10000)[1]
ethRaw = AlpacaMarkets.crypto_bars("ETH/USD", "1Hour"; startTime = now() - Year(1), limit = 10000)[1];

btc = @transform(btcRaw, :ts = DateTime.(chop.(:t)), :Ticker = "BTC")
eth = @transform(ethRaw, :ts = DateTime.(chop.(:t)), :Ticker = "ETH")

btc = btc[:, [:ts, :Ticker, :c]]
eth = eth[:, [:ts, :Ticker, :c]]

allPrices = vcat(btc, eth)
allPrices = sort(allPrices, :ts)

allPrices = @transform(groupby(allPrices, :Ticker), 
                      :Return = [NaN; diff(log.(:c))]);

modelData = unstack(@select(allPrices, :ts, :Ticker, :Return), :ts, :Ticker, :Return);
modelData = @subset(modelData, .! isnan.(:ETH .+ :BTC))

Plotting out the returns we can see they are loosely related just like
the stock example.

plot(modelData.ts, cumsum(modelData.BTC), label = "BTC")
plot!(modelData.ts, cumsum(modelData.ETH), label = "ETH")

Crypto returns

We will be using BTC as the ‘index’ and see how ETH is related.

regModel = lm(@formula(ETH ~ BTC), modelData)
ETH ~ 1 + BTC

Coefficients:
─────────────────────────────────────────────────────────────────────────────
                  Coef.  Std. Error       t  Pr(>|t|)    Lower 95%  Upper 95%
─────────────────────────────────────────────────────────────────────────────
(Intercept)  7.72396e-6  3.64797e-5    0.21    0.8323  -6.37847e-5  7.92327e-5
BTC          1.115       0.00673766  165.49    <1e-99   1.10179     1.12821
─────────────────────────────────────────────────────────────────────────────

Fairly high beta for ETH and against BTC. We use a 90-hour rolling window now instead of a 90 day.

window = 90

paramsRes = Array{DataFrame}(undef, length(window:(nrow(modelData) - window)))

for (j, i) in enumerate(window:(nrow(modelData) - window))
    modelDataSub = modelData[i:(i+window), :]
    modelDataSub.ETH = (modelDataSub.ETH .- mean(modelDataSub.ETH)) ./ std(modelDataSub.ETH)
    modelDataSub.BTC = (modelDataSub.BTC .- mean(modelDataSub.BTC)) ./ std(modelDataSub.BTC)
    
    macroRegr = lm(@formula(ETH ~ BTC), modelDataSub)
    auxData = cumsum(residuals(macroRegr))
    ouRegr = lm(@formula(y~x), DataFrame(x=auxData[1:end-1], y=auxData[2:end]))
    varEta = var(residuals(ouRegr))
    a, b = coef(ouRegr)
    k = -log(b)/((1/24)/252)
    m = a/(1-b)
    sigma = sqrt((varEta * 2 * k) / (1-b^2))
    sigma_eq = sqrt(varEta / (1-b^2))
    
    
    paramsRes[j] = DataFrame(ts= modelDataSub.ts[end], MacroBeta = coef(macroRegr)[2], MacroAlpha = coef(macroRegr)[1],
                             VarEta = varEta, OUA = a, OUB = b, OUK = k, Sigma = sigma, SigmaEQ=sigma_eq,
                             Score = -m/sigma_eq)
    
end

paramsRes = vcat(paramsRes...)

Again, looking at \(\beta\) overtime we see there has been a sudden
shift

plot(plot(paramsRes.ts, paramsRes.MacroBeta, label = "Macro Beta", legend = :left), 
     plot(paramsRes.ts, paramsRes.OUK, label = "K"), layout = (2,1))

Crypto params

Interesting that there has been a big change in \(\beta\) between ETH and BTC
recently that has suddenly reverted. Ok, onto the backtesting again.

paramsRes.ETH_Pos .= 0.0
paramsRes.BTC_Pos .= 0.0

for i in 2:nrow(paramsRes)
    
    if paramsRes.OUK[i] > (252/(1/24)/45)
    
        if paramsRes.Score[i] >= 1.25
            paramsRes.ETH_Pos[i] = -1
            paramsRes.BTC_Pos[i] = paramsRes.MacroBeta[i]   
        elseif paramsRes.Score[i] >= 0.75 && paramsRes.ETH_Pos[i-1] == -1
            paramsRes.ETH_Pos[i] = -1
            paramsRes.BTC_Pos[i] = paramsRes.MacroBeta[i]     
        end

        if paramsRes.Score[i] <= -1.25
            paramsRes.ETH_Pos[i] = 1
            paramsRes.BTC_Pos[i] = -paramsRes.MacroBeta[i]   
        elseif paramsRes.Score[i] <= -0.5 && paramsRes.ETH_Pos[i-1] == 1
            paramsRes.ETH_Pos[i] = 1
            paramsRes.BTC_Pos[i] = -paramsRes.MacroBeta[i]     
        end
    end
        
end


modelData = @transform(modelData, :NextETH= lead(:ETH, 1), :NextBTC = lead(:BTC, 1))

paramsRes = leftjoin(paramsRes, modelData[:, [:ts, :NextETH, :NextBTC]], on=:ts)

portRes = @combine(groupby(paramsRes, :ts), :Return = :NextETH .* :ETH_Pos .+ :NextBTC .* :BTC_Pos);

plot(portRes.ts, cumsum(portRes.Return))

Crypto stat arb returns

This looks slightly better. At least it is positive at the end of the
testing period.

plot(paramsRes.ts, cumsum(paramsRes.NextETH .* paramsRes.ETH_Pos), label = "ETH Component")
plot!(paramsRes.ts, cumsum(paramsRes.NextBTC .* paramsRes.BTC_Pos), label = "BTC Component")
plot!(portRes.ts, cumsum(portRes.Return), label = "Stat Arb Portfolio", legend=:topleft)

Crypto components

Again, the components of the portfolio seem to be ok in the ETH case
but generally, this is from the overall long bias. Unlike the JPM/XLF
example, there isn’t much more diversification we can add anything that might
help. We could add in more crypto assets, or an equity/gold angle, but
it becomes more of an asset class arb than something truly
statistical.

Conclusion

The original paper is one of those that all quants get recommended to
read and statistical arbitrage is a concept that you probably
understand in theory but practically doing is another
question. Hopefully, this blog post gets you up to speed with the
basic concepts and how to implement them.
It can be boiled down to two steps.

  1. Model as much as you can with a simple regression
  2. Model what’s left over as an OU process.

It can work with both high-frequency and low-frequency data, so have a
look at different combinations or assets and see if you have more luck
then I did backtesting.

If you do end up seeing something positive, make sure you are
backtesting properly!

Predicting a Successful Mt Everest Climb

By: Dean Markwick's Blog -- Julia

Re-posted from: https://dm13450.github.io/2023/04/27/Predicting-a-Successful-Mt-Everest-Climb.html

Climbing Mount Everest is a true test of human endurance with a real risk of death. The Himalayan Database is a data repository, available for free, that records various details about the peaks, people, and expeditions to climb the different Nepalese Himalayan mountains and provides the data for this analysis. In this blog post, I’ll show you how to load the database and explore some of the features before building a model that tries to predict how you can successfully climb Mount Everest.


Enjoy these types of posts? Then you should sign up for my newsletter.


Over the past few months, I’ve been training for a marathon and have been trying to understand the best way to train and maximise my performance. This means extensive research and reading to get an idea of what the science says. Endure by Alex Hutchinson is a book I recommend and it takes a look at the way the human body functions over long distances/extreme tasks – such as climbing Mount Everest with no oxygen or ultra, ultra marathoners with an overarching reference to the Breaking2 project by Nike.

In one section the book references something called the Himalayan Database which is a database of expeditions to Mount Everest and other mountains in the Himalayas. As a data lover, this piqued my interest as an interesting data source and something a bit different from my usual data explorations around finance/sports. So I downloaded the database, worked out how to load it, and had a poke around the data.

If you go to the website, himalayandatabase, you can download the data yourself and follow along.

The database is distributed in the DBF format and the website itself is a bit of a blast from the past. It expects you to download a custom data viewer program to look at the data, but thankfully there are people in the R world that demonstrated how to load the raw DBF files. I’ve taken inspiration from this, downloaded the DBF files, loaded up DBFTables.jl and loaded the data into Julia.

using DataFrames, DataFramesMeta
using Plots, StatsPlots
using Statistics
using Dates

I hit a roadblock straight away and had to patch DBFTables.jl with a new datatype that the Himalayan database uses that isn’t in the original spec. Pull request here if you are interested: DBFTables.jl – Add M datatype. Another feather to my open-source contributions hat!

using DBFTables

There are 6 tables in the database but we are only interested in 3 of them:

  • exped details the expeditions. So each trip up a mountain by one or more people.
  • peaks has the details on the mountains in the mountains in the Himalayas.
  • members which has information on each person that has attempted to climb one of the mountains.
function load_dbf(fn)
    dbf = DBFTables.Table(fn)
    DataFrame(dbf)
end

exped = load_dbf("exped.DBF")
peaks = load_dbf("peaks.DBF")
members = load_dbf("members.DBF");

Taking a look at the mountains with the most entries.

first(sort(@combine(groupby(exped, :PEAKID), :N = length(:YEAR)),
       :N, rev=true), 3)
Row PEAKID N
String Int64
1 EVER 2191
2 AMAD 1456
3 CHOY 1325

Unsurprisingly Mount Everest is the most attempted mountain with Ama Dablam in second and Cho Oyu in third place.

Exploring the Himalayas Data

We start with some basic groupings to look at how the data is distributed per year.

expSummary = @combine(groupby(@subset(members, :CALCAGE .> 0), :EXPID),  
         :N = length(:CALCAGE),
         :YoungestAge=minimum(:CALCAGE), 
         :AvgAge = mean(:CALCAGE),
         :NFemale = sum(:SEX .== "F"))

expSummary = leftjoin(expSummary, 
              @select(exped, :EXPID, :PEAKID, :BCDATE, :SMTDATE, :MDEATHS, :HDEATHS, :SUCCESS1), on = :EXPID)
expSummary = leftjoin(expSummary, @select(peaks, :PEAKID, :PKNAME), on = :PEAKID)

everest = dropmissing(@subset(expSummary, :PKNAME .== "Everest"))
everest = @transform(everest, :DeathRate = (:MDEATHS .+ :HDEATHS) ./ :N, :Year = floor.(:BCDATE, Dates.Year))
everestYearly = @combine(groupby(everest, :Year), :N = sum(:N), 
                          :Deaths = sum(:MDEATHS + :HDEATHS),
                        :Success = sum(:SUCCESS1))
everestYearly = @transform(everestYearly, :DeathRate = :Deaths ./ :N, :SuccessRate = :Success ./ :N)
everestYearly = @transform(everestYearly, 
                            :DeathRateErr = sqrt.(:DeathRate .* (1 .- :DeathRate)./:N),
                            :SuccessRateErr = sqrt.(:SuccessRate .* (1 .- :SuccessRate)./:N));

What is the average age of those who climb Mount Everest?

scatter(everest.SMTDATE, everest.AvgAge, label = "Average Age of Attempting Everest")

Average age of climbing Mount Everest

By eye, it looks like the average age has been steadily increasing. Generally, your expedition’s average age needs to be at least 30. Given it costs a small fortune to climb Everest this is probably more of a ‘need money’ rather than a look at the overall fitness of a 30-year-old.

When we look at the number of attempts yearly and the annual death rate:

plot(bar(everestYearly.Year, everestYearly.N, label = "Number of Attempts in a Year"),
    scatter(everestYearly.Year, everestYearly.DeathRate, yerr=everestYearly.DeathRateErr, 
            label = "Yearly Death Rate"),
     layout = (2,1))

Yearly death rate on Mount Everest

scatter(everestYearly[everestYearly.Year .> Date("2000-01-01"), :].Year, 
        everestYearly[everestYearly.Year .> Date("2000-01-01"), :].DeathRate, 
        yerr=everestYearly[everestYearly.Year .> Date("2000-01-01"), :].DeathRateErr, 
        label = "Yearly Death Rate")

20th century death rate

But how ‘easy’ has it been to conquer Everest over the years? Looking at the success rate at best 10% of attempted expeditions are completed, which highlights how tough it is. Given some of the photos of people queueing to reach the summit, you’d think it would be much easier, but out of the 400 expeditions, less than 100 will make it.

scatter(everestYearly.Year, everestYearly.SuccessRate, yerr=everestYearly.SuccessRateErr, 
        label = "Mt. Everest Success Rate")

Mount Everest success rate

A couple of interesting points from this graph:

  • 2014 was an outlier due to an avalanche that lead to Mount Everest being closed from April until the rest of the year.
  • No one successfully climbed Mt Everest in 2015 because of the earthquake.
  • Only 1 success in 2020 before the pandemic closed everything.

So a decent amount of variation in what can happen in a given year on Mt Everest.

Predicting Success

The data has some interesting quirks and we now turn to our next step, trying to build a model. Endurance was about what it takes to complete impressive human feats. So let’s do that here, can we use the database to predict and explain what leads to success?

We will be using the MLJ.jl package again to fit some machine learning models easily.

using MLJ, LossFunctions

To start with we are going to pull out the relevant factors that we think will help climb a mountain. Not specifically Everest, but any of the Himalayan peaks from the database.

modelData = members[:, ["MSUCCESS", "PEAKID","MYEAR", 
                        "MSEASON", "SEX", "CALCAGE", "CITIZEN", "STATUS", 
                         "MROUTE1", "MO2USED"]]
modelData = @subset(modelData, :PEAKID .== "EVER")
modelData.MROUTE1 = modelData.PEAKID .* "_" .* string.(modelData.MROUTE1)
modelData = dropmissing(modelData)
modelData.MYEAR = parse.(Int, modelData.MYEAR)
modelData = @subset(modelData, :CALCAGE .> 0)

print(size(modelData))
(22583, 10)
first(modelData, 4)
4×10 DataFrame
Row MSUCCESS PEAKID MYEAR MSEASON SEX CALCAGE CITIZEN STATUS MROUTE1 MO2USED
Bool String Int64 Int64 String Int64 String String String Bool
1 false EVER 1963 1 M 36 USA Climber EVER_2 true
2 true EVER 1963 1 M 31 USA Climber EVER_1 true
3 false EVER 1963 1 M 27 USA Climber EVER_1 false
4 false EVER 1963 1 M 26 USA Climber EVER_2 true

Just over 22k rows and 10 columns, so plenty of data to sink our teeth into. MLJ needs us to define the Multiclass type of the factor variables and we also want to split out the predictor and predictors then split out into the test/train sets.

modelData2 = coerce(modelData,
                    :MSUCCESS => OrderedFactor,
                    :MSEASON => Multiclass,
                    :SEX => Multiclass,
                    :CITIZEN => Multiclass,
                    :STATUS => Multiclass, 
                    :MROUTE1 => Multiclass,
                    :MO2USED => OrderedFactor);
y, X = unpack(modelData2, ==(:MSUCCESS), colname -> true; rng=123);

train, test = partition(eachindex(y), 0.7, shuffle=true);

All these multi-class features need to be one-hot encoded, so we use the continuous encoder. The workflow is:

  • Create the encoder/standardizer.
  • Train on the data
  • Transform the data

This gives confidence that you aren’t leaking the training data into the test data.

encoder = ContinuousEncoder()
encMach = machine(encoder, X) |> fit!
X_encoded = MLJ.transform(encMach, X);

X_encoded.MO2USED = X_encoded.MO2USED .- 1;
standardizer = @load Standardizer pkg=MLJModels
stanMach = fit!(machine(
                 standardizer(features = [:CALCAGE]),X_encoded); 
                rows=train)
X_trans = MLJ.transform(stanMach, X_encoded);
X_trans.MYEAR = X_trans.MYEAR .- minimum(X_trans.MYEAR);
plot(
    histogram(X_trans.CALCAGE, label = "Age"),
    histogram(X_trans.MYEAR, label = "Year"),
    histogram(X_trans.MO2USED, label = "02 Used")
    )

Variable distribution

Looking at the distribution of the transformed data gives a good indication of how varied these variables change post-transformation.

Model Fitting using MLJ.jl

I’ll now explore some different models using the MLJ.jl workflow similar to my previous post on Machine Learning Property Loans for Fun and Profit. MLJ.jl gives you a common interface to fit a variety of different models and evaluate their performance all from one package, so handy here when we want to look at a simple linear model and also an XGBoost model.

Let’s start with our null model to get the baseline.

constantModel = @load ConstantClassifier pkg=MLJModels

constMachine = machine(constantModel(), X_trans, y)

evaluate!(constMachine,
        rows=train,
         resampling=CV(shuffle=true),
         operation = predict_mode,
         measures=[accuracy, balanced_accuracy, kappa],
         verbosity=0)
Model Accuracy Kappa
Null 0.512 0.0

For classification tasks, the null model is essentially tossing a coin, so the accuracy will be around 50% and the \(\kappa\) is zero.

Next we move on to the simple linear model using all the features.

logisticClassifier = @load LogisticClassifier pkg=MLJLinearModels verbosity=0

lmMachine = machine(logisticClassifier(lambda=0), X_trans, y)

fit!(lmMachine, rows=train, verbosity=0)

evaluate!(lmMachine, 
          rows=train, 
          resampling=CV(shuffle=true), 
          operation = predict_mode,
          measures=[accuracy, balanced_accuracy, kappa], verbosity = 0)
Model Accuracy Kappa
Null 0.512 0.0
Linear Regression 0.884 0.769

This gives a good improvement over the null model, so indicates our included features have some sort of information useful in predicting success.

Inspecting the parameters indicates how strong each variable is. Route 0 leads to a large reduction in the probability of success whereas using oxygen increases the probability of success. Climbing in the Autumn or Winter also looks like it reduces your chance of success.

params = mapreduce(x-> DataFrame(Param=collect(x)[1], Value = collect(x)[2]), 
                   vcat, fitted_params(lmMachine).coefs)
params = sort(params, :Value)

vcat(first(params, 5), last(params, 5))
10×2 DataFrame
Row Param Value
Symbol Float64
1 MROUTE1__EVER_0 -4.87433
2 SEX__F -1.97957
3 SEX__M -1.94353
4 MSEASON__3 -1.39251
5 MSEASON__4 -1.1516
6 MROUTE1__EVER_2 0.334305
7 CITIZEN__USSR 0.43336
8 CITIZEN__Russia 0.518197
9 MROUTE1__EVER_1 0.697601
10 MO2USED 3.85578

XGBoost Time

What’s a model if we’ve not tried xgboost to squeeze the most performance out of all the data? Easy to fit using MLJ and without having to do any special lifting.

xgboostModel = @load XGBoostClassifier pkg=XGBoost verbosity = 0

xgboostmodel = xgboostModel()

xgbMachine = machine(xgboostmodel, X_trans, y)

evaluate!(xgbMachine,
        rows=train,
         resampling=CV(nfolds = 6, shuffle=true),
         measures=[accuracy,balanced_accuracy, kappa],
         verbosity=0)
Model Accuracy Kappa
Null 0.512 0.0
Linear Regression 0.884 0.769
XGBoost 0.889 0.778

We get 88.9% accuracy compared to the linear regression 88.4% and a \(\kappa\) increase too, so looking like a good model.

How Do I Succeed in the Climbing Mount Everest?

The whole point of these models is to try and work out what combination of these parameters gets us the highest probability of success on a mountain. We want some idea of feature importance that can direct us to the optimal approach to a mountain. Should I be an Austrian Doctor or is there an easier route that should be taken?

With xgboost we can use the feature_importances function to do exactly what it says on the tin and look at what features are most important in the model.

fi = feature_importances(xgbMachine)
fi = mapreduce(x-> DataFrame(Param=collect(x)[1], Value = collect(x)[2]), 
                   vcat, fi)
first(fi, 5)
5×2 DataFrame
Row Param Value
Symbol Float32
1 MO2USED 388.585
2 MROUTE1__EVER_0 46.3129
3 STATUS__H-A Worker 15.0079
4 CITIZEN__Nepal 11.6299
5 CITIZEN__UK 4.25651

So using oxygen, taking the 0th route up, being an H-A Worker, and either being from Nepal or a UK citizen appears to have the greatest impact on being successful. Using oxygen is an obvious benefit/cannot be avoided and I don’t think anyone believes that their chance of success would be higher without oxygen. Being Nepalese is the one I would struggle with.

How does the model perform on the hold-out set? We’ve got 30% of the data that hasn’t been used in the fitting that can also validate how well the model performs.

modelNames = ["Null", "LM", "XGBoost"]
modelMachines = [constMachine, 
               lmMachine, 
               xgbMachine]


aucRes = DataFrame(Model = modelNames,
    AUC = map(x->auc(MLJ.predict(x,rows=test), y[test]), 
                modelMachines))
kappaRes = DataFrame(Kappa = map(x->kappa(MLJ.predict_mode(x,rows=test), y[test]), modelMachines),
                     Accuracy = map(x->accuracy(MLJ.predict_mode(x,rows=test), y[test]), modelMachines),
          Model = modelNames)
evalRes = leftjoin(aucRes, kappaRes, on =:Model)
3×4 DataFrame
Row Model AUC Kappa Accuracy
String Float64 Float64? Float64?
1 Null 0.5 0.0 0.512768
2 LM 0.937092 0.768528 0.883838
3 XGBoost 0.939845 0.775408 0.88738

On the test set, the XGBoost model is only slightly better than the linear model in terms of \(\kappa\) and accuracy. It’s worse when measuring the AUC, so this is setting alarm bells ringing that the model isn’t quite there yet.

How Does Oxygen Change the Probability of Success?

X_trans2 = copy(X_trans[1:2, :])
X_trans2.MO2USED  = 1 .- X_trans2.MO2USED


predict(xgbMachine, vcat(X_trans[1:2, :], X_trans2))
4-element CategoricalDistributions.UnivariateFiniteVector{OrderedFactor{2}, Bool, UInt32, Float32}:
 UnivariateFinite{OrderedFactor{2}}(false=>0.245, true=>0.755)
 UnivariateFinite{OrderedFactor{2}}(false=>1.0, true=>0.000227)
 UnivariateFinite{OrderedFactor{2}}(false=>0.901, true=>0.0989)
 UnivariateFinite{OrderedFactor{2}}(false=>1.0, true=>0.000401)

By taking the first two entries and switching whether they used oxygen or not we can see how the outputted probability of success changes. In each case, it provides a dramatic shift in the probabilities. Again, from the feature importance output, we know this is the most important variable but it does seem to be a bit dominating in terms of what happens with and without oxygen.

Probability Calibration

Finally, let’s look at the calibration of the models.

using CategoricalArrays

modelData.Prediction = pdf.(predict(xgbMachine, X_trans), 1)

lData = @transform(modelData, :prob = cut(:Prediction, (0:0.1:1.1)))
gData = groupby(lData, :prob)
calibData = @combine(gData, :N = length(:MSUCCESS), 
                            :SuccessRate = mean(:MSUCCESS), 
                            :PredictedProb = mean(:Prediction))

calibData = @transform(calibData, :Err = 1.96 .* sqrt.((:PredictedProb .* (1 .- :PredictedProb)) ./ :N))




p = plot(calibData[:, :PredictedProb], 
         calibData[:, :SuccessRate], 
         yerr = calibData[:, :Err],
    seriestype=:scatter, label = "XGBoost Calibration")

p = plot!(p, 0:0.1:1, 0:0.1:1, label = :none)
h = histogram(modelData.Prediction, normalize=:pdf, label = "Prediction Distribution")

plot(p, h, layout = (2,1))

Calibration plot

To say the model is poorly calibrated is an understatement. There is no association of an increased success rate with the increase in model probability and from the distribution of predictions we can see it’s quite binary, there isn’t an even distribution to the output.
So whilst the evaluation metrics look better than a null model, the reality is that the model isn’t doing anything. With all the different factors in the model matrix, there is likely some degeneracy in the data, such that a single occurrence of a variable ends up predicting success or not. There is potentially an issue with using the member’s table instead of the expedition table, as whether the expedition was successful or not will lead to multiple members being successful.

Conclusion

Overall it’s an interesting data set even if it does take a little work to get it loaded into Julia. There is a wealth of different features in the data that lead to some nice graphs, but using these features to predict whether you will be successful or not in climbing Mount Everest doesn’t lead to a useful model.

Similar Posts