Author Archives: Chris G

Germany most likely to win Euro 2016

By: Chris G

Re-posted from: https://grollchristian.wordpress.com/2016/06/13/germany-most-likely-to-win-euro-2016/

After World Cup 2014 we finally are facing the next spectacular football event now: Euro 2016. With billions of football fans spread all over the world, football still seems to be the single most popular sport. Might have something to do with the fact that football is a game of underdogs: David could beat Goliath any day. Just take a look at the marvelous story of underdog Leicester City in this year’s Premier League season. It is this high uncertainty in future match outcomes that keeps everybody excited and puzzled about the one question: who is going to win?

A question, of course, that just feels like a perfectly designed challenge for data science, with an ever increasing wealth of football match statistics and other related data that is freely available nowadays. It comes as no surprise, hence, that Euro 2016 also puts statistics and data mining into the spotlight. Next to “who is going to win?”, the second question is: “who is going to make the best forecast?”

Besides the big players of the industry, whose predictions traditionally get the most of the media attention, there also is a less known academic research group that already had remarkable success in forecasting in the past. Andreas Groll and Gunther Schauberger from Ludwig-Maximilians-University, together with Thomas Kneib from Georg-August-University, again did set out to forecast the next Euro champion, after they already were able to predict the champion of Euro 2012 and World Cup 2014 correctly.

Based on publicly available data and the gamlls R-package they built a model to forecast probabilities of win, tie and loss for any game of Euro 2016 (Actually, they even get probabilities on a more precise level with an exact number of goals for both teams. For more details on the model take a look at their preliminary technical report).

This is what their model deems as most likely tournament evolution this time:

em_results_group em_results_tree

The model not only takes into account individual team strengths, but also the group constellations that were randomly drawn and also have an influence on the tournament’s outcome. This is what their model predicts as probabilities for each team and each possible achievement in the tournament:

em_results

So good news for all Germans: after already winning World Cup 2014, “Die Mannschaft” seems to be getting its hands on the next big international football title!

Well, do they? Let’s see…

Mainstream media usually only picks up on the prediction of the Euro champion – the “point forecast”, so to speak. Keep in mind that although this single outcome may well be the most likely one, it still is quite unlikely itself with a probability of 21.1% only. So from a statistical point of view, you basically should not judge the model only on grounds of whether it is able to predict the champion again, as this would require a good portion of luck, too. Just imagine the probability of the most likely champion was 30%, then getting it correctly three times in a row merely has a probability of (0.3)³=0.027 or 2.7%. So in order to really evaluate the goodness of the model you need to check its forecasting power on a number of games and see whether it consistently does a good job, or even outperforms bookmakers’ odds. Although the report does not list the probabilities for each individual game, you still can get a quite good feeling about the goodness of the model, for example, by looking at the predicted group standings and playoff participants. Just compare them to what you would have guessed yourself – who’s the better football expert?

 

Filed under: R, science Tagged: Rbloggers

Prediction model for the FIFA World Cup 2014

By: Chris G

Re-posted from: https://grollchristian.wordpress.com/2014/06/12/world-cup-2014-prediction/

Like a last minute goal, so to speak, Andreas Groll and Gunther Schauberger of Ludwig-Maximilians-University Munich announced their predictions for the FIFA World Cup 2014 in Brazil – just hours before the opening game.

Andreas Groll, with his successful prediction of the European Championship 2012 already experienced in this field, and Gunther Schauberger did set out to predict the 2014 world cup champion based on statistical modeling techniques and R.

A bit surprisingly, Germany is estimated with highest probability of winning the trophy (28.80%), exceeding Brazil’s probability (the favorite according to most bookmakers) only marginally (27.65%). You can find all estimated probabilities compared to the respective odds from a German bookmaker in the graphic on their homepage (http://www.statistik.lmu.de/~schauberger/research.html), together with the most likely world cup evolution simulated from their model. The evolution also shows the neck-and-neck race between Germany and Brazil: they are predicted to meet each other in the semi-finals, where Germany’s probability of winning the game is a hair’s breadth above 50%. Although there does not exist a detailed technical report on the results yet, you still can get some insight into the model as well as the data used through a preliminary summary pdf on their homepage (http://www.statistik.lmu.de/~schauberger/WMGrollSchauberger.pdf).

probs-001-001.jpg tree-001-001.jpg

Last week, I had the chance to witness a presentation of their preliminary results at the research seminar of the Department of Statistics (a home game for both), where they presented an already solid first predictive model based on the glmmLasso R package. However, continuously refining the model to the last minute, it now did receive its final touch, as they published the predictions at their homepage.

As they pointed out, statistical prediction of the world cup champion builds on two separate components. First, you need to reveal the individual team strengths – “who is best?”, so to speak. Afterwards, you need to simulate the evolution of the championship, given the actual world cup group drawings. This accounts for the fact that even quite capable teams might still miss the playoffs, given that they were drawn into a group of hard competitors.

Revealing the team strength turns out to be the hard part of the problem, as there exists no simple linear ranking for teams from best to worst. A team that might win more games on average still could have its problems with a less successful team, simply because they fail to adjust to the opponents style of play. In other words: tough tacklings and fouls could be the skillful players’ death.

Hence, Andreas Groll and Gunther Schauberger chose a quite complex approach: they determine the odds of a game through the number of goals that each team is going to score. Thereby, again, the likelihood of scoring more goals than the opponent depends on much more than just a single measure of team strength. First, the number of own goals depends on both teams’ capabilities: your own, as well as that of your opponent. As mediocre team, you score more goals against underdogs than against title aspirants. And second, your strength might be unevenly distributed across different parts of the team: your defense might be more competitive than your offensive or the other way round. As an example, although Switzerland’s overall strength is not within reach to the most capable teams, their defense during the last world cup still was such insurmountable that they did not receive a single goal (penalty shooting excluded).

The first preliminary model shown in the research seminar did seem to do a great job in revealing overall team strength already. However, subtleties as the differentiation between offensive and defense were not included yet. The final version, in contrast, now even allows such a distinction. Furthermore, the previous random effects model did build its prediction mainly on the data of past results itself, referring to explanatory co-variates only minor. Although this in no way indicates any prediction inaccuracies, one still would prefer models to have a more interpretable structure: not only knowing WHICH teams are best, but also WHY. Hence, instead of directly estimating team strength from past results, it is much nicer to have them estimated as a result of two components: the strength predicted by co-variates like FIFA rank, odds, etc, plus a small deviation found by the model through past results itself. As a side effect, the model should also become more robust against structural breaks this way: a team with very poor performance in the past now still could be classified as good if indicators of current team strength (like the number of champions league players or the current odds) hint to higher team strength.

Building on explanatory variables, however, the efficient identification of variables with true explanatory power out of a large set of possible variables is the real challenge. Hence, instead of throwing in all variables at once, their regularization approach allows to gradually extend the model by incorporating the variable with best explanatory power among all not yet included variables. This variable selection seems to me to be the big selling point of their statistical model, and with both Andreas Groll and Gunther Schauberger having prior publications in the field already, they most likely should know what they are doing.

From what I have heard, I think we can expect a technical report with more detailed analysis within the next weeks. I’m already quite excited about getting to know how large the estimated distinction between offensive and defense actually turns out to be in their model. Hopefully, we will get these results at a still early stage of the running world cup. The problem, however, is that some explanatory variables within their model could only be determined completely when all the team’s actual squads were known, and hence they could start their analysis only very shortly prior to the beginning of the world cup. Although this obviously caused some delay for their analysis, this made sure that even possible changes of team strength due to injuries could be taken into account. I am quite sure, however, that they will catch up on the delay during the next days, as I think that they are quite big football fans themselves, and hence are most likely as curious about the detailed results as we are…

Filed under: R Tagged: 2014 world cup, football, prediction, Rbloggers

spotted elsewhere: SlideRule

By: Chris G

Re-posted from: https://grollchristian.wordpress.com/2014/04/25/spotted-elsewhere-sliderule/

Being a big Massive Open Online Course (MOOC) and Coursera fan already for quite some time, I stumbled upon another internet platform that promises to bring video education to you just today: SlideRule. It searches several online course providers and “helps you discover the world’s best online courses in every subject”. In extension, there also is iversity, which is not yet searched by SlideRule. Have fun studying!

Filed under: science Tagged: education, mooc, sliderule, spotted elsewhere