All of these mathematical endeavours begin with the presumption that Blizzard has a secret formula it uses to compute the amount of mana a card ought to cost. The value on the card will be different on the card for one of three reasons - the influence of mechanics, the effects of rounding (because a card costs 1 or 2 mana, not 1.5), and tweaking based on how it plays (variance introduced through human evaluation).
What we're ultimatley trying to build (for the base card value) is a model of the basic relationships between the values on the card (Attack, Health, and Mana), and these other effects, to 'reverse-engineer' the basic formula.
A standard lm model uses a family of functions for analyzing the variance of a dataset based on the assumption that there's a gaussian distribution in your data, and is primarily concerned with producing a binary predictor for those values.
As always, model fitting is part art, part science, and part throwing shit at the wall to see what works. Or, it is when I do it, at any rate.
Given our belief that there's a direct, additive relationship between the base card value and the attack/defense values on the card, in theory, a glm on a poisson family should get you a better fit than the gaussian; it better represents the expected behaviour of a count variable. But is it true?
First, the original, gaussian LM fit from last time.
Call: lm(formula = Mana ~ Attack + Health, data = dataset, subset = CardType == 1 & CardText == "" & Mana > 0) Residuals: Min 1Q Median 3Q Max -1.9940 -0.2844 0.1968 0.2218 0.7611 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.16376 0.14999 -1.092 0.283 Attack 0.50626 0.06172 8.202 2.28e-09 _*_ ## Health 0.43566 0.06107 7.134 4.27e-08 _*_ Signif. codes: 0 ‘**_’ 0.001 ‘_**_’ 0.01 ‘_’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.4999 on 32 degrees of freedom Multiple R-squared: 0.9359, Adjusted R-squared: 0.9319 F-statistic: 233.6 on 2 and 32 DF, p-value: < 2.2e-16
Now let's change tack.
If we presume that our outcome value is "count"-like - additive in nature from the base values on the card - we can switch to the generalized linear model, and switch to a poisson distribution - with intriguing results.
glm(formula = Mana ~ Attack + Health, family = family, data = dataset, subset = CardType == 1 & CardText == "" & Mana > 0) Deviance Residuals: Min 1Q Median 3Q Max -1.0093 -0.2435 -0.1382 0.2194 0.9077 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.17849 0.23141 -0.771 0.4405 Attack 0.14891 0.06234 2.389 0.0169 * Health 0.16466 0.06622 2.487 0.0129 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 44.9140 on 34 degrees of freedom Residual deviance: 4.4248 on 32 degrees of freedom AIC: 101.1 Number of Fisher Scoring iterations: 4
Note the differences in the two sets of graphs:
Residuals are healthier. The residuals on the poisson distribution have better deviations; LM range was -1.9940 to 0.7611 (~2.75); our Poisson GLM is -1.0093 to 0.9077 (~1.9).
Residual QQ is better. The two graphs are day and night; you now see a nice, clean, quantized Q-Q for the residuals, showing the stair-stepping you'd expect to see when you know that whatever magic formula exists has the effects of rounding (because mana is an integer) applied.
Cook's Distance improved. We go from having some pretty strange outliers to being within 0.06 on all modelled values.
Scale-Location and Residuals Vs Leverage also see huge improvements over their normal counterpart.
In short, the poisson family appears to do a much better job of estimating the base value of the card than the normal family.