Information theory

Comparing hypotheses

When comparing hypotheses, we want to know:

“What is the empirical evidence for hypothesis \(x\) relative to other possible hypotheses?”

“Which hypothesis/model is closest to reality?”

Null hypothesis statistical tests

In contrast, Null Hypothesis Statistical Tests¹ are designed to answer the question:

“What is the probability of observing my data if the null hypothesis were true?”

“How likely are my data, given the null hypothesis?”

So NHST starts with the hypothesis, and checks whether our field data comply

This is counter-intuitive! We would prefer to find out the strength of evidence for real (alternative) hypotheses that we’re actually interested in

A more natural approach is:

“How likely are my hypotheses, given the data?”

Information theory

The Information Theoretic approach starts with our data, and evaluates the strength of evidence in support of each hypothesis

Information Theory is based on Kullback-Leibler Information¹, and allows us to:

Compare our models, each of which represents a discrete hypothesis
Select the best model, if one exists, and
Average the results of all models, if there is no single best model

Akaike’s Information Criterion (AIC)

Information Theory’s fundamental statistic is Akaike’s Information Criterion (AIC)

AIC is a numerical value representing the scientific evidence for a model¹

Information Theory offers a simple and compelling approach:

Compute an AIC value for each model (hypothesis)
Compare models using their AIC values
The model with the smallest AIC value is the best fit to your field data
Calculate measures of the relative strength of evidence for each hypothesis
Infer the importance of predictor variables using all models

This enables us to evaluate the likelihood of our proposed hypothesis being correct, given what we observe in the field

AIC definition

\[ AIC = -2 log(L(\hat{\theta}) | data) + 2K \]

Where:

\(\hat{\theta}\) = Parameter estimates from your model

\(L(\hat{\theta}) | data\) = The likelihood of your model, given the data
\(K\) = number of parameters¹
The minus sign means that AIC value decreases as the likelihood of your model increases

How is AIC helpful?

Imagine AIC as:

The distance from the model to full reality (you want to minimise this distance), or
The amount of information lost by using that model to approximate reality

In both cases, the best model is the one:

Closest to reality (small AIC), or
That loses the smallest amount of information (small AIC)

AIC in our water deer case-study

Here’s the output from our model of constant density and detectability:

hn_Null # show distance model output


Call:
distsamp(formula = ~1 ~ 1, data = distUMF, keyfun = "halfnorm", 
    output = "density", unitsOut = "ha")

Density:
 Estimate    SE     z   P(>|z|)
    -2.95 0.102 -28.9 9.56e-184

Detection:
 Estimate     SE    z P(>|z|)
     4.72 0.0617 76.5       0

AIC: 381.835

The AIC value is 381.8349836

AIC is for comparing models

This AIC value is only meaningful when compared with the AIC values of other models tested on the same data

AIC is only useful for comparing models

We cannot draw any conclusions from the AIC value of a single model in isolation, or from AIC values for models tested on different data

The value of AIC depends on the data, and so it’s only valid to compare models using AIC if they have been run on the same data¹

Which model is best?

There is no simple rule to determine which model is best based on AIC values

We need to interpret the evidence for each model, and decide whether:

There is strong enough support for a single model, or
You should draw conclusions based on your entire model set

The materials in this module and the next take you through this judgement process

AIC is more objective than NHST p-values:

There are no arbitrary choices about what level of \(\alpha\) to compare your \(p\)-value with
Testing multiple hypotheses doesn’t increase your chance of getting a spurious significant result

However, the Information Theoretic approach still requires thought when selecting hypotheses to compare!

AIC and parsimony

\[ AIC = -2 log(L(\hat{\theta}) | data) + 2K \]

AIC decreases when you include fewer parameters in your model, because the \(+2K\) component is smaller

Parsimony - use the simplest model capable of representing the information in our data

If you have two models that explain the same amount of variation in your field data, but one is simpler (fewer parameters), you should prefer the simpler model

This is because:

It’s easier to interpret simpler models
You’re less likely to be over-fitting your model by trying to explain every fine-scale pattern in the data
The precision of your parameter estimates will be higher

Without this ‘penalty’ for each parameter, more complex models would always be better, because each parameter reduces the amount of residual noise (lower the AIC), even if the extra parameters are not biologically informative

AIC small sample correction (AICc)

AICc is the AIC value corrected for small sample sizes

AICc is a better estimator when the number of parameters is large compared to the sample size

\[ AIC_c = -2 log(L(\hat{\theta}) | data) + 2K \left(\frac{n}{n - k - 1}\right) \]

Where:

\(n\) = the sample size of your field data

The formula for AICc increases the parameter penalty \(2K\) by a bias correction factor of \(\frac{n}{n - k - 1}\)

As the sample size increases relative to the number of parameters \(k\), this correction factor approaches 1, causing AICc to converge on AIC

Anderson (2008) recommends that you always use AICc, not AIC

We will use a separate R package to calculate AICc, but it can easily be calculated by hand