ANOVA: Analysis of Variance

Equality of means of multiple groups

There are more than two samples/groups, and you want to test for equality of population means among all the groups.

Example 1: there are four social groups: SC, ST, OBC and Others
- And you want to test if the average wage of workers belonging to these groups is same.

Example 2: There are four different Covid vaccines which reduce the severity of impact Covid infection has on patients (say, measured in terms of drop in blood oxygen levels)
- And you want to test if all the vaccines are equally effective

Example 3: Three states implemented three different social security programmes
- And you want to test if all the three interventions were equally effective (say, in terms of average income/consumption of beneficiaries)

Intuition

Greater the variability between sample means of groups and smaller the variability within the sample means of groups, stronger is the evidence that the population means differ
We create a test statistic that is ratio of estimates of “between group variance” and “within group variance”
The estimator for within group variance is an unbiased estimator of population variance irrespective of weather H0 is true or not.
Between group estimator is unbiased only if H0 is true, and in that case, yields a value close to the value of the within group estimate.
So, we expect the ratio to be close to 1 if H0 is true. It will be higher than 1 if H0 is false.

Model description

I: treatments/groups
J: samples in each size (let us assume equal sample size for the sake of simplicity)
\(Y_{i,j}\): \(j^{th}\) observation of the \(i^{th}\) treatment

The model:

\(Y_{i,j}= \mu+\alpha_{i}+\epsilon_{i,j}\)

\(\mu\): overall mean
\(\alpha_{i}\): differential effect of the \(i^{th}\) treatment
(no term): The effect of the treatment is normalised so that: \(\displaystyle\sum_{i=i}^{I}{\alpha_{i}}=0\)
(no term): \(\epsilon_{i,j}\) are the random errors which are assumed to be independent, normally distributed with zero mean and variance \(\sigma^{2}\)
(no term): The expected response to \(i^{th}\) treatment: \(E(Y_{i,j})= \mu+\alpha_{i}\)

Basic identity

\(\displaystyle\sum_{i=1}^{I}\sum_{j=1}^{J}{(Y_{ij}-\overline{Y})^{2}}=\sum_{i=1}^{I}\sum_{j=1}^{J}{(Y_{ij}-\overline{Y_{i}})^{2}}+J\sum_{i=1}^{I}{(\overline{Y}_{i}-\overline{Y})^{2}}\)

\(SS_{ToT}=SS_{W}+SS_{B}\)

where \(\displaystyle\overline{Y_{i}}=\frac{1}{J}\sum_{j=1}^{J}{Y_{ij}}\) and \(\displaystyle\overline{Y}=\frac{1}{IJ}\sum_{i=1}^{I}\sum_{j=1}^{J}Y_{ij}\)

Basic identity (Proof)

\(\displaystyle\sum_{i=1}^{I}\sum_{j=1}^{J}{(Y_{ij}-\overline{Y})^{2}}=\sum_{i=1}^{I}\sum_{j=1}^{J}{[(Y_{ij}-\overline{Y_{i}})+(\overline{Y_{i}}-\overline{Y})]^{2}}\)

\(\displaystyle = \sum_{i=1}^{I}\sum_{j=1}^{J}{(Y_{ij}-\overline{Y_{i}})^{2}}+\sum_{i=1}^{I}\sum_{j=1}^{J}{(Y_{i}-\overline{Y})^{2}}+2\sum_{i=1}^{I}\sum_{j=1}^{J}{(Y_{ij}-\overline{Y_{i}})(\overline{Y_{i}}-\overline{Y})\)

\(\displaystyle = \sum_{i=1}^{I}\sum_{j=1}^{J}{(Y_{ij}-\overline{Y_{i}})^{2}}+\sum_{i=1}^{I}\sum_{j=1}^{J}{(Y_{i}-\overline{Y})^{2}}+2\sum_{i=1}^{I}(\overline{Y_{i}}-\overline{Y})\sum_{j=1}^{J}{(Y_{ij}-\overline{Y_{i}})\)

\(\displaystyle \because \sum_{j=1}^{J}{(Y_{ij}-\overline{Y_{i}})=0\)

\(\displaystyle \therefore \sum_{i=1}^{I}\sum_{j=1}^{J}{(Y_{ij}-\overline{Y})^{2}}=\sum_{i=1}^{I}\sum_{j=1}^{J}{(Y_{ij}-\overline{Y_{i}})^{2}}+\sum_{i=1}^{I}\sum_{j=1}^{J}{(Y_{i}-\overline{Y})^{2}}\)

Expectations …

For \(X_{i}\) where i=1,….,n, be independent random variables with

\(E[X_{i}]=\mu_{i}\), and \(Var(X_{i})=\sigma^{2}\)

It can be shown that:

\(E(X_{i}-\overline{X})^{2}=(\mu_{i}-\overline\mu)^{2}+\frac{n-1}{n}\sigma^{2}\)

where

\(\displaystyle \overline{\mu}=\frac{1}{n}\sum_{i=1}^{n}{\mu_{i}}\)

\(E(U^{2})=[E(U)]^{2}+Var(U)\) for any random variable U with a finite variance

\([E(X_{i}-\overline{X})]^{2}=[E(X_{i})-E(\overline{X}))]^{2}\) \(=(\mu_{i}-\overline{\mu})^{2}\)

\(Var(X_{i}-\overline{X})=Var(X_{i})+Var(\overline{X})-2Cov(X{i},\overline{X})\)

\(Var(X_{i})=\sigma^{2}\)
\(Var(\overline{X})=\frac{1}{n}\sigma^{2}\)
\(Cov(X{i},\overline{X})=\frac{1}{n}\sigma^{2}\)

\(Var(X_{i}-\overline{X})=\frac{n-1}{n}\sigma^{2}\)

Expectation of Within Sums of Squares

Using \(E(X_{i}-\overline{X})^{2}=(\mu_{i}-\overline\mu)^{2}+\frac{n-1}{n}\sigma^{2}\)
We can show that \(\displaystyle E(SS_{W})=I(J-1)\sigma^{2}\)

\(\displaystyle E(SS_{W})=\sum_{i=1}^{I}\sum_{j=1}^{J}E(Y_{ij}-\overline{Y_{i}})^{2}\)

\(\because E(Y_{ij})=E(\overline{Y_{i}})=\mu+\alpha_{i}\)

\(\displaystyle E(SS_{W})=\sum_{i=1}^{I}\sum_{j=1}^{J}{\frac{J-1}{J}\sigma^{2}}\)

\(=I(J-1)\sigma^{2}\)

Pooled variance (\(s_{p}^{2}\)) is an unbiased estimator of population variance.

\(\displaystyle s_{p}^{2}=\frac{SS_{W}}{I(J-1)}\) is an unbiased estimator of \(\sigma^{2}\)

\(\displaystyle SS_{W}=\sum_{i=1}^{I}(J-I)s_{i}^{2}\)

Expectation of Between Sums of Squares

\(\displaystyle E(SS_{B})=J\sum_{i=1}^{I}{\alpha_{i}^{2}}+(I-1)\sigma^{2}\)

\(\displaystyle E(SS_{B})=J\sum_{i=1}^{I}{E(\overline{Y_{i}}-\overline{Y})^{2}\)

Using \(E(X_{i}-\overline{X})^{2}=(\mu_{i}-\overline\mu)^{2}+\frac{n-1}{n}\sigma^{2}\)

\(\displaystyle E(\overline{Y_{i}}-\overline{Y})^{2}=\Big[E(\overline{Y_{i}})-\frac{1}{I}\sum_{i=i}^{I}{E(\overline{Y_{i}})]^{2}]+\frac{I-1}{I}\frac{\sigma^{2}}{J}\)

\(\displaystyle = [\mu+\alpha_{i}-\frac{1}{I}\sum_{i=1}^{I}{(\mu+\alpha_{i})}]^{2}+\frac{I-1}{I}\frac{\sigma^{2}}{J}\)

\(\displaystyle = [\mu+\alpha_{i}-\frac{1}{I}I\mu-\frac{1}{I}\sum_{i=1}^{I}{\alpha_{i}}]^{2}+\frac{I-1}{I}\frac{\sigma^{2}}{J}\)

\(\displaystyle = \alpha_{i}^{2}+\frac{(I-1)\sigma^{2}}{IJ}\), since \(\displaystyle \sum_{i=1}^{I}{\alpha_{i}=0}\) and \(\mu\) s cancel out

\(\displaystyle \therefore E(SS_{B})=J\sum_{i=1}^{I}\Big[\alpha_{i}^{2}+\frac{(I-1)\sigma^{2}}{IJ}\Big]\)

\(\displaystyle =J\sum_{i=1}^{I}{\alpha_{i}^{2}}+(I-1)\sigma^{2}\)

Estimating \(\sigma^{2}\)

Pooled variance (Sp square) of all samples is an unbiased estimator of population variance.

\(\displaystyle s_{p}^{2}=\frac{SS_{W}}{I(J-1)}\) is an unbiased estimator of \(\sigma^{2}\)

\(\displaystyle SS_{W}=\sum_{i=1}^{I}(J-I)s_{i}^{2}\)

if all the \(\alpha_{i}=0\), then \(\displaystyle \frac{E(SS_{B})}{(I-1)}=\sigma^{2}\)

Thus, this should be the case:

\(\displaystyle \frac{SS_{W}}{I(J-1)} = \frac{SS_{B}}{(I-1)}\)

Since \(\displaystyle E(SS_{B})=J\sum_{i=1}^{I}{\alpha_{i}^{2}}+(I-1)\sigma^{2}\)

If some of the \(\alpha_{i} \ne 0\), \(SS_{B}\) will be inflated

The test statistic

\(H0: \alpha_{1} = \alpha_{2} = \alpha_{3} = ... = 0\) Ha: At least one \(\alpha \ne 0\)

\(F=\frac{SS_{B}/(I-1)}{SS_{W}/[I(J-1)]}\)

follows an F distribution with degrees of freedom I-1 and I(J-1)

If the null hypothesis is true, the F statistic should be close to 1.

If the null hypothesis is false, the F statistic would be inflated.

If sample sizes for all treatments are not equal

\(\displaystyle F=\frac{SS_{B}/(I-1)}{SS_{W}/\displaystyle \sum_{i=1}^{I}{(J_{i}-1)}}\) follows an F distribution with degrees of freedom \(\displaystyle \sum_{i=1}^{I}{(J_{i}-1)}\) and I-1