Discovering Statistics: The Blog: August 2012

.... I didn't grow a pair of breasts. If you didn't read my last blog that comment won't make sense, but it turns out that people like breasts so I thought I'd mention them again. I haven't written a lot of blogs, but my frivolous blog about growing breasts as a side effect of some pills was (by quite a large margin) my most viewed blog. It's also the one that took me the least time to write and that I put the least thought into. I think the causal factor might be the breasts.

This blog isn't about breasts, it's about normality. Admittedly the normal distribution looks a bit like a nipple-less breast, but it's not one: I'm very happy that my wife does not sport two normal distributions upon her lovely chest. I like stats, but not that much ...

Assumptions

Anyway, I recently stumbled across this paper. The authors sent a sample of postgrads (with at least 2 years research experience) a bunch of data analysis scenarios and asked them how they would analyze the data. They were interested in whether or not, and how these people checked the assumptions of the tests they chose to use. The good news was that they chose the correct test (although given all of the scenarios basically required a general linear model of some variety that wasn’t hard). However, not many of them checked assumptions. The conclusion as that people don’t understand assumptions or how to test them

I get asked about assumptions a lot. I also have to admit to hating the chapter on assumptions in my SPSS and R books. Well, hate is a strong word, but I think it toes a very conservative and traditional line. In my recent update of the SPSS book (out early next year before you ask) I completely re-wrote this chapter. It takes a very different approach to thinking about assumptions.

Most of the models we fit to data sets are based on the general linear model, (GLM) which means that any assumption that applies to the GLM (i.e., regression) applies to virtually everything else. You don’t really need to memorize a list of different assumptions for different tests: if it’s a GLM (e.g., ANOVA, regression etc.) then you need to think about the assumptions of regression. The most important ones are:

Linearity
Normality (of residuals)
Homoscedasticity (aka homogeneity of variance)
Independence of errors.

What Does Normality Affect?

For this post I’ll discuss normality. If you’re thinking about normality, then you need to think about 3 things that rely on normality:

Parameter estimates: That could be an estimate of the mean, or a b in regression (and a b in regression can represent differences between means). Models have error (i.e., residuals), and if these residuals are normally distributed in the population then using the method of least squares to estimate the parameters (the bs) will produce better estimates than other methods.
Confidence intervals: whenever you have a parameter, you usually want to compute a confidence interval (CI) because it’ll give you some idea of what the population value of the parameter is. We use values of the standard normal distribution to compute the confidence interval: using values of the standard normal distribution makes sense only if the parameter estimates actually come from one.
Significance tests: we often test parameters against a null value (usually we’re testing whether b is different from 0). For this process to work, we assume that the parameter estimates have a normal distribution. We assume this because the test statistics that we use (such as the t, F and chi-square), have distributions related to the normal. If parameter estimates don’t have a normal distribution then p-values won’t be accurate.

What Does The Assumption Mean?

People often think that your data need to be normally distributed, and that’s what many people test. However, that’s not the case. What matters is that the residuals in the population are normal, and the sampling distribution of parameters is normal. However, we don’t have access to the sampling distribution of parameters or population residuals; therefore, we have to guess at what might be going on by testing the data instead.

When Does The Assumption Matter?

However, the central limit theorem tells us that no matter what distribution things have, the sampling distribution will be normal if the sample is large enough. How large is large enough is another matter entirely and depends a bit on what test statistic you want to use. So bear that in mind. However, oversimplifying things a bit, we could say:

Confidence intervals: For confidence intervals around a parameter estimate to be accurate, that estimate must come from a normal distribution. The central limit theorem tells us that in large samples, the estimate will have come from a normal distribution regardless of what the sample or population data look like. Therefore, if we are interested in computing confidence intervals then we don’t need to worry about the assumption of normality if our sample is large enough. (There is still the question of how large is large enough though.) You can easily construct bootstrap confidence intervals these days, so if your interest is confidence intervals then why not stop worrying about normality and use bootstrapping instead?
Significance tests: For significance tests of models to be accurate the sampling distribution of what’s being tested must be normal. Again, the central limit theorem tells us that in large samples this will be true no matter what the shape of the population. Therefore, the shape of our data shouldn’t affect significance tests provided our sample is large enough. (How large is large enough depends on the test statistic and the type of non-normality. Kurtosis for example tends to screw things up quite a bit.) You can make a similar argument for using bootstrapping to get a robust p if p is your thing.
Parameter Estimates: The method of least squares will always give you an estimate of the model parameters that minimizes error, so in that sense you don’t need to assume normality of anything to fit a linear model and estimate the parameters that define it (Gelman & Hill, 2007). However, there are other methods for estimating model parameters, and if you happen to have normally distributed errors then the estimates that you obtained using the method of least squares will have less error than the estimates you would have got using any of these other methods.

Summary

If all you want to do is estimate the parameters of your model then normality doesn’t really matter. If you want to construct confidence intervals around those parameters, or compute significance tests relating to those parameters then the assumption of normality matters in small samples, but because of the central limit theorem we don’t really need to worry about this assumption in larger samples. The question of how large is large enough is a complex issue, but at least you know now what parts of your analysis will go screwy if the normality assumption is broken..

This blog is based on excerpts from the forthcoming 4th edition of ‘Discovering Statistics Using SPSS: and sex and drugs and rock ‘n’ roll’.

I know I have been a bit rubbish with blogs recently, but I’m massively behind with the Discovering Statistics Using SPSS update, and these things fall by the wayside. Also, I can so rarely find anything remotely interesting to say, let alone blog about. If it were a blog about music then I could write all day. Anyway …

So, while writing the DSUS update I was unwell for a couple of months. It turned out to (probably) be stress related (updating a book involves a lot of long days, late nights, and pressure). Unlike women who sensibly go to the doctor when they feel ill, men do not. However, I did eventually do the un-manly thing and go to my doctor. She prescribed some pills. In one of my other blogs I talked about key statistical skills that we should try to teach undergrads, and as I read the instructions of these pills it occurred to me that this is a good example of where the world would be a better place if people left university understanding statistics a bit better, and providing useful statistical information, therefore became the norm.

Like a diligent patient, I read the instruction leaflet with the pills. Like most instruction leaflets with pills they had an un-amusing list of possible side effects. These side effects were helpfully listed as common, uncommon and rare. Common ones included headache, stomache aches and feeling sick (Ok, I can handle that), uncommon ones were dizziness, liver disease which might make my eyes yellow, rash, sleepiness or trouble sleeping (but not both). The rare ones included liver failure resulting in brain damage, bleeding at the lips, eyes, mouth, nose and genitals and development of breasts in men.

Excuse me? Did it say ‘development of breasts in men’?

Yes it did.

Here’s a photo to prove it.

Side effects

I’ll admit that I don’t know much about human anatomy, but based on the little I do know, it seems intuitive that my immune system, if reacting badly to something like a drug, might overload my liver and make it explode, or give me kidney failure. I also know that feeling sick and having flu-like symptoms is part and parcel of your immune system kicking into action. But why on earth would my body respond to a nasty drug by sprouting breasts? Perhaps because having them would make me more likely to visit my doctor.

Anyway, back to the tenuous link to stats. Whenever I read this sort of thing (which fortunately isn’t often) I usually feel that I’d rather put up with whatever it is that’s bothering me than run the risk of, for example, bleeding from my penis or getting brain damage. I might feel differently if I had enough information to assess the risk. What do they mean by ’uncommon’ or ‘rare’: 1/100, 1/1,000, 1/billion? Wouldn’t it be nice if we could have a bit more information, maybe even an odds ratio – that way I could know, for example, that if I take the pill I’d be 1.2 times more likely to grow breasts than if I don’t. That way we could better assess the likelihood of these adverse events, which if you’re as neurotic as me, would be very helpful.

The campaign for more stats on drug instruction leaflets starts here.

Anyway, after all that I took the pill, went to sleep and dreamt of the lovely new breasts that I’d have in the morning …

Discovering Statistics: The Blog

Monday, August 6, 2012

Assumptions Part 1: Normality