Wednesday, November 14, 2012

FAQ #1: K-S Tests in SPSS

I decided to start a series of blogs on questions that I get asked a lot. When I say a series I'm probably raising expectation unfairly: anyone who follows this blog will realise that I'm completely crap at writing blogs. Life gets busy. Sometimes I need to sleep. But only sometimes.

Anyway, I do get asked a lot about why there are two ways to do the Kolmogorov-Smirnov (K-S) test in SPSS. In fact, I got an email only this morning. I knew I'd answered this question many times before, but I couldn't remember where I might have saved a response. Anyway, I figured if I just blog about it then I'd have a better idea of where I'd written a response. So, here it is. Anyway, notwithstanding my reservations about using the K-S test (you'll have to wait until edition 4 of the SPSS book), there are three ways to get one from SPSS:

  1. Analyze>explore>plots> normality plots with tests
  2. Nonparametric Tests>One Sample ... (or legacy dialogues>one sample KS)
  3. Tickle SPSS under the chin and whisper sweet nothings into its ear
These methods give different results. Why is that? Essentially (I think) if you use method 1 then SPSS applies Lillifor's correction, but if you use method 2 it doesn't. If you use method 3 then you just look like a weirdo.

So, is it better to use Lillifor's correction or not? In the additional website material for my SPSS book, which no-one ever reads (the web material, not the book ...) I wrote (self-plaigerism alert):

"If you want to test whether a model is a good fit of your data you can use a goodness-of-fit test (you can read about these in the chapter on categorical data analysis in the book), which has a chi-square test statistic (with the associated distribution). One problem with this test is that it needs a certain sample size to be accurate. The K–S test was developed as a test of whether a distribution of scores matches a hypothesized distribution (Massey, 1951). One good thing about the test is that the distribution of the K–S test statistic does not depend on the hypothesized distribution (in other words, the hypothesized distribution doesn’t have to be a particular distribution). It is also what is known as an exact test, which means that it can be used on small samples. It also appears to have more power to detect deviations from the hypothesized distribution than the chi-square test (Lilliefors, 1967). However, one major limitation of the K–S test is that if location (i.e. the mean) and shape parameters (i.e. the standard deviation) are estimated from the data then the K–S test is very conservative, which means it fails to detect deviations from the distribution of interest (i.e. normal). What Lilliefors did was to adjust the critical values for significance for the K–S test to make it less conservative (Lilliefors, 1967) using Monte Carlo simulations (these new values were about two thirds the size of the standard values). He also reported that this test was more powerful than a standard chi-square test (and obviously the standard K–S test).

Another test you’ll use to test normality is the Shapiro-Wilk test (Shapiro & Wilk, 1965) which was developed specifically to test whether a distribution is normal (whereas the K–S test can be used to test against other distributions than normal). They concluded that their test was ‘comparatively quite sensitive to a wide range of non-normality, even with samples as small as n = 20. It seems to be especially sensitive to asymmetry, long-tailedness and to some degree to short-tailedness.’ (p. 608). To test the power of these tests they applied them to several samples (n = 20) from various non-normal distributions. In each case they took 500 samples which allowed them to see how many times (in 500) the test correctly identified a deviation from normality (this is the power of the test). They show in these simulations (see table 7 in their paper) that the S-W test is considerably more powerful to detect deviations from normality than the K–S test. They verified this general conclusion in a much more extensive set of simulations as well (Shapiro, Wilk, & Chen, 1968)." 

So there you go. More people have probably read that now than when it was on the additional materials for the book. It Looks like Lillifor's correction is a good thing (power wise) but you probably don't want to be using K-S tests anyway really, or if you do interpret them within the context of the size of your sample and look at graphical displays of your scores too.

Wednesday, October 3, 2012

You Can't Trust Your PhD Supervisor:)

My Ex Ph.D. supervisor Graham Davey posted a blog this morning about 10 ways to create false knowledge in psychology. It's a tongue-in-cheek look at various things that academics do for various reasons. Two of his 10 points have a statistical theme, and they raise some issues in my mind. I could walk the 3 meters between my office and Graham's to discuss these, or rib him gently about it next time I see him in the pub, but I thought it would be much more entertaining to write my own blog about it. A blog about a blog, if you will. Perhaps he'll reply with a blog about a blog about a blog, and I can reply with a blog about a blog about a blog about a blog, and then we can end up in some kind of blog-related negative feedback loop that wipes us both from existence so that we'd never written the blogs in the first place. But that would be a paradox. Anyway, I digress.

Before I continue, let me be clear that during my PhD Graham taught me everything I consider worth knowing about the process of science, theory, psychopathology and academic life. So, I tend to listen to what he says (unless it's about marriage or statistics) very seriously. The two points I want to focus on are:

2.  Do an experiment but make up or severely massage the data to fit your hypothesis. This is an obvious one, but is something that has surfaced in psychological research a good deal recently (http://bit.ly/QqF3cZ;http://nyti.ms/P4w43q).

Clearly the number of high profile retractions/sackings in recent times suggests that there is a lot of this about (not just in psychology). However, I think there is a more widespread problem than deliberate manipulation of data. For example, I remember reading somewhere about (I think) the Dirk Smeeters case or it might have been the Stapel one (see, I'm very scientific and precise); in any case, the person in question (perhaps it was someone entirely different), had claimed that they hadn't thought they were doing anything wrong when applying the particular brand of massage therapy that they had applied to their data. So, although there are high profile cases of fraud that have been delved into, I think there is a wider problem of people simply doing the wrong thing with their data because they don't know any better. I remind you of  Hoekstra, Henk, Kiers and Johnson's recent study that asked recent postgraduate students about assumptions in their data, this paper showed (sort of) that recent postgraduate researchers don’t seem to check assumptions. I'd be very surprised if it's just postgraduates. I would bet that assumptions, what they mean, when they matter and what to do about them are all concepts that are poorly understood amongst many very experiences researchers (not just within psychology). My suspicions are largely founded on the fact that I have only relatively recently really started to understand why and when these things matter, and I'm a geek who takes an interest in these things. I also would like to bet that the misunderstandings about assumptions and robustness of tests stem from being taught by people who poorly understand these things. I'm reminded of Haller & Kraus' (2002) study showing that statistics teachers misunderstood p-values. The fourth edition of my SPSS book (plug: out early 2013) is the first book in which I really feel that I have handled the teaching of assumptions adequately - so I'm not at all blameless in all of this mess. (See two recently blogs on Normality and Homogeneity also.)

I'd really like to do a study looking at more experienced researcher's basic understanding of assumptions (a follow up to Hoekstra's study on a more experienced sample, and with more probing questions) just to see whether my suspicions are correct. Maybe I should email Hoekstra and see if they're interested because, left to my own devices, I'll probably never get around to it.

Anyway, my point is that I think it's not just deliberate fraud that creates false knowledge, there is also a problem of well-intentioned and honest folk simply not understanding what to do, or when to do it.

3.  Convince yourself that a significant effect at p=.055 is real. How many times have psychologists tested a prediction only to find that the critical comparison just misses the crucial p=.05 value? How many times have psychologists then had another look at the data to see if it might just be possible that with a few outliers removed this predicted effect might be significant? Strangely enough, many published psychology papers are just creeping past the p=.05 value – and many more than would be expected by chance! Just how many false psychology facts has that created? (http://t.co/6qdsJ4Pm).


This is a massive over-simplification because an effect of p = .055 is 'real' and might very well be 'meaningfiul'. Conversely, an effect with p < .001 might very well be meaningless. To my mind it probably matters very little if a p is .055 or .049. I'm not suggesting I approve of massaging your data, but really this point illustrates how wedded psychologists are to the idea of effects somehow magically becoming 'real' or 'meaningful' once p drifts below .05. There's a few points to make here:

First, all effects are 'real'. There should never be a decision being made by anyone about whether an effect is real or not real. They're all real. It's just that some are large and some are small. There is a decision about whether an effect is meaningful, and that decision should be made within the context of the research question.

Second, I think an equally valid way to create 'false knowledge' is to publish studies based on huge samples reporting loads of small and meaningless effects that are highly significant. Imagine you look at the relationship between statistical knowledge and eating curry. You test 1 million people and find that there is a highly significant negative relationship, r = -.002, p < .001. You conclude that eating curry is a 'real' effect - it is meaningfully related to poorer statistical knowledge. There are two issues here: (1) in a sample of 1 million people the effect size estimate will be very precise, and the confidence interval very narrow. So we know the true effect in the population is going to be very close indeed to -.002. In other words, there is basically no effect in the population - eating curry and statistical knowledge have such a weak relationship that you may as well forget about it. (2) anyone trying to replicate this effect in a sample substantially smaller than 1 million is highly unlikely to get a significant result. You've basically published an effect that is 'real' if you use p < .05 to define your reality, but is utterly meaningless and won't replicate (in terms of p) in small samples.

Third, there is a wider problem than people massage their ps. You have to ask why people massage their ps. The answer to that is because psychology is so hung up on p-values. Over 10 years since the APA published their report on statistical reporting (Wilkinson, 1999) there has been no change in the practice of applying the all-or-nothing thinking of accepting results as 'real' if p < .05. It's true that Wilkinson's report has had a massive impact in the frequency with which effect sizes and confidence intervals are reported, but (in my experience which is perhaps not representative) these effect sizes are rarely interpreted with any substance and it is still the p-value that drives decisions made by reviewers and editors.

This whole problem would go away if 'meaning' and 'substance' of effects were treated not as a dichotomous decision, but as a point along a continuum. You quantify your effect, you construct a confidence interval around it, and you interpret it within the context of the precision that your sample size allows. This way, studies with large samples could no longer focus on meaningless but significant effects, instead the researcher could say (given  the high level of precision they have) that the effects in the population (the true effects if you like) are likely to be about the size that they observed and interpret accordingly. In small studies, rather than throwing out the baby with the bathwater, large effects could be given some creditability but with the caveat that the estimates in the study lack precision. This is where replication is useful. No need to massage data - researchers just give it to the reader as it is, interpret it and apply the appropriate caveats etc. One consequence of this might be that rather than publishing a single small study with massaged data to get p < .05, researchers might be encouraged to replicate their own study a few times and report them all in a more substantial paper. Doing so would mean that across a few studies you could show (regardless of p) the likely size of the effect in the population.

That turned into a bigger rant than I was intending ....

References


Haller, H., & Kraus, S. (2002). Misinterpretations of Significance: A Problem Students Share with Their Teachers? MPR-Online, 7(1), 1-20. 
Wilkinson, L. (1999). Statistical Methods in Psychology Journals: Guidelines and Explanations. American Psychologist, 54(8), 594-604. 


Thursday, September 13, 2012

Assumptions Part 2: Homogeneity of Variance/Homoscedasticity

My last blog was about the assumption of normality, and this one continues the theme by looking at homogeneity of variance (or homoscedasticity to give it its even more tongue-twisting name). Just to remind you, I’m writing about assumptions because this paper showed (sort of) that recent postgraduate researchers don’t seem to check them. Also, as I mentioned before, I get asked about assumptions a lot. Before I get hauled up before a court for self-plaigerism I will be up front and say that this is an edited extract from the new edition of my Discovering Statistics book. If making edited extracts of my book available for free makes me a bad and nefarious person then so be it.

Assumptions: A reminder

Now, I’m even going to self-plagiarize my last blog to remind you that most of the models we fit to data sets are based on the general linear model, (GLM). This fact means that any assumption that applies to the GLM (i.e., regression) applies to virtually everything else. You don’t really need to memorize a list of different assumptions for different tests: if it’s a GLM (e.g., ANOVA, regression etc.) then you need to think about the assumptions of regression. The most important ones are:
  • Linearity
  • Normality (of residuals) 
  • Homoscedasticity (aka homogeneity of variance) 
  • Independence of errors. 

What Does Homoscedasticity Affect?

Like normality, if you’re thinking about homoscedasticity, then you need to think about 3 things:
  1. Parameter estimates: That could be an estimate of the mean, or a b in regression (and a b in regression can represent differences between means). if we assume equality of variance then the estimates we get using the method of least squares will be optimal. 
  2. Confidence intervals: whenever you have a parameter, you usually want to compute a confidence interval (CI) because it’ll give you some idea of what the population value of the parameter is. 
  3. Significance tests: we often test parameters against a null value (usually we’re testing whether b is different from 0). For this process to work, we assume that the parameter estimates have a normal distribution. 

When Does The Assumption Matter?

With reference to the three things above, let’s look at the effect of heterogeneity of variance/heteroscedasticity:
  1. Parameter estimates: If variances for the outcome variable differ along the predictor variable then the estimates of the parameters within the model will not be optimal. The method of least squares (known as ordinary least squares, OLS), which we normally use, will produce ‘unbiased’ estimates of parameters even when homogeneity of variance can't be assumed, but better estimates can be achieved using different methods, for example, by using weighted least squares (WLS) in which each case is weighted by a function of its variance. Therefore, if all you care about is estimating the parameters of the model in your sample then you don’t need to worry about homogeneity of variance in most cases: the method of least squares will produce unbiased estimates (Hayes & Cai, 2007). However, if you even better estimates, then use weighted least squares regression to estimate the parameters. 
  2. Confidence intervals: unequal variances/heteroscedasticity creates a bias and inconsistency in the estimate of the standard error associated with the parameter estimates in your model (Hayes & Cai, 2007). As such, your confidence intervals and significance tests for the parameter estimates will be biased, because they are computed using the standard error. Confidence intervals can be ‘extremely inaccurate’ when homogeneity of variance/homoscedasticity cannot be assumed (Wilcox, 2010). 
  3. Significance tests: same as above. 

Summary

If all you want to do is estimate the parameters of your model then homoscedasticity doesn’t really matter: if you have heteroscedasticity then using weighted least squares to estimate the parameters will give you better estimates, but the estimates from ordinary least squares will be ‘unbiased’ (although not as good as WLS). 
If you’re interested in confidence intervals around the parameter estimates (bs), or significance tests of the parameter estimates then homoscedasticity does matter. However, many tests have variants to cope with these situations; for example, the t-test, the Brown-Forsythe and Welch adjustments in ANOVA, and numerous robust variants described by Wilcox (2010) and explained, for R, in my book (Field, Miles, & Field, 2012

Declaration


This blog is based on excerpts from the forthcoming 4th edition of ‘Discovering Statistics Using SPSS: and sex and drugs and rock ‘n’ roll’.

References

  • Field, A. P., Miles, J. N. V., & Field, Z. C. (2012). Discovering statistics using R: And sex and drugs and rock 'n' roll. London: Sage. 
  • Hayes, A. F., & Cai, L. (2007). Using heteroskedasticity-consistent standard error estimators in OLS regression: An introduction and software implementation. Behavior Research Methods, 39(4), 709-722. 
  • Wilcox, R. R. (2010). Fundamentals of modern statistical methods: substantially improving power and accuracy. New York: Springer.

Monday, August 6, 2012

Assumptions Part 1: Normality


.... I didn't grow a pair of breasts. If you didn't read my last blog that comment won't make sense, but it turns out that people like breasts so I thought I'd mention them again. I haven't written a lot of blogs, but my frivolous blog about growing breasts as a side effect of some pills was (by quite a large margin) my most viewed blog. It's also the one that took me the least time to write and that I put the least thought into. I think the causal factor might be the breasts.

This blog isn't about breasts, it's about normality. Admittedly the normal distribution looks a bit like a nipple-less breast, but it's not one: I'm very happy that my wife does not sport two normal distributions upon her lovely chest. I like stats, but not that much ...


Assumptions


Anyway, I recently stumbled across this paper. The authors sent a sample of postgrads (with at least 2 years research experience) a bunch of data analysis scenarios and asked them how they would analyze the data. They were interested in whether or not, and how these people checked the assumptions of the tests they chose to use. The good news was that they chose the correct test (although given all of the scenarios basically required a general linear model of some variety that wasn’t hard). However, not many of them checked assumptions. The conclusion as that people don’t understand assumptions or how to test them

I get asked about assumptions a lot. I also have to admit to hating the chapter on assumptions in my SPSS and R books. Well, hate is a strong word, but I think it toes a very conservative and traditional line. In my recent update of the SPSS book (out early next year before you ask) I completely re-wrote this chapter. It takes a very different approach to thinking about assumptions.

Most of the models we fit to data sets are based on the general linear model, (GLM) which means that any assumption that applies to the GLM (i.e., regression) applies to virtually everything else. You don’t really need to memorize a list of different assumptions for different tests: if it’s a GLM (e.g., ANOVA, regression etc.) then you need to think about the assumptions of regression. The most important ones are:

  • Linearity
  • Normality (of residuals)
  • Homoscedasticity (aka homogeneity of variance)
  • Independence of errors.

What Does Normality Affect?

For this post I’ll discuss normality. If you’re thinking about normality, then you need to think about 3 things that rely on normality:

  1. Parameter estimates: That could be an estimate of the mean, or a b in regression (and a b in regression can represent differences between means). Models have error (i.e., residuals), and if these residuals are normally distributed in the population then using the method of least squares to estimate the parameters (the bs) will produce better estimates than other methods.
  2. Confidence intervals: whenever you have a parameter, you usually want to compute a confidence interval (CI) because it’ll give you some idea of what the population value of the parameter is. We use values of the standard normal distribution to compute the confidence interval: using values of the standard normal distribution makes sense only if the parameter estimates actually come from one.
  3. Significance tests: we often test parameters against a null value (usually we’re testing whether b is different from 0). For this process to work, we assume that the parameter estimates have a normal distribution. We assume this because the test statistics that we use (such as the t, F and chi-square), have distributions related to the normal. If parameter estimates don’t have a normal distribution then p-values won’t be accurate. 

What Does The Assumption Mean?


People often think that your data need to be normally distributed, and that’s what many people test. However, that’s not the case. What matters is that the residuals in the population are normal, and the sampling distribution of parameters is normal. However, we don’t have access to the sampling distribution of parameters or population residuals; therefore, we have to guess at what might be going on by testing the data instead.

When Does The Assumption Matter?


However, the central limit theorem tells us that no matter what distribution things have, the sampling distribution will be normal if the sample is large enough. How large is large enough is another matter entirely and depends a bit on what test statistic you want to use. So bear that in mind. However, oversimplifying things a bit, we could say:


  1. Confidence intervals: For confidence intervals around a parameter estimate to be accurate, that estimate must come from a normal distribution. The central limit theorem tells us that in large samples, the estimate will have come from a normal distribution regardless of what the sample or population data look like. Therefore, if we are interested in computing confidence intervals then we don’t need to worry about the assumption of normality if our sample is large enough. (There is still the question of how large is large enough though.) You can easily construct bootstrap confidence intervals these days, so if your interest is confidence intervals then why not stop worrying about normality and use bootstrapping instead?
  2. Significance tests: For significance tests of models to be accurate the sampling distribution of what’s being tested must be normal. Again, the central limit theorem tells us that in large samples this will be true no matter what the shape of the population. Therefore, the shape of our data shouldn’t affect significance tests provided our sample is large enough. (How large is large enough depends on the test statistic and the type of non-normality. Kurtosis for example tends to screw things up quite a bit.) You can make a similar argument for using bootstrapping to get a robust if p is your thing.
  3. Parameter Estimates: The method of least squares will always give you an estimate of the model parameters that minimizes error, so in that sense you don’t need to assume normality of anything to fit a linear model and estimate the parameters that define it (Gelman & Hill, 2007). However, there are other methods for estimating model parameters, and if you happen to have normally distributed errors then the estimates that you obtained using the method of least squares will have less error than the estimates you would have got using any of these other methods. 

Summary

If all you want to do is estimate the parameters of your model then normality doesn’t really matter. If you want to construct confidence intervals around those parameters, or compute significance tests relating to those parameters then the assumption of normality matters in small samples, but because of the central limit theorem we don’t really need to worry about this assumption in larger samples. The question of how large is large enough is a complex issue, but at least you know now what parts of your analysis will go screwy if the normality assumption is broken..

This blog is based on excerpts from the forthcoming 4th edition of ‘Discovering Statistics Using SPSS: and sex and drugs and rock ‘n’ roll’.

Wednesday, August 1, 2012

Side effects


I know I have been a bit rubbish with blogs recently, but I’m massively behind with the Discovering Statistics Using SPSS update, and these things fall by the wayside. Also, I can so rarely find anything remotely interesting to say, let alone blog about. If it were a blog about music then I could write all day. Anyway …

So, while writing the DSUS update I was unwell for a couple of months. It turned out to (probably) be stress related (updating a book involves a lot of long days, late nights, and pressure). Unlike women who sensibly go to the doctor when they feel ill, men do not. However, I did eventually do the un-manly thing and go to my doctor. She prescribed some pills. In one of my other blogs I talked about key statistical skills that we should try to teach undergrads, and as I read the instructions of these pills it occurred to me that this is a good example of where the world would be a better place if people left university understanding statistics a bit better, and providing useful statistical information, therefore became the norm.

Like a diligent patient, I read the instruction leaflet with the pills. Like most instruction leaflets with pills they had an un-amusing list of possible side effects. These side effects were helpfully listed as common, uncommon and rare. Common ones included headache, stomache aches and feeling sick (Ok, I can handle that), uncommon ones were dizziness, liver disease which might make my eyes yellow, rash, sleepiness or trouble sleeping (but not both). The rare ones included liver failure resulting in brain damage, bleeding at the lips, eyes, mouth, nose and genitals and development of breasts in men.

Excuse me? Did it say ‘development of breasts in men’?
Yes it did.
Here’s a photo to prove it.
Side effects

I’ll admit that I don’t know much about human anatomy, but based on the little I do know, it seems intuitive that my immune system, if reacting badly to something like a drug, might overload my liver and make it explode, or give me kidney failure. I also know that feeling sick and having flu-like symptoms is part and parcel of your immune system kicking into action. But why on earth would my body respond to a nasty drug by sprouting breasts? Perhaps because having them would make me more likely to visit my doctor.

Anyway, back to the tenuous link to stats. Whenever I read this sort of thing (which fortunately isn’t often) I usually feel that I’d rather put up with whatever it is that’s bothering me than run the risk of, for example, bleeding from my penis or getting brain damage. I might feel differently if I had enough information to assess the risk. What do they mean by ’uncommon’ or ‘rare’: 1/100, 1/1,000, 1/billion? Wouldn’t it be nice if we could have a bit more information, maybe even an odds ratio – that way I could know, for example, that if I take the pill I’d be 1.2 times more likely to grow breasts than if I don’t. That way we could better assess the likelihood of these adverse events, which if you’re as neurotic as me, would be very helpful.

The campaign for more stats on drug instruction leaflets starts here.

Anyway, after all that I took the pill, went to sleep and dreamt of the lovely new breasts that I’d have in the morning … 

Saturday, July 21, 2012

Bonferroni correcting lots of correlations


Someone posed me this question:
Some of my research, if not all of it (:-S) will use multiple correlations. I'm now only considering those correlations that are less than .001. However, having looked at bonferroni corrections today - testing 49 correlations require an alpha level of something lower than 0.001. So essentially meaning that correlations have to be significant at .000. Am I correct on this? The calculator that I am using from the internet says that with 49 correlational tests, with an alpha level of 0.001 - there is chance of finding a significant result in approximately 5% of the time.
Some people have said to me that in personality psychology this is okay - but I personally feel wary about publishing results that could essentially be regarded as meaningless. Knowing that you probably get hammered every day for answers to stats question, I can appreciate that you might not get back to me. However - if you can, could you give me your opinion on using multiple correlations? Just seems a clunky method for finding stuff out.
It seemed like the perfect opportunity for a rant, so here goes. My views on this might differ a bit from conventional wisdom, so might not get you published, but this is my take on it:
  1. Null hypothesis significance testing (i.e. looking at p-values) is a deeply flawed process. Stats people know it's flawed, but everyone does it anyway. I won't go into the whys and wherefors of it being flawed but I touch on a few things here  and to a lesser extent here. Basically, the whole idea of determining 'significance' based on an arbitrary cut off for a p-value is stupid. Fisher didn't think it was a good idea, Neyman and Pearson didn't think it was a good idea, and the whole thing dates back to prehistoric times when we didn't have computers to compute exact p-values for us.
  2. Because of the above, Bonferroni correcting when you've done a billion tests is even more ridiculous because your alpha level will be so small that you will almost certainly make Type II errors and lots of them. Psychologists are so scared of Type I errors, that they forget about Type II errors.
  3. Correlation coefficients are effect sizes. We don't need a p-value to interpret them. The p-value adds precisely nothing of value to a correlation coefficient other than to potentially fool you into thinking that a small effect is meaningful or that a large effect is not (depending on your sample size). I don't care how small your p-value is, an r = .02 or something is crap. If your sample size is fairly big then the correlation should be a precise estimate of the population effect (bigger sample = more precise). What does add value is a confidence interval for r, because it gives you limits within which the true (population) value is likely to lie.

So, in a nutshell, I would (personally) not even bother with p-values in this situation because, at best, they add nothing of any value, and, at worst, they will mislead you. I would, however, get confidence intervals for your many correlations (and if you bootstrap the CIs, which you can on SPSS, then all the better). I would then interpret effects based on the size of r and the likely size of the population effect (which the confidence intervals tells you).
Of course reviewers and PhD examiners might disagree with me on this, but they're wrong:-)
Ahhhhhh, that feels better.

SPSS is not dead


This blog was published recently showing that the use of R continues to grow in academia. One of the graphs (Figure 1) showed citations (using google scholar) of different statistical packages in academic papers (to which I have added annotations).
Figure 1: Citations of stats packages (from http://blog.revolutionanalytics.com/2012/04/rs-continued-growth-in-...)

At face value, this graph implies a very rapid decline in SPSS use since 2005. I sent a tongue in cheek tweet about this graph, and this perhaps got interpreted that I thought SPSS use was on the decline. So, I thought I’d write this blog. The thing about this graph is it deals with citations in academic papers. The majority of people do not cite the package they use to analyse their data, so this might just reflect a decline in people stating that they used SPSS in papers. Also, it might be that users of software such as R are becomming more inclined to cite the package to encourage others to use it (stats package preference does for some people mimic the kind of religious fervor that causes untold war and misery. Most packages have their pros and cons and some people should get a grip). Also, looking at my annotations on Figure 1 you can see that the decline in SPSS is in no way matched by an upsurge in the use of R/Stata/Systat. This gap implies some mysterious ghost package that everyone is suddenly using but is not included on this graph. Or perhaps people are just ditching SPSS for qualitative analysis or doing it by handJ
If you really want to look at the decline/increase of package use then there are other metrics you could use. This article details lots of them. For example you could look at how much people talk about packages online (Figure 2).
Figure 2: online talk of stats packages (Image from http://r4stats.com/popularity)

Based on this R seems very popular and SPSS less so. However, the trend for SPSS is completely stable between 2005-2010 (the period of decline in the Figure 1). Discussion of R is on the increase though. Again though you can’t really compare R and SPSS here because R is more difficult to use than SPSS (I doubt that this is simply my opinion, I reckon you could demonstrate empirically that the average user prefers the SPSS GUI to R’s command interface if you could be bothered). People are, therefore, more likely to seek help on discussion groups for R than they are for SPSS. It’s perhaps not an index of popularity so much as usability. 
There are various other interesting metrics discussed in the aforementioned article. Perhaps the closest we can get to an answer to package popularity (but not decline in use) is survey data on what tools people use for data mining. Figure 3 shows that people most frequently report R, SPSS and SAS. Of course this is a snapshot and doesn’t tell us about usage change. However, it shows that SPSS is still up there. I’m not sure what types of people were surveyed for this figure, but I suspect it was professional statisticians/business analysts rather than academics (who would probably not describe their main purpose as data mining). This would also explain the popularity of R, which is very popular amongst people who crunch numbers for a living.
Figure 3: Data mining/analytic tools reported in use on Rexer Analytics survey during 2009 (from http://r4stats.com/popularity).
To look at the decline or not of SPSS in academia what we really need is data about campus licenses over the past few years. There were mumblings about Universities switching from SPSS after IBM took over and botched the campus agreement, but I’m not sure how real those rumours were. In any case, the teething problems from the IBM take over seem to be over (at least most people have stopped moaning about them). Of course, we can’t get data on campus licenses because it’s sensitive data that IBM would be silly to put in the public domain. I strongly suspect campus agreements have not declined though. If they have, IBM will be doing all that they can (and they are an enormously successful company) to restore them because campus agreements are a huge part of SPSS’s business.
Also, I doubt campus agreements have declined because they will stop for two main reasons (1) SPSS isn’t used by anyone anymore, (2) the cost become prohibitive. These two reasons are related obviously – the point at which they stop the agreement will be a function of cost and campus usage. In terms of campus usage, If you grew up using SPSS as an undergraduate or postgraduate, you’re unlikely to switch software later in your academic career (unless you’re a geek like me who ‘enjoys’ learning R). So, I suspect the demand is still there. In terms of cost, as I said, I doubt IBM are daft enough to price themselves out of the market.
So, despite my tongue in cheek tweet, I very much doubt that there is a mass exodus from SPSS. Why would there be? Although some people tend to be a bit snooty about SPSS, it's a very good bit of software: A lot of what it does, it does very well. There are things I don’t like about it (graphs, lack of robust methods, their insistence on moving towards automated analysis), but there’s things I don’t like about R too. Nothing is perfect, but SPSS's user-friendly interface allows thousands of people who are terrified of stats to get into it and analyse data and, in my book, that's a very good thing.

One-Tailed Tests


I’ve been thinking about writing a blog on one-tailed tests for a while. The reason is that one of the changes I’m making in my re-write of DSUS4 is to alter the way I talk about one-tailed tests. You might wonder why I would want to alter something like that – surely if it was good enough for the third edition then it’s good enough for the fourth? Textbook writing is quite an interesting process because when I wrote the first edition, I was very much younger, and to some extent the content was driven by what I saw in other textbooks. Even as the book has evolved over certain editions, the publishers will get feedback from lecturers who use the book, I get emails from people who use the book, and so, again, content gets driven a bit by what people who use the book want and expect to see. People expect to learn about one-tailed tests in an introductory statistics book and I haven’t wanted to disappoint them. However, as you get older you also get more confident about having an opinion on things. So, although I have happily entertained one-tailed tests in the past, in more recent years I have felt that they are one of the worse aspects of hypothesis testing that should probably be discouraged.
Yesterday I got the following question landing in my inbox, which was the perfect motivator to write this blog and explain why I’m trying to deal with one-tailed tests very differently in the new edition of DSUS:
Question: “I need some advice and thought you may be able to help. I have a one-tailed hypothesis, ego depletion will increase response times on a Stroop task. The data is parametric and I am using a related T-Test.
Before depletion the Stroop performance mean is 70.66 (12.36)
After depletion the Stroop performance mean is 61.95 (10.36)
The t-test is, t (138) = 2.07, p = .02 (one-tailed)
Although the t-test comes out significant, it goes against what I have hypothesised. That Stroop performance decreased rather than increased after depletion. So it goes in the other direction. How do I acknowledge this in a report?
I have done this so far. Is it correct?
Although the graph suggests there was a decrease in Stroop performance times after ego-depletion. Before ego-depletion (M=70.66, SD=12.36) after ego-depletion (M= 61.95, SD=10.36), a t-test showed there was a significance between Stroop performance phase one and two t (138) = 10.94, p <.001 (one-tailed).”
This question illustrates perfectly the confusion people have about one-tailed tests. The author quite rightly wants to acknowledge that the effect was in the opposite direction, but quite wrongly still wants to report the effect … and why not, effects in the opposite direction and interesting and intriguing and any good scientists wants to explain interesting findings.
The trouble is that my answer to the question of what to do when you get a significant one-tailed p-value but the effect is in the opposite direction to what you predicted is (and I quote my re-written chapter 2 here): “if you do a one-tailed test and the results turn out to be in the opposite direction to what you predicted you must ignore them, resist all temptation to interpret them, and accept (no matter how much it pains you) the null hypothesis. If you don’t do this, then you have done a two-tailed test using a different level of significance from the one you set out to use”
[Quoting some edited highlights of the new section I wrote on one-tailed tests]:
One-tailed tests are problematic for three reasons:
  1. As the question I was sent illustrates, when scientists see interesting and unexpected findings their natural instinct is to want to explain them. Therefore, one-tailed tests are dangerous because like a nice piece of chocolate cake when you’re on a diet, they waft the smell of temptation under your nose. You know you shouldn’t eat the cake, but it smells so nice, and looks so tasty that you shovel it down your throat. Many a scientist’s throat has a one-tailed effect in the opposite direction to that predicted wedged in it, turning their face red (with embarrassment).
  2. One-tailed tests are appropriate only if a result in the opposite direction to the expected direction would result in exactly the same action as a non-significant result (Lombardi & Hurlbert, 2009; Ruxton & Neuhaeuser, 2010). This can happen, for example, if a result in the opposite direction would be theoretically meaningless or impossible to explain even if you wanted to (Kimmel, 1957). Another situation would be if, for example, you’re testing a new drug to treat depression. You predict it will be better than existing drugs. If it is not better than existing drugs (non-significant p) you would not approve the drug; however it was significantly worse than existing drugs (significant p but in the opposite direction) you would also not approve the drug. In both situations, the drug is not approved.
  3. One-tailed tests encourage cheating. If you do a two-tailed test and find that your p is .06, then you would conclude that your results were not significant (because .06 is bigger than the critical value of .05). Had you done this test one tailed however, the p you would get would be half of the two tailed value (.03). This one-tailed value would be significant at the conventional level. Therefore, if a scientist finds a two-tailed p that is just non-significant, they might be tempted to pretend that they’d always intended to do a one-tailed test, half the p value to make it significant and report that significant value. Partly this problem exists because of journal’s obsessions with p-values, which therefore rewards significance. This reward might be enough of a temptation for some people to half their p-value just to get a significant effect. This practice is cheating (for reasons explained in one of the Jane Superbrain boxes in Chapter 2 of my SPSS/SAS/R books). Of course, I’d never suggest that scientists would half their p-values just so that they become significant, but it is interesting that two recent surveys of practice in ecology journals concluded that “all uses of one-tailed tests in the journals surveyed seemed invalid.” (Lombardi & Hurlbert, 2009), and that only 1 in 17 papers using one-tailed tests were justified in doing so (Ruxton & Neuhaeuser, 2010).
For these reasons, DSUS4 is going to discourage the use of one-tailed tests unless there's a very good reason to use one (e.g., 2 above). 
PS Thanks to Shane Lindsay who, a while back now, sent me the Lombardi and Ruxton papers.

References

  • Kimmel, H. D. (1957). Three criteria for the use of one-tailed tests. Psychological Bulletin, 54(4), 351-353. doi: 10.1037/h0046737
  • Lombardi, C. M., & Hurlbert, S. H. (2009). Misprescription and misuse of one-tailed tests. Austral Ecology, 34(4), 447-468. doi: 10.1111/j.1442-9993.2009.01946.x
  • Ruxton, G. D., & Neuhaeuser, M. (2010). When should we use one-tailed hypothesis testing? Methods in Ecology and Evolution, 1(2), 114-117. doi: 10.1111/j.2041-210X.2010.00014.x

Rock makes you Racist ... Apparently


Like buses, you don’t get a blog for weeks and then two come at once. I saw today this headline: Does listening to rock make you racist? Seven minutes of Bruce Spri... in the daily mail online. They also included a helpful picture of Scott Weiland wearing a pseudo-nazi outfit (well, it was a black shirt, with a bit of a poor choice of peaked cap) to ‘reflect the association between rock and white people’. ‘The association between rock and white people’, bugger me, it’s as though bad brains, living colour, 24-7 spyz, Animals as Leaders, bodycount (shall I go on?) or those collaborations between public enemy and anthrax had never happened. In the world of the Daily Mail, rock makes you a racist, simple as. Now they’ve got the science to back it up. Mothers and fathers everywhere protect your children from this evil and rancid puff of Satan’s anal smoke that pervades society in the form of ‘rock music’, it will infect their brains and make them racists. I’d have thought this would be a good thing as far as the Daily Mail are concerned given this, and this, and this, and, well, every other article they publish.
Anyway, enough about the Daily Mail. The point is, this piece of research has been seized on by many a website, including theNME who have for years been trying to find a good reason to justify looking down their self-important noses at rock and heavy metal. Now they have one: it makes us all racist. Or does it?
It’s based on Helen LaMarre’s doctoral thesis. I don’t want to get into bashing this study because I suspect like most scientists who find their studies spreading like wildfire across the internet, they at no point said that listening to Bruce Springsteen makes you a racist. It’s easy to bash any study – nothing is perfect. My issue here is with the way the study is presented by the media.
Essentially, in this study they took 148 undergrads (all Caucasian otherwise it doesn’t really make sense), and sat them in a waiting room for 7 minutes during which one of three types of music was played:
  • Mainstream rock: The White Stripes, Bon Jovi, Bruce Springsteen, Van Morrison, Foo Fighters (2 songs), Radiohead
  • Radical white power rock (i.e. racist dickhead rock): Prussian Blue (2 songs), Screwdriver, Bound for Glory, Max Resist (2 songs)
  • Top 40 Pop: Justin Timberlake (3 songs), Fergie and Akon (2 songs), Fergie (withour Akon), Gwen Stefani (with Akon, who gets about a bit), Gwen Stefani (2 songs), Rihanna.
At the end of this they were asked to allocate $500,000, as percentage chunks, to four student groups based on descriptions of those groups. The descriptions depicted White American, African American, Arab American and Latino American student groups. So, for example, if you wanted to make equal allocations, then you would respond 25%, 25%, s5%, 25%. They found that when listening to pop music the allocations were fairly even (means of 24.02, 25.49, 24.02, 24.76), after rock music they allocated more to the white American student group (M = 35) compared to all of the others (all Ms around 21). After listening to right wing music, allocations were higher to White American students (M = 39.47) than to African (M = 16.09), Arab (M = 14.58) and Latina (M = 25.58) students.
Statistically speaking these are pretty decent sized effects (huge in some cases). However, a few things to consider in making your own mind up about whether this shows that 7 minutes of Bruce Springsteen makes you a racist:
  1. Is a control group of pop music appropriate? A no music control group (just being in the waiting room) would give you a better baseline of people’s natural responses. The pop music (I’m not really familiar with it, but judging by song titles) was quite love oriented, so it’s possible that hearing songs about love etc. puts you in a good mood, and in a good mood you make more balanced allocations of the funds. I don’t know this to be true, it’s a hypothesis. However, I think a no music control group is a better baseline than any other form of music, because you can then assess whether a particular genre changes things compared to nothing at all. We could then see whether rock music affects allocations negatively, or pop music affects them positively. As it stands we just know the genres differ, but we don’t know whether pop makes you fairer or rock makes you unfair, or both.
  2. Is it the music that matters? This kind of research is very difficult to do because you’re not just manipulating the genre of music, you’re manipulating all sorts of other confounds that systematically vary with your independent variable. One example in this study is (arguably) aggression (rock is arguably more aggressive than pop, right wing rock is undoubtedly more aggressive than lots of other things). So here, you have a pattern of the rockier the music, the more money was allocated to White American students, but is it just because of a mood induction? Is it that the more of a negative mood you’re in, the more biased you are to the same race? (It would be an interesting finding in itself that people show a same race bias when they’re in a bad mood, but it would undermine the conclusion that rock music per se causes a same-race bias because there are lots of things that might put you in a bad mood other than rock music. Reading the daily mail, for example.) The problem here is that rock wasn’t pitted against, say hardcore hip hop, or better still perhaps some minor threat or fugazi who are very aggressive but promote very liberal themes in their lyrics. No measures of mood were taken so we don’t know whether there was a mood effect at all, and we certainly don’t know whether it’s the genre that matters, the lyrics, or the tone of the music. As I said, it’s really hard to match all of the variables that you might want to match, but the press portray the research in very simplistic terms and it’s not that's simple.
  3. What about individual differences? When asked what music the people listened to the most common response was pop (the details of this questionnaire are sketchy so I’m not entirely sure what question was asked). So, in effect you’ve got a bunch of people who probably don’t listen to rock much, who are played rock in a waiting room. Some other people were played music that they ‘prefer’ (pop) and they are subsequently fair minded and nice than those played less familiar and less preferred music (rock). You’d really need some kind of measure of people’s preference and then look for an interaction between genre and preference. Maybe it’s simply that when you’re subjected to music that you don’t particularly like you show a same-race bias? This goes back to the mood effects problem. Again, what’s needed here is a bit more research that delves into how you’re affected by familiarity of the music, whether it’s music you actually like: by having a wider range of genres (not just rock and pop), different groups of people with different tastes (and from different racial backgrounds) we might be able to pick apart some of these potential confounds.
  4. The money allocation task: arguably the money allocation task magnifies the effect. You have 100% to allocate over 4 boxes. You have to allocate exactly 100%. So, let’s imagine you’re fair minded and allocate across the boxes as 25%, 25%, 25%, 25%. Job done. Let’s say you change you’re mind and decide that you want to give box 1 an extra percent: 26%, 25%, 25%, 25%. You’ve now allocated 101% and that’s not allowed. So, you’d have to remove a 1% from another box to complete the task as requested. So perhaps you decide box 2 is your least favourite so you now allocate: 26%, 24%, 25%, 25%. You have allocated 100% and you have completed the task as requested. My point is that a small preference for box 1 (you wanted to add 1%) gets doubled because to do this you have to subtract some from one or more of the other boxes: a 1% difference between box 1 and 2 is doubled to a 2% difference. I’m not saying that this means that the results are nonsense or anything like that, but I am saying that it has probably magnified the effects reported because a slight preference for one group will be magnified simply because to increase funds to that group you have to take them away from another.
These are just a few points off the top of my head. Of course, I’m a huge rock and metal fan and I have my own biases: years of listening to slayer have not made me a Satanist anymore than years of listening to public enemy made me anti-white (although it did give me an enlightening new perspective on many things). I’m prepared to be proved wrong, but on the basis of this study I’m not concerned that I’ll wake up tomorrow as a raving racist. So, like I said this blog is more about how the press portray what is actually a very complex research question in a completely idiotic way. I always like reading studies about music preferences and this, like many I have read, pose interesting questions about the effect that music has on us and how we study it. There are lots of methodological issues that arise in trying to control the appropriate confounds if you’re trying to make statements about genres of music. There are also lots of interesting questions about what aspects of music effect people (so digging below the rather arbitrary classifications of rock, pop, rap or indie) and how these characteristics interact with the personality types of people that listen to them to affect cognition and emotion.
Right, I’m off to listen to some Devin Townsend, after which I’m going to start a campaign to shut down all bad coffee outlets. Ziltoid ……..

TwitterPanic, NHST, Wizards, and the cult of significance again


****Warning, some bad language used: don't read if you're offended by that sort of thing****
I haven’t done a blog in a while, so I figured I ought to. Having joined Twitter a while back, I now find myself suffering from TwitterPanic™, which is an anxiety disorder (I fully anticipate to be part of DSM-V) characterised by a profound fear that people will unfollow you unless you keep posting things to remind them of why it’s great to follow you. In the past few weeks I have posted a video of a bat felating himself and a video of my cat stopping me writing my textbook. These might keep the animal ecologists happy, but most people probably follow me because they think I’m going to write interesting things about statistics, and not because they wanted to see a felating bat. Perhaps I’m wrong, and if so please tell me because I find it much easier to get ideas for things to put online that rhyme with stats (like bats and cats) than I do about stats itself.
Anyway, I need to get over my TwitterPanic, so I’m writing a blog that’s actually about stats. A few blogs back I discussed whether I should buy the book ‘the Cult of Statistical .... I did buy it, and read it. Well, when I say I read it, I started reading it, but if I’m honest I got a bit bored and stopped before the end. I’m the last person in the world who could ever criticise anyone for labouring points but I felt they did. To be fair to the authors I think the problem was more that they were essentially discussing things that I already knew, and it’s always difficult to keep focus when you’re not having ‘wow, I didn’t know that’ moments. I think if you’re a newbie to this debate then it’s an excellent book and easy to follow.
The Fields on Honeymoon
In the book, the authors argue the case for abandoning null hypothesis significance testing, NHST (and I agree with most of what they say – see this), but they frame the whole debate a bit like a war between them (and people like them) and ‘the sizeless scientists’ (that’s the people who practice NHST). The ‘sizeless scientists’ are depicted (possibly not intentionally) like a bunch of stubborn, self-important, bearded, cape-wearing, fuckwitted, wizards who sit around in their wizardy rooms atop the tallest ivory tower in the kingdom of elephant tusks, hanging onto notions of significance testing for the sole purpose of annoying the authors with their fuckwizardry. I suspect the authors have had their research papers reviewed by these fuckwizards. I can empathise with the seeds of bile that experience might have sewn in the authors’ bellies, however, I wonder whether writing things like ‘perhaps they [the sizeless scientists] don’t know what a confidence interval is’  is the first step towards thinking that the blue material w
ith stars on that you’ve just seen would look quite fetching as a hat.
I don’t believe that people who have PhDs and do research are anything other than very clever people, and I think the vast majority  want to do the right thing when it comes to stats and data analysis (am I naïve here?). The tone of most of the emails I get suggest that people are very keen indeed not to mess up their stats. So, why is NHST so pervasive? I think we can look at a few sources:
  1. Scientists in most disciplines are expected to be international experts in their discipline, which includes being theoretical leaders, research experts, and drivers of policy and practice. On top of this they’re also expected have a PhD in applied statistics. This situation is crazy really. So, people tend to think (not unreasonably) that what they were taught in university about statistics is probably still true. They don’t have too much time to update their knowledge. NHST is appealing because it’s a very recipe-book approach to things and recipes are easy to follow.
  2. Some of the people above, will be given the task of teaching research methods/statistics to undergraduates/postgraduates. Your natural instinct is to teach what you know. If you were taught NHST, then that’s what you’ll teach. You might also be doing a course that forms part of a wider curriculum and that will affect what you teach. For example, I teach second year statistics, and by the time I get these students they have had a year of NHST, so it seems to me that it will be enormously confusing for them if I suddenly say ‘oh, all that stuff you were taught last year, well, I think it’s bollocks, learn this instead’. Instead, I weave in some arguments against NHST, but in a fairly low key way so that I don’t send half of the year into mass confusion and panic. Statistics is confusing enough for them without me undermining a year of their hard word.
  3. Even if you wanted to remove NHST from your curriculum, you might be doing your students a great disservice because reviewers of research will likely be familiar with NHST and expect to see it. It might not be ideal that this is the case, but that is the world as we currently know it. When I write up research papers I would often love to abandon p-values but I know that if I do then I am effectively hand-carving a beautiful but knobbly stick, attaching it to my manuscript, and asking the editor if he or she would be so kind as to send the aforementioned stick to the reviewers so that they can beat my manuscript with it. If your students don’t know anything about NHST are you making their research careers more tricky to negotiate?
  4. Textbooks. As I might have mentioned a few million times, I’m updating Discovering Statistics Using SPSS (DSUS as I like to call it). This book is centred around NHST, not because I’m particularly a fan of it, but because it’s what teachers and people who adopt the book expect to see in it. If they don’t see it, they will probably use a different book. I’m aware that this might come across as me completely whoring my principles to sell my book, and perhaps I am, but I also feel that you have to appreciate from where other people come. If you were taught NHST, that’s what you’ve done for 10 or 20 years, that’s what you teach because that what you genuinely believe is the right way to do things, then the last thing you need is a pompous little arse from Brighton telling you to change everything. It’s much better to have that pompous little arse try to stealth-brainwash you into change: Yes, each edition I feel that I can do a bit more to promote approaches other than NHST. Subvert from within and all that.
 So, I think the cult of significance will change, but it will take time, and rather than seeing it as a war between rival factions, perhaps we should pretend it’s Christmas day, get out of the trenches, play a nice game of football/soccer, compliment each other on our pointy hats, and walk away with a better understanding of each other. It’d be nice if we didn’t go back to shooting each other on boxing day though.
The APA guidelines of over 10 years ago and the increased use of meta-analysis have, I think, had a positive impact on practice. However, we’re still in a sort of hybrid wilderness where everyone does significance tests and, if you’re lucky, people report effect sizes too. I think perhaps one day NHST will be abandoned completely, but it will take time, and by the time it has we’ll probably have a found a reason why confidence intervals and effect sizes are as comedic as sticking a leech on your testicles to cure a headache.
I’ve completely lost track of what the point of this blog was now. It started off that I was going to have a rant about one-tailed tests (I’ll save that for another day) because I thought that might ease my TwitterPanic. However, I got side tracked by thinking about the cult of significance book. I now feel a bit bad, because I might have been a bit critical of it and I don’t like it when people criticise my books so I probably shouldn’t criticise other’s. I stuck a sweet wizard hat related honeymoon picture in to hopefully soften the authors' attitude towards me in the unlikely event that they ever read this and decide to despise me. I then took some therapy for dealing with worrying too much about what other people think. It didn't work. Once I’d thought about that book I remembered that I’d wanted to tell anyone who might be interested that I thought the authors had been a bit harsh on people who use NHST. I think that side track was driven by a subconscious desire to use the word ‘fuckwizardry’, because it made me laugh when I thought of it and Sage will never let me put that in DSUS4. The end result is a blog about nothing, and that’s making my TwitterPanic worse …

Definitions

  • Fuckwizard: someone who does some complicated/impressive task in a fuckwitted manner but with absolute confidence that they are doing it correctly.
  • Fuckwizardry: doing a complicated or impressive task in a fuckwitted manner but with absolute confidence that you are doing it correctly