Thursday, September 29, 2016

StinkFiske

StinkFiske[1]


Many of you will have seen former APS president Professor Susan Fiske’s recently leaked opinion piece in the APS observer and the outcry it has caused.

I’m late in on this, but in my defence I have a 6 week old child to help keep alive and I’m on shared parental leave, so I’m writing this instead of, you know, enjoying one of the rare moments I get to myself. I really enjoyed (and agreed with) Andrew Gelman’s blog post about it, and there are other good pieces by, amongst others, Chris Chambers. I won’t pretend to have read other ones, so forgive me if I’m repeating someone else. It wouldn’t be the first time.

The Gelman and Chamber’s blogs will give you the background of why Fiske’s piece has caused such waves. I don’t want to retread old ground, so the short version is that she has basically accused a bunch of people (who she refuses to name) of being ‘methodological terrorists’ who go around ruining people’s careers by posting critiques on social media. She goes onto argue that this sort of criticism is better conducted in private behind the (let’s face it) wholly flawed peer review system. She cites anecdotal examples of students leaving science for fear that (shock horror) someone might criticise their work.

The situation hasn’t been helped by her choice of evocative language. Let’s be clear here, I don’t agree with anything Professor Fiske writes, or the way she wrote it. However, I think it has been interesting and useful in getting people to think about why she believes what she believes. I particularly recommend Gelman’s one for some insight into the whole history of the situation and his take on where Fiske might be coming from[2].

I follow a lot of methodologists on Twitter and the ensuing carnage has been informative and thought-provoking. The reaction has tended to focus on the lack of evidence for the claims she makes, and counterarguments against her view. However much you might disagree with her view though, I think it plausibly represents the views of a great number of psychologists/scientists. As the days have gone on since I read Fiske’s piece I find myself less and less focussed on her individual view and more and more asking myself why people might share these views and what we need to do to douse the flames with white poppies.


What science should be


I want to start with an anecdote. My PhD and first couple of years of post doc was spent failing to replicate a bunch of studies that showed Evaluative Conditioning (essentially preference learning through association). I did a shit-tonne of experiments, they all failed to replicate the basic phenomenon. The original studies were by a group at KU Leuven. I tried to get them published, that didn’t go too well[3]. I emailed the lead author (Frank Baeyens) throughout my PhD and he was always very helpful, constructive and open to discussing my failures - even after I published a paper suggesting that their results might have been an artefact of their methodology. The upshot was that they invited me (expenses paid) to Belgium to discuss things. Which we did. They then tried to kill me in the most merciful way they could think of: Belgian beer. My point is, they cared about science and about working out what was going on. We could sit down with a beer and forge long-standing friendships over our disagreements. It wasn’t personal - everyone just wanted to understand better the phenomenon we were trying (as best we both could) to capture.

That’s how science should be: it’s about updating your beliefs as new evidence emerges and it’s not about the people doing it. Why is it that scientists feel so threatened by failed replications and re-analysis of their data? I’m going to brain dump some thoughts.


Tone


I consider myself at least aligned with (and possibly a fully fledged member of) the “self-appointed data police”, but I have at times (the minority of times I hasten to add) found discussions of some work a bit ‘witch-hunty’. I have some sympathy for some people feeling attacked. However, as someone who keeps a fairly close eye on methodological stuff and follows quite a lot of people who I suspect Fiske was directing her polemic at, on the whole people are civil and really just want to make science better[4]. I really believe that the data police have their hearts in the right place. Yes, statisticians have hearts.


Non-selective criticism


I think one reason why people might share Fiske’s views is that critiques tend to garner more interest when they change the conclusions of the study negatively, and because of the well-known selection bias for significant studies this invariably entails ‘look, the significant effect is not significant when the data police do things properly’. Wouldn’t it be fun, just for a change, to have a critique along the lines of ‘I did something the authors didn’t think of to improve the analysis, and the original conclusions stand up’. Let’s really let our imaginations wander to scenarios along the lines of ‘We re-examined this study of null results using some reasonable subjective priors to obtain Bayesian estimates of the model parameters and there’s greater evidence than the authors thought that the substantive effect under investigation could be big enough to warrant further investigation.’ In this last case, no-one would be doubting the integrity of the original authors but, other things being equal, there’s no difference in any of these situations: there’s data, there’s one analysis of it and some conclusions, then another analysis of it and some other conclusions.

My point is that researchers are likely to feel less defensive, if the focus or re-analysis within our discipline broadens from de-bunking. Re-analysis that reaches the same conclusions as the original paper has just as much value as debunking, but either the data police don’t do that sort of thing, or when we do no-one pays it any attention. We can’t control hits and retweets (and a good debunking story is always going to generate interest) but we can affect the broader culture of critique and make it more neutral.

We also need to get away from this notion of doing analysis the correct way. Of course there are ways to do things incorrectly, but there is rarely one correct way. In a recent study by Silberzahn et al 61 data analysts were given the same dataset and research question and asked to analyse the data resulting in a variety of models fit to the data. We can usefully re-analyse data/conclusions without falling into the judgemental terminology of correct/incorrect.

We need ‘re-analysing data’ to become a norm rather than the current exceptions that get widely publicised because they challenge the conclusions of the paper in a bad way. To make it a norm, we need new a whole new system really because as click-baity as retractions are, science is about updating beliefs in the light of new evidence. Surely that ethos is best served by papers that are open to re-analysis, commentary and debate. Some hurdles are how to stop that debate getting buried, and how to reward/incentivise that debate (i.e. give people credit for the time they spend contributing to these ongoing discussions of theories). It feels like the traditional system on peer-reviewed ‘articles’ that stand as static documents of ‘truth’ poorly serves these aims. Quite how we create the sea change necessary to think of, present, and cite ‘journal articles’ that are dynamic pieces of knowledge in a constant state of flux is another matter. Of course, there is also the question of how we re-invent CVs because we all know how much academics love their CVs and all of the incentive structures currently favour lists of ‘static’ knowledge.


Honest mistakes or mistaken honesty


The second reason why I think some people might have sympathy with Fiske’s views is that many scientists find it difficult to disentangle critique of their work from accusations of dishonesty. It is understandable that emotions run high: academia is not a job it’s a way of life, and for most of us the line between home and work is completely blurred. We invest emotionally in what we do, and criticism of your work can feel like criticism of you[5]. In psychology at least, the situation isn’t helped by the selective nature of methodological critique in recent years (see above) and the very public cases of actual misconduct unearthed through methodological critique (insert your own example here, but Diederik Stapel is possibly the most famous[6]). I think we could all benefit from accepting that being a scientist is an ongoing learning curve. If we knew everything, there would be no point in us doing our jobs.

Let me give you a personal example. I am regarded by some as a statistics ‘expert’ (at least within Psychology), which of course is a joke because I have no formal training in statistics. Nevertheless, I like statistics more than I like psychology, and I enjoy learning about it. My textbooks are a document of my learning. If I could create a black hole that would suck editions 1 to 3 of my SPSS book into it, I happily would because they contain some fairly embarrassing things that reflect what I thought at the time was the ‘truth’. I didn’t know any better, but I do now. Give me a dataset now and I’ll do a much better job of analysing it than I would have in 1998 when I started writing the SPSS book. Three years ago I didn’t have a clue what Bayesian statistics was, these days I still don’t, but I get the gist and have some vague sense of how to apply it in a way that I think (hopefully) isn’t wrong. Perhaps I should be embarrassed that I needed to ask Richard Morey to critique the Bayesian bits of my last textbook, and that he found areas where my understanding was off, but I learnt a lot from it. Likewise, someone reanalysing my data I hope, teaches me something. Andrew Gelman makes a similar point. Let’s not see re-analysis as judgement of competence, because we are all at the mercy of our knowledge base at any given time. My knowledge base of statistics in 2016 is different to in 1998, so let re-analysis be about helping people to improve how they do their job.

If we accept that scientists are on a learning curve then they will make mistakes. I don’t believe that most scientists are dishonest, but I do believe that they make honest mistakes that are perpetuated by (1) poor education, and (2) the wrong incentive structures.


They know not what they do


Anecdotally, I get hundreds of emails a year asking statistics questions from people aspiring to publish their research (not just in psychology). None of them seem dishonest, but some of them certainly harbour some misconceptions about data analysis and what’s appropriate. Hoekstra et al. (2012) provide some evidence for researchers not routinely checking the assumptions of their models, but again I think this likely reflects perceived norms or poor education than it does malpractice.

It is of course ridiculous that we are expected to be both expert theoreticians in some specialist area of a discipline and simultaneously remain at the cutting edge of ever-increasingly complex statistical tools. It’s bonkers, it really is. I’ve reached the point where I spend so much time thinking/reading/doing statistics that I barely have room in my head for psychology. Within this context, I am certain that people are trying their best with their data, but let’s be clear - they are up against it for many reasons and there will always be some a-hole like me who has abandoned psychology for a life of nitpicking everyone else’s analyses.

One major obstacle is the perpetuation of poor practice. The problem of education boils down to the fact that training in psychological research methods and data analysis tends to be quite rule-based. I did a report for the HEA a few years back on statistics teaching in UK psychology degree programmes. There is amazing consistency in what statistical methods are taught to undergraduate psychologists in the UK, and it won’t surprise anyone that it is very based on Null Hypothesis Significance Testing (NHST), p values etc. Relatively few institutions delve into effect sizes, or Bayesian approaches. There’s nothing necessarily wrong with teaching NHST (I say this mainly to wind up the Bayesian’s on Twitter …) because it is so widely used, but it is important to also teach its limitations and to offer alternative approaches. It’s not clear how much this is done, but I think awareness of the issues has radically increased compared to when I was a postgraduate in 1763.

One problem that teaching NHST does create is that it is very easy to be recipe-book about it: if you don’t understand what you’re doing just look at p and follow the rule. Of course, I’m absolutely guilty of this in my textbooks because it is such an easy trap to fall into[7]. The nuances of NHST are tricky, so for students who struggle with statistics the line of least resistance is ‘follow this simple rule’. For those that then go onto PhDs, and are supervised by people who also blindly follow the rules (because that’s what they were taught too), and who are then incentivised to get ‘significant results’ (see later) you have, frankly, a recipe for disaster.

There are many reasons why NHST is so prevalent in teaching and research. I wrote a blog in 2012 that is strangely relevant here (and until I’m proved otherwise, I believe to be the first recorded use of the word ‘fuckwizardy’, which I’m disappointed hasn’t caught on, so give it a read and insert that word liberally into conversation from now on - thanks). In it, I gave a few ideas about why NHST is so prevalent in psychological science, and why that will be slow to change. The take home points were: (1) researchers don’t have time to be experts on statistics and their research topic (see above); (2) people tend to teach what they know, modules have to fit in with other modules/faculty expertise, so deviating from decades of established research/teaching practice is difficult; (3) as long as reviewers/journals don’t update their own statistics knowledge we’re stuck between a rock and hard place if we start deviating from the norm; (4) textbooks tend to tow a conservative line (ahem!).

A problem I didn’t mention in that blog is that some teachers don’t themselves understand what they’re teaching. Haller and Krauss (2002) (and Oakes (1986) before them) showed that 80% of methodology instructors and 90% of active researchers held one misconception about the p value. Similarly, a study by Belia et al. (2005)[8] showed that researchers have difficulty interpreting confidence intervals. So, poor education perpetuates poor education. Of course, we need to try to improve training for the future generations of researchers, but for those for whom it’s too late, open and constructive critique offers a way to help them not keep making the same mistakes. However, critique needs to be more ‘there are a multitude of reasons why you probably did it your way, but let me show you an alternative’ and a bit less ‘you are stupid for not using Bayes’.

In the long term though, improving our statistical literacy/training will result in better-informed reviewers and editors in the future. Bad practice will wither as the ‘norms’ progress beyond the recipe book.


Incentive structures


Another reason why people might ‘in good faith’ make poor data-based decisions is because the incentive structures in academia are completely screwy: individuals are incentivised, good science is not. Promotions are based on publications and grants, grants are based (to some extent) on likely success and track record (which of course is indexed by publications), and publications are - as is well known - hugely skewed towards significant results. Of course, academics are supposed to be great teachers, engage with the community and all that, but ask anyone in an academic job what matters when it comes to promotion and it’ll be grants and publications[9].

Scientists are rewarded for publishable results, and publishable results invariable means significant results. Mix this with poor training (i.e. awareness of things like p-hacking) and you can see how easily (even with the best intentions) researcher degrees of freedom can filter into data analysis. This is why registered reports are such a brilliant idea because they do a decent job of incentivising ideas/methods above the results. Also it offers an opportunity to correct well-intentioned but poor data-analysis practice before data are collected and analysed.

I actually think incentive structures in academia need a massive overhaul to put science as the priority, but that’s a whole other stream of consciousness …


Rant over


This has ended up as a much more directionless rant than I planned, and it’s now time to go and get my 2-year old from nursery so I need to wrap up. I think my main point would be that open critique of science is essential, not because people are dishonest and we need to flush out that dishonesty, but because many scientists are doing the best they can, using what they’ve been taught. In many cases, they won’t even realise the mistakes they’re making, public conversation can help them, but it should be in the spirit of improvement. Second, let’s change the incentive structures in science away from the individual and towards the collective. Finally, everyone practice open science because it’s awesome.



  1. I thought Tool fans would appreciate the title …  ↩
  2. In general, Gelman’s blogs are well worth reading if you’re interested in statistics.  ↩
  3. I did eventually get my 12 experiments published in the Netherlands Journal of Psychology in 2008, 10 years after completing my PhD, where it sank without trace.  ↩
  4. The exceptions are the daily spats between some frequentists and Bayesians who seem to thrive on being rude to each other.  ↩
  5. Should you ever need a case study then come up to me and slag off one of my recent textbooks (old editions are fair game, even I think they’re crap), I will probably cry.  ↩
  6. English readers can enjoy a translation of his autobiography thanks to Nick Brown.  ↩
  7. In my defence, I have over the years tried hard to lace my books with a healthy dose of critique of blindly following p value based decision rules, but even so …  ↩
  8. Belia, S., Fidler, F., Williams, J., & Cumming, G. (2005). Researchers misunderstand confidence intervals and standard error bars. Psychological Methods, 10, 389–396.  ↩
  9. In 2010 when I was promoted to professor I had to go through an interview process as the final formality. The research side of my CV was as you would expect to get a chair; however, unlike comparable applications I had a lot more teaching stuff including my textbooks and a National Teaching fellowship (I was one of only 4–5 people in the entire university to have one of those at the time). During my interview my teaching was not mentioned once - it was all about grants, research leadership etc.  ↩

Thursday, April 28, 2016

"If you're not doing something different, you're not doing anything at all."

Yesterday was the official launch of my new textbook An Adventure in Statistics: The Reality Enigma. Although a few ‘print to order’ copies are floating about, the ‘proper’ hi-res print copies won’t be available for a few more weeks, but I thought it was a good opportunity to blog something about the book and perhaps textbook writing more generally. I’m going to start by telling you something about the book. Then I will try to give you an idea of the timeline and some rough statistics that probably don’t do justice to the emotional and physical investment that goes into a textbook.

A history of ‘an adventure in statistics’

Visualization guru (and sculptor) Edward Tufte apparently has a small sign taped to his computer screen that says "If you're not doing something different, you're not doing anything at all." It’s a note that I don’t have taped to my monitor, but I probably should because I like ‘different’, and I strive for ‘different’  not always in a good way.

In 2008 I was in Rotterdam updating my SPSS book (third edition) and like all of my books I had a long list of things from the previous edition that I hated and wanted to change.  It would be easy to just change the SPSS screenshots and slap a new cover on the front, but I wanted to do something different. After all "If you're not doing something different, you're not doing anything at all."

I thought it would be interesting to try to embed the academic content of the book within a fictional story. I didn’t have a story though, and I had only 6 months to update the book. It would be impossible. So I copped out: I book-ended each chapter with an anecdote from the only story I had to hand – my life. Some thought it was different, but to me it was a poor imitation of what I could have done.

A couple of years later I was approached to write a stats book for the ‘for dummies’ range. I was tempted. I spoke to my editor at SAGE (who publish my statistics books) because of potential overlap with the SPSS book. This led to a conversation with Ziyad Marar who runs the London office of SAGE. I’ve known Ziyad a long time – he signed my first book – but trust me, he rarely phones me. That’s true of most people because I go to great lengths to tell everyone how uncomfortable telephones make me, but a call from Ziyad is a particularly rare and beautiful thing. The gist of that conversation was that Ziyad convinced me to write new book for SAGE instead. He said, something to the effect of:

‘Why not write that book for us? We will let you do whatever you like  express yourself fully.’
"What?” I asked, “You’d give me complete control even after ‘the incident’?"
“Yes”. He replied (after what I like to mis-remember as a dramatic pause).

Ziyad was offering me the opportunity to set my imagination free, to go somewhere that perhaps other publishers would not let me go, to try something without needing to justify it with research, or pedagogy. An opportunity to follow my heart and not my head, but what did my heart want to do? I briefly considered whether it was possible to put even more penis jokes into a statistics textbook, but I’d been there, done that, worn the phallus and "If you're not doing something different, you're not doing anything at all."

I thought back to 2008, to the idea of writing a fictional story through which a student learns statistics through a shared adventure with the main character. I thought about collaborating with a graphic novel illustrator to bring the story to life. I didn’t know anything about writing fiction: but, I didn’t know anything about logistic regression and multilevel models before I wrote 60-page chapters about them. Not knowing something should never be an obstacle to writing about it.

I got on board a badass illustrator, James Iles, to create graphic novel strips to bring the story to life. There have been a few pivotal moments during the book’s life but none more than the moment that James replied to my advert on freelancer.com. He fought off about 80 other applicants to get the gig, and although I deluded myself that the choice of illustrator was a complex, make-or-break, decision, my gut instinct always pointed to James. He’d done storyboarding for Doctor Who, and I fucking love Doctor Who. If James was good enough Doctor Who, he was certainly going to be good enough for me. Unknown to me at the time, I hadn’t just found an exceptionally talented artist, but I’d also found someone who would put as much passion and care into the book as I would.

What is ‘an adventure in statistics’ all about?

An adventure in statistics is set in a future in which the invention of the reality prism, a kind of hat that splits reality into the subjective and objective has bought society to collapse by showing everyone the truth. Without blind belief, no-one tried anymore. In the wake of this ‘reality revolution’ society fragmented into people who held onto the pre-technological past (the Clocktorians) and those who embraced the ever-accelerating technology of the new world (the chippers). Society had become a mix of the ultra-modern and the old fashioned.

Into this world I put Zach, a rock musician, and his girlfriend Dr. Alice Nightingale. They are part of the first generation since the revolution to believe that they can change the world. Zach through his music, and Alice through her research. Then Alice suddenly disappears leaving Zach with a broken heart, a song playing on repeat and a scientific report that makes no sense to him. Fearing the worst, he sets out to find her. Strange things happen: people collapse and lose their memories, he gets messages from someone called Milton, and the word JIG:SAW haunts him. Zach feels that something is terribly wrong and that Alice is in danger, but her vanishing triggers an even worse thought: that after 10 years they have drifted apart.

At a simple level ‘an adventure in statistics’ is a story about Zach searching for Alice, and seeking the truth, but it’s also about the unlikely friendship he develops with a sarcastic cat, it’s about him facing his fear of science and numbers, it’s about him learning to believe in himself. It’s a story about love, about not forgetting who you are. It’s about searching for the heartbeats that hide in the gaps between you and the people you love. It’s about having faith in others.

Of course, it’s also about fitting models, robust methods, classical and Bayesian estimation, significance testing and whole bunch of other tedious statistical things, but hopefully you’ll be so engrossed in the story that you won’t notice them. Or they might be a welcome relief from the terrible fiction. Time will tell.

What goes into creating a textbook?

What does writing a textbook involve? That is hard to explain. For an adventure in statistics I really enjoyed the writing (especially the fictional story), on the whole it has been the most rewarding experience of my academic career. However, rest assured that if you decide to write a textbook, you will hit some motivational dark times. Very dark times.

The timeline

I had the initial idea in 2008, I wrote the proposal in January 2011 (final version March 2011). The final contract with SAGE was signed in April 2011. Around this time, I started discussing with SAGE my idea to have graphic novel elements and a story. I started making notes about a potential story and characters in a black book and using Scrivener. I started writing in January 2014. By this point James Iles had just come on board.  (SAGE are doing some videos where James and I discuss how we worked, so I won’t say more on that.) At the point that I started writing I had a lot of ideas, most of the characters in place and a decent idea of what would happen at the beginning and end of the story, and some bits in the middle.  A lot of the story developed as I wrote. (One thing I learned in writing this book is that even though I thought I'd done a lot of planning, I should have done an awful lot more before writing the first word!) June 2014 my wife and I had our first child. I took unpaid paternity leave and did quite a long stretch of writing (4 months) where I’d spend the day doing dad stuff until about 3-4pm and then start work, writing until 1-3am. I generally work better at night. The first draft was finished around April 2015. We had feedback from a fiction editor (Gillian Stern) on the story which came to me May 2015. I did a re-draft of the entire book based on that, which I finished around August 2015. I then had a bunch more feedback on the story from Robin, my development editor at SAGE, and on the statistics stuff and story from my wife. I did a third and final draft which was submitted October 2015. January 2016 I received the copy editor’s comments for the entire book for approval (or not). March 2016 I received proofs of the entire book, which I spent 2-3 weeks reading/correcting working well into the night most nights. April 2016 I received the corrected proofs to approve. In a sense then, it’s consumed 8 years of my life (as an ambition), but really it’s more like 4 years of work, 2 of them intensive.

The anatomy of ‘an adventure in statistics’


  • I don’t know exactly how many hours I spent on the book, but I spent probably 2 years casually collecting ideas and thoughts, and developing ideas for the structure and so on. I spent another 21 months pretty much doing not a lot else but writing or re-drafting the book. I had my university job to do as well, so it’s impossible to really know how many hours it took to create, but it’s probably somewhere in the region of 4000 hours. That’s just to the point of submitting the manuscript.
  • I wrote 297,676 words, ~1.6 million characters, 13,421 paragraphs and 28,768 lines. In terms of word length that’s about 3-4 psychology PhD theses, or if you assume the average research paper is about 5000 words then it’s about 60 research papers. In 2 years. I will get precisely no credit in the REF for this activity. [I’m not saying I should, I’m just making the point that you really are putting your research career on hold and investing a lot of creativity/energy into something that isn’t valued by the system that universities value. I am fortunate to be able to do this but I think this is a really tough balancing act for early-career scientists who want to write books.]
  • Given the book had three drafts, and I had to read proofs, I have read at least 1.19 million of my own words. It’ll be a lot more than that because of stuff you write and then delete.
  • I used Scrivener to plan the story. My project document in which I gathered ideas (e.g., plot ideas, character outlines, descriptions of locations, venues, objects, concepts, artwork ideas etc.) contains another 87,204 words and quite a few images – in addition to the 297,676 word in the book itself.
  • I created 603 diagrams. [Not all of them are the book because this includes old versions of diagrams, and image files that I used in diagrams – for example, an image of a normal curve that I drop into a finished diagram]. I used Omnigraffle incidentally for my diagrams, and graphs and stat-y stuff would have been created using R, most often with ggplot2.
  • I created 185 data-related files (data files, R-scripts etc.)
  • I wrote ~4000 lines of R-code (to generate data, create graphs, run analyses etc.).
  • At some point I will have to produce a bunch of online stuff – powerpoint presentations, handouts, answers to tasks in the book etc.
  • Basically, it was a lot of fucking work.

The beginning and the end

James Iles (Right) and I (Left) at the book launch
Yesterday the book was launched: it is both a beginning and an end. Beginnings can be exciting. It is the beginning of the public life of ‘an adventure in statistics’. It might be the beginning of it being a well-received book? The beginning of it inspiring young scientists? The beginning of people thinking differently about teaching statistics? That’d be nice but my excitement is laced with fear because beginnings can be scary too: today could be the beginning of seeing the book through a reality prism that shows me the objective truth in the form of scathing reviews, poor sales, sympathetic looks, and five wasted years.

Yesterday was also an end. Primarily an end to my work on the book (well, apart from a bunch of online materials …). I have never liked endings. When I was a child and people would come to stay, I always felt hollow when they left. For over 2 years, the characters in this book – especially Zach, Alice, Milton and Celia – have been the houseguests of my mind. We’ve had a lot of fun times. We’ve worked hard and played hard. We’ve had lots of late night conversations, we’ve shared our deepest feelings, we’ve discussed life, and they’ve helped me to see the world through different eyes. Yesterday they left me to find their own way in the world and I’m going to miss them. I feel a little hollow. I never thought I’d miss writing a statistics textbook.

It’s a scary time. I am proud of and excited about the book, and of what James and I have created. I'm also a little terrified that no-one else will share my enthusiasm  after all, it’s different to other statistics textbooks. People don’t always like ‘different’. Tufte’s words are a comfort though because if it’s true that “If you're not doing something different, you're not doing anything at all." then I feel that, with ‘an adventure in statistics’ I have at least done something.

Andy
[Some of this blog is adapted from a speech I gave at the launch, which you can watch here]

Links

Download the preface and chapter 1.
Read the article in the Times Higher Education Supplement about the book.
The book can be ordered direct from SAGE, or from your local Amazon or other retailer.

Tuesday, March 29, 2016

Max or No Max?

There's been a recent spat between the heavy metal bands Sepultura and Soulfly. For those unaware of the history, 50% of Sepulture used to be the Cavalera brothers (Max and Igor) until Max (the frontman and guitarist) left the band in 1996 and formed Soulfly. The full story is here. There's a lot of bad blood even 20 years later, and according to a recent story on metal sucks, Soulfly's manager (and Max's wife) Gloria Cavalier recently posted a fairly pointed post on her Facebook page. This got picked up by my favourite podcast (the metal sucks podcast). What has this got to do with me, or statistics? Well, one of the presenters of the metal sucks podcasts asked me this over Twitter:



After a very brief comment about needing to operationalise 'better', I decided that rather than reading book proofs I'd do a tongue in cheek analysis of what is better max or no max and here it is.

First we need to operationalise 'better'. I have done this by accepting subjective opinion as determining 'better' and specifically ratings of albums on amazon.com (although I am English metal sucks is US based, so I thought I'd pander to them and take ratings from the US site). Our questions then becomes 'is max or no max rated higher by the sorts of people who leave reviews on Amazon'. We have operationalised our questions and turned it into a scientific statement, which we can test with data. [There are all sorts of problems with using these ratings, not least of which is that they tend to be positively biased, and they likely reflect a certain type of person who reviews, often reviews reflect things other than the music (e.g., arrived quickly 5*), and so on ... but fuck it, this is not serious science, just a bit of a laugh.]

Post Sepultura: Max or No Max

The first question is whether post-max Sepultura or Soulfly are rated higher. Figure 1 shows that the data are hideously skewed with people tending to give positive reviews and 4-5* ratings. Figure 2 shows the mean ratings by year post Max's departure (note they released albums in different years so the dots are out of synch, but it's a useful timeline). Figure 2 seems to suggest that after the first couple of albums, both bands are rated fairly similarly: the Soulfly line is higher but error bars overlap a lot for all but the first albums.

Figure 1: Histograms of all ratings for Soulfly and (Non-Max Era) Sepultura


Figure 2: Mean ratings of Soulfly and (Non-Max Era) Sepultura by year of album release







There are a lot of ways you could look at these data. The first thing is the skew. That messes up estimates of confidence intervals and significance tests ... but our sample is likely big enough that we can rely on the central limit theorem to do its magic and let us assume that the sampling distribution is normal (beautifully explained in my new book!)

I'm going to fit three models. The first is an intercept only model (a baseline with no predictors), the second allows intercepts to vary across albums (which allows ratings to vary by album, which seems like a sensible thing to do because albums will vary in quality) the third predicts ratings from the band (Sepultura vs Soulfly).

  maxModel2a<-gls(Rating ~ 1, data = sepvssoul, method = "ML")
  maxModel2b<-lme(Rating ~ 1, random = ~1|Album, data = sepvssoul, method = "ML")
  maxModel2c<-update(maxModel2b, .~. + Band)
  anova(maxModel2a, maxModel2b, maxModel2c)

By comparing models we can see:

           Model df      AIC      BIC    logLik   Test  L.Ratio p-value
maxModel2a     1  2 2889.412 2899.013 -1442.706                        
maxModel2b     2  3 2853.747 2868.148 -1423.873 1 vs 2 37.66536  <.0001
maxModel2c     3  4 2854.309 2873.510 -1423.155 2 vs 3  1.43806  0.2305

That album ratings varied very significantly (not surprising), the p-value is < .0001, but that band did not significantly predict ratings overall (p = .231). If you like you can look at the summary of the model by executing:
  
  summary(maxModel2c)

Which gives us this output:

Linear mixed-effects model fit by maximum likelihood
 Data: sepvssoul 
       AIC     BIC    logLik
  2854.309 2873.51 -1423.154

Random effects:
 Formula: ~1 | Album
        (Intercept) Residual
StdDev:   0.2705842 1.166457

Fixed effects: Rating ~ Band 
               Value Std.Error  DF   t-value p-value
(Intercept) 4.078740 0.1311196 882 31.107015  0.0000
BandSoulfly 0.204237 0.1650047  14  1.237765  0.2362
 Correlation: 
            (Intr)
BandSoulfly -0.795

Standardized Within-Group Residuals:
       Min         Q1        Med         Q3        Max 
-2.9717684 -0.3367275  0.4998698  0.6230082  1.2686186 

Number of Observations: 898
Number of Groups: 16 

The difference in ratings between Sepultura and Soulfly was b = 0.20. Ratings for soulfully were higher, but not significantly so (if we allow ratings to vary over albums, if you take that random effect out you'll get a very different picture because that variability will go into the fixed effect of 'band'.).

Max or No Max

Just because this isn't fun enough, we could also just look at whether either Sepultura (post 1996) or Soulfly can compete with the Max-era-Sepultura heyday.

I'm going to fit three models but this time including the early Sepultura albums (with max). The models are the same as before except that the fixed effect of band now has three levels: Sepultura Max, Sepultura no-Max and Soulfly:

  maxModela<-gls(Rating ~ 1, data = maxvsnomax, method = "ML")
  maxModelb<-lme(Rating ~ 1, random = ~1|Album, data = maxvsnomax, method = "ML")
  maxModelc<-update(maxModelb, .~. + Band)
  anova(maxModela, maxModelb, maxModelc)

By comparing models we can see:

          Model df      AIC      BIC    logLik   Test   L.Ratio p-value
maxModela     1  2 4686.930 4697.601 -2341.465                         
maxModelb     2  3 4583.966 4599.973 -2288.983 1 vs 2 104.96454  <.0001
maxModelc     3  5 4581.436 4608.114 -2285.718 2 vs 3   6.52947  0.0382

That album ratings varied very significantly (not surprising), the p-value is < .0001, and the band did significantly predict ratings overall (p = .038). If you like you can look at the summary of the model by executing:
  
  summary(maxModelc)

Which gives us this output:

Linear mixed-effects model fit by maximum likelihood
 Data: maxvsnomax 
       AIC      BIC    logLik
  4581.436 4608.114 -2285.718

Random effects:
 Formula: ~1 | Album
        (Intercept) Residual
StdDev:     0.25458 1.062036

Fixed effects: Rating ~ Band 
                         Value Std.Error   DF  t-value p-value
(Intercept)           4.545918 0.1136968 1512 39.98281  0.0000
BandSepultura No Max -0.465626 0.1669412   19 -2.78916  0.0117
BandSoulfly          -0.262609 0.1471749   19 -1.78433  0.0903
 Correlation: 
                     (Intr) BndSNM
BandSepultura No Max -0.681       
BandSoulfly          -0.773  0.526

Standardized Within-Group Residuals:
       Min         Q1        Med         Q3        Max 
-3.3954974 -0.3147123  0.3708523  0.6268751  1.3987442 

Number of Observations: 1534
Number of Groups: 22  

The difference in ratings between Sepultura without Max compared to with him was b = -0.47 and significant at p = .012 (ratings for post-max Sepultura are significantly worse than for the Max-era Sepultura). The difference in ratings between Soulfly compared two Max-era Sepultura was b = -0.26 and not significant (p = .09) (ratings for Soulfly are not significantly worse than for the Max-era Sepultura). A couple of points here, p-values are silly, so don't read too much into them, but the parameter (the bs) which quantifies the effect is a bit smaller for Soulfly.

Confidence Intervals

Interestingly if you write yourself a little bootstrap routine to get some robust confidence intervals around the parameters:
  
    boot.lme <- function(data, indices){
    data <- data[indices,] # select obs. in bootstrap sample
    model <- lme(Rating ~ Band, random = ~1|Album, data = data, method = "ML")
    fixef(model) # return coefficient vector
  }

maxModel.boot<-boot(maxvsnomax, boot.lme, 1000)
  maxModel.boot
  boot.ci(maxModel.boot, index = 1, type = "perc")
  boot.ci(maxModel.boot, index = 2, type = "perc")

  boot.ci(maxModel.boot, index = 3, type = "perc")


Then you find these confidence intervals for the three betas (intercept, Post-Max Sepultura vs. Max Era-Sepultura, Soulfly vs. Max-Era-Sepultura):

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot.ci(boot.out = maxModel.boot, type = "perc", index = 1)

Intervals : 
Level     Percentile     
95%   ( 4.468,  4.620 )  
Calculations and Intervals on Original Scale

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot.ci(boot.out = maxModel.boot, type = "perc", index = 2)

Intervals : 
Level     Percentile     
95%   (-0.6153, -0.3100 )  
Calculations and Intervals on Original Scale

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot.ci(boot.out = maxModel.boot, type = "perc", index = 3)

Intervals : 
Level     Percentile     
95%   (-0.3861, -0.1503 )  
Calculations and Intervals on Original Scalens: 1534
Number of Groups: 22  

The difference in ratings between Sepultura without Max compared to with him was b = -0.47 [-0.62, -0.31]. The difference in ratings between Soulfly compared to Max-era Sepultura was b = -0.26 [-0.39, -0.15]. This suggests  that both soulfully and post-Max Sepultura yield negative parameters that reflect (to the degree that you believe that a confidence interval tells you about the population parameter ....) a negative effect in the population. In other words, both bands are rated worse than Max-era Sepultura.

Summary

Look, this is just a bit of fun and an excuse to show you how to use a bootstrap on a multilevel model, and how you can use data to try to answer pointless questions thrown at you on Twitter. Based on this hastily thrown together analysis that makes a lot of assumptions about a lot of things, my 120 character twitter response will be: Sepultura Max better than everything, but post 1996 Max is no better than No Max;-)

R Code

Data

You can generate the data using this R code (data correct as of today):

morbid<-c(rep(1,2), rep(2, 4), rep(3, 3), rep(4, 8), rep(5, 36))
  Schizo<-c(rep(1,1), rep(2, 2), rep(3, 4), rep(4, 10), rep(5, 33))
  remains<-c(2, rep(3, 5), rep(4, 9), rep(5, 104))
  Arise<-c(rep(2, 2), rep(4, 16), rep(5, 89))
  Chaos<-c(rep(1,4), rep(2, 2), rep(3, 9), rep(4, 20), rep(5, 120))
  Roots<-c(rep(1,9), rep(2, 8), rep(3, 17), rep(4, 24), rep(5, 94))
  Against<-c(rep(1,16), rep(2, 14), rep(3, 11), rep(4, 20), rep(5, 32))
  Nation<-c(rep(1,3), rep(2, 7), rep(3, 6), rep(4, 22), rep(5, 19))
  Roorback<-c(rep(1,6), rep(2, 6), rep(3, 5), rep(4, 13), rep(5, 20))
  Dante<-c(rep(1,1), rep(2, 3), rep(3, 4), rep(4, 8), rep(5, 30))
  Alex<-c(rep(1,1), rep(2, 1), rep(3, 3), rep(4, 6), rep(5, 18))
  Kairos<-c(rep(1,3), rep(2, 2), rep(3, 2), rep(4, 6), rep(5, 33))
  Mediator<- c(rep(1,0), rep(2, 3), rep(3, 4), rep(4, 6), rep(5, 21))
  
  
  morbid<-data.frame(rep("Morbid", length(morbid)), rep(1986, length(morbid)), morbid)
  Schizo<-data.frame(rep("Schizo", length(Schizo)), rep(1987, length(Schizo)), Schizo)
  Remains<-data.frame(rep("remains", length(remains)), rep(1989, length(remains)), remains)
  Arise<-data.frame(rep("Arise", length(Arise)), rep(1991, length(Arise)), Arise)
  Chaos<-data.frame(rep("Chaos", length(Chaos)), rep(1993, length(Chaos)), Chaos)
  Roots<-data.frame(rep("Roots", length(Roots)), rep(1996, length(Roots)), Roots)
  Against<-data.frame(rep("Against", length(Against)), rep(1998, length(Against)), Against)
  Nation<-data.frame(rep("Nation", length(Nation)), rep(2001, length(Nation)), Nation)
  Roorback<-data.frame(rep("Roorback", length(Roorback)), rep(2003, length(Roorback)), Roorback)
  Dante<-data.frame(rep("Dante", length(Dante)), rep(2006, length(Dante)), Dante)
  Alex<-data.frame(rep("Alex", length(Alex)), rep(2009, length(Alex)), Alex)
  Kairos<-data.frame(rep("Kairos", length(Kairos)), rep(2011, length(Kairos)), Kairos)
  Mediator<-data.frame(rep("Mediator", length(Mediator)), rep(2013, length(Mediator)), Mediator)
  
  
  names(morbid)<-c("Album", "Year", "Rating")
  names(Schizo)<-c("Album", "Year", "Rating")
  names(Remains)<-c("Album", "Year", "Rating")
  names(Arise)<-c("Album", "Year", "Rating")
  names(Chaos)<-c("Album", "Year", "Rating")
  names(Roots)<-c("Album", "Year", "Rating")
  names(Against)<-c("Album", "Year", "Rating")
  names(Nation)<-c("Album", "Year", "Rating")
  names(Dante)<-c("Album", "Year", "Rating")
  names(Alex)<-c("Album", "Year", "Rating")
  names(Kairos)<-c("Album", "Year", "Rating")
  names(Mediator)<-c("Album", "Year", "Rating")

  SepMax<-rbind(morbid, Schizo, Remains, Arise, Chaos, Roots)
  SepMax$Band<-"Sepultura Max"
  SepMax$Max<-"Max"
  
  SepNoMax<-rbind(Against, Nation, Dante, Alex, Kairos, Mediator)
  SepNoMax$Band<-"Sepultura No Max"
  SepNoMax$Max<-"No Max"
  
  
  
  soulfly<-c(rep(1,8), rep(2, 9), rep(3, 4), rep(4, 16), rep(5, 89))
  primitive<-c(rep(1,11), rep(2, 5), rep(3, 5), rep(4, 19), rep(5, 53))
  three<-c(rep(1,1), rep(2, 10), rep(3, 12), rep(4, 7), rep(5, 19))
  prophecy<-c(rep(1,2), rep(2, 5), rep(3, 5), rep(4, 25), rep(5, 42))
  darkages<-c(rep(1,1), rep(2, 1), rep(3, 5), rep(4, 18), rep(5, 36))
  conquer<-c(rep(1,1), rep(2, 0), rep(3, 5), rep(4, 5), rep(5, 31))
  omen<-c(rep(1,0), rep(2, 2), rep(3, 1), rep(4, 6), rep(5, 17))
  enslaved<-c(rep(1,1), rep(2,1), rep(3, 4), rep(4, 2), rep(5, 30))
  savages<-c(rep(1,0), rep(2, 2), rep(3, 3), rep(4, 10), rep(5, 27))
  archangel<-c(rep(1,3), rep(2, 2), rep(3, 4), rep(4, 7), rep(5, 21))

  
  soulfly<-data.frame(rep("Soulfly", length(soulfly)), rep(1998, length(soulfly)), soulfly)
  primitive<-data.frame(rep("Primitive", length(primitive)), rep(2000, length(primitive)), primitive)
  three<-data.frame(rep("Three", length(three)), rep(2002, length(three)), three)
  prophecy<-data.frame(rep("Prophecy", length(prophecy)), rep(2004, length(prophecy)), prophecy)
  darkages<-data.frame(rep("Darkages", length(darkages)), rep(2005, length(darkages)), darkages)
  conquer<-data.frame(rep("Conquer", length(conquer)), rep(2008, length(conquer)), conquer)
  omen<-data.frame(rep("Omen", length(omen)), rep(2010, length(omen)), omen)
  enslaved<-data.frame(rep("Enslaved", length(enslaved)), rep(2012, length(enslaved)), enslaved)
  savages<-data.frame(rep("Savages", length(savages)), rep(2013, length(savages)), savages)
  archangel<-data.frame(rep("Archangel", length(archangel)), rep(2015, length(archangel)), archangel)
  
  names(soulfly)<-c("Album", "Year", "Rating")
  names(primitive)<-c("Album", "Year", "Rating")
  names(three)<-c("Album", "Year", "Rating")
  names(prophecy)<-c("Album", "Year", "Rating")
  names(darkages)<-c("Album", "Year", "Rating")
  names(conquer)<-c("Album", "Year", "Rating")
  names(omen)<-c("Album", "Year", "Rating")
  names(enslaved)<-c("Album", "Year", "Rating")
  names(savages)<-c("Album", "Year", "Rating")
  names(archangel)<-c("Album", "Year", "Rating")

  
  
  Soulfly<-rbind(soulfly, primitive, three, prophecy, darkages, conquer, omen, enslaved, savages, archangel)
  Soulfly$Band<-"Soulfly"
  Soulfly$Max<-"Max"
  
  
  maxvsnomax<-rbind(SepMax, SepNoMax, Soulfly)
  maxvsnomax$Band<-factor(maxvsnomax$Band)
  maxvsnomax$Max<-factor(maxvsnomax$Max)
  maxvsnomax$Album<-factor(maxvsnomax$Album)
  
  sepvssoul<-subset(maxvsnomax, Band != "Sepultura Max")
  sepvssoul$Band<-factor(sepvssoul$Band)
  sepvssoul$Album<-factor(sepvssoul$Album)

Graphs in this post

library(ggplot2)

m <- ggplot(sepvssoul, aes(x = Rating, fill = Band, colour = Band)) 
  m + geom_histogram(binwidth = 1, alpha = .2, position="identity") + coord_cartesian(xlim = c(0, 6)) + scale_y_continuous(breaks = seq(0, 400, 50)) + labs(x = "Rating", y = "Frequency", colour = "Band", fill = "Band") + theme_bw()
  
  
  line <- ggplot(sepvssoul,  aes(Year, Rating, colour = Band))
  line + stat_summary(fun.y = mean, geom = "point") + stat_summary(fun.data = mean_cl_boot, geom = "errorbar", width = 0.2) + labs(x = "Year", y = "Mean Rating on Amazon.com") + coord_cartesian(ylim = c(1, 5)) + scale_x_continuous(breaks=seq(1998, 2015, 1))+ theme_bw() + stat_summary(fun.y = mean, geom = "line", aes(group=Band))

Wednesday, February 17, 2016

To retract or to not retract, that is the question ...

Amazingly I haven't written a blog since September last year, it's almost as though I have better things to do (or have no ideas about topics ... or have nothing interesting to say). It's more the lack of ideas/interesting things to say, oh, and other people are so much better at statistics blogging than I am (see Dan Lakens for example). Still, twitter provided me with inspiration this morning as reports filtered through of the latest in a long line of Psychological Science retractions. This particular one is for an article by Fisher et al. (2015) in which they showed (according to retraction watch) that 'women prefer to wear makeup when there is more testosterone present in their saliva'.  The full details of the retraction are also helpfully described by retraction watch if you're interested. The main reason for the retraction though as described by the authors was as follows (see here):

"Our article reported linear mixed models showing interactive effects of testosterone level and perceived makeup attractiveness on women’s makeup preferences. These models did not include random slopes for the term perceived makeup attractiveness, and we have now learned that the Type 1 error rate can be inflated when by-subject random slopes are not included (Barr, Levy, Scheepers, & Tily, 2013). Because the interactions were not significant in reanalyses that addressed this issue, we are retracting this article from the journal."

The purpose of this blog is to explain why I believe (other things being equal) the authors should have published a correction and not retracted the article. Much of what I think isn't specific to this example, it just happens to have been triggered by it.

To retract ...

I assume that the authors' decision to retract is motivated by a desire to rid the world of false knowledge. By retracting, the original paper is removed from the universe thus reducing the risk of 'false knowledge' on this topic spreading. A correction would not minimise the risk that the original article was cited or followed up by other researchers unless it was sort of tagged onto the end of the paper. If a correction appears as a separate paper then it may well be overlooked. However, I think this is largely a pragmatic issue for the publishers to sort out: just make it impossible for someone to get the original paper without also getting the correction. Job done.

To not retract ...

If you read the full account of the retraction, the authors fitted a model, published the details of that model in the supplementary information with the paper and then posted their data the Open Science Framework for others to use. They have been completely transparent. Someone else re-analysed the data and included the aforementioned random slope, and alerted the authors to the differences in the model (notably this crucial interaction term). The authors retracted the paper. I would argue that a correction would be better for the following reasons.

Undeserved repetitional damage

One of the things that really bugs me about science these days (especially psychology) is the witch-hunt-y-ness of it (yes, that's not a real word). Scientists happily going about their business with good intentions, make bad decisions and suddenly everyone is after their head. This is evidenced by the editor feeling the need to make this comment in the retraction: "I would like to add an explicit statement that there is every indication that this retraction is entirely due to an honest mistake on the part of the authors." The editor is attempting damage limitation for the authors.

The trouble is that retractions come with baggage ranging from 'the scientists don't know what they're doing' (at best) to hints that they have deliberately misled everyone for their own gain. This baggage is unnecessary. Don't get me wrong, I've seen people do terrible things with data (in the vast majority of cases out of ignorance, not deceit) and I'm convinced that the incentive structures in academia are all wrong (quantity is valued over quality), but deep down I still like to believe that scientists care about science/knowledge. Given how open they have been with their analysis and data, these scientists strike me as being people who care about science They are to be applauded for their openness, and not burdened with the baggage of retraction. A correction would have better reflected their honesty and integrity.

Retraction implies there is one correct way to model the data

Retracting the paper implies 'we did it wrong'. Did the authors analyse their data incorrectly though? Here's some food for thought. Raphael Silberzahn and colleagues published a paper in which they gave the same research question and the same data set to 29 research teams and examined how they addressed the question (there is an overview of the paper here, and the paper itself is available here). Essentially they found a lot of variability in what statistical models were applied to answer the question including tobit regression, logistic regression (sometimes multilevel, sometimes not), poisson regression (sometimes multilevel, sometimes not), spearman's correlation, OLS regression, WLS regression, Bayesian logistic regression (sometimes multilevel, sometimes not). You get the gist. The resulting odds ratios for the effect ranged from 0.89 to 2.93 (although all but 2 were > 1). Confidence intervals for these odds ratios ranged in width quite widely. The positive thing was that if you look at Figure 1 in the paper, despite variation in the models applied, there was a fair bit of consensus in the odds ratio and confidence intervals produced (about half of the point estimates/CIs  - the ones from team 26 to team 9 - line up pretty well despite the varying models applied). However, it goes to show that give a data set and a question to 29 research teams and they will analyse it differently. Is there one correct model? Are 28 teams wrong and 1 team correct. No, data analysis is always about decisions, and although there can be unequivocally wrong decisions, there is rarely only one correct decision.

So, Fisher and colleagues didn't include a random slope, someone else did. This change in model specification affected the model parameters and p-values. Is the inclusion of the random slope any more correct than it's exclusion? That's somewhat a matter of opinion. Of course, its exclusion could have led to a Type I error (if you fixate on p-values), but the more interesting question is why it changes the model, how it changes it and what the implications are moving forward. The message (for me) from the Silberzahn paper is that if any of us let other scientists loose with our data, they would probably do different things with it that would affect the model parameters and p-values. Just as has happened here. The logic of this particular retraction is that every scientist should retract every paper they have ever published on the grounds that there were probably other models that could have been fit, and if they had been then the parameter estimates in the paper would be different. A correction (rather than retraction) would have allowed readers and researcher in this field to consider the findings in the light of the difference that the random slope makes to the model.

Retraction devalues the original study

Here's how science works. People generate theories, they transform them into testable hypotheses, they collect data, they evaluate the evidence for their hypothesis. Then other people get interested in the same theory and collect more data and this adds to the collective knowledge on the topic. Sometimes the new data contradicts the old data, in which case people update their beliefs. We do not, however, retract all of the old papers because this new one has thrown up different evidence. That would be silly, and yet I think that is all that has happened here with Fisher et al.'s paper. They fitted one model to the data, drew some conclusions, then someone else moved forward with the data and found something different. Retraction implies that the original study was of no value whatsoever, it must be hidden away never to be seen. Regardless of how you analyse the data, if the study was methodologically sound (I don't know if it was - I can't access it because its been retracted) then it adds value to the research question irrespective of the significance of an interaction in the model. A retraction removes this knowledge from the world, it becomes a file drawer paper rather than information that is out in the open. We are deprived of the evidence within the paper (including how that evidence changes depending on what model you fit to the data). A correction allows this evidence to remain public, and better still updates that evidence in the light of new analysis in useful ways ...

Retraction provides the least useful information about the research question

By retracting this study we are none the wiser about the hypothesis. All we know is that a p-value that was below < .05 flipped to the other side of that arbitrary threshold when a random slope was included in the model. It could have changed from .049 to .051 for all we know, in which case the associated parameter has most likely not really changed much at all. It might have changed from .00000000001 to .99, in which case the impact has been more dramatic. A retraction deprives us of this information. In a correction, the authors could present the new model, its parameters and confidence intervals (incidentally, on the topic of which I recommend Richard Morey's recent paper) and we could see how things have changed as a result of including the random slope. A correction provides us with specific and detailed evidence with which we can update our beliefs from the original paper. A correction allows the reader to determine what they believe. A retraction provides minimal and (I'd argue) unhelpful information about how the model changed, and about how to update our beliefs about the research question. All we are left with is to throw the baby out with the bathwater and pretend, like the authors, that the study never happened. If the methods were sound, then the study is valuable, and the new analysis is not damning but simply sheds new light on the hypotheses being tested. A retraction tells us little of any use.

Where to go ...

What this example highlights to me is how science needs to change, and how the publication process also needs to change. Science moves forward through debate, through people challenging ideas, and this is a good example. If the paper were in a PLoS style journal that encourages debate/comment then a retraction would not have been necessary. Instead, the models and conclusions could have been updated for all to see, the authors could have updated their conclusions based on these new analyses, and knowledge on the topic would have advanced. You'd end up with a healthy debate instead of evidence being buried. One of the challenges of open science and the OSF is to convince scientists that by making their data public they are not going to end up in these situations where they are being witch hunted, or pressured into retractions. Instead, we need to embrace systems that allow us to present different angles on the same data, to debate conclusions, and to strive for truth by looking at the same data from different perspectives ... and for none of that to be perceived as a bad thing. Science will be better for it.

References

Fisher, C. I., Hahn, A. C., DeBruine, L. M., & Jones, B. C. (2015). Women’s preference for attractive makeup tracks changes in their salivary testosterone. Psychological Science, 26, 1958–1964. doi:10.1177/0956797615609900