11 November 2017

Don't stiff people who live from tips

I don't normally blog just because a tweet annoyed me (otherwise I'd be writing several dozen blog posts per day), but this tweet and its ensuing thread touched a nerve.
From his profile and recent tweets, the author seems like the kind of person with whom I probably share a very sizeable percentage of my political and social attitudes.  He follows me on Twitter; I would follow him back, except that I ration my follows simply to try and slow down the firehose of information.  In short, I'm sure he's a nice guy with progressive attitudes.  But it seems that he and some of his followers have a rather different attitude to tipping to mine.

Consider the situation.  You're(*) standing outside a restaurant at a US airport, looking at the menu.  The airline has messed up your connection, so maybe a nice meal will help you feel a little better.  You fancy half a dozen oysters (maybe $15) and then perhaps the steak (maybe $28) and a beer ($7), so that'll be $50 in total.  Plus you'll need to add $5 for tax and $10 for a 20% tip.  So that will cost a total of $65.  Can you afford that?  Yes?  OK, let's go.  Those of us who are lucky enough to be able to afford to eat in restaurants might make similar decisions many times per year (give or take the tax and tip calculations, depending on where we spend most of our time.)

Then the meal happens.  I encourage you to read the Twitter  thread (it's quite short) to see what happened.  The situation is not entirely black and white, and the details of the author's experience are not especially relevant to my point here, but to sum up, he was not very impressed with the overall level of the service he received. That's OK; disappointment is a normal part of the human condition and experience, especially in consumer and travel situations.

After a while, the check arrives.  It's for $50 plus $5 tax, and it may or may not have "Thank you" written on it by hand, perhaps even with a little smiley face, because scientific research that was totally not underpowered or p-hacked in any way has shown that when female (but not male, as hypothesised in advance by the remarkably prescient authors of that study) servers do this, they get bigger tips.  Remember, you have already budgeted $10 for the tip, but because you were unhappy with the service, you are thinking twice about whether to give it.  For support in that decision, you go on Twitter to ask people how much of that amount they think you should give or withhold.  And, because Twitter solidarity with your friends while they're travelling is a genuinely rather nice thing about the 21st century, within just a few minutes you have several replies:
So, the consensus was that the author should tip 10%, instead of the now-conventional (for the US) 20%.  A couple of people even suggested that he tip 0%, but he settled on 10%.  Under the reasonable (I think) assumptions about his meal that I made earlier, that means he left the waitress about $5 instead of $10.

Had I been watching the proceedings during the 17 minutes between the first tweet and the verdict, my answer would have been: you should withhold nothing.  Tip the 20%.  Give the waitress the full $10 that you presumably budgeted for from the start.  After all, once you decided to walk into the restaurant it was basically a sunk cost anyway.  You can't know all of the reasons why the service was slow, and even if you could somehow establish that it was entirely her personal fault (rather than that of other staff, or the restaurant, none of whom will be affected by your tipping decision), it doesn't matter anyway. Within five minutes of leaving the restaurant the bad service you got will be forgotten forever (not least because you will shortly be waiting in line at the boarding gate, or back on the phone to American Airlines about your connection, dealing with people who do not have the incentive of possibly losing a tip to encourage them to give you better service, ha ha).

There seems to be a pervasive idea in certain parts of the world (mostly North America and the UK) that serving in a restaurant is like being one of those street entertainers who juggle things in front of the cars at red lights.  Indeed, something to reassure you that it's OK to think that way is usually written on the menu in some form: "We do not impose a service charge, as we believe that our customers have the right to reward good service personally".  Well, I've got news for all you people who like to imagine that you are normally skeptical of capitalism: that is pure marketing bullshit.  What it means is, "If our menu prices were 10/15/20% higher, people would be less likely to come inside.  So we make the posted prices lower in order to entice you in, at no risk to us, and we let the staff play a kind of roulette with their income based, essentially, on your mood".  (However, for parties of 6 or more, the restaurant has to add a service charge because waitstaff know that groups are terrible tippers and so would otherwise try to avoid having them seated in their area of the restaurant.)

Think about this: if you were eating at a restaurant in a country where the pricing structure of restaurants is such that tipping is not expected (in some cases, it might even be regarded as slightly offensive), you would probably not go to the manager and request, say, an 8% reduction in the bill because the service was a bit sloppy.  And a big part of the reason why you wouldn't do that is because it would require you to actively do something to justify your claim, versus the far simpler act of not placing the second $5 bill on the plate.  As a result of this (entirely natural) behaviour by customers, waitstaff in countries with a tip-based wage model are essentially incentivised to be both happy-looking and efficient, every minute of their working day.  That is, frankly, an inhuman requirement (try it in your office job for, say, fifteen minutes).

Actually, I can think of a certain kind of person who I would expect to stiff people in these circumstances.  The current archetype of this kind of person has strange blond hair and a fake tan and plays a lot of golf and mouths off a lot about how bad almost everyone else in the world is.  If we were to learn that he tweeted his buddies and told them how much he was going to stiff a waitress, we wouldn't bother spending the energy on rolling our eyes.  There are endless stories about how this individual didn't pay bills sent to himself or his companies, because he decided he didn't like the service he received.

Don't be that person.  Don't, in effect, put working people on piecework rates ("$0.30 per smile") by deciding how much you will tip them based on how perfectly they do their not-particularly-desirable job, simply because the formal rules say that you legally can because the tip is optional.  Be a mensch, as I believe the expression goes.  Eat your oysters, add the going rate for the tip, pay the bill, get on your plane, and don't punish the waitress for working in a messed-up system that pits her against both you and her employer.  If you are the kind of person who can afford to dine on oysters at an airport restaurant prior to getting on a plane, then pretty much by definition $5 means more to the waitress than it does to you.

Here's my personal benchmark (your mileage may vary): I wouldn't withhold a tip unless the situation was sufficiently serious that I would be prepared to complain to the restaurant manager about it.  (For what it's worth, I have never been in a restaurant situation that was so bad that I felt the need to complain to the manager.)  If the server were to, say, cough violently into my food and then carry on in the hope that I didn't notice, then that's not a tipping matter.  But I don't like the idea of micro-managing the ups and downs of other people's workdays through small (to me) sums of money.  It just doesn't feel like something we ought to be doing on the way to building a nicer society.

If you still have a few minutes, please watch this video, where someone a lot more erudite than me makes a far better job of explaining the point I wanted to make here. If you're in a hurry, skip to 09:30.





(*) All references to "you" are intended to be to a generic restaurant customer, although obviously the example from the quoted tweets will be salient. I hope Malcolm von Schantz will forgive me for choosing the occasion of his Twitter thread as a reminder that this issue has bothered me for some time and was in a far corner of my "blog ideas back burner".

23 September 2017

Problems in Cornell Food and Brand Lab's replacement "Can Branding Improve School Lunches?" article

Back in February 2017, I wrote this post about an article in JAMA Pediatrics from the Cornell Food and Brand Lab entitled "Can Branding Improve School Lunches?". That article has now been "retracted and replaced", which is a thing that JAMA does when it considers that an article is too badly flawed to simply issue a correction, but that these flaws were not the result of malpractice.  This operation was covered by Retraction Watch here.

Here is the retraction notice (including the replacement article), and here is the supplementary information, which consists of a data file in Excel, some SPSS syntax to analyze it, and a PDF file with the SPSS output.  It might help to get the last of these if you want to follow along; you will need to unzip it.


Minor stuff to warm up with


There are already some inconsistencies in the replacement article, notably in the table.

First, note (d) says that the condition code for Elmo-branded cookies was 2, and note (e) says that the condition code for Elmo-branded cookies was 4.  Additionally, the caption on the rightmost column of the table says that "Condition 4" was a "branded cookie". (I'll assume that means an "Elmo-branded cookie"; in my previous blog post I noted that "branded" could also mean "branded with the 'unknown character' sticker" that was mentioned in the first article, but since that sticker is now simply called the "unknown sticker", "unfamiliar sticker", or "control sticker", I'll assume that the only branding is Elmo.)  As far as I can tell (e.g., from the dataset and syntax), note (d) is correct, and note (e) and the column caption are incorrect.

Second, although the final sample size (complete cases) is reported as 615, the number of exclusions in the table is 420, and the overall sample was reported as 1,040.  So it looks like five cases are missing (1040 - 420 = 620).  On closer inspection, though, we can see that 10 cases were eliminated because they could not be tied to a school, so the calculation is (1030 - 420 = 610) and we seem to have five cases too many.  After trawling through the dataset for an hour or so, I identified five cases that had been excluded for more than one reason, but counted as having been excluded for both reasons.  (Yes, this was the authors' job, not mine.  I occasionally fantasise about one day billing Cornell for all the time I've spent trying to clean up their mess, and I suspect my colleagues feel the same way.)

I'll skip over the fact that, in reporting their only statistically significant results, the authors reported percentages of children choosing an apple that were not the actual percentages observed, but the estimated marginal means from their generalized estimating equations model (which had a more impressive gap between the Elmo-branded apple and the control condition: 13.1 percentage points versus 9.0).  Were those actual percentages to be cited in a claim about how many more children actually took an apple in the experiment then there would be a problem, but for the moment those numbers are just sitting there.

Now, so far I've maybe just been nitpicking about the kind of elementary mistakes that are easy to make (and fail to spot during proofreading) entirely inadvertently when submitting an article to a journal with an impact factor of over 10, and which can be fixed with a simple correction.  Isn't there something better to write about?

Well, it's funny you should ask, because yes, there is.


The method doesn't match the data


Recall that the design of the study is such that each school (numbered 1 to 7) runs each condition of the experiment (numbered 1 to 6, with 4 not used) on one day of the week (numbered 1 to 5).  All schools are supposed to run condition 5 on day 1 (Monday) and condition 6 on day 5 (Friday), but apart from that, not every school runs, say, condition 2 on Tuesday.  However, for each school X, condition Y (and only condition Y) is run on day Z (and only on day Z).  If we have 34 choice records from school X for condition Y, these records should all have the same number for the day.

To check this, we can code up the school/condition/day combination as a single number, by multiplying the school number by 100, the condition by 10, and the day by 1, and adding them together.  For school 1, this might give us numbers such as 112, 123, 134, 151, and 165.  (The last two numbers definitely ought to be 151 and 165, because condition 5, the pre-test, was run on day 1 and condition 6, the post-test, was run on day 5.)  For school 2, we might find 214, 222, 233, 251, and 265.  There are five possible numbers per school, one per day (or, if you prefer, one per condition).  There were seven schools with data included in the analysis, so we would expect to see a maximum of 35 different three-digit school/condition/day numbers in a summary of the data, although there may well be fewer because in some cases a school didn't contribute a valid case on one or more days.  (I have made an Excel sheet with these data available here, along with SPSS and CSV versions of the same data, but I encourage you to check my working starting from the data supplied by the authors; detailed instructions can be found in the Appendix below.)

Let's have a look at the resulting pivot table showing the frequencies of each school/condition/day number.  The column on the left is our three-digit number that combines school, condition, and day, while the column on the right is the number of times it appears.  Remember, in the left-hand column, we are expecting up to five numbers starting with 1, five starting with 2, etc.


Oh.

You can see from the Grand Total line that we have accounted for all 615 cases in here.  And for schools 5 and 7, we see the pattern we expected.  For example, school 7 apparently ran condition 1 on day 3, condition 2 on day 4, condition 3 on day 2, and (as expected) condition 5 on day 1 and condition 6 on day 5.  School 6 only had a few cases.  (I'm trying to work out if it's meaningful to add them to the model; since we only have cases from one condition, we can have no idea what the effect of the Elmo sticker was in school 6.  But that's a minor point for the statisticians.)

School 3 isn't too bad: the numbers opposite 324 and 325 suggest that condition 2 was run on day 4 with 6 cases, and on day 5 for 1 case.  Maybe that 325 is just a typo.

However, schools 1, 2, and 4 --- which, between them, account for 73.3% of the cases in the dataset --- are a disaster.  Let's start at the top with school 1: 19 cases for condition 1 on day 2, 12 cases for condition 1 on day 3.  Conditions 2, 3, and 5 appear to have been run on three different days in that school.  Cases from condition 1 appear on four different days in school 4.  Here is a simple illustration from the authors' own Excel data file:


Neither the retracted article, nor the replacement, makes any mention of multiple conditions being run on any given day.  Had this been an intended feature of the design, presumably the day might have been included as a predictor in the model (although it would probably been quite strongly correlated with condition).  You can check the description of the design in the replacement article, or in the retracted article in both its published and draft forms, which I have made available here.  So apparently the authors have either (again) inadvertently failed to describe their method correctly, or they have inadvertently used data that is highly inconsistent with their method as it was reported in both the retracted article and the replacement.

Perhaps we can salvage something here.  By examining the counts of each combination of school/condition/day, I was able to eliminate what appeared to be the spurious multiple-days-per-condition cases while keeping the maximum possible number of cases for each school.  In most cases, I was able to make out the "dominant" condition for any given day; for example, for school 1 on day 2, there are cases for 19 participants in condition 1, five in condition 2, nine in condition 3, and two in condition 5, so we can work (speculatively, of course) on the basis that condition 1 is the correct one, and delete the others.  After doing this for all the affected schools, 478 cases were left, which represents a loss of 22.3% of the data.  Then I re-ran the authors' analyses on these cases (using SPSS, since the authors provided SPSS syntax and I didn't want to risk translating a GEE specification to R):


Following the instructions in the supplement, we can calculate the percentages of children who chose an apple in the control condition (5) and the "Elmo-branded apples" condition (1) by subtracting the estimated marginal means from 100%, giving us an increase of (1.00 - .757) = 24.3% to (1.00 - .694) = 30.6%, or 6.3 percentage points instead of the 13.1 reported in the replacement article.  The test statistic shows that this is not statistically significant, Wald χ2 = 2.086, p = .149.  Of course, standard disclaimers about the relevance or otherwise of p values apply here; however, since the authors themselves seemed to believe that a p = .10 result is evidence of no association when they reported that "apple selection was not significantly associated ... in the posttest condition (Wald χ2 = 2.661, P = .10)", I think we can assume that they would have not considered the result here to be evidence that "Elmo-branded apples were associated with an increase in a child’s selection of an apple over a cookie".


The dataset contains duplicates

 

There is one more strange thing in this dataset, which appears to provide more evidence that the authors ought to go back to the drawing board because something about their design, data collection methods, or coding practices is seriously messed up.  I coded up participants and conditions into a single number by multiplying condition by 1000 and adding the participant ID, giving, for example, 4987 for participant 987 in condition 4.  Since each participant should have been exposed to each condition exactly once, we would expect to see each of these numbers only once.  Here's the pivot table.  The column on the left is our four-digit number that combines condition and participant number, while the column on the right is the number of cases.  Remember, we are expecting 1s, and only 1s, for the number of cases in every row:

Oh. Again. (The table is quite long, but below what's shown here, it's all ones in the Total column.)

Eleven participants were apparently exposed to the same condition twice. One participant (number 26) was assigned to condition 1 on three days (3, 4, and 5).  Here are those duplicate cases in the authors' original Excel data file:


(Incidentally, it appears that this child took the cookie on the first two days in the "Elmo-branded apple" condition, and then the apple on the last day.  Perhaps that's a demonstration of the longitudinal persuasive power of Elmo stickers to get kids to eat fruit.)


Conclusion: Action required


Not to put too fine a point on it, this dataset appears to be a complete mess (other adjectives are available, but some may not be suitable for a family-friendly blog such as this one aspires to be.)

What should happen now?  This seems like too big a problem to be fixable by a correction.  Maybe the authors should retract their "retract and replace" replacement, and replace it with a re-replacement.  (I wonder if an article has ever been retracted twice?)  This could, of course, go on for ever, with "retract and replace" becoming "rinse and repeat".  But perhaps at some point the journal will step in and put this piece of research out of its misery.



Appendix: Step-by-step instructions for reproducing the dataset


As mentioned above, the dataset file provided by the authors has two tabs ("worksheets", in Excel parlance).  One contains all of the records that were coded; the other contains only the complete and valid records that were used in the analyses.  Of these, only the first contains the day on which the data were collected.  So the problem is to generate a dataset with records from the first tab (including "Day") that correspond to the records in the second tab.

To do this:
1. Open the file ApamJamaBrandingElmo_Dataset_09-21-17.xlsx (from the supplementary information here).  Select the "Masterfile" tab (worksheet).
2. Sort by "Condition" in ascending order.
3. Go to row 411 and observe that Condition is equal to 4. Delete rows 411 through 415 (the records with Condition equal to 4).
4. Go to line 671 and observe that Condition is equal to "2,3".  Delete all rows from here to the end of the file (including those where Condition is blank).
5. Sort by "Choice" in ascending order.
6. Go to row 2. Delete rows 2 and 3 (records with "Choice" codes as 0 and 0.5)
7. Go to rows 617 and observe that the following "Choice" values are 3, 5, 5, "didn't eat a snack", and then a series of blanks.  Delete all rows from 617 through the end of the file.
8. You now have the same 615 records as in the second tab of the file, "SPSS-615 cases", with the addition of the "Day" field.  You can verify this by sorting both on the same fields and pasting the columns of one alongside the other.

13 September 2017

Now even second-hand books are fake

First off, my excuses for the shameless plug in this post.  I co-edited a book and it's due to appear this week:


Actually, I should say it "was" due to appear this week, because as you can see, Amazon thought it would be available yesterday (Tuesday 12 September).  My colleagues and I have submitted the final proof corrections and been told that it's in production, but I guess these things can always slip a bit.  So Amazon's robots have detected that the book is officially available, but they don't have any copies in the warehouse.  Hence the (presumably automatically-generated) message saying "Temporarily out of stock".

As of today the book costs £140 at Amazon UK, $187.42 at Amazon.com, €202.16 at Amazon.de, €158.04 at Amazon.es, and Amazon.fr don't have a price for it.  This is a lot of money and I can't really recommend the average student to buy a personal copy, although I hope that anyone who is even peripherally interested in positive psychology will pester their librarian to acquire several!  (That said, I've recently reviewed some academic books that are about one-quarter the size and which cost £80-100.  So perhaps our book isn't too bad value in relative terms.  Plus at 590 pages thick you can probably use it to fend off attackers, although obviously this is not professional security advice and you should definitely call the police first.)

But all of that is a bit academic (ha ha) today, because the book is out of stock.  Or is it?  Look at that picture again: "1 Used from £255.22".

Now, I guess it makes sense that a used book would sometimes be more expensive than the new one, if the latter is out of stock.  Maybe the seller has a copy and hopes that someone really, really needs the book quickly and is prepared to pay a premium for that (assuming that a certain web site *cough* doesn't have the full PDF yet).  Wonderful though I obviously think our book is, though, I can't imagine that anyone has been assigned it as a core text for their course just yet (hint hint).  But I guess this stuff is all driven by algorithms, which presumably have to cope with books like this and the latest from J. K. Rowling, so maybe that's OK.

However, alert readers may already have spotted the bigger problem here.  There is no way that the seller of the used book can actually have a copy of it in stock, because the book does not exist yet.  I clicked on the link and got this:


So not only is the book allegedly used, but it's in the third-best possible condition ("Good", below "Like New" and "Very Good").

The seller, Tundra Books, operates out of what appears to be a residential address in Sevilla, Spain.  Of course, plenty of people work from home, but it doesn't look like you could fit a great deal of stock in one of those maisonettes.  I wonder what magical powers they possess that enable them to beam slightly dog-eared academic handbooks back from the future?  Or is it just possible that if I order this book for over £100 more than the list price, I will receive it in about four of weekssay, the time it takes to order a new one from Amazon.es, ruffle the pages a bit, and send it on to me?


Credits:
- My attention was first drawn to the "out of stock" issue by my brother-in-law, Tony Douglas, who also painted the wonderful picture on the cover.  Seriously, even if you don't open the book it's worth the price for the cover alone.  (And remember, "it's not nepotism if the work is really good".)
- Matti Heino spotted the "1 Used from £255.22" in the screen shot.


07 June 2017

Exploring John Carlisle's "bombshell" article about fabricated data in RCTs

For the past couple of days, this article by John Carlisle has been causing a bit of a stir on Twitter. The author claims that he has found statistical evidence that a surprisingly high proportion of randomised controlled trials (RCTs) contain data patterns that cannot have arisen by chance.  Given that he has previously been instrumental in uncovering high-profile fraud cases, and also that he used data from articles that are known to be fraudulent (because they have been retracted) to calibrate his method, the implication is that some percentage of these impossible numbers are the result of fraud.  The title of the article is provocative, too: "Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals".  So yes, there are other reasons, but the implication is clear (and has been picked up by the media): There is a bit / some / a lot of data fabrication going on.

Because I anticipate that Carlisle's article is going to have quite an impact once more of the mainstream media decide to run with it, I thought I'd spend some time trying to understand exactly what Carlisle did.  This post is a summary of what I've found out so far.  I offer it in the hope that it may help some people to develop their own understanding of this interesting forensic technique, and perhaps as part of the ongoing debate about the limitations of such "post publication analysis" techniques (which also include things such as GRIM and statcheck).

[[Update 2017-06-12 19:35 UTC: There is now a much better post about this study by F. Perry Wilson here.]]

How Carlisle identified unusual patterns in the articles

Carlisle's analysis method was relatively simple.  He examined the baseline differences between the groups in the trial on (most of) the reported continuous variables.  These could be straightforward demographics like age and height, or they could be some kind of physiological measure taken at the start of the trial.  Because participants have been randomised into these groups, any difference between them is (by definition) due to chance.  Thus, we would expect a certain distribution of the p values associated with the statistical tests used to compare the groups; specifically, we would expect to see a uniform distribution (all p values are equally likely when the null hypothesis is true).

Not all of the RCTs report test statistics and/or p values for the difference between groups at baseline (it is not clear what a p value would mean, given that the null hypothesis is known to be true), but they can usually be calculated from the reported means and standard deviations.  In his article, Carlisle gives a list of the R modules and functions that he used to reconstruct the test statistics and perform his other analyses.

Carlisle's idea is that, if the results have been fabricated (for example, in an extreme case, if the entire RCT never actually took place), then the fakers probably didn't pay too much attention to the p values of the baseline comparisons.  After all, the main reason for presenting these statistics in the article is to show the reader that your randomisation worked and that there were no differences between the groups on any obvious confounders.  So most people will just look at, say, the ages of the participants, and see that in the experimental condition the mean was 43.31 with an SD of 8.71, and in the control condition it was 42.74 with an SD of 8.52, and think "that looks pretty much the same".  With 100 people in each condition, the p value for this difference is about .64, but we don't normally worry about that very much; indeed, as noted above, many authors wouldn't even provide a p value here.

Now consider what happens when you have ten baseline statistics, all of them fabricated.  People are not very good at making up random numbers, and the fakers here probably won't even realise that as well as inventing means and SDs, they are also making up p values that ought to be randomly distributed.  So it is quite possible that they will make up mean/SD combinations that imply differences between groups that are either collectively too small (giving large p values) or too large (giving small p values).

Reproducing Carlisle's analyses

In order to better understand exactly what Carlisle did, I decided to reproduce a few of his analyses.  I downloaded the online supporting information (two Excel files which I'll refer to as S1 and S2, plus a very short descriptive document) here.  The Excel files have one worksheet per journal with the worksheet named NEJM (corresponding to articles published in the New England Journal of Medicine) being on top when you open the file, so I started there.

Carlisle's principal technique is to take the p values from the various baseline comparisons and combine them.  His main way of doing this is with Stouffer's formula, which is what I've used in this post.  Here's how that works:
1. Convert each p value into a z score.
2. Sum the z scores.
3. If there are k scores, divide the sum of the z scores from step 2 by the square root of k.
4. Calculate the one-tailed p value associated with the overall z score from step 3.

In R, that looks like this.  Note that you just have to create the vector with your p values (first line) and then you can just copy/paste the second line, which implements the entire Stouffer formula.

plist = c(.95, .84, .43, .75, .92)
1 - pnorm(sum(sapply(plist, qnorm))/sqrt(length(plist))) 
[1] 0.02110381


That is, the p value associated with the test that these five p values arose by chance is .02.  Now if we start suggesting that something is untoward based on the conventional significance threshold of .05 we're going to have a lot of unhappy innocent people, as Carlisle notes in his article (more than 1% of the articles he examined had a p value < .00001), so we can probably move on from this example quite rapidly.  On the other hand, if you have a pattern like this in your baseline t tests:

plist = c(.95, .84, .43, .75, .92, .87, .83, .79, .78, .84)
1 - pnorm(sum(sapply(plist, qnorm))/sqrt(length(plist)))
[1] 0.00181833

then things are starting to get interesting.  Remember that the p values should be uniformly distributed from 0 to 1, so we might wonder why all but one of them are above .50.

In Carlisle's model, suspicious distributions are typically those with too many high p values (above 0.50) or too many low ones, which give overall p values that are close to either 0 or 1, respectively.  For example, if you subtract all five of the p values in the first list I gave above from 1, you get this:

plist = c(.05, .16, .57, .25, .08)
1 - pnorm(sum(sapply(plist, qnorm))/sqrt(length(plist)))
[1] 0.9788962


and if you subtract that final p value from 1, you get the value of 0.0211038 that appears above.

To reduce the amount of work I had to do, I chose three articles that were near the top of the NEJM worksheet in the S1 file (in principle, the higher up the worksheet the study is, the bigger the problems) and that had not too many variables in them.  I have not examined any other articles at this time, so what follows is a report on a very small sample and may not be representative.

Article 1

The first article I chose was by Jonas et al. (2002), "Clinical Trial of Lamivudine in Children with Chronic Hepatitis B", which is on line 8 of the NEJM worksheet in the S1 file.  The trial number (cell B8 of the NEJM worksheet in the S1 file) is given as 121, so I next looked for this number in column A of the NEJM worksheet of the S2 file and found it on lines 2069 through 2074.  Those lines allow us to see exactly which means and SDs were extracted from the article and used as the basis for the calculations in the S1 file.  (The degree to which Carlisle has documented his analyses is extremely impressive.)  In this case, the means and SDs correspond to the three baseline variables reported in Table 1 of Jonas et al.'s article:
By combining the p values from these variables, Carlisle arrived at an overall inverted (see p. 4 of his article) p value of .99997992. This needs to be subtracted from 1 to give a conventional p value, which in this case is .00002.  That would appear to be very strong evidence against the null hypothesis that these numbers are the product of chance. However, there are a couple of problems here.

First, Carlisle made the following note in the S1 file (cell A8):
Patients were randomly assigned. Variables correlated (histologic scores). Supplementary baseline data reported as median (range). p values noted by authors.
But in cells O8, P8, and Q8, he gives different p values from those in the article, which I presume he calculated himself.  After subtracting these p values from 1 (to take into account the inversion of the input p values that Carlisle describes on p. 4 of his article), we can see that the third p value in cell Q8 is rather different (.035 has become approximate .10).  This is presumably because the original p values were derived from a non-parametric test which would be impossible to reproduce without the data, so presumably Carlisle assumed a parametric model (for example, p values can be calculated for a t test from the mean, SD, and sample sizes of the two groups).  Note that in this case, the difference in p values actually works against a false positive, but the general point is that not all p value analyses can be reproduced from the summary statistics.

Second, and much more importantly, the three baseline variables here are clearly not independent.  The first set of numbers ("Total score") is merely the sum of the other two, and arguably these other two measures of liver deficiencies are quite likely to be related to one another.  Even if we ignore that last point and only remove "Total score", considering the other two variables to be completely independent, the overall p value for this RCT would change from .00002 to .001.

plist = c(0.997916597, 0.998464969, 0.900933333)
1 - pnorm(sum(sapply(plist, qnorm))/sqrt(length(plist)))
[1] 2.007969e-05
 
plist = c(0.998464969, 0.900933333)     #omitting "Total score"
1 - pnorm(sum(sapply(plist, qnorm))/sqrt(length(plist)))
[1] 0.001334679
 
Carlisle discusses the general issue of non-independence on p. 7 of his article, and in the quote above he actually noted that the liver function scores in Jonas et al. were correlated.  That makes it slightly unfortunate that he didn't take some kind of measures to compensate for the correlation.  Leaving the raw numbers in the S1 file as if the scores were uncorrelated meant that Jonas et al.'s article appeared to be the seventh most severely problematic article in NEJM.

(* Update 2017-06-08 10:07 UTC: In the first version of this post, I wrote "it is slightly unfortunate that [Carlisle] apparently didn't spot the problem in this case".  This was unfair of me, as the quote from the S1 file shows that Carlisle did indeed spot that the variables were correlated.)

Article 2

Looking further down file S1 for NEJM articles with only a few variables, I came across Glauser et al. (2010), "Ethosuximide, Valproic Acid, and Lamotrigine in Childhood Absence Epilepsy" on line 12.  This has just two baseline variables for participants' IQs measured with two different instruments.  The trial number is 557, which leads us to lines 10280 through 10285 of the NEJM worksheet in the S2 file.  Each of the two variables has three values, corresponding to the three treatment groups.

Carlisle notes in his article (p. 7) that the p value for the one-way ANOVA comparing the groups for the second variable is misreported.  The authors stated that this value is .07, whereas Carlisle calculates (and I concur, using ind.twoway.second() from the rpsychi package in R) that this should be around .0000007.  Combining this p value with the .47 from the first variable, Carlisle obtains an overall (conventional) p value of .0004 to test the null hypothesis that these group differences are the result of chance.



But it seems to me that there may be a more parsimonious explanation for these problems.  The two baseline variables are both measures of IQ, and one would expect them to be correlated.  Inspection of the group means in Glauser et al.'s Table 2 (a truncated version of which is show above) suggests that the value for the Lamotrigine group on the WPPSI measure is a lot lower than might be expected, given that this group scored slightly higher on the WISC measure.  Indeed, when I replaced the value of 92.0 with 103.0, I obtained a p value for the one-way ANOVA of almost exactly .07.  Of course, there is no direct evidence that 92.0 was the result of a finger slip (or, perhaps, copying the wrong number from a printed table), but it certainly seems like a reasonable possibility.  A value of 96.0 instead of 92.0 would also give a p value close to .07.

xsd = c(16.6, 14.5, 14.8)
xn = c(155, 149, 147)
 
an1 = ind.oneway.second(m=c(99.1, 92.0, 100.0), sd=xsd, n=xn)
an1$anova.table$F[1]
[1] 12.163 
1 - pf(q=12.163, df1=2, df2=448)
[1] 7.179598e-06




an2 = ind.oneway.second(m=c(99.1, 103.0, 100.0), sd=xsd, n=xn)
an2$anova.table$F[1]
[1] 2.668
1 - pf(q=2.668, df1=2, df2=448)
[1] 0.0704934
 
It also seems slightly strange that someone who was fabricating data would choose to make up this particular pattern.  After having invented the numbers for the WISC measure, one would presumably not add a few IQ points to two of the values and subtract a few from the other, thus inevitably pushing the groups further apart, given that the whole point of fabricating baseline IQ data would be to show that the randomisation had succeeded; to do the opposite would seem to be very sloppy.  (However, it may well be the case that people who feel the need to fabricate data are typically not especially intelligent and/or statistically literate; the reason that there are very few "criminal masterminds" is that most masterminds find a more honest way to earn a living.)

Article 3

Continuing down the NEJM worksheet in the S1 file, I came to Heydendael et al. (2003) "Methotrexate versus Cyclosporine in Moderate-to-Severe Chronic Plaque Psoriasis" on line 33.  Here there are three baseline variables, described on lines 3152 through 3157 of the NEJM worksheet in the S2 file.  These variables turn out to be the patient's age at the start of the study, the extent of their psoriasis, and the age at which psoriasis first developed, as shown in Table 1, which I've reproduced here.

Carlisle apparently took the authors at their word that the numbers after the ± symbol were standard errors, as he seems to have converted them to standard deviations by multiplying them by the square root of the sample size (cells F3152 through F3157 in the S2 file).  However, it seems clear that, at least in the case of the ages, the "SEs" can only have been SDs.  The values calculated by Carlisle for the SDs are around 80, which is an absurd standard deviation for human ages; in contrast, the values ranging from 12.4 through 14.5 in the table shown above are quite reasonable as SDs.  It is not clear whether the "SE" values in the table for the psoriasis area-and-severity index are in fact likely to be SDs, or whether Carlisle's calculated SD values (23.6 and 42.8, respectively) are more likely to be correct.

Carlisle calculated an overall p value of .005959229 for this study.  Assuming that the SDs for the age variables are in fact the numbers listed as SEs in the above table, I get an overall p value of around .79 (with a little margin for error due to rounding error on the means and SDs, which are given to only one decimal place).

xn = c(43, 42)
 


an1 = ind.oneway.second(m=c(41.6, 38.3), sd=c(13.0, 12.4), n=xn)
an1$anova.table$F[1]
[1] 1.433
1 - pf(q=1.433, df1=1, df2=83)
[1] 0.2346833 
 
an2 = ind.oneway.second(m=c(25.1, 24.3), sd=c(14.5, 13.3), n=xn)
an2$anova.table$F[1]
[1] 0.07
1 - pf(q=0.07, df1=1, df2=83)
[1] 0.7919927 1 - pf(q=0.07, df1=1, df2=83)
 
# Keep value for psoriasis from cell P33 of S1 file
plist = c(0.2346833, 0.066833333, 0.7919927)
1 - pnorm(sum(sapply(plist, qnorm))/sqrt(length(plist)))
[1] 0.7921881
 

The fact that Carlisle apparently did not spot this issue is slightly ironic given that he wrote about the general problem of confusion between standard deviations and standard errors in his article (pp. 56) and also included comments about possible mislabelling by authors of SDs and SEs in several of the notes in column A of the S1 spreadsheet file.

Conclusion

The above analyses show how easy it can be to misinterpret published articles when conducting systematic forensic analyses. I can't know what was going through Carlisle's mind when he was reading the articles that I selected to check, but having myself been through the exercise of reading several hundred articles over the course of a few evenings looking for GRIM problems, I can imagine that obtaining a full understanding of the relations between each of the baseline variables may not always have been possible.

I want to make it very clear that this post is not intended as a "debunking" or "takedown" of Carlisle's article, for several reasons.  First, I could have misunderstood something about his procedure (my description of it in this post is guaranteed to be incomplete).  Second, Carlisle has clearly put a phenomenal amount of effortthousands of hours, I would guessinto these analyses, for which he deserves a vast amount of credit (and does not deserve to be the subject of nitpicking).  Third, Carlisle himself noted in his article (p. 8) that is was inevitable that he had made a certain number of mistakes.  Fourth, I am currently in a very similar line of business myself at least part of the time, with GRIM and the Cornell Food and Brand Lab saga, and I know that I have made multiple errors, sometimes in public, where I was convinced that I had found a problem and someone gently pointed out that I had missed something (and that something was usually pretty straightforward).  I should also point out that the quotes around the word "bombshell" in the title of this post are not meant to belittle the results of Carlisle's article, but merely to indicate that this is how some media outlets will probably refer to it (using a word that I try to avoid like the plague).

If I had a takeaway message, I think it would be that this technique of examining the distribution of p values from baseline variable comparisons is likely to be less reliable as a predictor of genuine problems (such as fraud) when the number of variables is small.  In theory the overall probability that the results are legitimate and correctly reported is completely taken care of by the p values and Stouffer's formula for combining them, but in practice when there are only a few variables it only takes a small issue—such as a typo, or some unforeseen non-independence—to distort the results and make it appear as if there is something untoward when there probably isn't.

I would also suggest that when looking for fabrication, clusters of small p values—particularly those below .05—may not be as good an indication as clusters of large p values.  This is just a continuation of my argument about the p value of .07 (or .0000007) from Article 2, above.  I think that Carlisle's technique is very clever and will surely catch many people who do not realise that their "boring" numbers showing no difference will produce p values that need to follow a certain distribution, but I question whether many people are fabricating data that (even accidentally) shows a significant baseline difference between groups, when such differences might be likely to attract the attention of the reviewers.

To conclude: One of the reasons that science is hard is that it requires a lot of attention to detail, which humans are not always very good at it.  Even people who are obviously phenomenally good at it (including John Carlisle!) make mistakes.  We learned when writing our GRIM article what an error-prone process the collection and analysis of data can be, whether this be empirical data gathered from subjects (some of the stories about how their data were collected or curated that were volunteered by the authors whom we contacted to ask for their datasets were both amusing and slightly terrifying) or data extracted from published articles for the purposes of meta-analysis or forensic investigation.  I have a back burner project to develop a "data hygiene" course, and hope to get round to actually developing and giving it one day!

27 April 2017

An open letter to Dr. Todd Shackelford

To the editor of Evolutionary Psychological Science:

Dear Dr. Shackelford,

On April 24, 2017, in your capacity as editor of Evolutionary Psychological Science, you issued an Editorial Note [PDF] that referenced the article "Eating Heavily: Men Eat More in the Company of Women," by Kevin M. Kniffin, Ozge Sigirci, and Brian Wansink (Evolutionary Psychological Science, 2016, Vol. 2, No. 1, pp. 38–46).

The key point of the note is that the "authors report that the units of measurement for pizza and salad consumption were self-reported in response to a basic prompt 'how many pieces of pizza did you eat?' and, for salad, a 13-point continuous rating scale."

For comparison, here is the description of the data collection method from the article (p. 41):
Consistent with other behavioral studies of eating in naturalistic environments (e.g., Wansink et al. 2012), the number of slices of pizza that diners consumed was unobtrusively observed by research assistants and appropriate subtractions for uneaten pizza were calculated after waitstaff cleaned the tables outside of the view of the customers. In the case of salad, customers used a uniformly small bowl to self-serve themselves and, again, research assistants were able to observe how many bowls were filled and, upon cleaning by the waitstaff, make appropriate subtractions for any uneaten or half-eaten bowls at a location outside of the view of the customers.
It is clear that this description was, to say the least, not an accurate representation of the research record.  Nobody observed the number of slices of pizza.  Nobody counted partial uneaten slices when the plates were bussed.  Nobody made any surreptitious observations of salad either.  All consumption was self-reported.  It is difficult to imagine how this 100-plus word description could have accidentally slipped into an article.

Even if we ignore what appears to have been a deliberately misleading description of the method, there is a further very substantial problem now that the true method is known.  That is, the entire study would seem to depend on the amounts of food consumed having been accurately and objectively measured. Hence, the use of self-report measures of food consumption (which are subject to obvious biases, including questions around desirability), when the entire focus of the article is on how much food people actually (and perhaps unconsciously, due to the influence of evolutionarily-determined forces) consumed in various social situations, would seem to cast severe doubt on the validity of the study.  The methods described in the Editorial Note and the article itself are thus contradictory, as they describe substantially different methodologies. The difference between real-time unobtrusive observations by others, versus post hoc self-reports, is both practically and theoretically significant in this case. 

Hence, we are surprised that you apparently considered that issuing an "Editorial Note" was the appropriate response to the disclosure by the authors that they had given an incorrect description of their methods in the article.  Anyone who downloads the article today will be unaware that the study simply did not take place as described, nor that the results are probably confounded by the inevitable limitations of self-reporting.

Your note also fails to address a number of other discrepancies between the article and the dataset.  These include: (1) The data collection period, which the article reports as two weeks, but which the cover page for the dataset states was seven weeks; (2) The number of participants excluded for dining alone, which is reported as eight in the article but which appears to be six in the dataset; (3) The overall number of participants, which the article reports as 105, a number that is incompatible with the denominator degrees of freedom reported on five F tests on pp. 41–42 (109, 109, 109, 115, and 112).

In view of these problems, we believe that the only reasonable course of action in this case is to retract the article, and to invite the authors, if they wish, to submit a new manuscript with an accurate description of the methods used, including a discussion of the consequences of their use of self-report measures for the validity of their study.

Please note that we have chosen to publish this e-mail as an open letter here.   If you do not wish your reply to be published there, please let us know, and we will, of course, respect your wishes.

Sincerely,

Nicholas J. L. Brown
Jordan Anaya
Tim van der Zee
James A. J. Heathers
Chris Chambers


12 April 2017

The final (maybe?) two articles from the Food and Brand Lab

It's been just over a week since Cornell University, and the Food and Brand Lab in particular, finally started to accept in public that there was something majorly wrong with the research output of that lab.  I don't propose to go into that in much detail here; it's already been covered by Retraction Watch and by Andrew Gelman on his blog.  As my quote in the Retraction Watch piece says, I'm glad that the many hours of hard, detailed, insanely boring work that my colleagues and I have put into this are starting to result in corrections to the scientific record.

The statement by Dr. Wansink contained a link to a list of articles for which he states that he has "reached out to the six journals involved to alert the editors to the situation".  When I clicked on that list, I was surprised to see two articles that neither my colleagues nor I had looked at yet.  I don't know whether Dr. Wansink decided to report these articles to the journals by himself, or perhaps someone else did some sleuthing and contacted him.  In any case, I thought that for completeness (and, of course, to oblige Tim van der Zee to update his uberpost yet again) I would have a look at what might be causing a problem with these two articles.

Wansink, B. (1994). Antecedents and mediators of eating bouts. Family and Consumer Sciences Research Journal, 23, 166182. http://dx.doi.org/10.1177/1077727X94232005

Wansink, B. (1994). Bet you can’t eat just one: What stimulates eating bouts. Journal of  Food Products Marketing1(4), 324. http://dx.doi.org/10.1300/J038v01n04_02

First up, there is a considerable overlap in the text of these two articles.  I estimate that 35–40% of the text from "Antecedents" had been recycled verbatim into "Bet", as shown in this image of the two articles side by side (I apologise for the small size of the page images from "Bet"):



The two articles present what appears to be the same study, from two different viewpoints (especially in the concluding sections, which as you can see above do not have any overlapping text) and with a somewhat different set of results reported. In "Antecedents", the theme is about education: broadly speaking, getting people to understand why the embark on phases of eating the same food, and the implications for dietary education.  In "Bet", by contrast, the emphasis is placed on food marketers; the aim is to get them to understand how they can encourage people to consume more of their product.  I suppose that, like the arms export policy of a country that sells arms to both sides in the same conflict, this could be viewed as hypocrisy or blissful neutrality.

The Method and Results sections show some curious discrepancies.  I assume the two articles must be describing the same study since the basic (212) and final (178) sample sizes are the same, and where the same item responses are reported in both articles, the numbers are generally identical, with one exception that I will mention below.  Yet some details differ for no obvious reason.  Thus, in "Antecedents", participants typically took 35 minutes to fill out a 19-page booklet, whereas in "Bet" then took 25 minutes to fill out an 11-page booklet.  In "Antecedents", the reported split between the kinds of food that participants discussed eating was 41% sweet, 29% salty, 16% dairy, and 14% "other".  In "Bet" the split was 52% sweet, 36% salty, and 12% "other".  The Cronbach's alpha reported for coder agreement was .87 in "Antecedents" but .94 in "Bet".

There are further inconsistencies in the main tables of results (Table 2 in "Antecedents", Table 1 in "Bet").  The principal measured variable changes from consumption intensity (i.e., the amount of the "eating bout" food that was consumed) to consumption frequency (the number of occasions on which the food was consumed), although the numbers remain the same.  The ratings given in response to the item "I enjoyed the food" are 0.8 lower in both conditions in "Bet" compared to "Antecedents".  On p. 14 of "Bet", the author reuses some text from "Antecedents" to describe the mean correlation between nutritiousness and consumption frequency, but inexplicably manages to copy the two correlations incorrectly from Table 2 and then calculate their mean incorrectly.

Finally, the F statistics and associated p values on p. 175 of "Antecedents" and pp. 12–13 of "Bet" have incorrectly reported degrees of freedom (177 should be 176) and in several cases, the p value is not, as claimed in the article, below .05.

Is this interesting?  Well, less than six months ago it would have been major news.  But so, today so much has changed that I don't expect many people to want to read a story saying "Cornell professor blatantly recycled sole-authored empirical article", just as you can't get many people to click on "President of the United States says something really weird".  Even so, I think this is important.  It shows, as did James Heathers' post from a couple of weeks ago, that the same problems we've been finding in the output of the Cornell Food and Brand Lab go back more than 20 years, past the period when that lab was headquartered at UIUC (1997–2005), through its brief period at Penn (1995–1997), to Dr. Wansink's time at Dartmouth.  When Tim gets round to updating his summary of our findings, we will be up to 44 articles and book chapters with problems, over 23 years.  That's a fairly large problem for science, I think.

You can find annotated versions of the article discussed in this post here.