This past weekend, the New York Times published an article on a study from Stanford University, where the authors apparently found benefit from acupuncture in pregnant women with Major depression. Given the track record of acupuncture (which features a resounding lack of evidence that it works), my skeptical antennae started twitching. I ferreted out the original study in the Obstetrics and Gynecology journal (link to full text here), and read it through thoroughly. This report - of a single randomized clinical trial (RCT) study with less than 150 subjects - claimed that an acupuncture regimen, specifically designed for a particular individual, could significantly reduce depression in that individual. As I suspected, the paper made a whole lot of science-y sounding, but nonetheless vacuous, arguments; their predominant talking point seemed to be that multiple exploratory analyses were done on the observed outcome. This assertion is always suspect; for an RCT, it shouldn't need so many exploratory analyses at the study stage, and the outcome measures should have been determined prior to the initiation of the study. As a friend of mine pointed out, "exploratory analyses" frequency means "fishing expedition", which is what this paper seems to have done in plenty. Unfortunately, the mainstream media coverage of this single study has been far from ideal; the news report has been worded to make it seem like a breakthrough or a major milestone in research, which is the impression the general public is left with - eventually to their detriment.
Steven Novella, a prominent clinical neurologist and blogger, has commented in great detail on the gaping lacunae of this paper over at the Science-based Medicine blog. One of the primary points that he has raised is that of prior plausibility. The study evaluated an acupuncture method designed specifically to treat depression - the method being tailored to the individuals according to a manual of traditional Chinese medicine (TCM). The TCM principles have no prior plausibility; the theory of Qi (life force) has no evidentiary backing or empirical foundation in science, reminiscent of Vital theories that were outdated and discarded over a century ago. Therefore, since acupuncture starts from a position of extremely low prior probability as to efficacy, it can be validated only with a large, rigorous and reproducible study. As Carl Sagan once said, "Extraordinary claims require extraordinary evidence." Sadly, this Ob-Gyn paper wasn't it. I urge you to read Dr. Novella's takedown of it.
I came across a different set of problems. First, I had a problem with the differentiation of acupuncture into specific and non-specific for depression. If I assume that sticking needles into the body actually does something (i.e. if I consciously set aside Steven's prior plausibility criterion), it is not clear to me how the study ensured that needles stuck in "non-specifically" did not evoke physiological responses that "specific" needling produced. This seems to be corroborated by the rather modest outcome difference in the Hamilton scale, ~(-)11.5 in the specific, and ~(-)9.0 in the non-specific, or for that matter, ~(-)9.5 in the massage group, as well as by a Cohen's d of 0.39 (which is in the range of a 'small' change) between specific acupuncture and combined controls. Yes, the authors have shown a statistically significant difference between specific and combined controls, but statistics is a function of numbers. The question is: is that significance biologically relevant?
Secondly, their practice of combining the control groups (non-specific acupuncture with massage) is dubious. The modality of the massage has nothing in common with that of the acupuncture; why club those as a group and subject them to statistical tests that depend upon the sample size, unless both of the controls are expected to be completely ineffective and thereby provide a sharp contrast to the treatment group? This seems to be corroborated by the lack of any difference in outcome between the controls at week 4 and very slight difference at week 8.
Thirdly, the results section of the article was very poorly reviewed and edited, leaving very ambiguous statements that seem to mean quite different from what the authors intended. Two examples are:
Exploratory mixed model analyses revealed a greater reduction in Hamilton Rating Scale for Depression scores in those receiving acupuncture specific for depression than in those receiving acupuncture not specific for depression (P<.05; Cohen's d=0.46, 95% CI 0.01-0.92) but no difference from those receiving prenatalThis seems to indicate that the specific group had no difference with the massage group (Freudian slip?)
massage (P=.13; Cohen's d=0.33; 95% CI (-)0.10-0.76).
Exploratory analysis revealed that the group receiving acupuncture specific for depression had a greater response rate than the group receiving acupuncture not specific for depression (P<.05; number needed to treat 3.9; 95% CI 2.2-19.8) but was not different from the group receiving acupuncture not specific for depression and prenatal massage (P=.20; number needed to treat 7.7).Look at the underlined group descriptions; does it make sense!!
The authors indicate that the remission rates were not significantly different between the treatment and control groups. Besides, if one goes by the numbers in Table 3, the 'specific' acupuncture group reported a lot more of the side effects compared to the other groups. I didn't see any effect size statistics on that!
I was surprised (and oddly pleased) to see the use of effect size as a statistic in this study. Effect size is a descriptive statistic that measures the magnitude of relationship between two variables in a sample-based estimate of that parameter, without making a statement about the representation of that relationship in the population. While it effectively complements inferential statistics, such as p-values, and is useful in exploratory studies (which is what this study was billed as) and in meta-analyses, it does not prima facie indicate whether the observations are generalizable to the population or not. Standardized effect size measures, such as Cohen's d (which is difference of two group means divided by pooled standard deviation; used in this study), may not have any biological significance when used in individual studies. Besides, the authors provided no justification for setting their study standard to a moderate/medium effect (defined as Cohen's d of 0.5). For effect size measures, I would have liked to see Odd's ratio and/or Relative Risk measures, which are standard for case-control studies and RCTs.
In summary, despite a lot of fancy statistics (what a friend of mine aptly described as statistical legerdemain), the study suffers from several inadequacies (for instance, the insufficient blinding), and more devastatingly, absence of a strong and reliable outcome. Sadly, though, this study would be touted as a proof positive of efficacy of acupuncture as an intervention by hordes of pseudoscience-worshippers.