When To Trust Research Findings

Many published studies reach incorrect conclusions. By one famous estimation (1), most published research findings are false, and in projects designed to directly replicate landmark studies, replication rates of positive findings are often below 50% (2, 3, 4)

In this article, I want to explain why incorrect research findings are so common, and then discuss criteria you can use to predict the likelihood that the findings of a paper are accurate.

There are several reasons why incorrect results frequently get published.

This article is from a previous issue of Monthly Applications in Strength Sport (MASS), my monthly research review with Eric Helms and Mike Zourdos. Each issue of MASS contains at least 9 pieces of content.

Want to get more articles like this? Click here to learn more and join 2,750+ subscribers.

Why Incorrect Results Get Published

Publication Bias

I think publication bias is the primary reason why the literature contains a disproportionate amount of incorrect findings. Journals are much more likely to publish “statistically significant” results than non-significant results (5), because significant results are often seen as sexier and more exciting.

Journals are much more likely to publish “statistically significant” results than non-significant results

Often, if an experiment doesn’t turn up significant results, a scientist won’t even bother submitting it for publication, either because they view the experiment as a failure (even though null results are important results too!), or because they know they’ll deal with multiple rounds of submissions, revisions, reformatting, further submissions, further revisions, further reformatting, etc., before eventually landing the study in a low-impact journal that probably won’t have much of an effect on their career prospects. Trying to get a non-significant result published can just seem like a huge undertaking with very little payoff.

More often, however, publication bias is driven by the journals rejecting perfectly good science that just didn’t happen to get statistically significant results.

Publication bias is insidious due to sheer probability. If all research is performed in good faith (i.e. there’s no shady data analysis going on), a certain percentage of study results will be false positives purely due to chance. If all true positives and all false positives get published, while most true negatives and most false negatives wind up in a file drawer, the rate of published false positive results will be much higher than the rate of actual false positive results. I made a spreadsheet that helps illustrate this point; you can find the spreadsheet and instructions for how to use it here.

As I mentioned, I think publication bias is the primary reason why the scientific literature is littered with incorrect, non-replicable research findings. This is not primarily the fault of the scientists – it’s a matter of journals and publishers responding to market incentives. Positive results are exciting, exciting results get cited, citations drive impact factors, and impact factors drive subscriptions and revenue. Thus, publication bias is primarily driven by journals, but is also driven by university press departments (who hype significant findings more than non-significant findings), lay-press science writers (who create awareness for exciting, significant findings), funding agencies (who also like to see exciting, significant findings before opening their wallets to fund big projects building on that research), and the scientific community as a whole (which still places way more weight on journal impact factors than is warranted)

It’s worth noting that publication bias doesn’t always work in favor of significant findings. It works in favor of “exciting” findings, which are usually significant findings. However, a null result that runs counter to some well-supported orthodoxy may also be viewed as exciting, and thus be likely to get published. For example, you’d expect lower energy intake to lead to greater weight loss, if all other variables are controlled. If a metabolic ward study found no significant difference in weight loss with two dramatically different levels of energy intake, that would be a very exciting null finding and would have absolutely no problem getting published (assuming the study’s methodology wasn’t horrible). For the most part, though, “exciting” findings tend to be significant results.

P-hacking

A second factor contributing to the publication of incorrect research findings is p-hacking. P-hacking is probably at least somewhat motivated by publication bias, but while journals bear most of the responsibility for publication bias, scientists bear most of the responsibility for p-hacking.

P-hacking describes a wide variety of tools and approaches for finding “statistically significant” results in a dataset after you fail to find the significant effect you were actually looking for.

P-hacking describes a wide variety of tools and approaches for finding “statistically significant” results in a dataset after you fail to find the significant effect you were actually looking for. This can be accomplished several different ways.

One common method of p-hacking (or ensuring you have a dataset you can p-hack) is simply to collect a load of outcome variables: the more variables you have to analyze, the greater your chances of having at least one false positive. For example, in this famous “sting,” researchers intentionally set up a study to demonstrate the ease of p-hacking by testing the effects of chocolate consumption on a whole host of measures associated with health. Sure enough, purely by chance, they got some “significant” results showing that chocolate led to weight loss and improved cholesterol levels. Their study got a ton of press (people like to hear that chocolate is good for them) before the researchers confessed to the game they were playing.

Now, simply collecting a ton of data isn’t necessarily insidious. The issue is with the way the data are reported and analyzed. If you have 30 outcome measures, and you report the results for all variables and use the correct statistical procedures to adjust your false positive risk (making it harder to attain significance for any single variable), collecting more data is a good thing. However, if you only report the significant results and don’t use statistical procedures to adjust your false positive risk, you can wind up with a lot of “significant” findings due to chance alone.

Another method of p-hacking is the use of sub-analyses. For example, if you compare two training approaches in a large, heterogeneous sample, you may not find any significant differences with the full sample. However, you can then isolate the analysis to just men, or just women, or just older subjects, or just younger subjects, or just subjects that trained in the morning, or just subjects that ate pancakes for breakfast, or just Geminis, etc. If you have a dataset that allows for a lot of sub-analyses, some of them are likely to give you significant results due to chance alone, unless you perform statistical adjustments to control your false positive risk. While statistical adjustments for multiple outcome measures are common, statistical adjustments for multiple sub-analyses are much less common. And again, authors can also choose whether to report all sub-analyses they ran, or to just report the ones that gave them significant results. The presence of sub-analyses doesn’t always indicate that p-hacking occurred – the authors may have set up the experiment with clear hypotheses about the entire cohort, and about a particular subsection of the cohort – but running a slew of sub-analyses is a common tool for squeezing false positives out of noisy datasets.

Another method of p-hacking simply involves statistical shenanigans. There are often multiple ways you can analyze a dataset. Some approaches are conservative, with low risk of false positives but greater risk of failing to detect an actual significant effect. Other approaches are more liberal, detecting almost all actual significant effects, along with more false positives. There are situations when more conservative approaches are preferable and situations when more liberal approaches are preferable. However, that’s a decision that should be made before data analysis actually starts. If a researcher decides to use a conservative statistical approach when planning a study, doesn’t find significant results once the data are collected, and switches to a more liberal approach in order to find significance, that’s p-hacking.

Finally, one very blunt way to p-hack is simply to drop the results of one or two participants who are holding you back from attaining statistical significance. In a training study, the participants in group A may have mostly made larger strength gains than the participants in group B, but one or two members of group B made huge gains, which keep the results from being significant. If you had a predefined plan for dealing with outliers (and if these subjects meet some objective criterion to qualify as outliers), then maybe you could legitimately remove their results. But if you didn’t have a predefined plan for dealing with outliers, and you make the decision to toss out their data after all the results are in, that’s p-hacking.

Some p-hacking is conducted maliciously, but I think most instances are simply due to people not knowing any better.

I think that some p-hacking is conducted maliciously, but I think most instances are simply due to people not knowing any better. When you get a dataset, it’s fun to poke around and see what sort of relationships you can find. If you don’t have a defined data analysis plan, you may not realize how much you actually poked around, and how many of the significant results you found are likely due to chance. However, regardless of intent, p-hacking increases the number of incorrect research findings, compounding the effects of publication bias.

General sloppiness

Publication bias and p-hacking are exciting (at least to nerds like me) because they represent systemic problems within the entire system and may even arise from malicious intent in some cases. Sloppiness, on the other hand, is much more mundane. However, sloppiness can also contribute to incorrect research findings.

Additional noise from sloppy data collection can dramatically increase the odds of finding erroneous significant effects in an underpowered study.

P-hacking and “creative” data analysis rely on finding false positives. When data aren’t collected cleanly, that introduces more noise into the dataset. Especially when statistical power is low, all it can take are a few erroneously low data points in one group and a few erroneously high data points in another group for a “significant” effect to materialize. Low power is almost a fact of life in exercise science, but low power shouldn’t increase false positive risk if the data are collected cleanly (in fact, low power has the opposite effect, decreasing your odds of detecting true positive effects). However, additional noise from sloppy data collection can dramatically increase the odds of finding erroneous significant effects in an underpowered study.

Another domain where sloppiness can rear its head is in the planning phase of a study. Some studies simply use data from assessments that are inadequate for answering their research questions, or use statistical methods that are improper for the data collected. Both of those issues could be avoided by better attention to detail when designing the experiment. In a perfect world, scientists would always consult with methodologists and statisticians when designing a research project to make sure the design of the study and the statistical analysis plan are appropriate, but that rarely happens, at least in our field.

Data peeking/lack of clearly defined endpoint

When you run a study, you should have a clearly defined endpoint. Typically, that endpoint is defined by the number of subjects recruited. As the study rolls along, you may enter the data as you collect it, but you shouldn’t start analyzing the data until data collection is finished. However, if you don’t have a clearly defined endpoint, and you analyze your data as you collect it, you can inflate the odds of finding false positives. During the process of data collection, you may just hit a random run of subjects that all have results leaning in one direction. If you’re peeking at your data as you collect it, you may notice that your results have attained significance, and stop data collection there. However, if you ran more subjects through the study, those abnormal results would wash out, leaving you with no significant findings.

If you don’t have a clearly defined endpoint, and you analyze your data as you collect it, you can inflate the odds of finding false positives.

Peeking at your data and analyzing as you go can dramatically increase the risk of false positives, especially if your study doesn’t have a clearly defined endpoint. If you planned to recruit 40 subjects, instead of just worrying about your risk of false positives once all data is collected, you also have to deal with the risk of false positives after 10 subjects have been through the study, after 11 subjects have been through the study, 12, 13, 14, etc. When you combine data peeking with p-hacking and a dash of sloppiness, it would be incredibly unlikely that you wouldn’t end up with at least one false positive.

This paper explains in more detail why this is a problem (6).

HARKing

HARKing (7) stands for “hypothesizing after the results are known.” In other words, instead of designing a study to investigate a clear hypothesis, you run an experiment, see what significant results you can find, and write your manuscript as if the study was designed to investigate the specific variable(s) where the outcome was significant.

HARKing doesn’t exist in a vacuum. It’s primarily just a method to make p-hacking more powerful and convincing. For example, if you run a study, collect 30 variables and analyze them all with t-tests (instead of an ANOVA, which would adjust the false positive risk), report all outcomes, and don’t clearly state a hypothesis in your paper, it’ll be clear to most people that you p-hacked your study to smithereens. However, if you instead identify the significant results, then craft a hypothesis to make it look like the study was designed to investigate those specific variables, and don’t report results for the measures that didn’t yield significant results, no one would know that you should have statistically adjusted for the multitude of unreported outcome measures. On the surface, your paper looks solid, but in reality, you got away with statistical murder, and there’s a pretty good chance your results are false positives.

HARKing is probably the most common in cross-sectional and epidemiological research, which often deals with huge datasets and a multitude of variables.

HARKing is probably the most common in cross-sectional and epidemiological research, which often deals with huge datasets and a multitude of variables. For example, if you download the data from the National Health and Nutrition Examination Survey (NHANES), it wouldn’t surprise me if you could find hundreds (or even thousands) of significant effects. Once you have your result, you just need to come up with a reason why you expected to find that result, use that made-up reason to justify a hypothesis, and ignore the dozen analyses you ran that didn’t find significant results.

Low Power

The other issues primarily increase risk of false positive findings. However, low statistical power is a double whammy – it increases the risk of false positives and false negatives (8).

Statistical power is the likelihood of detecting a true effect if one exists. In our field, we’re supposed to aim for 80% power (detecting true effects 80% of the time), though actual power is likely much lower. On the surface, this means that if you do everything right (you don’t peek at your data, you don’t p-hack, etc.) but underpower your study, you’ll be more likely to “miss” true, positive effects.

However, if you combine low power with unethical research practices, you also inflate the chance of finding false positives. In a scenario where there’s truly no effect, large sample sizes should converge on the “true” effect size of 0. However, the observed effect in a sample can fluctuate substantially before eventually approaching zero. It’s a lot easier to wind up with 8 atypical people in an experimental group than 50.

Fraud

Outright fraud is the most serious transgression on this list, but probably the least common. Fraud exists on a continuum, such that aggressive enough p-hacking or HARKing probably counts as fraud (especially if you know what you’re doing and aren’t simply engaging in those practices because you don’t know any better), while wholesale data manipulation or fabrication is much more serious fraud. It’s impossible to know how much research is truly fraudulent, but my hunch is that it’s a fairly small minority. However, it does happen, and it does contribute to the rate of incorrect research findings.

Analyzing Findings

If you made it this far, congratulations! You now know more about creative ways to mishandle data (and why they’re problematic) than most researchers. While I consider myself an advocate for science, I think it’s important the people are also informed about its dark side. Unless you know about the problems, you don’t know how to protect yourself. Also note, I’m certainly not saying that science is worthless; I recognize that there are serious flaws in the way it’s often practiced, but I also think it’s the best process we have for discovering truths about the world around us. The march of scientific and technological progress over the past 300 years should be evidence enough that, in spite of its flaws, the scientific process is ultimately effective.

With that in mind, here are some things to look out for when reading research that will help you judge the likelihood that the findings of an individual paper are accurate.

Biological Plausibility

One problem in our field is that we often rush to applied research before doing mechanistic research. In a field like medicine, the molecular effects of a drug are generally thoroughly researched before testing the efficacy of that drug in animals or humans. In exercise science, we often investigate “does this work” before investigating “mechanistically, why should we expect that this would work?”

When there are mechanistic studies, you can compare theory and outcome to see if results of a study “make sense.”

However, when there are mechanistic studies, you can compare theory and outcome to see if results of a study “make sense.” For example, we know that beta-alanine increases muscle carnosine levels, and we know that carnosine is an important biological buffer. Thus, before running any experiments, we’d expect that beta-alanine would boost performance in situations when metabolic acidosis would limit performance, but not in situations when metabolic acidosis isn’t likely to limit performance. When we compare theory to outcomes, that’s exactly what we see: For short-duration and long-duration performance (which likely aren’t limited by metabolic acidosis), beta-alanine doesn’t seem to have much of an effect, while it does seem to improve moderate-duration performance (9). If a new study finds that beta-alanine improves 800m run times, we know that’s a plausible and expected outcome, since 800m running is at least partially limited by metabolic acidosis. However, if a new study finds that acute beta-alanine supplementation increases 1RM deadlift strength, we’d have every right to be skeptical, as there’s no clear mechanistic reason to expect such a result.

If a paper is well-written (and if mechanistic research exists on the subject), it should discuss possible mechanistic reasons for the observed results. If a paper doesn’t discuss possible mechanisms, or if it notes that the results are the opposite of what would be expected based on known mechanisms, then you have every right to be a bit more skeptical of the results.

Comparisons to past research

If a study investigates a completely novel hypothesis, you won’t have anything to compare the results to. However, most studies investigate hypotheses that are similar to those of previous research, while perhaps using a slightly different population or experimental setup. If the results of a study are similar to those of previous studies, you can have more confidence in the results. However, if a study reports results that are markedly different from similar research, it deserves more skepticism. If this occurs, a well-written paper will discuss possible reasons for its differing results. If a paper doesn’t discuss possible reasons for its divergent results, or if the authors seem to be grasping at straws, the results deserve even more skepticism.

If the results of a study are similar to those of previous studies, you can have more confidence in the results. However, if a study reports results that are markedly different from similar research, it deserves more skepticism.

If a study tackles a research question you’re really passionate about, it’s worth conducting your own literature search to see how that study fits into the literature. Oftentimes, authors will discuss a majority of the research findings that agree with their results but fail to mention several studies that had different outcomes. And, of course, older studies won’t cite more recent research, and the weight of the evidence may have shifted between the time a study was published and the time that you’re reading it.

Financial incentives and funding sources

I’m not going to harp on financial incentives and funding sources (10) too much, because they seem to be one of the few things most people are already aware of when critically reading research, and discussions of financial incentives or conflicts can often venture into tinfoil-hat territory.

However, the mere presence of a financial incentive (funding source or affiliation of the authors) doesn’t necessarily mean the results of a study are incorrect, and it certainly doesn’t mean the results are fraudulent. Financial incentives probably primarily influence study design (using a design that maximizes the odds of a positive result), suppression of negative or null results (magnifying the effect of publication bias), and interpretation (casting positive results in an even more laudatory light). So, when reading a study with a financial incentive, pay extra close attention to the study protocol (i.e. was the protocol specifically designed to show an effect?), and read the interpretation with a bit more skepticism, but don’t necessarily assume the results are incorrect. Additional skepticism is certainly warranted, however.

Blinding increases the trustworthiness of any study, but it’s especially important for studies where the authors have clear financial incentives or conflicts of interest.

Finally, in studies where there’s a clear financial incentive, pay attention to blinding. If the person analyzing the data isn’t blinded to the subjects’ group allocation, the risk of biased data collection and biased analysis increases. Blinding increases the trustworthiness of any study, but it’s especially important for studies where the authors have clear financial incentives or conflicts of interest.

Reported p-values

I won’t belabor this point, because I know it would likely make many people’s eyes glaze over. However, most research in our field determines statistical significance based on p-value thresholds, and the most common threshold is p<0.05.

The issue with a threshold of p<0.05 is that it’s not too hard to p-hack your way to a p-value of 0.049. However, it is hard to p-hack your way to a p-value of 0.01, and even harder to p-hack your way to a p-value of 0.001. If you see a lot of p-values in a paper between 0.04 and 0.05, that doesn’t necessarily mean the authors p-hacked or HARKed their way to significant results, but it is certainly a cause for concern and skepticism. On the other hand, it’s much more likely that very low p-values are indicative of true positive results. A very low p-value doesn’t guarantee that a finding is legit, but it certainly increases the likelihood.

If a paper doesn’t report p-values (i.e. it just reports p<0.05, but not p=0.021), but it does report sample sizes and t-values, r-values, or F-ratios, you can use online calculators to convert those number to p-values.

Effect sizes

Effect sizes give you additional information that helps you determine the importance of a significant result. They also help you ascertain the likelihood of false positives in large datasets.

If a result is statistically significant but only associated with a small or trivial effect size, it may not actually be very important for coaches or athletes, even if the p-value is really low. In exercise science, that doesn’t happen too often since most studies don’t use large samples (you generally need a medium or large effect to achieve statistical significance if your sample is small), but it’s at least worth keeping in mind.

Statistically significant comparisons or associations with moderate-to-large effect sizes are more likely to be true positives than comparisons or associations with trivial-to-small effect sizes, simply because greater sampling bias is required to produce erroneous large effects than erroneous small effects.

Furthermore, in large samples, effect sizes help you determine whether a significant finding is likely to be a false positive. If a large study uses true random sampling, then the data should be accurate and unbiased. However, true random sampling is extremely rare, meaning that large studies can still have biased samples. With hundreds or thousands of participants, almost every comparison or association will be significant, simply because statistical power is so high. When you combine enormous statistical power with non-random sampling, you have a high risk of false positive results. However, statistically significant comparisons or associations with moderate-to-large effect sizes are more likely to be true positives than comparisons or associations with trivial-to-small effect sizes, simply because greater sampling bias is required to produce erroneous large effects than erroneous small effects.

Pre-registration

If you pre-register a study (11), you tell the world exactly how you’re going to conduct the study and analyze the data before initiating data collection. If a study is pre-registered, and the published methods and statistical analyses of the study match the pre-registered methods and analyses, risk of p-hacking, HARKing, etc., is dramatically reduced. You can be much more confident in its results.

However, if a study is pre-registered, and the methods or analyses differ from the pre-registered account, you can be quite confident that some sort of shenanigans are afoot. This is pretty rare, though, because the researchers know that they’re leaving a paper trail that will expose them.

Pre-registration is great, and more people should do it.

Ease of conducting a study

This is a factor you wouldn’t consider unless you’ve conducted research.

Some studies don’t take much time or effort to conduct (primarily cross-sectional studies that are easy to recruit for and don’t use labor-intensive measures), and some studies are colossal undertakings. Specifically, in our field, training studies are a HUGE pain in the ass, and can easily require 500+ man hours for all of the training visits.

I can’t pretend like I have data on this, but I think that there’s a strong relationship between ease of conducting a study and risk of publication bias.

If you didn’t have to sink much time into a study, and the study doesn’t turn up any significant results, it’s not a huge loss if the study doesn’t get published. You may not even bother submitting it, or you may just cut your losses after one or two rejections. If, on the other hand, you spent months of your life and hundreds of hours carrying out a study, you’re going to be willing to wade through hell or high water to make sure it gets published. If you don’t have any significant results, it may not get accepted in a high-impact journal, but you’re going to keep submitting it until it finds a home.

Evidence of p-hacking or HARKing

P-hacking and HARKing are dangerous because, if someone knows what they’re doing, they can completely hide the fact that any chicanery took place. However, many authors aren’t that slick.

For example, if a study reports multiple barely significant outcome measures, and it’s pretty obvious that the researchers had other outcome measures that weren’t reported, there’s a pretty good chance that the study was p-hacked.

If a study reports multiple barely significant outcome measures, and it’s pretty obvious that the researchers had other outcome measures that weren’t reported, there’s a pretty good chance that the study was p-hacked.

To illustrate, if someone uses bar kinetics for their outcome measures, and they report significant differences in mean power and peak velocity with p-values between ~0.03 and 0.05, but they don’t report any other measures of bar kinetics, they probably p-hacked. If they report mean power and peak velocity, you know they also have data on peak power and mean velocity. Since they have power data (which has a force component), they also have data on mean and peak force. Since power and velocity both have time components, they also have data on impulse. Thus, we know that they have at least seven outcome measures (impulse, mean and peak power, mean and peak velocity, and mean and peak force), but they only reported two outcomes, and both outcomes were barely significant. If they reported all outcomes and used correct statistical procedures to control their false positive rate, their mean power and peak velocity findings would not have been significant anymore (12).

A method of p-hacking that’s easier to spot (if you’re at all conversant in statistics) is the use of non-standard, inappropriate statistical procedures. If you’re not conversant in stats, and you see a study use something other than t-tests, ANOVAs, ANCOVAs, Bonferroni or Tukey’s post-hoc tests, pearson regression, spearman regression, chi-squared tests, multiple regression, logistic regression, or random effects meta-analysis (the most common procedures in our field), ask a friend who knows about stats about whether the statistical procedures used were valid.

Prediction

One interesting finding in recent years is that scientists generally have a pretty good idea of which studies will replicate and which are probably false positives (4, 13). Researchers set up betting markets, and scientists can place bets on the studies they think are most likely to successfully replicate. These betting markets have done a surprisingly good job of predicting which studies won’t replicate and which will.

Scientists generally have a pretty good idea of which studies will replicate and which are probably false positives

Obviously, you won’t have the time to set up a betting market for every new study that’s published, but I think it can be helpful to ask experts about the results of studies that seem fishy to you. If multiple experts aren’t surprised by the results, then there’s a decent chance the results are accurate. However, if researchers in the field are also surprised at the result, or if they also think it seems fishy, there’s a decent chance that its results are incorrect (either as a false positive or a false negative).

Note: I’m not making an appeal to authority. I’m not saying that expert opinion about a study is inherently correct. I’m saying that we have good evidence that experts are able to predict whether a particular research finding will replicate with a pretty high degree of accuracy, meaning that if they think a particular finding is legit, that should rationally increase your perceived odds of it being legit.

Measures with minimal room for subjectivity

In unblinded research, there’s still room for researcher bias to skew results, even if they’re trying to be as objective and impartial as possible. The choice of measurements can make a big difference. For example, if strength is assessed via a squat, the researchers may be slightly more strict on depth requirements (either intentionally or unintentionally) in the group they don’t expect to do as well; on the other hand, if strength is assessed via bench press, there’s little subjectivity in observing whether the bar touched the chest, and whether the lifter locked the rep out. When assessing body composition, DEXA scans leave little room for user error (assuming you’re enforcing the same pre-testing guidelines for all subjects), while there’s more subjectivity when taking skinfold measurements using calipers.

If the research is unblinded, consider whether the measurements used in the study allow room for the researchers to either purposefully or inadvertently “tip the scales.”

If the research is unblinded, consider whether the measurements used in the study allow room for the researchers to either purposefully or inadvertently “tip the scales.” If the measurements contain less room for bias to influence the outcome, you can be a bit more confident in the results. If the measurements contain more room for bias to influence the outcome, you need to be a bit more wary.

Conclusion

So, to wrap things up, the findings of individual studies aren’t nearly as trustworthy as most people would like to believe. There are a variety of ways to wind up with false positive results, and the publishing system more heavily rewards positive results, which means a disproportionate amount of false positive results wind up in the literature. However, if you know what to look for, you can roughly appraise individual studies to determine the likelihood that their results are incorrect. The simplest and most straightforward things to look for are large effects, small p-values (below 0.01, not 0.05), results that can be explained by known mechanisms, and results that jive with prior research. If you want to get more sophisticated, you can see whether the study was pre-registered, look at financial incentives, determine if the study would be time-consuming to conduct, look for evidence of p-hacking or HARKing, see whether the measurements allow substantial room for bias, and ask experts if they would predict that the findings would replicate.

Ultimately, you can never be sure about the accuracy of any single study, but if you know how people can cheat the system, and you know the hallmarks of solid research findings, you’ll be able to more accurately appraise and interpret research.

The latest research – interpreted and delivered every month

As you can see, there’s a lot that goes into analyzing a study beyond simply being able to read and understand the text. It’s a skill set that takes years to develop, and even professional researchers often miss red flags. For example, in a recent high-profile case, it was discovered that a high-profile food scientist published a lot of suspicious research over the years, including suspect papers that have been cited hundreds of times. In most cases, a close, critical reading of those papers would have revealed obvious methodological and statistical issues, but those issues went unnoticed by professionals for years. If you want to stay up-to-date on the research pertinent to strength and physique athletes and coaches, but you don’t have the time to desire to develop the skill set to critically analyze research, you can sign up for Monthly Applications in Strength Sport (MASS), the monthly research review I put out every month, along with Dr. Eric Helms and Dr. Mike Zourdos.

Each issue of MASS contains at least 9 pieces of content like this. Click here to learn more and join 2,750+ subscribers.

References

Ioannidis JP. Why most published research findings are false. PLoS Med. 2005 Aug;2(8):e124.
Open Science Collaboration. PSYCHOLOGY. Estimating the reproducibility of psychological science. Science. 2015 Aug 28;349(6251):aac4716.
Begley CG, Ellis LM. Drug development: Raise standards for preclinical cancer research. Nature. 2012 Mar 28;483(7391):531-3.
Camerer CF et al. Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015. Nature Human Behavior. 2018 Aug 27; 2:637-44.
Mlinarić A, Horvat M, Šupak Smolčić V. Dealing with the positive publication bias: Why you should really publish your negative results. Biochem Med (Zagreb). 2017 Oct 15;27(3):030201.
Wood J, Freemantle N, King M, Nazareth I. Trap of trends to statistical significance: likelihood of near significant P value becoming more significant with extra data. BMJ. 2014 Mar 31;348:g2215.
Kerr NL. HARKing: hypothesizing after the results are known. Pers Soc Psychol Rev. 1998;2(3):196-217.
Dumas-Mallet E, Button KS, Boraud T, Gonon F, Munafò MR. Low statistical power in biomedical science: a review of three human research domains. R Soc Open Sci. 2017 Feb 1;4(2):160254.
Saunders B, Elliott-Sale K, Artioli GG, Swinton PA, Dolan E, Roschel H, Sale C, Gualano B. β-alanine supplementation to improve exercise capacity and performance: a systematic review and meta-analysis. Br J Sports Med. 2017 Apr;51(8):658-669.
Lexchin J. Sponsorship bias in clinical research. Int J Risk Saf Med. 2012;24(4):233-42.
Nosek BA, Ebersole CR, DeHaven AC, Mellor DT. The preregistration revolution. Proc Natl Acad Sci U S A. 2018 Mar 13;115(11):2600-2606.
Obviously, you need to be reasonable when judging whether it’s likely that variables were examined but weren’t reported on. Using the example of a linear position transducer, you could also get data on rate of force development over half a dozen time windows, eccentric as well as concentric measures for force, power, velocity, and impulse, and some LPTs even tell you about bar path. You could also subdivide all of these analyses by lift phase (for example, in the squat, you could also look at power, velocity, and force up to the point of minimum bar velocity, and after the point of minimum bar velocity). You could come up with 30 different outcome measures pretty easily, but most of them wouldn’t be variables that most researchers would typically look at. Also keep the aim of the study in mind. For example, a study designed to examine load-velocity profiles really only needs to report velocity; the researchers have data on a ton of other variables, but velocity is the only variable that’s relevant to their research question and is probably the only variable they actually analyzed.
Dreber A, Pfeiffer T, Almenberg J, Isaksson S, Wilson B, Chen Y, Nosek BA, Johannesson M. Using prediction markets to estimate the reproducibility of scientific research. Proc Natl Acad Sci U S A. 2015 Dec 15;112(50):15343-7.