I used this search query to find sources: https://www.ncbi.nlm.nih.gov/pubmed/?term=((((gender+or+sex))+AND+(strength+training+or+resistance+training+or+powerlifting))+AND+(strength+or+hypertrophy+or+1RM))+NOT+concurrent+NOT+children+NOT+adolescents+NOT+disease+NOT+cardiovascular+NOT+disease+NOT+review+NOT+supplement (thanks to Brandon Roberts for this)
There were 662 results. From these results, I identified 54 relevant studies. From reference lists of those studies, I identified 24 more results, bringing the total up to 78 studies. Several of these studies were the result of multiple papers being published from the same experiment (i.e. one paper reporting the hypertrophy results, and one paper publishing the strength results). When combining these studies, there were 61 unique research projects included in this analysis. Many of these studies made comparisons in two different age groups (i.e. they studied the effects of both age and sex on resistance training adaptations, comparing young men, young women, older men, and older women). I treated the young people and older people as two separate cohorts, bringing the total number of groups compared in this analysis to 72, with a combined total of 3,811 participants.
I separated the results from these studies into four broad categories: direct measures of muscle hypertrophy (CSAs, mean fiber area, muscle thicknesses, etc.), indirect estimates of hypertrophy (LBM, appendicular lean mass, etc.), upper body strength gains, and lower body strength gains.
When a paper reported both dynamic and isometric strength, I only went with dynamic strength measures, since increases in dynamic and isometric strength aren’t interchangeable, and MOST lifters are only interested in dynamic strength. When a paper only reported isometric strength changes, I did include them in my analysis since isometric force is a valid measure, and I wanted to give the broadest overview possible. When multiple isokinetic force measures were provided, I only went with the slowest contraction speed (i.e. if given isokinetic force at 60, 120, and 240 degrees per second, I’d go with 60) since that would be the best proxy for maximal strength. I did include one paper outside of these general rules that used an atypical strength assessment (Carlsson, 2017) – maximum reps completed for dips and pull-ups – simply because it was one of the few studies on trained subjects. To make sure its inclusion didn’t meaningfully affect the analysis, I checked to see how much its exclusion would affect the pooled effect size estimate. It decreased it by 0.004 points. In other words, it ultimately didn’t matter. When making comparisons, I pooled similar measures within studies when multiple similar measurements were taken (so a study with 5 strength measures wouldn’t be given 5 times as much weight as a similar study with one strength measure, or a study with 3 direct measures of hypertrophy wouldn’t be give 3 times as much weight as a study making one comparison).
When a paper only reported changes in CSA of each individual muscle fiber type, I didn’t include that in the analysis unless it also reported changes in the relative proportion of each fiber type so I could take weighted averages to calculate the mean change in fiber CSA.
Results for all analyses were weighted by the number of subjects in the study, such that if one study had 30 people per group while another had 10 people per group, the first study would count 3x as much in the analysis. I spot-checked the overall strength analysis and the analysis of direct measures of hypertrophy to ensure that weighting didn’t affect things too much; it didn’t (it changed the male/female difference in strength gains per week by 0.07%, and the difference in weekly hypertrophy by 0.01%, which were essentially meaningless changes).
I ran into some issues when trying to figure out how to analyze the data. Typically, you’d divide the change in each group by a pooled standard deviation (typically using the SD of both groups’ pre-training strength/muscle size/etc.) to calculate effect size for each comparison in each study, and then roll from there. That works when the assumption is that the SD should be the same in both groups because you’re drawing them from the same population. I.e. if you’re comparing squatting with high vs. low loads in college-aged men, and the pre-training SD for 1RM bench strength is 10kg in one group and 5kg in the other, you can pool those two standard deviations because both groups are being drawn from the same population, and you assume the difference within groups is just due to randomness. In this meta-analysis, however, men and women are clearly two different populations, so SDs shouldn’t be pooled. However, it became clear to me that using within-group effect sizes always wasn’t appropriate either, since women (for whatever reason) tended to be relatively less variable than men, which tends to inflate effect sizes.
Here’s a (slightly) exaggerated illustration of how this complicates things:
Let’s say there’s a group of men who bench 100 ± 15kg, and a group of women who bench 50 ± 5 kg. Let’s assume the men increase their bench by 10kg, while the women increase their bench by 5kg. The proportional change is the same (10% in both groups). However, the pooled standard deviation would be around 10kg, giving the men an effect size of d=1.0 (10/10), and the women an effect size of d=0.5 (5/10). Using within-group effect sizes instead doesn’t help matters, though. The men would have an effect size of d=0.67 (10/15), while the women would have an effect size of 1.0 (5/5). In this scenario, relative change in strength is the same, but both methods of calculating effect sizes substantially favor one group or the other.
So, I went with a much simpler approach. I calculated mean percent change for men and women in each measure in each study, and just used subject-weighted paired-samples t-tests. I did this both for total percent changes, and for weekly percent changes. This would generally be considered a pretty liberal statistical approach (i.e. increased risk of false positive), but I’m not too concerned about it here, because none of the significant findings were anywhere particularly close to the typical significance cutoffs (p<0.05). The significant differences were all super significant (p<0.001, and generally p<0.0001).
I ran six analyses looking at sex differences in strength gains:
1) A main analysis of all studies looking at strength gains, with all measures of strength gains pooled in each study.
2-3) Subanalyses looking at strength gains in just young people (average age <35) and older people (average age >40, but generally 60+).
4-5) Subanalyses looking at upper body and lower body strength gains independently.
6) Strength gains just in studies lasting longer than 20 weeks
I only ran two analyses looking at sex differences in hypertrophy:
1) An analysis of studies reporting direct measures of muscle hypertrophy
2) An analysis of studies reporting indirect measures of muscle hypertrophy (i.e. lean mass)
For the two analyses that I considered the most important (overall strength gains, and direct measures of hypertrophy), I constructed that basically amount to funnel plots in to assess bias, with the average number of subjects per group on the y-axis (in place of standard error), and the percent difference in favor of men or women on the x-axis (in place of between-group effect size). With a funnel plot, you expect the studies with the lowest risk of bias and random error (studies with the most subjects) to cluster right around the average effect, while you expect studies with higher risk of bias and more random error (studies with fewer subjects) to “fan out” and have more variable results, with roughly an equal number skewing in both directions. Both funnel plots look about how you’d expect them to and are quite symmetrical, suggesting low risk of bias.
For the significant differences, I also calculated how large of a study with null findings would be necessary to make the difference non-significant to further assess how solid the difference was. In all cases, getting below the significance threshold would require a study with 3,000+ participants and null findings, with both groups gaining strength in line with the overall average for both sexes in the extant literature. Finally, I also removed studies from the analysis one by one to see if their removal significantly changed the results – the removal of any particular study failed to materially change the results.