Why Sex Differences Don’t Always Measure Up

Every so often, I read an article quoting some claim arising from research on sex differences. Typically, scientists have found some sex difference that they have found to be statistically significant, and the difference is reported with much fanfare and various claims about what this difference means. Unfortunately, a statistically significant difference is not necessarily a useful difference in practice, making many of the ways people interpret the original narrow scientific claim simply wrong.

A naïve perspective on a statistical difference often goes like this: “I heard that girls aren’t as good at spacial rotation as boys are, so I guess that explains why I get lost so easily.”  Or, for other differences, someone might say, “Apparently, they’ve found that gay men’s index fingers are longer than straight men’s, so I measured mine and my partner’s. I passed the test, but my partner failed it—I guess he is a bit straight acting.” Or, “Researchers have found that transsexual brains are different from cissexual brains. If only they could have tested me when I was a child, I could have diagnosed back then and been saved a lot of pain.”

The problem here is a fundamental misunderstanding of what statistics tell us. Statistics tell us about properties of samples taken from populations. They don’t necessarily tell us about individuals. To understand why, we’ll look at three sex differences, height, finger length, and brains. Tasty, tasty brains.

Measuring Up the Sexes

Let’s start with height, an “obvious” sex difference. In most human populations, men are noticeably taller than women. For our discussion, we’ll use demographic data from the USA. Men have a median height of 5′ 8.5″ (174 cm), whereas women have a median height of 5′ 3.5″ (162 cm). Thus, we have a tangible sex difference of five inches between the sexes. (I used the median as my “average” here, rather than the mean—see Note 1 at the bottom.)

So, we have a statistic, and it meshes well with our daily observations of life, but what do these numbers actually tell us? Can we use height as a predictor for sex, and, if so, just how good a predictor is it?

To try to answer that question, let’s arbitrarily pick someone from the US population who is 5′ 6.5″ (169 cm) tall, making them 3 inches taller than the median for women, and two inches shorter than the median for men. Naïvely, you might expect that this individual is more likely to be a man than a woman; after all, the person’s height is closer to the male median than the female one. But you’d be wrong. In fact, statistically, it is slightly more likely that the person we have picked is a woman.

It isn’t enough to have a sense of the height of the average man or the average woman. We also need to know something about the distribution of heights in the population. If we measured one million Americans, chosen at random, we would expect to get statistics that look like the ones below:

Height (cm) Women Men P(Woman) P(Man)
75–80 146 100.00%
80–85 907 66 93.22% 6.78%
85–90 1,911 2,115 47.47% 52.53%
90–95 3,915 4,708 45.40% 54.60%
95–100 4,805 4,727 50.41% 49.59%
100–105 5,050 4,015 55.71% 44.29%
105–110 5,303 3,941 57.37% 42.63%
110–115 4,949 5,462 47.54% 52.46%
115–120 4,461 6,651 40.15% 59.85%
120–125 5,347 7,371 42.04% 57.96%
125–130 5,728 6,404 47.21% 52.79%
130–135 6,199 5,475 53.10% 46.90%
135–140 5,265 5,056 51.01% 48.99%
140–145 6,461 7,699 45.63% 54.37%
145–150 20,368 6,669 75.33% 24.67%
150–155 55,291 7,245 88.41% 11.59%
155–160 99,716 10,198 90.72% 9.28%
160–165 122,115 24,369 83.36% 16.64%
165–170 100,480 57,832 63.47% 36.53%
170–175 41,443 94,289 30.53% 69.47%
175–180 9,660 102,024 8.65% 91.35%
180–185 1,521 74,355 2.00% 98.00%
185–190 54 33,282 0.16% 99.84%
190–195 12,338 100.00%
195–200 2,212 100.00%
200–205 402 100.00%
Total 511,095 488,905 51.11% 48.89%

One of the interesting points to note from this data is that our sample has more women than men, because there are more women than men in the US population (men are more likely to die). But it is also the case that men occupy a greater range of heights, so they are necessarily spread more thinly.

If we plotted these counts, they would look like the following:

Graph showing the distribution of heights for each sex.

The graph shows two distributions, one for men and one for women (each roughly following the classic bell-curve shape of a statistical normal distribution). At a little before 5′ 7″ (169.5 cm), they cross—anyone taller than that appears to be statistically more likely to be a man, and anyone shorter is more likely to be a woman. It might seem that we could use this value as a threshold between “female” heights and “male” heights. But there are a lot of people “on the wrong side” of the cut off point. More than 1 in 9 women (11.6%) are taller than 169.5 cm (the shaded pink section of the graph), putting them on the side we might have described as “more likely to be male”, but there are even more men who have “girly” heights: almost exactly one third of men (33.4%) are shorter than 169.5 cm.

So, imagining that there is a height threshold that we could use to reliably partition women from men is false. And our daily experience backs that up. There is a lot of overlap between the range of heights for men and the range of heights for women. Statistically, it may be the case that someone who is 5′ 6″ (167 cm) is more likely to be a woman, but 1 in 3 people of that height are men, which means as a test to sort men and women, a height threshold would be wrong quite often. Of course, in places where the distributions have less overlap, things are more clear cut; for example, only 1 in 10 of people 5′ 3″ (159.5 cm) tall are men, and only 1 in 10 people 5′ 9.5″ (176.5 cm) are women. And there are very few women taller than 6′ 1″ (185 cm).

It’s tempting to imagine we might be able to salvage the simple threshold idea by saying that people taller than 176.5 cm are usually male and a those shorter than 159.5 cm are usually female, and abandoning everyone the middle (more than half of our sample!) as living in a gray area. But even that doesn’t work. Once we get to people shorter than 4′ 9″ (145 cm), there is no reliable gender difference. There are 60,447 women less than 145 cm tall, and 63,690 men, making it pretty much a wash. But because there are more men than women, there are proportionately more particularly short men (13% of all men) than short women (11.8% of all women).

So even though height is a sex difference that is fairly visible in the world, and the difference between the height of men and women is statistically significant, it’s difficult to put this information to good use to make predictions about individuals. If all we know is that men have heights centered around 5′ 8.5″ and women have heights centered 5′ 3.5″, that’s not enough information to make any reasonable predictions at all about individuals, and certainly not enough to use someone’s height to predict whether they are male or female. Even with a better sense of how height is distributed, we can only predict gender with at least 90% accuracy for 34.7% of people, mostly people who are really tall (but also people between 5′ 1″ and 5′ 3″). In other words, even though knowing the statistical distributions for the heights of men and women can tell us something some of the time, for most people, knowing their height is useless for making reliable predictions about their sex.

Furthermore, we have not even begun to consider other factors that influence height, such as race and ethnicity. For example, an average woman from Norway is 5′ 6.5″, but an average man from rural India is only 5′ 3.5″. So, any generalizations we make about height are “all other things being equal”, but in real life, those other factors are not being held constant.

And that was height—what about other sex differences? After all, you’re hardly likely to be considered to have made a major research discovery if you announce that on average men are taller than women.  Typically, research on sex differences focuses on more subtle, less obvious differences. These differences are good for headlines, but at least some of the time, the differences that are uncovered, while “statistically significant”, are even less practically significant than height when applied to individuals.

Pull My Finger

With height under our belt, let’s move on to looking at a more subtle sex difference, finger length, specifically the 2D:4D ratio. Here’s what Wikipedia says about it (at the time of writing):

The digit ratio is the ratio of the lengths of different digits or fingers typically measured from the bottom crease where the finger joins the hand to the tip of the finger. It has been suggested by some scientists that the ratio of two digits in particular, the 2nd (index finger) and 4th (ring finger), is affected by exposure to androgens e.g. testosterone while in the uterus and that this 2D:4D ratio can be considered a crude measure for prenatal androgen exposure, with lower 2D:4D ratios pointing to higher androgen exposure. The 2D:4D ratio is calculated by measuring the index finger of the right hand, then the ring finger, and dividing the former by the latter. A longer ring finger will result in a ratio of less than 1, a longer index finger will result in a ratio higher than 1.

The 2D:4D digit ratio is sexually dimorphic: in males, the second digit tends to be shorter than the fourth, and in females the second tends to be the same size or slightly longer than the fourth.

A number of studies have shown a correlation between the 2D:4D digit ratio and various physical and behavioral traits.

It seems to say, then, that we can set a threshold of one for the ratio, with men on one side and women on the other. If you have a ratio of less than one (longer ring finger), you have “boy fingers”, and if you have a ratio of greater than one (longer index finger), you have “girl fingers”. It’s a simple rule that’s easy to remember. But we ought to be suspicious. We had a threshold for height for (169.5 cm), but a full third of men fell into the “girly height” category. And we haven’t been told anything about the average ratios for men and women, or their distribution. At the time of writing, no such information is on Wikipedia about that.

So, let’s dive into a paper on the topic. For convenience I’m going to pick just one paper that has a moderately good sample size, The Visible Hand: Finger Ratio (2D:4D) and Competitive Behavior, by Matthew Pearson and Burkhard C. Schipper. Here’s their data (see Note 2):

Race Sex Count Average Std Dev Min Max
White Male 35 0.960 0.026 0.899 1.022
Asian Male 47 0.944 0.026 0.882 1.000
Hispanic Male 10 0.954 0.025 0.913 1.002
Black Male 2 0.951 0.015 0.941 0.962
Others Male 6 0.973 0.025 0.938 0.998
All Male 100 0.952 0.0272 0.882 1.033
White Female 20 0.959 0.030 0.898 0.999
Asian Female 65 0.963 0.026 0.912 1.033
Hispanic Female 5 0.948 0.043 0.898 0.996
Black Female 1 0.917 0.917 0.917
Others Female 7 0.978 0.034 0.942 1.033
All Female 98 0.962 0.0293 0.898 1.033

The first intriguing detail is that no group has an average greater than one. Women, in aggregate, do not match our intuition of a “girly“ finger length ratio at all, averaging 0.962. Also, while the paper does claim to have found a statistically significant difference between men and women in general, for white women, they failed to find any significant difference at all, which is just as well, because their white women had more mannish hands than their white male counterparts—oops!

The authors of the paper don’t give us the distribution for their data, but they do give the standard deviation, and so it is reasonable to assume that we can approximate it with a normal (bell-curve) distribution. The graph below shows what the two distributions look like:

Graph showing the distribution of 2D:4D ratio in men and women.

As you can see, there is a lot of overlap. If we used finger length as a sex test, it would be right only 56.7% of the time. It’s only better than 75% accurate for people with finger length ratios of 1.02 and above, and only 1.6% of the population are fall into that category. It’s only 90% accurate for people with a finger ratio of 1.06, which is a tiny 0.026% of the population.

Also, while the test can accurately identify a very small number of women, it can never accurately identify men. It is at its most clear-cut at a ratio of 0.89; 3 out 5 people with that ratio (60%) are men.

So while the researchers for this paper did find an actual “statistically significant” difference between the finger ratios of men and women, in practice the difference is not one we can usefully apply to individuals.

Other researchers have examined finger-length ratios of smaller groups, including gays and lesbians and transsexuals. They, too, have found “statistically significant” differences, but we have no reason to expect that they will be any more useful, especially as the sample sizes for these groups are smaller and the differences observed more subtle, as we will see in our final sex difference.

Brains, Bring Me Brains

Brains are a favorite choice of sex-differences researchers, so let’s pick one random study of brains, namely Male-to-Female Transsexuals Have Female Neuron Numbers in a Limbic Nucleus, by Kruijver, et al. Here’s a summary of their results (see Note 3):

Subjects Mean
Std Dev
Cissexual Gay Men 9 34.6 10.20
Cissexual Straight Men 9 32.9 9.00
Transsexual Women 6 19.6 8.08
Cissexual Women 10 19.2 7.91

A naïve view of this data is that transsexual women’s brains look a lot like cissexual women’s brains, and unlike the brains of cissexual men. We might also naïvely suppose from these results that we could have a “brain femininity” test and use it to detect transsexual women (provided that we found killing them and dissecting their brains to make the measurement to be a good trade-off!).

You might be concerned that someone is making a generalization about the world’s hundreds of thousands of transsexuals (we can estimate at least 350,000 in the USA and Europe alone) from comparing six transsexual brains, but let’s forget the issue of tiny sample sizes for now. Instead, we’ll assume that the average and standard deviation values are accurate, and come from data with a normal distribution.

If we examine the distributions, their probabilities look like the following:

Graph showing the distributions of BSTc neuron numbers in cissexual women, transsexual women, cissexual men, and cissexual gay men.

Here, we can see that despite the mean BSTc count of the women’s brains being almost half that of men, there is still considerable overlap. If we just compare cissexual men with cissexual women, and say that BSTc counts above 26.5 × 103 are male and those below female, we find that that 23.9% of men have “female” brains, and 17.8% of women have “male” brains.

But the picture changes a lot when you look at the distributions in a context where everyone comes from the same population. As before, we’ll use an imagined sample of one million people from the general population of the USA. In that sample, we’d expect to see about 510,000 cissexual women, 444,000 straight cissexual men, 45,000 cissexual gay men, and a mere 500 transsexual women (plus 500 transsexual men). When we scale the curves proportionately, the bell curve for gay men (about 1 in 10 of the population) drops a good deal, but the curve for transsexual women (about 1 in 1000) flatlines. (I didn’t color the x-axis red; that’s the line for transsexual women.)

Graph showing the same distributions as the previous graph, but scaled to reflect the proportions of these groups in the general population.

Now we can see why a test for transsexualism based on measuring BSTc is not possible (other than that annoying brain-dissection requirement). If we have someone who was born physiologically male who comes to us, survives a brain examination, and appears to have a very “girly” BSTc count in the 14 × 103 range, there will be about 2175 straight cissexual men with a similar value, against only 19 transsexual women. In other words, if you used BSTc measurments to test for transsexualism, you’d only be right on someone with a brain this “girly” a mere 0.76% of the time. Even worse, fewer than 25% of transsexual women’s brains would score this (or more) “girly”—those with less girly brains are even tougher to correctly identify with our putative transsexuality test.

So, even if these researchers have found a statistically significant difference in the brains of transsexual women from their tiny sample of six women, in practice, it is of little use to individual transsexuals.

What Would a Useful Sex Difference Look Like?

If you’re hoping for a sex difference that might be helpful for some kind of test applied to individuals, let me give you some rules of thumb. For gay people (or any group that is about 1 in 10 of the population), you want a difference where there isn’t too much overlap between the two distributions. Since the larger the standard deviation, the greater the overlap, we can set a rule for the maximum standard deviation that will avoid exessive false positives. Find the distance between the average values for men and women (or whatever groups we’re distinguishing between), and divide it by 4. The standard deviation should be no larger than the result. For example, if men average 25 and women average 33, the groups are 8 apart and the maximum workable standard deviation is 2. If it is worse than that, you’ll have excessive false positives (more than about 25%) from opposite-sexed heterosexuals in the tail of their distribution.

For a sex-difference based test for transsexualism (or any group that is about 1 in 1000 of the population), the sex difference needs to have an even smaller standard deviation. To get the rough value you need, divide the distance between the two averages by 7 instead of 4. But the truth is, that would give you two distributions that barely touch at all. Very, very few sex differences are going to be that clear cut. For that reason, you can probably count on there never being a useful and reliable test for transsexualism based on sex differences.

A Test That Works

Despite all that I have said, there is a pretty accurate test for transsexualism, based on sex differences. Simply tell your would-be transsexual about this sex difference, and watch them. If you see them frantically measuring their fingers, or wishing they could scan their brains, they’re probably transsexual. It’s probably not that accurate, but it’s better than we’d do from actually scanning their brains or measuring their fingers.


  1. I used the median for the height data because the average and standard deviations are skewed by the long tail at the left of the distribution. The tail is presumably caused by the various disorders that can cause stunted growth.
  2. In their original paper, Pearson & Schipper seem to come up with a different total, and, more importantly, their standard deviation lacks precision. By using some tricks with the data they do give, I calculated a standard deviation with three digits of precision rather than two.
  3. The paper presents SEM values (standard error of the mean), rather than standard deviation, but we can convert between the two using the formula stddev = SEM × sqrt(sampleSize).

15 comments so far

  1. Diana_W on

    This is a really wonderful and useful post. There’s a strain of anatomical measurement over-enthusiasm in the trans sphere which would do well to mind the warning suggested here.

    On the other hand, just as this post is helpful because it debunks misrepresentations of statistical evidence, it might trend just a tad into neo-luddite sensibilities in your conclusions. Evidence can be “useful” even when it affects hypotheses you may not be considering. The very statistics you are comparing fly in the face of earlier assertions that would imply a statistical distribution undifferentiated from the cis norm. That’s interesting and useful even if it doesn’t offer any simple “test” of transness.

    I like your post’s suggestion that statistical evidence needs to be used properly. But that warning should cut both directions.

  2. Joanne Hook on

    According to Dr. Ann Vitale, there is one very good litmus test for transsexualism vs other forms of sexual identity variance. A 40 day test on hormone replacement therapy. As the psychological and sexual dysfunction occur long before any significant physiological changes occur, those who are appalled by loss of sexual function are definitely not transsexual. However for those who are transsexual, gender dysphoria rapidly turns into gender euphoria, despite the length of fingers, height of the individual and other statistical representations between the sexes.

    Statistics can be a good tool for population dynamics, but fall very short of the mark when dealing with individuals.

  3. Nebulous Persona on

    On the trans-brains front, a similar study to the one mentioned in the article was breathlessly reported in a NewScientist article as follows:

    Differences in the brain’s white matter that clash with a person’s genetic sex may hold the key to identifying transsexual people before puberty. Doctors could use this information to make a case for delaying puberty to improve the success of a sex change later. […] Antonio Guillamon’s team at the National University of Distance Education in Madrid, Spain, think they have found a better way to spot a transsexual brain.

  4. Nebulous Persona on

    FWIW, although the article uses a normal distribution, for some of the data, it would have been more appropriate to use a log-normal distribution. The aforelinked wikipedia article shows how to set the distribution parameters to match a given average and standard deviation in the collected data. (That said, I reran the calculations from the article, and it doesn’t make a huge difference. The largest difference is in the trans brains one because the standard deviation is large compared to the magnitude of the values, and even there it’s fairly minor.)

  5. Nebulous Persona on

    One hole in the article is what happens if you have multiple tests, rather than one test — could we test for something difficult like trans brains then? The answer is yes, but it’s still a pain and still unlikely to be useful in practice.

    If we had fifteen tests like the BSTc test, and said that you had a “girly” brain if your brain was on the girls’ side of BSTc numbers (i.e., not scaling things by group sizes), we could conclude that someone was a male to female transsexual if they got a “girly” reading on at least nine of their fifteen tests. That scheme would correctly identify 94.1% of transsexuals, or in one million people, it would correctly identify 471 and miss 29. It would also falsely claim that 0.003% of eligible cis people (15 guys out of our million people) were transsexuals when they weren’t. In other words, 5.9% of transsexuals aren’t identified (false negatives), and 3.1% of identified transsexuals aren’t actually transsexual (false positives).

    But all this supposes that we could find fifteen completely statistically unrelated (i.e., “independent”) kinds of girliness to measure in brains, and that is really, really unlikely. If we try to make do with fewer, we either have a poor false positive rate, or a poor false negative rate. If we’re okay with a 12.5% false positive rate, we could go with 8 passes in 14 tests, or we can raise it to 9 passes in 14 tests but get an 11.5% false negative rate.

    So, even if we’re allowed combine different brain tests, it’s still a hard problem.

  6. […] averages. The problem is that the deviation is probably massive from person to person. A blogger (here) elucidates better than I can the problems of taking cool findings to crappy conclusions. The […]

  7. valeriekeefe on

    Your transition incidence is off by an order of magnitude. Look at the most recent survey with the largest sample size:


  8. […] For all the Americans who refuse to learn metric, 140 centimeters is approximately 4’8”, 170 centimeters is around 5’7”, and 200 centimeters is about 6’7”. The blog this image comes from is definitely worth the read: https://sugarandslugs.wordpress.com/2011/02/13/sex-differences […]

  9. Brett Taylor on

    The height stats here are way off. And I mean waaaaaay off. According to the stats listed, 109,977 out of 1 million, (roughly 11%) American adults are under 4’7″ (140cm), with 63,132 (6.3%) of them being under 3’11″…. No they aren’t. How the writer can look at these numbers & chalk it up to “various disorders that cause stunted growth” instead of realizing something is off with the stats is beyond me. The MOST COMMON cause of dwarfism, Achondroplasia, occurs in less than 1 in 10,000 people.

    How about this as a possible explanation for the mysterious long tail… Did you use a study that included children’s heights alongside adult heights? You’d have kind of a cool visual there, if the numbers weren’t so screwed up.

  10. […] charts from https://sugarandslugs.wordpress.com/2011/02/13/sex-differences/ which is worth reading in its entirety, unlike the articles I’m responding to […]

  11. Wesley Pollins on

    One BIG problem with this study, the US man median height is 5’9.3″-5’9.5″, no longer in the 5’8 range. Where do people still get 5’8 from for American men?

  12. Bernie on

    What is the source of the data for your range of heights in the graph comparing men to women? Did you just make this up? Thanks!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: