Lies, Damned Lies and Statistics

Change of plan for this post after receiving a mailshot from a local estate agent…

Statistics are the at the heart of the scientific method. They help us to prove or disprove hypotheses (to a certain level of confidence) and so make the discussion about facts rather than opinion. They have huge power – both to reveal the truth but also to hide it when used wrongly.

When I was 16, I received, as a present, the book “How to Lie with Statistics” by Darrell Huff. OK, so perhaps I was rather an odd teenager as I thought this book was fantastic. I am pleased to see it is still available from good bookstores. It has stood me in good stead for many years as I always go to graphs in any article I read and I always wonder what the author is trying to show (or hide). So when I received a mailshot through the door recently from an estate agent full of pretty graphs I was impressed to see that they were able to demonstrate many of Huff’s observations in one glossy sheet of paper.

The first graph the mailshot has is what Huff calls a “gee whiz graph”. It’s the one below. They state that they have done some “spatial interpolation of property price data, also known as number crunching!” They go on to explain helpfully that “for every 0.25km you live closer to the station, the average property price rose by £2700.” Do you believe them?

Huff describes this use of statistics as “statisticulation” which I rather like. Of course, what they have done is “supress zero” on the y-axis without any warning – cutting off 89% of the bar on the left and 98% of the bar on the right. The bar on the left is nine times the height of the bar on the right even though the numerical difference is just 10%. But, of course, the graph begs many more questions – such as what sort of average is shown? (see Huff’s “Well chosen average”) Is the difference statistically significant? (see Huff’s “Much ado about nothing”) How many properties are included in the figures? Is the mix of properties the same within both radii? And what if I tell you that within 5km of the particular train station they are talking about, there are actually another 9 train stations – including at least one that has many more commuter trains stopping regularly? And there a well-regarded school nearby that it is known many parents want to live near to in order to increase the chance of their child attending. Could that be a factor?

Of course, even if they are able to prove a correlation between distance from station and property price (which they certainly haven’t with the data above), we know that “correlation does not imply causation” and Huff describes it in his chapter “Post hoc rides again”. It reminds me of the annual story of how living near Waitrose (a top-end UK supermarket) can increase the value of your home. Could it be that wealthier people tend to shop at high-end supermarkets and so high-end supermarkets locate where wealthier people live (in more expensive properties).

Another of the graphs is shown below. Along with the text “People are at lots of different stages of their lives. The largest number of people are Retired which accounts for 20.5% of the total. This is 0.4% lower than the national average.” Is this what you take as the most interesting feature of the graph?

When I look at the graph (I am assuming the data is accurate and they claim it comes from the Office for National Statistics so I think that’s OK), the tiny difference in the red and grey bars for Retired is not what strikes me. I would say it looks as though this area has more families and Empty Nesters than the average. But, of course, I don’t really know because I don’t know whether the differences are statistically significant (see Huff’s “Much ado about nothing”). We can be reasonably confident that the larger differences are likely to be significant because the sample is large. But could we really say that there are 0.4% fewer Retired households than the national average? I think it likely this is within the range of error and that we can’t really say whether there is any difference – but I don’t know, of course, because there are no numbers shown for the samples. We only have percentages. It also starts me wondering about how the data is collected. What about a house with grown-up children where one of those grown-up children has had a child (i.e. three generations in the house), which category does that fall into? And a couple without children – are they a Young Family? What if they are older but not retired? Or a split family where one parent looks after the children one week and the other the next week? And how does the Office of National Statistics know what type of family is living in each property? After all, people are moving all the time whether buying/selling but also moving in with others or moving out.

You get the point.

Statistics and data can tell us so much. They are the bedrock of the scientific method. But we must always be sceptical and question them. Who is telling us? How do they know? What’s missing? Does it make sense? Or as Huff puts it “talk back to a statistic”!

In my next post I will go back to looking at the DIGR® method of root cause analysis by looking in some more detail at the G of DIGR®. How using process maps can really help everyone involved to Go step by step and start to see where a process might fail.

 

Text © 2017 Dorricott MPI Ltd. All rights reserved.

DIGR® is a registered trademark of Dorricott MPI Ltd.