Of all the areas of mathematics that we teach at school, statistics must be the one that is most easily related to the world around our students. They encounter data and are subject to the whims of algorithms, so having a handle on statistics is crucial. That means that as educators it is equally crucial that the ethical and moral aspects of statistics are taught right from the start.
One of the most important concepts to challenge is the superiority of the data. Whilst the phrase "Computer says no" is intended to lampoon this idea, it still pervades data use and needs to be confronted. The earlier this can be done, the better. In particular, waiting until students do full data collection and analysis is quite late because they will have been taught at least some of the analysis skills beforehand and even there can be found some traits of this insidious idea.
In this post, I'll take a look at three places where this idea surfaces and the problems that it provokes. These are:
Choosing an average,
Dealing with outliers,
Conducting a hypothesis test.
Before getting started, I'd like to recommend this Chalkdust article for an introduction to the murky background of statistics.
2 Choose Your Average
A common explanation of choosing the median over the mean is that it is less sensitive to outliers, the classic example being that when Bill Gates walks into a bar then the mean wage increases vastly but the median hardly flinches. Another example is the age of people in a class: including the teacher changes the mean but not the median.
This is a case of the data driving the story and this is wrong.
Rather, it is the question that is asked of the data that should determine which average should be used.
Let's give an example to illustrate this. Imagine I've started a glove manufacturing business. I've only just started, I can only afford one machine and it makes one size of glove, and I can't afford fancy materials that make it stretchy so if that glove fits you then great otherwise you're going to have to go elsewhere. I can set the original size of the glove, but reconfiguring it is time consuming so best to leave it at one setting.
What should that setting be?
Clearly, I should go out and gather some data into people's hand sizes. Let's assume I do so, measuring, say, the distance around the hand just above the thumb to the nearest centimetre. This gives me some data that I can use to figure out the best setting. I obviously want to sell as many gloves as I can so I want to know the measurement that occurs most often.
That's the mode.
Fast forward a little and, bizarrely, I've yet to go bankrupt. I can now afford two machines, and slightly better quality material that can stretch a bit. So I can offer gloves in two sizes: "small" and "large" (or "regular" and "medium" as my local coffee shop calls them). For peak efficiency1, it's best if the machines produce the same number of gloves each day so I want to label the "small" and "large" so that I can expect to sell the same number of each. Therefore, I want to put the dividing line so that it divides the data into two equal parts.
1yes, I know – I'm deliberately setting the scenario to get the answer that I want, but note that I'm not having to be too contriving
That's the median.
I'm now doing so well with my glove business that I'm looking to add a luxury glove to my portfolio. I'm going on Dragon's Den and will need to give a cost per glove. The luxury part is a gold thread embroidery, which obviously will vary with the size of the glove. I'm making gloves of all sizes now, so to come up with a figure for "gold per glove" I need to imagine apportioning all the gold used evenly out as if it were the same for each glove.
That's the mean.
My point is that I've used the same data to answer each question. At no stage did I look at the data and ask "what type of average is best for this data?", rather I asked "what type of average is best for this question?". Now it is true that not all averages can be applied to all data. For the mean to make sense then the data has to be summable (and scalable). For the median to make sense then the data has to be comparable. For the mode to make sense then the data has to be accumulable2. But that simply sets limits, it doesn't inform best practice.
2If you thought the mode always makes sense then imagine being able to measure, say, lengths with arbitrary precision and what the mode would look like for such a data set.
For best practice, you need to think about the type of question you are asking of your data:
Do you want the most typical data point?
Do you want the data point that divides the data in equal halves?
Do you want to imagine the values spread out evenly amongst the data points?
Another way to make the decision is to think about the values of the data points.
Do the values that the data takes simply exist to differentiate between them?
If you can envision replacing the values by labels, then you probably want the mode.
Do the values exist to allow you to put the data in order?
If you can envision replacing each data value by something else that keeps the same order but distorts the values, such as squaring them, then you probably want the median.
Do the values themselves actually matter?
If replacing the values by something like their squares would fundamentally change the answer, then you probably want the mean.
In a twitter conversation, the proponent started the discussion by dismissing the mode and when models were suggested where the mode was the "best" average then the response was to ask for one that wasn't the "shoe shop" scenario. The problem with this is twofold. First is that every scenario for a particular average can be loosely shown to be equivalent to a standard one since each average answers a very specific question. Second, and more importantly, there are an awful lot of shoe shops. So while the "shoe shop" scenario might be the "only" case where the mode is useful, that still makes it extremely useful to a lot of cobblers.
In fact, let's pursue that. If we really wanted to know which average was the most useful then we could look at all the different data sets gathered in, say, a month and ask "which average was most useful for the person who gathered that data?". That would generate some data on which average was most useful. To find the overall most useful average we would then have to look at the average that was useful …
… the most number of times.
I'll just leave that there.
3 Is it an Outlier or an Anomaly?
One of the most frequent (hah!) arguments for the median over the mean is that the median is less sensitive to outliers. Underlying this is a sense that outliers are somehow wrong. A frequent problem in school statistics is to identify and remove outliers. The common illustration of why they are "wrong" is the average age of people in a classroom – including the teacher "skews" the mean (but not the median, hence its "superiority").
But in that scenario the teacher is a person in the classroom so if the data is "ages of people in the classroom" then the teacher's age is a valid data point so should be reflected in a given summary of the data. Of course, the question might be such that the median is a better average than the mean for answering a given question, but – as I've argued above – that depends on the question that's asked. Why did we gather that information about people's ages in the first place?
If we wanted to get a sample of students registered in the school to find their ages then obviously including the teacher was incorrect and identifying the outlier can help to indicate that something went wrong. But then the outlier is not really an outlier but actually an anomaly and therefore that data point should be removed. Equally, though, so should the exchange student's age.
Here's a peculiar data set. In school statistics, a common way to identify outliers is that they are more than times the interquartile range more extreme than the corresponding quartile. Now, the way of choosing the quartiles varies from exam board to exam board (and wikipedia is not set on a particular choice) so let's settle on a convention. The precise convention doesn't matter so long as we have one. I'm choosing the simplest from a programmers point of view!
We'll focus on the upper quartile; we want this to represent three quarters of the way from the first data point to the last, so it is at "position" , and if this is not a whole number then we take the corresponding interpolation between the data points on either side. We use a similar approach for the lower quartile.
So consider this data:
Using the convention above, and identifying outliers as data points more than times the IQR above the upper quartile, then this sets the boundary at . Quelle surprise! The last data point is an outlier, so let's remove it.
Now, we should redo our calculation because our data set has changed. With the last data point removed, the boundary for determining outliers is now .
Ah! Another outlier! Let's remove it.
I'm sure you can see where this is going. It turns out that this procedure continues down until we end up with just the data points . These are clearly the True Data and all others were Outliers.
(Incidentally, I used a little python program to create the rest of the data set. At time of writing, this sequence is not in the OEIS.)
So automatic removal of outliers is problematic. Using the outlier rule-of-thumb to identify interesting data values is useful to do but they should only be removed if they genuinely shouldn't have been in the data set in the first place.
4 Conducting a Hypothesis Test
Hypothesis testing is a core part of the statistics module in A level (the mathematics exam in the UK for 18 year olds who choose to study mathematics at that level).
Here's two ways that are taught to conduct a hypothesis test.
Use the significance level and the model to find the critical region. If the test statistic lies in the critical region, reject the null hypothesis; otherwise, don't.
Use the model to compute the –value of the test statistic. If it is lower than the significance level then reject the null hypothesis; otherwise, don't.
These methods will always reach the same conclusion: they are functionally equivalent. The second is slightly quicker than the first, particularly in exam questions when all the information – in particular the test statistic – is given to you at the outset. So if they are equivalent, and the second is quicker, why do you suspect that I have a problem with it?
In an exam scenario then I have no problem with it. But "teaching to the test" is not what I'm about. I want to help my students when they meet data out in the wild and the issue with calculating –values is that it puts the focus on "how significant is this data?" rather than the actual meaning of the significance level which is "how wrong am I prepared to be?"3.
3Of course, it only encodes one of the ways you could be wrong, but that's a matter for Further Maths.
The issue is that if the significance level has been set at and the –value is then the temptation is to change the significance level to, say, so that the result is significant, or more important – publishable4.
4Actually, the results should be published whichever way they go.
Or, even worse, the temptation is to not set a significance level at all and to simply report the –value as if that had some intrinsic meaning.
Using the critical region method says "I'm defining the parameters of my test before I collect the data". It therefore guards – a little – against letting the data lead the story.
Of course, there are many other issues with this type of hypothesis testing, see this post on the n-Category Café for deeper discussion. But as it is on the syllabus I have to teach it. However, I still have a responsibility to teach it ethically5.
5And don't get me started on the exam questions that see the significance level as a target rather than a limit.
Statistics is a crucial part of the mathematics experience of students in school. It is where there is most opportunity to develop skills in them that they won't otherwise have the opportunity to build. There are many ethical issues surrounding the use, and abuse, of statistics. One thing we should be doing is equipping our students to see the difference between doing a calculation and deciding what calculation to do.
The story goes of the company that hired an expert to fix a problem with their manufacturing line. The expert came in, looked at what was going on, flipped one switch, and that was it. When the – large – invoice came in the company queried it saying "You only flipped one switch!". The amended invoice came back with two items:
Flipping a switch: .
Knowing which switch to flip: .
A statistician's invoice should look similar:
Calculating an average: .
Knowing which average to calculate: .
Or, to quote from the ultimate "Science gone wrong" movie:
"Yeah, but your scientists were so preoccupied with whether or not they could, they didn't stop to think if they should."
Dr. Ian Malcolm, Jurassic Park