What can data scientists learn from the election polling disaster?

Since the election, there has been a lot of criticism of the use of polls and data driven decision making. It has been suggested that statistics are unreliable and making data-driven decisions instead of relying on intuition is a mistake. At uSwitch we collect and analyse big data on a daily basis. We even sometimes carry out opinion polls among users to look for ways we can continue to improve. We try and ensure that all our decisions, from the tiniest format change to huge structural changes, are made with statistically significant support from our data. Is our over-reliance on data a mistake? I believe it is a huge asset to our business and I would like to discuss what was wrong with the election polling and what we can learn from those mistakes when making data-driven decisions.

The election polls were not ‘wrong’, because they can never be ‘right’. The polls told us how 1000 people out of a population of almost 65 million intended to vote. This needed to be communicated with the degree of uncertainty that is clearly involved in such a tiny poll of a huge population. In this case, even the last and most inaccurate YouGov poll put Labour and Conservatives neck and neck on 34% each. The 99% confidence intervals put that proportion between 30 and 38% - which the actual vote share fell comfortably within. So what was wrong with the polling and what can we learn from it?

1) Polls and polling

An odd phenomenon in the growth of big data and its analysis, is an increased enthusiasm in public for any data analysis even if it is from really very little data indeed. Polling - the surveying of opinions from a carefully selected subsection of the population - is a 20th century 'science'. It is susceptible to sampling bias, if the poll does not consist of a cross section of the population, bias in the questions asked and bias from the motivation of the interviewee. I call it a 20th century science because the most obvious thing about it is how small the data is. If I had a data set of 1000 results professionally, I'd think about how I can gather a least 10 times the amount of data before I can take it seriously. I do that in the 21st century because I can - we have the storage and processing capabilities that can quickly analyse enormous data sets. As recently as 10 years ago, this type of processing ability was beyond a single user and forced us to carry out sampling procedures on any large data set. All data is not big data, yet often people today (particularly those who are not directly involved in data analyses) will treat any data study the same, irrespective of its size. That is not to say that Big Data is not entirely trustworthy either. It is still a random sample, susceptible to bias, however large it is, but it is less likely to have the enormous error bounds that small data has. This election has seen more polls than ever before, our newspapers are full of medical studies with tiny sample sizes saying contradictory things, or social science surveys proving all people think a particular way because they asked the opinions of 1000 people. In the age of big data there is a huge public interest and appetite for data-driven studies, but all data is not equal and maybe we in the big data world should be more vocal about that (bigger data IS better!)

2) Respect for opinion polls and the established pollsters

It appears to be an accepted opinion in the commercial world that if a survey comes from one of the established brand opinion pollsters, it must be true and trustworthy. This is odd. The opinion pollsters are particularly poor at communicating uncertainty (e.g. "Labour have moved up 1 percentage point" - is that statistically significant?). They conduct all sorts of hocus pocus to 'weight' their surveys by geographic spread, or the mythical 'shy tory' (why there are now hypothesized to be more shy tories in 2015 then at any previous time is unsupported by any data). These weights are often empirical rather than theoretical, added to account for previous errors, thus potentially leading to extrapolation issues and new biases. In addition their respondents are to a certain degree self-selecting. In an interesting interview that I came across recently, Dave Goldberg, the CEO of SurveyMonkey who tragically died last week, explained that when they collect random surveys for clients, they tell their respondents that they will give £10 to charity for every filled in survey, rather than giving the money to the respondent (as YouGov and other traditional survey companies do). He believed that without a personal monetary reward the respondent was more likely to return a true answer then if they were offered personal gain, when they could subconsciously veer towards the response that the questioner wanted to hear (e.g. yes minister). Conducting surveys are probably the hardest way of gathering truly representative data, as they are so susceptible to bias at every turn - the phrasing, choice and order of questions, the people who are asked and the way that the results are weighted. Anyone can get them wrong, even the most reputable of polling companies, who still rely on the old fashioned use of telephones, despite that becoming an increasingly outmoded form of communication which creates its own selection bias. It is why we in the Big Data world treat them with extreme caution.

3) What happened had nothing to do with polls because the polls were not measuring the actual event

The oddest thing about this election campaign was the near-religious reliance on country-wide polls which measured each party's popularity across the country to represent the outcome of an election which due to a complicated ancient system, has nothing to do with what proportion of the country supports each party, but where those people live. In this election Labour actually increased their vote share marginally across the country: in 2010 they got a 29% share, now they have 30.4%, yet they lost 26 seats in the process. The numbers in the polls were not wrong, the use of the polls at all was wrong, they were meaningless. The reason the exit polls got it right was not because of their improved sampling methods, but just because they interviewed 20,000 people across all constituencies and calculated results on a seat by seat level, rather than speaking to 1000 people across the country, irrespective of whether they lived in a safe seat or a swing seat. I am aware that Lord Ashcroft conducted polls at a seat level which were not wholly accurate, which raises another issue of trusting the data collected by an individual with a clear bias, just because it is 'data' and in the data age, data is good. It is so important for those of us who work in data analysis to have the confidence to communicate exactly what our analysis represents and reject interpretations that stray from the data.

4) Social media

An interesting 'noise' in this election turned out to be social media. It was assumed if things were trending on Twitter or Facebook that must indicate a national mood, be it the #millifandom, #cameronettes or Russel Brand. But that was to misunderstand that big data must be unbiased data, whereas it does not matter how big a data set can be it can still have an inherent bias. Social media still very strongly represents a certain age and social group. We then have the choice to bias it more by choosing our friends and followers ensuring we only ever see one world view, but so much of it that we assume it must be universal. The danger of gathering opinions from social media has never been more apparent. Here we have big, biased data.

5) There was a narrative and the data was forced to illustrate that narrative

"Do not put faith in what statistics say until you have carefully considered what they do not say." -- William W. Watt

Since 2010, political scientists and journalists have believed that since no party had done very much to change and people were generally dissatisfied, the result would be another hung parliament. With this initial hypothesis the data was squashed and squeezed to fit this narrative. There are now stories coming out that there were polls which put the Conservatives with a larger lead over Labour but they were not published as they did not fit the narrative and were assumed to be 'rogue' polls (the concept of 'rogue' polls are also a misnomer, they just sit further in one direction towards the error bound in the range of uncertainty). Indeed, it was true that Labour and Conservatives remained neck and neck over the last 5 years. But that hid that it was not the same people who were in those two groups. Both Labour and Conservatives gained many voters from the Liberal Democrats who lost 15.2% of their voters from 2010, they both lost voters to UKIP and Labour lost a lot of voters to the SNP and some to the Greens. In the same polls, that saw the Conservatives and Labour consistently neck and neck over 5 years, we saw the growth in UKIP and SNP voters and the demise of Liberal democrat voters, unless pundits went with the highly unlikely assumption that all Liberal Democrats were now voting UKIP or SNP, it had to be recognised that there was a lot of movement between voters for all parties which meant that this narrative of stagnation since 2010 was unsustainable. Traditionally we respect the 'expert' - the retired politician, the political journalist, the constitutional academic for their experience, even though it is not an evidence-based pursuit. We celebrate the data as an oracle when it is right (e.g. Nate Silver calling the last two US elections, he was called 'modest' when he tried to explain the concept of uncertainty and that he got lucky twice) and berate the empirical approach when it is 'wrong'. But that is to misunderstand the art and science of data analysis. What is really important is not just the collection of the data, but how data is interpreted and communicated, or as was the case in this election, constantly misinterpreted by people who do not understand the nuance and skill required to present the uncertainty or differentiate the signals from the noise.

Finally, I think the most important thing to remember is that only a small amount of statistics is about predicting the future - the most valuable contribution of our work is reporting on the hidden details of the present. By being too concerned to make 'predictions' the pollsters and journalists missed what was happening under their noses in the last few years and where the real stories were (as it turned out Scotland going to the SNP, the South West deserting the LibDems for Conservatives and Labour losing some voters to UKIP in key marginal seats, all this should have been observable if pollsters had paid more attention to the nuance of the here and now, rather than trying to predict an overall result). This cautionary tale is very important for all data scientists, analysts and anyone involved with data-driven decision making. There is a place for opinion polls, but they must be constructed and conducted with extreme care and independence and communicated with their uncertainty. Even when working with big data that has small error bounds, we are subject to biased data and issues with misinterpretation. Data-driven decision making has revolutionised business, but for it to succeed, it requires stringent and exact applications and interpretations. There are only lies, damned lies and bad statistics.