Track Record: Euros 2019

This page looks at our regression-based predictions for the European elections on 23 May 2019.

Electoral Calculus was commissioned by Remain United to predict the European elections, to provide them particularly with information about the relative strength of the pro-Remain parties (defined as the Lib Dems, Greens, Change UK, SNP and Plaid Cymru) in the 11 electoral regions.

To do this, we worked with leading pollster ComRes to conduct two waves of fieldwork, each surveying around 4,000 people. We also used regression-based techniques known as MRP (Multi-level Regression and Post-stratification) or RPP (Regularized Prediction and Post-stratification). The reason for the choice of RPP techniques was twofold. Firstly, backtesting over past elections by Electoral Calculus has shown them to be as accurate as classical polling and often more accurate. And secondly RPP methods are good for producing the regional estimates of vote share, which the project required.

The results of the final (second) wave of polling were published on 20 May 2019, with fieldwork conducted from 13-17 May. The final published tables are available here. Remain United also published seat projections and tactical voting advice derived from the polling.

There were several positive features of the predictions and a couple of negative ones.

In terms of the regional predictions, the following regions were broadly correctly predicted:

Measure	CON	LAB	LIB	Brexit	Green	SNP/PC	UKIP	ChUK
Predicted vote share, assuming tactical voting (pc)	11	24	17	32	6	4	2	2
Actual vote share (pc)	9	14	20	32	12	5	3	3
Predicted seats won, assuming tactical voting	6	18	14	26	2	4	0	0
Actual seats won	4	10	16	29	7	4	0	0

On the negative side, the collapse of Labour vote was under-estimated. The Labour vote share was around 10pc less than predicted, with Lib Dems and Greens benefiting. This was a common problem across most pollsters (apart from YouGov), whether using regression methods or not. In regional terms, in London, Eastern, and the North West the Greens won unpredicted seats due to the Labour collapse.

How good was the regression method?

There are two different things to look at. One is the overall predicted vote share, so let's start with that. Ignoring the tactical voting prediction, the basic vote share from the polling was very similar between the regression analysis and the classic polling analysis:

The two methods are broadly as accurate as each other. Classic is a bit closer for Labour, Green and UKIP; but RPP is slightly better for Conservative, Lib Dem and Change UK. The differences between them are only 2pc for Labour, and no more than 1pc for the other parties. The average absolute error was both methods was 3pc. The main error is the predicted Labour vote, and that error does not seem to be caused by the choice of polling analysis methodology.

The other key thing to examine is how well the two methodologies perform at predicting the individual regions. To do this, we take the raw regional predictions from each method and adjust them in a UNS-way to correct for the overall polling error. This preserves the "shape" of each method's prediction, but adjusts the mean to match reality.

We can then take the difference between the predictions and reality to get an error for each party in each region. We can do this for three separate cases: (a) simple case where we use the national average vote share as the prediction for each region, (b) classic polling prediction for each region, and (c) regression-based prediction for each region. The table below shows the average absolute error across regions for each party for all three methods.

Measure	CON	LAB	LIB	Brexit	Green	SNP/PC	UKIP	ChUK
Classic polling predicted vote share (pc)	12	22	14	32	7	4	3	5
RPP polling predicted vote share (pc)	11	24	15	32	6	4	2	4
Actual vote share (pc)	9	14	20	32	12	5	3	3

Average error (pc)	CON	LAB	LIB	Brexit	Green	SNP/PC	UKIP	ChUK	Average (ex NAT)
(a) National average	1.6	4.9	4.4	6.6	2.4	n/a	1.0	0.7	3.1
(b) Classic polling	2.5	4.7	3.3	2.7	3.5	1.3	1.7	1.4	2.8
(c) Regression polling	1.8	2.6	2.1	2.8	2.3	0.8	1.4	1.0	2.0

We see that the regression method is generally better at predicting the regional result than classic polling. On average the regression method has an average error of 2.0pc across parties and regions, while classic polling has a larger average error of 2.8pc. Classic polling is only slightly better than assuming the whole country is homogeneous.

Another way of quantifying this is to look at the variance of the actual results against the three methods. Ignoring the SNP/PC (whose high variance distorts the comparison), the total variances across parties and regions are (a) 15.1pc, (b) 12.0pc, and (c) 5.2pc. We can then create a crude version of "coefficient of determination" or R-squared, which is 20pc for classic polling and 66pc for regression analysis. This again suggests that the regression analysis is a better fit than classic polling for regional prediction.

In conclusion, the European election prediction went quite well with many accurate predictions. The main error was the understatement of Labour's collapse, which was common across many pollsters and methodologies. In terms of predicting the regional results, the regression-based approach appears superior to using classic polling.

Technical note: MRP and Machine Learning

As part of the technical communication around the use of regression methods for this election, Electoral Calculus had described these methods as machine learning. Several statisticians were quick to comment that regression, particularly MRP, is a statistical technique and is not "machine learning".

The entire MRP process is indeed manually-driven, but the at the centre of it is a logistic regression which determines the linkages between political attitudes and various demographic, geographical and political variates. The extent of those linkages and the choice of variates themselves is determined by an automatic algorithm and is not subject to manual intervention. That is very similar to the "supervised machine learning" practised in, say, natural language processing where logistic regression has played an important part in the subject's development. The final post-stratification part of the process is clearly separate and distinct from machine learning.

The name MRP was originally coined by the creator of the method, Prof Andrew Gelman of Columbia University. But he has recently proposed an improved and more general term of RPP (Regularized Prediction and Post-Stratification), and Electoral Calculus is happy to use this more modern term for these general prediction methods.

Track Record: Euros 2019

This page first posted 31 May 2019

How good was the regression method?

Technical note: MRP and Machine Learning