Back in early June, I got tired of waiting for Nate Silver to release his own election model so I started building my own. I don’t do this for a living and have new baby, full time job, part time other job and hence limited time. However using the 80/20 principle I initially found a good intuitive way to use aggregate polls and estimate each candidates’ chance of winning one of a dozen swing states, then expanded to include all states.

The model is informed only by polling data (taken from RCP) and how far away we are from the election. Attempting to include economic or demographic data would result in a lot more work without much more added value since that information is already captured in polls.

I use a logistic model to translate aggregated polling data into likelihood of winning for each state. The shape of the logistic graph is informed both by how far we are from election day and how much polling has gone on in a state (the curve gets sharper with more polling data and the closer we get to election day).

**State Poll Aggregation:**

I calculate the poll average weighting each poll by age and sample size. Older smaller polls are less meaningful. At the state level sample size weights work linearly, a 1000 person poll is weighted twice as much a 500 person poll. Age weights work on a half life of 30 days (this may need to chance as the election draws near). A poll from today counts twice as much as a poll from 30 days ago which counts twice as much as a poll from 60 days ago. There is also a distinction between likely and registered voter polls. Early on they were treated roughly the same, but as the election draws near the registered voter polls are weighted less and less until finally on election day they are not included.

The logistic which translates a polling edge into a state winning % is sensitive both to sample size an time to the election. For example a state with little polling and might have a logistic shaped like

While a more polled state might look like:

The logistic function is calibrated as follows. Polls currently conducted have two main sources of error:

1) The poll sample doesn’t necessarily represent the population (sample size)

2) The poll doesn’t know what will happen between now and election day (cone of uncertainty)

Error source #1 can be calculated. I discount polls against what we have defined as a full sample (10k) using the square root of the ratio of the sample / 10,000 (quite a bit went into selecting square root ratio over log-ratio or standard ratio if anyone is interested).

Quantifying the error from Source #2 is more difficult. The currently model assumes a linear flow of information (we learn, on average, as much going from 60 to 59 days out as we do from 9 to 8 days). We then calibrated the logistic using the assumption that a candidate who is 1 point down in a state in a full 10k sample when the poll was initially completed (5/29) had a 45% chance to win that state.

1) The poll sample doesn’t necessarily represent the population (sample size)

2) The poll doesn’t know what will happen between now and election day (cone of uncertainty)

Error source #1 can be calculated. I discount polls against what we have defined as a full sample (10k) using the square root of the ratio of the sample / 10,000 (quite a bit went into selecting square root ratio over log-ratio or standard ratio if anyone is interested).

Quantifying the error from Source #2 is more difficult. The currently model assumes a linear flow of information (we learn, on average, as much going from 60 to 59 days out as we do from 9 to 8 days). We then calibrated the logistic using the assumption that a candidate who is 1 point down in a state in a full 10k sample when the poll was initially completed (5/29) had a 45% chance to win that state.

## No comments:

## Post a Comment