## WinPicks Reference Manual

## Section 11.2. How SureLock Works

*SureLock* produces prediction formulas using a multiple linear regression model. Obviously, no knowledge of statistics is needed to use *SureLock*, but the mathematics behind the method may still interest you. If you are interested in mathematics and statistics, then please read on.

This section describes the prediction model in mathematical detail, but is not a proof of its effectiveness. MicroBrothers welcomes any comments you may have about the methods described in this section.

### Prediction Formula Model

Many statistical problems require one or more independent variables. These variables are used to design a prediction model that determines future outcomes based on past results. Simple problems with just one independent variable are often solved with a technique called *linear regression*. However, more complex problems often have more than one independent variable. If the model is linear in its coefficients and uses more than one variable, it is called a *multiple linear regression* model. The general form of such a model used to estimate future outcomes can be expressed as:

_{0}+ b

_{1}x

_{1}+ ... b

_{k}x

_{k}

where

y is the estimated outcome of the model, x_{1} through x_{k} are data samples taken from previous measurements, and b_{0} through b_{k} are regression coefficients that are estimated using multiple linear regression.

PFA allows you to use 32 separate statistical categories when you create prediction formulas. These 32 categories are independent variables that serve as inputs to the model. Therefore, the value of k in the standard equation is 32.

### Inputs to the Model

Each input data measurement is one game. For each of the 32 statistics, i=1..32, the input measurement to the model is:

_{i}= (R

_{ho}- R

_{vd}) + (R

_{hd}- R

_{vo})

where

- R
_{ho}is the home team league rank on offense - R
_{vd}is the visitor team league rank on defense - R
_{hd}is the home team league rank on defense - R
_{vo}is the visitor team league rank on offense

### Coefficients in the Model

The coefficients in the equation, b_{1} through b_{32}, are related to the prediction formula weights by a multiplier. The constant term, b_{0}, is simply the home field advantage value in the formula. In the prediction formula, the coefficients b_{1} through b_{32} are derived using the relationship shown below. The actual percentage weights in the formula are called w_{1} through w_{32}. These percentage weights are always positive and always add up to 100 (normalized). Another variable, called the *point spread multiplier* (PSM) is used to vary the magnitude of the predicted margin of victory as shown:

_{0}+ PSM ( w

_{1}x

_{1}+ ... w

_{k}x

_{k})

From this relationship we see that the coefficients b_{1} through b_{32} are related to the prediction formula's percentage weights by:

_{i}= PSM x w

_{i}

and thus we can solve for the percentage weights w_{1} through w_{32} by solving the standard regression model for b_{1} through b_{32} and then back substituting the above equation.

### Output from the Model

The output from our model, y, is the projected margin of victory of the home team over the visiting team.

To obtain an equation to estimate y, we collect data samples (x) from our sample of past games. The next step is solving for the coefficients b_{0} through b_{32} to produce an optimum formula.

### Solving the Model Using Regression

Solving the linear regression model for all 32 statistics is time consuming even on the fastest computers and does not necessarily produce a more accurate formula. In addition, as previously noted, the coefficients in a prediction formula are all positive. Therefore, a more effective solution is to use a subset of the 32 statistics, each with a positive coefficient, and then solve the model for that subset.

### Forward Selection

After we select a sample size, there are still two more important questions to ask - How many statistics should we use and which statistics should we use? Both questions can be answered using a method called forward selection. Forward selection is a stepwise procedure, and requires several iterations to find an "optimum subset" of statistics to include in the model. The basic process is performed as follows:

1) Each statistical category (Points Scored, First Downs, and so on) is solved independently using linear regression. The category which produces the highest correlation to the actual game results is selected.

2) Each of the remaining categories are inserted in addition to the category selected in step 1. Linear regression is used to solve the model for two variables. The category with the highest correlation to the actual game results and which has a positive coefficient is added to the model.

Step 2 is repeated until the number of categories in the model equals the number of statistics that you selected.

### Ridge Regression

Once the statistics are selected, *SureLock* solves for the coefficients. One problem that occurs when solving for the coefficients is called *multi-collinearity*. This occurs when two or more statistics are highly correlated to each other. For example, points scored and first downs may correlate closely in a given game. Multi-collinearity causes the standard regression solution to produce coefficients with large variances, and sometimes a solution is not even possible.

Ridge regression is another form of multiple linear regression, but is known as a biased estimation technique. It allows a certain amount of bias in the coefficients in order to reduce their relative magnitudes. Another nice feature of ridge regression is that the solution of the coefficients yields positive values (most of the time), which is exactly what *SureLock* needs to create prediction formulas.

*SureLock* uses ridge regression to solve for the coefficients b_{0} through b_{32} in the prediction model. Ridge regression is fairly complex, and we lack the space to discuss it here. If you are interested, the topic is covered in several books on statistics and forecasting.

### The Finished Formula

After using ridge regression, some coefficients still have negative values due to precision errors in floating point arithmetic. The negative coefficients are numbers very close to zero, and can be set to zero without degrading accuracy. SureLock then normalizes the set of coefficients so that the sum of the percentage weights equals 100, and sets the point spread multiplier to an optimum value. The result is a formula that *Pro Football Analyst* can use to accurately forecast future games.