MLB Run Total Projection Methodology

When ASA re-launched last winter in the middle of the NBA season, there was never really much need for a smart way to project implied point totals for two reasons:

  1. NBA over/unders and spreads are widely accessible, and we have a simple linear methodology for converting these two metrics into an implied point expectation.

  2. NBA point totals are fairly normally distributed. There is no real point floor, a team will never score zero points.

However, neither of these facts hold for projecting MLB run totals. While we do have accessible game run totals, we have spreads in baseball that are always +/- 1.5 runs, which isn’t truly indicative of the expected margin. Instead, we have moneylines, which are more indicative of expected game competitiveness, but don’t really have meaning or use in simple linear model for predicting implied runs.

Because of this, we identified a need for smartly projecting team run totals for baseball games given the information we have at hand. As we try to do with as much of our content as possible, we set out to make a projection tool that wouldn’t estimate run totals as a median point projection, but rather would predict run totals probabilistically as a distribution of possible outcomes.

To project MLB run totals, we went back to the well of random forest modeling, which I discuss in our PGA Optimizer Methodology post. Our projection model predicts run total based on the following set of independent variables:

  • Over/under

  • Win probability - derived from moneyline

  • Expected margin of victory - derived from win probability

  • Weather - temperature, wind, precipitation

  • Park factor

  • Team projected starting lineup

  • Opponent projected starting pitcher

  • Opponent bullpen

The astute statistician would point out that over/unders and moneylines (or win probability) should absorb, and are thus highly correlated with, the variables below them. But that is one of the great things about random forest modeling, it is quite adept at handling multicollinearity (read: problematic independent variable correlation). Rather than trying negatively model out correlation, random forest acknowledge variable correlation, and lean most heavily on the most salient variables.

To arrive at many of these variables, we had to do some internal variable transformation and modeling. I’ve found this to be a pretty effective approach with random forest modeling: created a model that predicts off of variables that are predictions (or aggregations of historical predictions) from internal models.

Over/under, park factor, and weather variables are more or less un-transformed. Wind variables come as wind speed and direction, so we have to do some work to put this into a single numeric variable. We use wind speed straight out to centerfield. Precipitation comes as categorical variable, so we have to do some work to transform this variable into a numeric bucket.

Win probability is derived from the moneyline, which is an odds measure. That is, the probability of outcome A divided by the probability of the antithesis of outcome A. With this knowledge, we can derive the win probability equation |ML|/(100 + |ML|) for favorites and (100 + |ML|)/|ML| for underdogs (note that “|ML|” indicates the absolute value of the moneyline). This approach assumes that these moneylines are fair, that there is no house juice in their value. We know this is not true, but this is the approach we are using now. In the future, we might explore the impact of increasing odds in order to account for house juice.

Calculating expected margin of victory is tough, I don’t think using the spread of +/- 1.5 is a good approach at all. I’ve seen some recommendations for using linear equations that consider over/under and win probability, but I’m less convinced of these, as run totals don’t appear to be a normally distributed linear metric. Our approach is to normalize run total (which we know to be not normally distributed) by taking (and predicting for) the square root of runs. With this, we now have a dataset of near-normally distributed sqrt(run totals), which have an assumed-constant variance that we can estimate. Now given a games over/under, each teams win probability, and the natural variance in run totals, we can set up a pair of normally distributed variables with some unknown margin separating them. Through random normal distribution creation and iteration (with parameterized means and variances), we can now estimate the expected margin of victory (in terms of square root runs, which we then square to get expected runs margin) associated with each win probability. Now we have a metric that we can combine with over/under to produce an equation for generating implied runs. But, this isn’t the end of our approach, it is just an input variable into our random forest. Essentially, we acknowledge that this is a way of generating implied runs, but it is imperfect, so let’s consider the prediction from this equation amidst other variables we create.

Lastly, we predict on the run creating talent (for hitters) or run-allowing lack of talent (for pitchers) of a team’s projected starters and the opponent’s projected starters and bullpen. In this step, we create a simple linear regression that estimates the incremental run value associated with the outcome of every plate appearance: regular outs, double plays, walks, hits of all base values, you name it, our regression model says “that outcome was worth this many incremental runs”. From this, player talent is measured as the average incremental runs created by hitters per plate appearance, and is measured as the incremental runs allowed by pitchers per plate appearance.

We then use this variable set to produce point predictions for the square root of runs we expect a team to score. This is a point projection, but keep in mind, it is the projection of a normally distributed variable. With these projection in hand, we have created a simple linear model that captures the variance associated with each point projection. This is an important step and a key note in projecting team totals: higher implied totals inherently have higher variance, there is a roughly linear relationship between the expected value of point/run outcome and the variance we can expect in that outcome. With this information, we can produce a distribution of possible outcomes of this normally distributed variable, and then, in square the results of this distribution to return our projection back to its most usable form: a projection distribution for runs scored.