PGA Optimizer Methodology

Our PGA optimizer is the first tool of its kind for ASA. To date, we’ve mainly focused on developing tools that help add contextual information to your DFS strategy, but have purposefully avoided generated objective projections. There are a couple reasons for this, but I felt like it was about time we take our approach to understanding and quantifying relevant DFS factors and apply them in a combined effort to create fantasy projections through a weighting of the various factors that we are able to quantify.

I must say, I don’t follow golf at all, and until this year was even less familiar with it in a daily fantasy context. However, after doing some research and exploring the data that we have gathered, it did seem like a convenient sport to begin our foray into algorithmic projection. There are a few features of daily fantasy golf that I think lends itself to algorithmic projection.

  1. There are no positions, you just have to pick the 6 best players that fit under salary.

  2. Players are more or less playing the same opponent on a given weekend: the course.

  3. All players have equal access to fantasy scoring opportunities; there is no internal projection needed around market share of minutes, touches, etc.

The methodology behind our projections is random forest modeling. Random forests are a lightweight machine learning method that takes a large set of input variables, with many observations (in this case the many tournament performances of many golfers that we have collected from the last five years), and creates a “forest” of decision trees, splitting on salient variables in an attempt to make the most accurate predictions. The premise of a random forest is that while weak learning models don’t make for very accurate predictors, the ensemble of many many weak learning models create a very powerful and accurate predictive model.

Our approach is a multilayered random forest model, one in which a set of random forest models are used to generate predictions that are used as inputs for a subsequent random forest. The first layer of this prediction model is a set of random forests that predict each players’ probability of making the cut, finishing in the top 20, top 10, top 5, and winning each tournament outright. These models predict on player- and course-specific variables that are measured in terms of strokes to par. We try to avoid pure “course history” as a predictor, because these metrics are based off such small sample sizes. However, our model does consider how specific players’ performance correlates with course length, par, and historical course difficulty (as measured by the aggregation of other players’ performances). In this way, we try to model for “course fit”. We also attempt to model for players’ historical performance over various time intervals, in an attempt to consider players’ historical production level as well as recent form.

Variable importance plot for top-level random forest, which predicts players’ probabilities of finishing in the Top 20 of each tournament field.

Variable importance plot for top-level random forest, which predicts players’ probabilities of finishing in the Top 20 of each tournament field.

One shortcoming of random forests is that while they offer strong predictive power, they offer little explanatory power. Linear regression can be used to predict (less accurately), but the method’s greatest value comes in its ability to associate incremental expectation with increase or decreasing levels of independent variables. Through regression coefficients, we might be able to say that Brooks Koepka’s expected fantasy output increases by y when the yards per par stroke of a course increase by x. We don’t have that explanatory element with random forests. However, random forests can shed light on which independent variables are most important in predicting an outcome. Above is a “variable importance plot” of a top-level random forest that predicts the probability of a player finishing in the top 20 of a tournament field. As suggested by the plot, it appears that long form variables (i.e. long form measured by strokes-relative-to-par or long form frequency of top 20 finish) tend to be more predictive factors than short form factors.

Schematic of ASA’s random forest projection algorithm.

Schematic of ASA’s random forest projection algorithm.

After generating random forest predictions for players’ probabilities of making the cut, finishing in the top 20, top 10, top 5, and winning outright, we fold those strokes-based projections into an set of fantasy-based input variables, such that our independent variable set is a robust set of variables that give us a better chance of modeling the variance that we see in players’ fantasy performances. These predicted input variables prove to be highly salient in predicting fantasy performance.

Variable importance plot of the final random forest model, predicting for DraftKings fantasy points.

Variable importance plot of the final random forest model, predicting for DraftKings fantasy points.

Variables pred-prob-made-cut, pred-prob-Top-20, etc are the probability predictions generated from the top-level predictive layer. Other variables considered are average fantasy point allowance by course (courseallowance) and how players relate to the difficulty of courses (mlm-course-allowance-DK-Total-FP-pred), players’ relation to course length per par stroke (mlm-ypps-DK-Total-FP-pred) and long form performance in terms of total fantasy points, strokes-generated fantasy points, finish position fantasy points, and cut-making propensity.

All in all, I feel that we’ve put together a pretty robust model that considers many factors that contribute to players’ fantasy expectation. At this point it feels like we’ve exhausted the manipulation of independent variables we have available to us, and increase in performance is going to either come from a) employing additional model types (i.e. ridge regression, of which I’ve heard people have had projection success) or b) adding additional datasets. To the latter point, we don’t have a good process for collecting and cleaning strokes gained data. However, it is something we are working to develop, and if we are able to, I am confident that we will be able to implement the added data in a meaningful way that should help us improve predictive accuracy.

Speaking of predictive accuracy, we are measuring this with “root mean squared error”, or “RMSE”. In our early modeling stages, we were at RMSEs of about 26 DKP and 30+ FDP. Since digging in and exploring new ways to engineer our independent variable set, we have reduced model RMSE to about 16 DKP and 21 FDP. It’s difficult for me to contextualize these prediction error levels, as I don’t know how accurately others are projecting, but I am encouraged by our ability to reduce error, and I’m hopeful that we can continue to work towards more accurate models as more data becomes available.

If you have any specific questions about methodology, don’t hesitate to reach out to us on Twitter.