Modeling Formula One

It's not as predictable as you think.

Jul 31, 2024

Formula One is far, far more random than fans (and adamant non-fans) care to admit.

Consistency in P1 can be deceiving. While Lewis Hamilton and Max Verstappen have won an absurd amount of races between them, the placements through the rest of the field have been through a veritable blender. McLaren has ascended. Ferrari has fallen off a cliff. And Sergio Perez looks just bad enough to cost RBR the constructors.

This randomness makes forecasting the outcome of a single event extremely difficult. Even if you know the placement of the top ten cars with certainty, there’s still 3.6 million possible finishing orders to parse through.

There’s no real point in trying to guess the full, final result of any given Grand Prix. I mean sure, have fun, take a swing. But with 2,432,902,008,176,640,000 possible combinations of 20 cars—futile doesn’t begin to cover it.

2.4 quintillion possible outcomes is a problem for me. And not just existentially at “lights out”, when the result is most in flux and I’m wondering if today is the day Leclerc gets back on top of the podium. It’s a problem when I’m staring at my computer screen at 1:00 A.M., trying to build a portfolio of DFS lineups before the Chinese Grand Prix.

Formula One DFS is deceptively difficult. It’s not a particularly well-known discipline within DFS, but Formula One’s growing popularity will likely force more attention onto the event type. The parameters of the game are simple:

Select five drivers and one team while staying within salary limitations
Nominate one driver to be your captain (MVP, if you’re coming from FanDuel), eating a requisite increase in salary for that driver
Score points primarily on two bases: the driver’s finishing position, and the driver’s performance relative to their teammate

Knowing Max Verstappen is in contention to win any Grand Prix from tomorrow’s through 2025 does not take a great degree of skill. In that sense, Formula One IS predictable. But divining whether Ocon or Gasly will prevail over the other at the Qatar Grand Prix takes a considerable amount of skill. That’s why I’m up at 1:00 A.M. I’m trying to develop that skill. While I’m not a tout or professional, there’s been some success in F1 DFS.1 Most of that success has come from being more willing than the field to respect the hand of randomness in GP outcomes.

That respect for randomness is the ultimate motivation for building a predictive model for Formula One. There’s simply no way to play the DFS game effectively without a sense of the distribution of probable outcomes for each GP. You can be sharp, you can understand the impact of safety cars, you can even track the upgrades teams bring to each track—but at the end of the day, you have to have some way to quantify the chances of Oscar Piastri finishing ahead of Lando Norris. Otherwise, you’ll have no way to know if you can afford to pay down for the No. 81, or if you should accept the cost to roster the No. 4.

For the first few entries into this article series, I want to illustrate the process of building a predictive Formula One model. I’ll eschew the code—it’s not very artful, anyway. Instead, the articles will focus on the basic concepts of modeling and forecasting within a Formula One context.

If you’re just interested in the Formula One, not the DFS game, read on. There’ll be a lot of discussion about the patterns of GP outcomes, as well as Lando’s chances of beating Max (which is probably more exciting).

For now, I want to cover a couple of key concepts and lay out the first step in the journey of model-building: picking a target.

Forecasting: Key Concepts

Before outlining how I’m constructing my (very simple) Formula One model, there’s a few ideas underpinning the whole shebang that a reader will benefit from grasping. They’re not dense; they just have odd names, and seem a bit scary if you don’t have pictures to orient you.

Distribution(s):

Distributions are just the set of all possible outcomes. In most contexts, you can actually think of a distribution as a “shape”, a shape that looks something like this:

The key insight for grasping this concept is this: the height of the bars at each possible finishing spot indicates the frequency of that outcome.

Notice how the shape is taller around P6 and P7. That means that a large chunk of Sainz’s possible outcomes are him finishing 6th or 7th. Distributions allow us to chat about unknown information (like where a driver will finish in the Singapore Grand Prix) without tying that conversation to specific number, like 1st, 5th, or 17th. For example, we can talk about Carlos Sainz’s distribution of outcomes and include both the possibility he’ll suffer a flubbed pit stop and the possibility he leads the event start-to-finish. All within the same word.

Encoded Information:

Information has a nasty habit of clumping together, and hiding itself throughout the means humans use to communicate. Spreadsheets, sentences, and images all conceal large quantities of information woven among their constituent parts. For example, consider this sentence:

Tom Brady has been named Super Bowl MVP five times.

It’s a really short one. But it’s packed with hidden, “encoded” information. Here’s a listing of the information that sentence contains.

Tom Brady played at least five seasons in the NFL
Tom Brady advanced to the playoffs at least five times in his NFL career
Tom Brady made it to the Super Bowl at least five times in his NFL career
Tom Brady very likely played quarterback
Tom Brady is likely a good quarterback
Tom Brady probably played more than five seasons in the NFL
Tom Brady was probably the best quarterback in the NFL for one or more seasons
Tom Brady is likelier than most quarterbacks to have played for a large number of NFL seasons

All of that, from ten words! The same is true of datapoints. Consider a Formula One race. You’ve been provided with the average lap time for each driver, and the number of laps each driver completed. From those two data points alone—you actually know the final result of the GP!

Encoded information matters because gathering information is difficult. Effective forecasting models often rely on cleverly-engineered datapoints that encode critical information about the target event.

Learned Relationships:

A learned relationship is a softer way of saying “optimized function”. A learned relationship is the tool forecasting methods use to take some input data (say, average lap time and position) and transform it into a prediction (P4, for instance).

A machine can “learn” a relationship between the inputs and some target information—like finishing position.

Any number of methods can accomplish the learning piece. These methods vary in their complexity, computational cost, and accuracy. But for each method, the aim is the same: learn the links between inputs and the target. Because once that relationship is known, forecasters can take a set of inputs and deduce the target value. That deduced value is the prediction.

The First Step: Isolating the Target

Despite the hand-waviness of the paragraph above, building a predictive model is not as simple as dumping some arbitrary inputs into a black box and running with the result.

The first item to hash out is the target. What is it you’re actually trying to predict? In our case, that’s actually a fairly easy task. We want to know the distribution of each driver’s finishing spot for a given Formula One race.

We don’t want to know how well a driver is likely to perform in a vacuum. Instead, it’s that driver’s performance relative to other drivers that interests us. We also don’t want some single placement prediction that we assume to be gospel; races are too random for an absolute prediction to be acceptable. A distribution of outcomes is key.

The second item to settle on is our input scheme. What information do we want to put into relationship with the finishing order? There’s a heap of considerations here, but I’ll spoil the fun and just give you the answer: we want information about the race we’re currently trying to forecast. We want to know things like lap time averages, average position, the driver’s place on the starting grid, etc.

That might seem a bit odd. After all, predictions about a GP have the most value before the race. You won’t know facts about a driver’s performance before you slot them into a fantasy lineup. Which is correct. But by starting from this position, it’s going to be substantially easier to get to a distribution of outcomes—I’ll explain why in a later article. And as already established, having a distribution is key.

So what specific information about the unfolding race is in the mix? I’m beginning the project by looking at three key variables. There’s no secret sauce here, none of that clever engineering mentioned earlier. At this stage, the critical element is building a workable, reasonable forecasting model. We’ll worry about using special tricks to tune up the accuracy down the line.

The three inputs are a driver’s starting position, the driver’s normalized lap time average, and the driver’s average position. That’s it. Why only three? Encoding! These three datapoints conceal loads of information about a driver’s on-track performance.

Average position combined with starting position contains information about how well the driver moved through (or away from) the field
Lap time averages contain information about the relative quality of the cars, drivers, and team decision making
Start position itself reveals information about the relative quality of the cars and the drivers before any on track chaos occurs. It also tells us how much work a driver will have to do in order to get to the front

We (or a computer we sic on the task) should be able to discover a strong relationship between the distribution of finishing spots and those three pieces of information. While more information is always better, I find it’s much easier to diagnose problems with simple models, and build out the sophistication slowly. I’ve tried models with 2,000+ input datapoints—only to realize later I probably only needed around ten!

Lookahead

In the next article, we’ll examine how using data from a race to predict that same race’s outcome is a vital part of our strategy for building a distribution.

Enjoy the rest of the summer break!

Not enormous success either. I will still have a day job for the foreseeable future.

Ark Fantasy

Discussion about this post