Guest snark: How accurate, and meaningful, were David Roche's WSER predictions?
A regular reader who has bagged at least one ultramarathon victory, has completed the Western States Endurance Run, enjoys genuine literature (as well as, sometimes, this newsletter), and has a strong inherent command of data and numbers has grown to detest David Roche’s Trail Runner articles for the same reason most of his detractors do: He says nothing of value, and he does so in language so indecipherable that it’s hard to tell whether this is mostly a feature (because a lack of clarity helps obfuscate Roche’s use of sham or bone-headed “math”) or mostly a bug (because some people are simply not constituted to effectively or efficiently convey thoughts and ideas in this medium, with the common trait being an aversion to entertaining any input that isn’t fawningly benevolent).
Despite being not so much triggered by trying to parse Roche’s output as cruelly yanked to the brink of brain-sizzling suicidal insanity, this reader has supplied me with tailored responses to a few of Roche’s Trail Runner articles in recent years, and I have posted these here with minimal formatting changes and other edits (example).
The article under scrutiny here is a June 20 offering, “The Best Performances Ever At The Western States 100, Plus 2023 Predictions.” The guest-rejoinder is written as if responding directly to David Roche himself, as Roche is known to enjoy being lavished with at least some flavors of personal attention.
As always, it’s tempting to ladle my own critical gravy atop the meaty barbs already supplied, but this would defeat the purpose of farming out the task of deconstructing Roche’s garbledyguck. Instead, I’ll limit myself to pointing out that Roche fundamentally misunderstands baseball’s Wins Against Replacement stat (very few Triple-A players possess the skills of a statistically average everyday Major Leaguer) and has no clue how to select or reject statistical outliers as either individual data points (e.g., lone racers) or group data points (e.g., entire Western States fields) when burps in trends present themselves. (Because he doesn’t consciously cheat to reach consistently inane conclusions, he should be working for the Food and Drug Administration or the Centers for Disease Control.)
And my favorite dumb line from the original article is “This ranking really shows the effect of temperature on men’s times.” Next up for hypothesis confirmation: “Ambient temperature really seems to play a role in how fast hockey players skate on outdoor rinks, with temps below freezing seemingly associated with possible faster speeds.”
Roche’s own words are in a regular or italicized font. The guest-lecturer’s are bolded.
I may not have gone to many school dances, but I could talk for 3 hours on the merits of Wins Above Replacement in baseball players before 1950.
Dude, you have worn this joke the fuck OUT.
Sabermetrics never did me much good in life… UNTIL TODAY. Because in this article, we’re going to analyze some advanced statistics for the Western States 100, with the help of some sabermetric principles along the way. My co-conspirator is Marshall Burke, associate professor of Global Environmental Policy at Stanford University, whose work on wildfire smoke you may have seen over the last few weeks in outlets like the New York Times. He’s back after last year’s time predictions, which were shockingly accurate. I didn’t ask him whether he went on dates in high school, but based on his statistics knowledge, I think we all know the answer.
Right, because specific interests and/or intelligence has a proven direct correlation with romantic success.
Let’s break down this year’s data fun, starting with the best women’s performances ever.
Ann Trason is the Babe Ruth of Ultrarunning
In 1920, Babe Ruth hit 54 home runs, at a time when no other team in the league hit more than 50. That wasn’t even his best season. The aforementioned Wins Above Replacement aims to quantify how many wins a player adds to their team relative to a solid player that the team could theoretically add from the minor leagues. In 1921, the Babe had 14.1 WAR. For comparison, in those years when Barry Bonds took steroids and broke the entire game of baseball (for context, his on base percentage was .609 in 2004, when the league average was .335), his highest WAR was 11.9.
Good thing there aren’t cheaters in ultrarunning *cough* *cough*
In ultrarunning, Ann Trason is Babe Ruth, but better.
(Also, fun fact: Barry Bonds is an avid cyclist on Strava. For some reason, this makes me so, so happy.)
Fascinating.
Here’s how Marshall conducted the thought experiment to determine the best performances ever. For each year starting in 1985, he built a model from all the other years but not that year, and predicted what would have happened in that year given the overall trend in times and the overall field average performance on that day. It’s akin to a simplified Wins Above Replacement, comparing the winner to the middle of the pack.
STARTING in 1985. Remember that. He’s not simply comparing the winner to the middle of the pack, he also built a graph or correlation linking the years 1985-2022.
Marshall’s rationale is that the overall gender-specific average is going to pick up the temperature effect and anything else that made that day fast or slow (snow, boats, mountain lions, Mercury in retrograde, etc.). And controlling for field average is probably better than controlling for the average time in the top 10 or top 20 since a fast leader might cause the elite times to be faster.
Right, having more data points is usually a good idea. I remember learning that in 4th grade. I’m not downplaying the rigor of Marshall’s analysis; I’m only stating that controlling for field average isn’t rocket science.
A model that considers the progression of performances over time and field average time explains greater than 75% of the variation in winning times. So even though this is just a thought experiment, it’s coming from Marshall’s brain, which could be hooked up to a turbine to power a medium-sized city.
I guess it depends on how you quantify “explaining the winning times”, but OK.
There is one more problem, though. How do we deal with athletes who have won multiple editions of the race? For example, Jim Walmsley’s 2018 performance could theoretically make his 2019 performance seem less remarkable since he’s competing against himself in the analysis. Marshall dealt with the problem by creating two models and letting us decide. What a boss! It’ll be a shame when his cerebellum is used to power microwaves in Albuquerque.
Since the population of Albuquerque is over 500,000, technically it’s bigger than a medium sized city (I did research; hint, hint).
The first model drops the winning times of every year that athlete won the race (“What would have happened in 2019 had Jim Walmsley not existed?”)
You’re not explaining this well. If Jim Walmsley finished #1 in 2018, but finished #4 in 2020 (I realize there was no race in 2020; I mean if he finished the race without winning), would his 4th place time remain in the analysis? You said he would drop the winning time every time an athlete WON a race, but also said “what would have happened in 2019 if Jim Walmsley had not existed”, implying any 4th (or 99th, or 223rd place times would be wiped out as well).
The second model just drops the winning time in a single year (“What would have happened in 2019 had Jim missed that year?”).
Now we have the context to think about the immortal Ann Trason. She won Western States 14 times, in bonkers performances that put her in the overall top-6 8 times. Let’s start with the model that predicts winning times assuming that the year’s winner never existed, since Marshall thinks it’s the fairest to outliers. We included the average finishing time and high temperature in Auburn for context. Prepare to have your mind blown.
KABOOM, there goes your mind! Ann was so far ahead of her time that she has run 13 of the best 14 performances ever based on the model. Here’s the second model, which assumes the winner didn’t compete in that year only.
Ann still has 12 of the top 20 performances ever, but she is dethroned at the top spot by legend Ellie Greenwood’s course record. While I am not a statistician, I actually think this model might be fairest because Ann won so many editions of the race that we would be tossing a lot of data. Ellie’s time is historic, and my gut tells me it’s the best run ever, maybe in the entire sport. I am biased toward the present, though.
Bias doesn’t belong in statistical analysis. Neither does your gut. If Ellie’s time was “the best run ever, maybe in the entire sport”, does that make Courtney’s time the best best run ever? You have named many performances as the “best ever” – Jim Walmsley, Grayson Murphy, Adam Peterman, Ellie Greenwood, etc. If everyone is the “best ever”, than no one is.
And all of that brings us back to baseball. If we take WAR at face value, it’s hard to argue against Babe Ruth being the greatest of all time. However, baseball has some unique considerations. Players back then had names like Rock Saw McGee, they threw fastballs that couldn’t break glass, and–disgustingly–the game was segregated. Babe Ruth would have hit fewer home runs if he had to face Smokey Joe Williams in 1930, or an exploding 100-mile-an-hour cut fastball in the modern era. He probably would have still been quite good, but no way he’s the clear-cut greatest.
There’s a difference between a semi-colon and a comma – just a lil’ FYI. Probably would have still been quite good? I’m not sure how you argue that he’s the greatest baseball player of all time, yet aren’t convinced he’s be excellent today (“probably still quite good” doesn’t sound like a GOAT athlete to me). Again, stats don’t care about your feelings. Whether or not Babe Ruth is the clear-cut greatest or not depends on how you quantify being the greatest. Babe Ruth may not have faced 100 mph fastballs (he did face fastballs up to the low 90s), but today’s pitchers will never face Babe Ruth, and Babe Ruth didn’t use the advanced engineered bats of today. Maybe Babe Ruth would have hit more home runs with faster fastballs – physics, ya know?! Have the balls to state Babe Ruth is the greatest and mean it.
Ultrarunning is a lot different, of course. Here, we’re talking the 1990s and early 2000s, not 100 years ago.
The first Western States 100 was in 1977. Just because you prefer the present, doesn’t mean the past doesn’t exist. Ann Trason, Kathy D’Onofrio, Chuck Jones, Brian Purcell, and Mark Brotherton all ran top-20 times (according to Marshall’s model) in the 1980s. Pay attention.
And the segregation piece is not directly relevant (though trail running has a long way to go with inclusion). But I do think that more recent athletes might be getting penalized for the advances of the sport more generally, as the average finishing time in the middle of the pack might have gone from someone who had never heard of a hill stride in 1994 to someone who reads everything about training theory today. Yes, in this formulation, I think hill strides are the most statistically significant variable. Is there anything hill strides can’t do?!
Does trail running really have a long way to go with inclusion? Compared to other sports, it’s quite inclusive. The WS100 even listed non-binary athletes’ results. Maybe hill strides can improve your writing? I had podium finishes in several ultras, and never ran hill strides. I’m not sure what you’re trying to say about recent athletes being penalized for the advances in sport. Are recent athletes penalized b/c other average Joes and average Janes have access to information, and therefore are better competition?
In addition to training theory, it’s possible that increased use of cooling, a larger talent pool, and better equipment could be driving down average times. But no matter how you slice it, Ann set records that will never be broken. 100 years from now, my great-great grandson’s coach will probably be writing an article on how Ann Trason is untouched historically.
I’ve always been a fan of Ann, but you don’t know her records will never be broken. At one time, no one thought a man could run a sub-4 minute mile. You forgot to mention that one must qualify to run Western States. Therefore, the “average person”, along with the slowest person in WS100 has already trained and reached a certain level of performance simply to make it to the starting line. Having more people reach the WS100 qualifying standards will certainly drive down average times – not a groundbreaking conclusion, Einstein. I ran WS100 in the early 2000s, and there was plenty of ice cubes and cooling. Maybe there are more advanced ice vests these days, but even those won’t get you that far when you have hours between aid stations at 90-100+ degrees.
Quickly, though, let’s highlight some of those modern performances. Every year since 2018, it has taken a historically stellar performance to win. Courtney, Clare, Beth, and Ruth, in chronological order, have beaten statistical expectations by 20-40 minutes. That informs my takeaway for this year for women. If you want to be in the top-5, you’re racing the competition. If you want to win, you’re racing history.
Every year since 2018, it has taken a historically stellar performance to win. Courtney, Clare, Beth, and Ruth, in chronological order, have beaten statistical expectations by 20-40 minutes. That informs my takeaway for this year for women. If you want to be in the top-5, you’re racing the competition. If you want to win, you’re racing history.
So beating the statistical expectations by 20-40 minutes = stellar. Got it. In 2023, only two women beat the 16:46 “stellar” time. Zero men beat the “stellar” time in 2023, even though 5 got top-20 times.
While we can’t run this model without an understanding of average finish times, we can use temperature and the same historical trends to predict what time the winner will run. The current high forecast in Auburn, CA is a downright temperate 78 degrees F. Marshall’s model would predict that the winning woman will finish in 17:06. If it’s 84 degrees, that time would be 17:18. It’s going to be snowy in the high country, so it’s possible that these predictions need to be tossed aside. But 20 minutes faster than 17:06 is 16:46.
Ellie Greenwood’s GOAT performance is 16:47. Get your popcorn popping.
What I need is a bottle of booze to figure out some of the points you’re trying to make.
Jim Walmsley is a Superhero
In last year’s analysis, we had to run predictions with Jim and without Jim. The problem? Jim breaks equations like he breaks records.
He doesn’t “break equations.” He’s an outlier.
We continued that trend for the time predictions, but for the best all-time performances, we don’t see the same effect as we saw with Ann. While Ann is like Babe Ruth, Jim might be like Pedro Martinez or Sandy Koufax, demonstrating absolute dominance, but at a time when the game was a bit more developed. Here are the best performances ever with all of that athlete’s performances omitted, which will be the only graph since the differences are marginal.
Jim’s 2021 win was his slowest time, but his best performance, and the best of all time by a ton. That year, it was hot and times were slower across the board. Except Jim, who ran an unthinkable time. And I love Mike Morton’s performance sneaking in with the 2nd spot! This ranking really shows the effect of temperature on men’s times, with some of the fastest times ever being relatively close to model predictions (as a reminder, every winning time is legendary in its own right, and all hate mail can go to Marshall and his big brain, P.O. Box 190 IQ Way).
Wow, that’s a lot to unpack. No offense to Jim, but if he ran his slowest time, why “…times were slower across the board. Except Jim, who ran an unthinkable time”? He either ran a slow (relative to his ability) time or he didn’t. How does one quantify “best of all time by a ton”? Was Jim’s 2021 win superior to Courtney’s 2023 win? I’ll send the hate mail directly to you at 123 WTF Parkway. How did Mike Morton “sneak” into 2nd place? Didn’t he simply perform well?
When you take out Jim’s performances, most recent men’s races have aligned with model predictions, or even been slower, contrasting with what we see for women. In 2023, it’s safe to assume that the snow will lead to slower times than the model predicts, since it doesn’t account for snowpack in the high country.
If the high in Auburn is 78 F, the model predicts the men’s winning time will be 14:24. Holy shit. Let’s take Jim out of the stats entirely to give a more accurate prediction. Without Jim, we’re looking at 14:41. And based on what we see historically, it’s safe to assume the time might be a bit slower than that unless we see an all-time great run.
I guess it wasn’t safe to assume, as the winner ran 14:40.
At 84 degrees, those times go to 14:34 and 14:56. Even with snow, we’re probably looking at a barn burner this year.
David Roche’s Predictions
The temps are going to be cool, the competition is going to be hot, and the sentence structure is going to be predictable. My bold prediction is that the women’s course record is broken, along with the best performance ever using this model. I don’t want to name names, but if you know, you know.
Maybe your sentence structure should be less predictable if you want to be a writer? If you’re making predictions, then why can’t you name names? Too chicken? Afraid to ruffle feathers (ha ha, ok now I’m telling bad jokes).
For men, I think the winning time will not break 15 hours or the top-20 performance list. Yes, I am going against Marshall’s model. He’s on vacation in Iceland, so what’s he going to do about it? Iceland doesn’t have baseball! I’m not even sure Iceland has internet!
Oops-a-daisy! The top male ran 14:40, and the top 5 males entered the top-20 performance list. The 6th man ran close to the previous top-20 times.
My rationale is that the men’s times are so fast, with a lower rate of improvement over time, and sub-15 pace in the first 30 snowy miles will eat athletes alive in the final 30 miles. So I think the men’s winner will come off a slightly more conservative pacing strategy than the model would predict, or will involve a fade from model-predicted times. But again: my brain is filled with Rod Carew batting averages and Fergie lyrics, so maybe that’s crowding out some useful predictive neurons.
I’m all for girl power, but you are greatly underestimating the mens’ field if you think they ALL will be thrown by the snowy miles, when the women won’t. I would make fun of you for knowing Fergie lyrics, but I remember all the words to some pretty crappy music myself.
Let’s end with one more baseball stat: Fielding Independent Pitching, or FIP. I absolutely love the story behind this statistic, so here I am telling you about it in a trail running magazine. Believe in your dreams, dateless kids!
Ugh.
In the early 2000s, researcher Voros McCracken discovered a wild baseball quirk. Across seasons and massive datasets, the number of batted balls that became hits rarely showed correlations for individual athletes. In other words, the probability that a batted ball becomes a hit might be out of a pitcher’s control.
That seems wrong. Wouldn’t a great pitcher give up less solid contact, leading to fewer hits? And wouldn’t Tugboat McGee with the middle-school fastball have every pitch belted a billion miles per hour? Shockingly, it doesn’t seem like that’s the way it works–the league-average batting average for balls in play is relatively stable across seasons, and individual deviations from that mean might just be luck, rather than skill (with some exceptions for certain pitchers).
FIP isolates what the pitcher actually controls: strikeouts, walks, hit-by-pitches, and home runs. It seems like that’s such a small part of the game, but those numbers alone added a depth of understanding to pitching performance that informs how the game unfolds now. And I think FIP is a solid metaphor for how athletes can think about Western States.
Wow, that was NOT a good description of FIP. FIP comes down to the pitcher’s talent, essentially isolating him from other players making plays in the field when a ball is hit. If you pitch on a good defensive team, your ERA will be better than if you were on a bad defensive team. The FIP is a way to make the pitcher’s talent independent from his teammates.
There are things you control: training, cooling, logistics, and mindset. There are many more things you don’t control: health, temperature, snow, trail conditions, and all the other vagaries of race day. Think about what you control, and try to remember that the uncontrollable variables are what make ultrarunning so special.
One can only control cooling so much on a course like Western States.
When there are two outs and the bases are loaded, just like when there is a make-or-break moment in an ultra, it doesn’t matter how much good luck or bad luck led to that moment. What matters is the underlying fundamental elements of performance you can control.
True, but what also matters in that situation is whether or not you can strike out the batter if you’re on defense, or produce a hit if on offense. If you are the most talented pitcher but allow a homerun, you’ve fucked up, and THAT is what matters.
There is a lot of luck involved in all of this stuff. But luck plays a lot smaller role if you take a deep breath, regroup, and strike the next hitter out.
All this stuff. Such literary gusto. Of course luck plays a smaller role if you strike the next hitter out. I’d win all my races if I simply took a deep breath, then outran all the other runners. Easy peasy.