Eric Ekholm - Crossfit Open ’24 Analysis

Now that the 2024 Open is over, I figured it might be fun to look at some of the data from this year’s top 100 finishers. There’s a lot we can look at here, and I’m sure I’m going to leave a few things out – feel free to drop me a line if there’s something, like, urgent that I missed.

If you’re into R and want to do your own analysis, you can check out my work-in-progress crossfitgames package that has some tools for fetching and processing data from the CrossFit API. It’s in kind of a janky state right now, but it works fine for what I want to do today.

I’m including the code used to pull/clean data and create graphs here. It’ll be folded up by default, and I won’t really explain what it’s doing step-by-step, but if you’re into that sort of thing, you can take a look.

Code

library(crossfitgames)
library(tidyverse)
library(gt)
library(hrbrthemes)
library(lubridate)

#placeholder for now
theme_set(theme_ipsum_rc())

women_raw <- open_leaderboard(2024, "women", top_n = 100)
men_raw <- open_leaderboard(2024, "men", top_n = 100)

women_lb <- extract_final_leaderboard(women_raw) |>
  mutate(div = "women")

men_lb <- extract_final_leaderboard(men_raw) |>
  mutate(div = "men")

#combine the two dataframes
all_lb <- rbind(women_lb, men_lb)

women_workout_res <- extract_workout_results(women_raw) |>
  mutate(div = "women")

men_workout_res <- extract_workout_results(men_raw) |>
  mutate(div = "men")

all_workouts <- rbind(women_workout_res, men_workout_res)

Top Finishers

First, let’s look at the top finishers for the men and women. We’ll start with the women:

Code

women_lb |>
  slice_min(score, n = 10) |>
  select(-div) |> 
  gt() |> 
  tab_header(
    title = "Top 10 Women",
    subtitle = "2024 CF Open"
  )

Top 10 Women
2024 CF Open
rank	athlete	score
1	Grace Walton	35
2	Mirjam von Rohr	40
3	Anikha Greer	72
4	Arielle Loewen	77
5	Carolyne Prevost	91
6	Christina Agerbeck	97
7	Julia Blazejowska	113
8	Seher Kaya	114
8	Katrina DiGiacomo	114
10	Kara Saunders	115
10	Aimee Cringle	115

One thing to keep in mind here is that lower scores are better in the Open, since your score is the sum of your rankings for each workout. Since there are 3 workouts, the best score possible would be 3.

On the women’s side, the most recognizable names are Loewen and Saunders, and a lot of the bigger names (Tia, Laura, Emma Lawson) are absent. This shouldn’t be concerning, but it’s potentially interesting.

And moving to the men’s side:

Code

men_lb |>
  slice_min(score, n = 10) |>
  select(-div) |> 
  gt() |> 
  tab_header(
    title = "Top 10 Men",
    subtitle = "2024 CF Open"
  )

Top 10 Men
2024 CF Open
rank	athlete	score
1	Jonne Koski	30
2	Saxon Panchik	39
3	Jay Crouch	64
4	Luka Vunjak	81
5	Noah Ohlsen	82
6	Fabian Beneito	93
7	Cale Layman	95
8	Jeffrey Adler	103
9	Brandon Luckett	124
10	Patrick Vellner	139

The story is a little different here – lots of noticeable names that have historically done well at the games. Vellner, Adler, Ohlsen, Koski, Panchik, and Crouch.

Men and Women Scores by Rank

Another way we might look at the data is to compare the scores of the men to those of the women. This can give us a sense of the “depth” or level of competition across the field.

For instance, we can plot each athlete’s rank (overall placement) against the total score for men and women:

Code

ggplot(all_lb, aes(y = rank, x = score, color = div)) +
  geom_point(alpha = .5) +
  scale_y_reverse() +
  labs(
    x = "Score",
    y = "Rank (Overall Place)",
    title = "Score by Overall Rank",
    subtitle = "Top 100 Men and Women"
  ) +
  scale_color_discrete(name = "Division")

In cases where, for any given rank, the women’s score is less than the men’s score, we can assume (sort of) the women outperformed the men. For example, the 100th place woman had considerably fewer points than the 100th place man. This might tell us that, relative to the rest of the field, the 100th place woman is “better” than the 100th place man. Of course it’s not that straightforward – we could just as reasonably conclude that the men’s field is deeper than the women’s field – but it’s a fun thought exercise.

What is interesting, though, is that the score gap between the 1st and 100th woman is much smaller than the gap between the 1st and 100th point man. And that we see separation between the men and women around ~50th place.

Variance in Workouts – Women

Another interesting datapoint we can look at is the spread between an athlete’s best finish and their worst finish. Since the open is only 3 events, this rank will explain a lot of their overall variance (at the Games, where there are more events, the best-to-worst spread is kinda less important). Obviously, all of these athletes finished top 100 overall, so nobody completely bombed anything.

Let’s start with the women:

Code

women_best_worst <- women_workout_res |>
  group_by(athlete) |>
  summarize(
    worst = max(workout_place),
    best = min(workout_place),
    spread = worst - best
  ) |>
  left_join(women_lb, by = "athlete")

women_best_worst |>
  ggplot(aes(y = rank, x = best)) +
  geom_segment(aes(xend = worst, yend = rank), color = "steelblue") +
  scale_y_reverse() +
  labs(
    x = "Finish Position (Spread)",
    y = "Rank (Overall Place)",
    title = "Best and Worst Finishes by Overall Rank",
    subtitle = "Women"
  )

Ok, so the way you interpret this plot:

The athlete’s overall rank is on the y-axis (the vertical axis)
The blue bar represents the spread of their performance. The left end is their best performance, the right end is their worst performance, and the width is the gap between best and worst

Keep in mind that having a small spread between your best and worst finish isn’t necessarily a good thing if your best finish is relatively high.

All in all, though, we tend to see bigger spreads the further down we get in the rankings, which kinda makes sense because a single bad event can crush your overall score with just 3 total events.

From here, we might want to look at which women had the largest best-to-worst even spreads:

Code

women_best_worst |>
  slice_max(spread, n = 10) |>
  select(-div) |> 
  gt() |> 
  tab_header(
    title = "Largest Differences between Best and Worst Event",
    subtitle = "Women, CF Open 2024"
  )

Largest Differences between Best and Worst Event
Women, CF Open 2024
athlete	worst	best	spread	rank	score
Emily Rolfe	504	22	482	99	596
Caitlin Bernardin	475	18	457	89	543
Tracy Johnson	450	39	411	100	605
Laura Horvath	395	6	389	62	408
Aizhan Zharasova	400	33	367	94	573
Sara Alicia Fernandez Costas	354	6	348	77	477
Baylee Rayl Christophel	380	39	341	80	499
Makenna Enslin	382	48	334	85	514
Addison DesRosiers	348	20	328	76	475
Linda Keesman	363	41	322	98	584

We see Laura on here, and I think she did (relatively) poorly on the first event compared to her other 2 event finishes, hence the large spread:

Code

women_workout_res |> 
  filter(athlete == "Laura Horvath") |> 
  select(-div) |> 
  gt() |> 
  tab_header(
    title = "Laura Horvath Open Finishes"
  )

Laura Horvath Open Finishes
athlete	workout_num	workout_place	points	score
Laura Horvath	1	395	395	7:03
Laura Horvath	2	6	6	911 reps
Laura Horvath	3	7	7	8:59

Right, so she finished 395th in event 1, but then 6th and 7th.

Variance in Workouts – Men

And we can do the same thing for the men:

Code

men_best_worst <- men_workout_res |>
  group_by(athlete) |>
  summarize(
    worst = max(workout_place),
    best = min(workout_place),
    spread = worst - best
  ) |>
  left_join(men_lb, by = "athlete")

men_best_worst |>
  ggplot(aes(y = rank, x = best)) +
  geom_segment(aes(xend = worst, yend = rank), color = "steelblue") +
  scale_y_reverse() +
  labs(
    x = "Finish Position (Spread)",
    y = "Rank (Overall Place)",
    title = "Best and Worst Finishes by Overall Rank",
    subtitle = "Men"
  )

So the point about having a small spread not necessarily being ideal is illustrated here – we see whoever finished in 98th has a teeny spread, but all of his finishes were ~250ish.

Code

men_best_worst |>
  slice_max(spread, n = 10) |>
  gt() |> 
  tab_header(
    title = "Largest Differences between Best and Worst Event",
    subtitle = "Men, CF Open 2024"
  )

Largest Differences between Best and Worst Event
Men, CF Open 2024
athlete	worst	best	spread	rank	score	div
Colin Bosshard	605	3	602	86	659	men
Lazar Đukić	609	26	583	88	662	men
Henry Matthews	609	35	574	90	689	men
Taylor Self	565	2	563	84	650	men
Fernando Llaneza Pardillos	567	13	554	82	606	men
Bailey MacDonald	577	51	526	95	723	men
Carlos Ferrara Coloma	466	5	461	70	509	men
Tyler Christophel	522	82	440	96	725	men
Javier Gonzalez Fernandez	497	64	433	100	754	men
Chandler Smith	442	12	430	72	525	men

These spreads are wider than the ones we saw for the women, which again maybe suggests that the men’s field is more variable than the women’s field?

24.1

Now let’s look briefly at individual workout results. Obviously, we’ll start with 24.1. You can see the workout here.

Code

wk1 <- all_workouts |>
  filter(workout_num == 1) |>
  mutate(time_score = period_to_seconds(ms(score)))

wk1 |>
  ggplot(aes(x = time_score, y = workout_place, color = div)) +
  geom_point(alpha = .5) +
  scale_y_reverse() +
  scale_color_discrete(name = "Division") +
  labs(
    x = "Time (Seconds)",
    y = "Workout Place",
    title = "Time to Complete 24.1 by Workout Place",
    subtitle = "Top 100 Overall Men and Women"
  )

I think this is a pretty cool plot in that it shows the “shape” of the performances by these athletes. The waterfall shape suggests that there’s more separation (horizontal space) amongst the very top finishers, but the increasingly steep slope toward the right end of the graph suggests there’s less time between worse-finishing places. Which makes sense.

See that one pink dot all the way to the left – who crushed this workout that hard?

Code

wk1 |>
  slice_min(time_score, n = 1) |> 
  pull(athlete)

[1] "Colten Mertens"

24.2

We’ll do the same thing for 24.2. You can see the workout description here

Code

wk2 <- all_workouts |>
  filter(workout_num == 2) |>
  mutate(n_reps = parse_number(score))

wk2 |>
  ggplot(aes(x = n_reps, y = workout_place, color = div)) +
  geom_point(alpha = .5) +
  scale_y_reverse() +
  scale_color_discrete(name = "Division") +
  labs(
    x = "Reps",
    y = "Workout Place",
    title = "Reps Completed in 24.2 by Workout Place",
    subtitle = "Top 100 Overall Men and Women"
  )

We see basically the inverse pattern here – the curve flattens toward the right (since more reps is better in this case). What’s interesting is that we see this flattening occur at around the top ~100 finishers of the workout for women, but at maybe the top ~30 for men. This might have something to do with the scoring (1 double under obviously isn’t equivalent to 1 deadlift), or it might be an indication that the top 100 women were considerably better than the next others in this workout. Again, it’s kinda hard to tell.

24.3

And we’ll wrap up by looking at the same type of plot for 24.3. You can see the workout description here

Code

wk3 <- all_workouts |>
  filter(workout_num == 3) |>
  mutate(time_score = period_to_seconds(ms(score)))

wk3 |>
  ggplot(aes(x = time_score, y = workout_place, color = div)) +
  geom_point(alpha = .5) +
  scale_y_reverse() +
  scale_color_discrete(name = "Division") +
  labs(
    x = "Time (Seconds)",
    y = "Workout Place",
    title = "Time to Complete 24.3 by Workout Place",
    subtitle = "Top 100 Overall Men and Women"
  )

Again we see basically the same pattern we saw in 24.1

I’ll probably do something like this again for the Quarterfinals and Semis, so if there’s anything else people would like to see, let me know.

Reuse

https://creativecommons.org/licenses/by-nc/4.0/

Citation

BibTeX citation:

@online{ekholm2024,
  author = {Ekholm, Eric},
  title = {Crossfit {Open} ’24 {Analysis}},
  date = {2024-03-19},
  url = {https://www.ericekholm.com/posts/cf-open-24},
  langid = {en}
}

For attribution, please cite this work as:

Ekholm, Eric. 2024. “Crossfit Open ’24 Analysis.” March 19, 2024. https://www.ericekholm.com/posts/cf-open-24.