Creative Ways to Slice the Tokyo Olympics (Total Medal Count is Boring!)

12 min readOct 7, 2021

As I watched the Tokyo Olympics, all these analytics questions popped into my head. Sports and performance training has been one of the leading areas for transformation through analytics over the last 15 years. I wrote down a few questions to scope data integration and modeling needs:

Many independent disciplines at the Olympics share certain fundamental movements or skills. For example, ‘combat’, ‘team ball sports’, ‘shoot-stab’, and ‘spin-flip’. What does the aggregation of events and the medal count by country look like when I regroup Olympic events?
What are the differences in stature between Olympic Medalists and the average adult? How do medalists in various sports differ in height, weight, and age? How do medalists by sport differ from average adults and adults of similar age?
How do attributes of competitors in Olympic events map to the general population? Height and weight, or BMI (proportion of fat to lean muscle), have strong genetic components with some behavioral and environmental influence. What portion of the general population is born with no realistic chance to compete in certain Olympic events? Should we aim for a percentage of events for which an average person, with proper motivation and training, has a chance to compete?
Can I build reconciliation functions to check for and fix errors in available Olympic summaries via HTML screen scrapes and API calls to source data on athletes or sports?
Can I automate the creation of custom hierarchical plots, such as to regroup Olympic disciplines by my common-sense categories and build layers of aggregation? Plotly- show me the way.

As I built and ran my data reconciliation functions, I ran plots that I could reconcile easily with public charts or data, like medal count by country. Once I had my core datasets verified, I could expand to look at the more interesting questions I wanted to ask.

Next was to address the question of how to show the difference in physical stature (height and weight) across medalists in different disciplines and versus the average US adult woman and man.

I wrote python code to source medalist data (and integrate html scraping via Pandas and Beautiful Soup packages). I filled gaps with manual searches for athlete info- this ended up being the 66–75% of effort on data extraction and scrubbing work that I’ve learned to anticipate! I used Numpy to build histogram distributions for adult height and weight based on a detailed NCHS (National Center for Health Statistics 2018) report I had for the United States.

Height and Weight of Medalists vs. Average US Adult Women and Men

Using my go-to plotting tool, Plotly, I mapped out various medalist data against both averages for all adults and for adults in the 20–29 age group (a group in which most Olympic medalists fall). Here is a plot with Height in the vertical dimension vs. Weight in the horizontal:

Height and Weight of Olympic Medalists by Discipline versus US Adult Women and Men

Height differs from Weight in a few (obvious) ways. We have little or no control over height, it is genetic except as influenced by childhood illnesses or nutritional deficiencies. Weight is arguably more controllable, as is Body Mass Index (BMI), which indicates the proportion of fat to lean muscle, a key measure for an athlete. While genetic tendencies likely exist, they can be influenced with regular training.

Height clearly presents an advantage in certain sports (most notably in most “Team-Ball Sports” in the graph above). Weight does too but more as a proxy for performance-related measures like the ratio of lean muscle mass to fat. If we had more granularity with the measure of weight, such as with BMI, then we could probably tell more of a story about the difference between Olympic medalists and the average adult!

In the graph I split plots for adult women and men. I plot male and female US adults as well as those in the 20–29 age group. I displayed the mean for each male and female, all or 20–29 group. Then I displayed the 75th percentile for each group, and the 90th percentile for the male group due to certain men’s Olympic events that show extreme attributes. To distinguish the points on this plot, for both Medalists and ‘average Americans’ I use circles for women and triangles for men.

Population height is not a single point but a range that approaches a normal distribution. I charted medalists against a histogram representing the distribution of heights among US adult women and men. This plot can be seen in the chart below:

How is this information actionable? In all of the “combat” sports (boxing, wrestling, judo, karate), competitions are split across weight classes. It would be interesting to consider height classes in sports in which we see a clear correlation between height and success. In combat, height is correlated with reach, whereas weight can slow an athlete down, so why segregate competitors on weight alone?

Also, when selecting the disciplines and events for an Olympic Games, what proportion of events should we target in which freak physical attributes are not the ante? Soccer, or “Football” is a good example of a team sport in which the height of medalists is within 1 standard deviation of the population at large. I’m editing this post during the World Cup, and it reminds me that almost no kid is genetically excluded from trying to someday compete in a World Cup, if they’ve got the guts, drive, and discipline.

In the above chart, I plotted four points for Soccer. They are averages for Men and Women medalist teams and for outfield players versus goalkeepers on each. So, moving from left to right we first see the plot for Women’s Soccer, outfield players, Women’s goalkeepers, then the same for the Men. While male goalkeepers are on average taller than RugbySevens or FieldHockey medalists, male outfield players are the shortest of any of the athletes in my “Team Ball Sports” grouping.

This type of analysis can help us consider having a range of events selected for the Olympics which don’t require extreme physical stature as the entry point. I have loved basketball my whole life, and I won some 3-point or free-throw contests when I was young. I also had to admit that at 5'9" I am at a big disadvantage in the sport! Don’t get me wrong, basketball requires incredible athletic ability and skills in dribbling, shooting, and passing, but there is a clear height-bias. What if we were to split 3on3 basketball into two classes for men: one for 6'1" and above, and one for under that height? OK, maybe there wouldn’t be much interest in the latter, but it’s just like separating combat sports by more than weight- we’re in an intelligent, analytics future when we should try out some innovations to level and segment competition.

Soccer Truly is The World’s Game

I am also a lifelong Soccer fan, even more than hoops. I did some additional work with athlete data in soccer. I plotted the difference among medalist teams (Gold, Silver, Bronze) between soccer outfield players and soccer goalkeepers and came up with the following:

Men’s Soccer outfield: mean height: 70.1" (5 ft 10 in) mean weight: 157 lbs

Men’s Soccer goalkeepers: mean height: 73.2" (6 ft 1 in) mean weight: 179 lbs

Averages for soccer outfield players among medalist teams, both for men and women, were close to the US adult population average for their respective gender. By ‘close’ I mean less than 0.5 standard deviations.

It surprised me that medalist players averaged only 5'10" and an un-intimidating 157 lbs, in a sport with such intense competition throughout the world! In an event such as the Olympics, meant to bring the world together for the enjoyment of fair competition, Soccer is a sport where one is limited only by their skill and effort. Perhaps there are underlying genetic traits that give an advantage, but we don’t see it inheight and weight.

The overlap between soccer medalist attributes and the distribution of the general population tells me billions of us have a chance at that podium. Only these select few actually made it, and it makes their character and drive that much more inspirational. That is in the spirit of the Olympic games: out of an entire world of potential contestants, these few athletes have differentiated themselves…( cue the Olympic theme with those pounding drums).

The selection of Olympic sports should probably maintain a mix of athletes with extreme attributes and athletes who’ve made the most of attributes within the mainstream. I enjoy watching events like Rugby, Water Polo, or Shot-putting where clearly only a select few with the genetic predisposition plus drive can compete for the spotlight. Newer sports like Skateboarding have medalists with height and weight attributes that are even below the population mean. Not only is there no height or muscle mass bias, but perhaps an age bias for pre-adults (the average age of womens park skating medalists was between 15 and 16). It was purely about skills and the poise to demonstrate those skills under lots of pressure.

Event Entry Restrictions: Women, Men, Mixed, and Open

Besides height and weight, another dimension for analysis of events is gender restriction- with four entry types: Women, Men, Open, and Mixed. Here is what my data showed on the breakdown of events at the Tokyo Olympics:

Men 165
Women 156
Mixed 12
Open 6

I’ve seen posts and blogs pushing for more events with mixed and open entry. If that is a goal, how would we best do that? One way would be to use the concept of classes, which is already applied in combat competitions, and be creative about applying classes to more than just Weight. That may allow for more Open entry events. To stay relevant and interesting, the Olympics should always be thinking about new sports to include or new ways to organize events in existing sports to allow a greater diversity of competitors. Analytics can identify the right balance to keep competitions interesting, fair, and open to a wide range of competitors as well as interested fans.

Creating Higher-Level Groups for the 48 Disciplines and 339 Medal Events

My brain wanders while watching various Olympic disciplines, and I think about things like the fundamental movements or actions involved and what they have in common with other disciplines. There are team ball sports, there are one-on-one combat sports, there are sports where people shoot projectiles or stab, those where they spin and/or flip, and where contestants need to pedal, paddle or row. I thought it would be fun to create fundamental attributes on which to group all the events and see what pops out.

Once I created these attributes, I wanted to see the distribution of medal events by group. Next, I wanted to see the medal count by country within each group. Would I find that certain countries excelled within my common-sense groups in a way that differed from the medal counts shown on the news? I did find that!

The number of medal events in combat competitions far exceeds that of any other group. I knew Athletics and Swimming as formal Olympic ‘Disciplines’ had the most medal events, but when I grouped disciplines by common themes, my ‘rowpaddle’ and ‘freeride’ groups were not far behind Athletics or Swimming. Just behind them was the ‘spinflip’ group. It was a toss-up whether to put trampoline in ‘spin-flip’ or ‘balance’, had I not opted for the latter then ‘spinflip’ would have moved up to 5th.

I created a report of the top x (default x-8) medal-winning countries within each common-sense grouping. It shows a different slice for which countries emphasize, or excel, at which types of sports. Below is the plot for the ‘combat’ group, representing the greatest count of medal events of any of my groupings: 74.

And here is the plot for another common-sense grouping: ‘teamballsports’:

With teamballsports the USA rises to the top, but it is interesting that neither in teamballsports or combatsports, do we see China in the top 8. Yet China was number two for total medal count. I’ve only shown charts for 2 of my 14 groups, but they are two of the biggest for medal events. Were I to show ‘spinflip’, we’d see that China dominated.

My Olympics Python App

I set out to create new ways to look at Olympic disciplines and events. I did not intend this to be a Python how-to article, it’s about the results not the means of obtaining them. But I did want to conclude by explaining a bit about what I built, my approach, and a link to my GitHub repos. I ended up writing over 1,300 lines of python code for this app, which is actually less than a third of the length of other python apps I’ve written recently.

Although python has it’s basis as an interpreted, scripting language, I build functions, classes and organize them into packages as I would for any project in any language. I started (reluctantly) in Python after taking a machine learning course 3 years ago in which Python was required. I was already using R for analytics and invested time learning R and R Studio, I had been certified in Java J2EE developer earlier in my career, so my initial impression of python was not that favorable. It seemed unstructured and inconsistent. I was kicking and screaming through my immersion in python, but now I love it.

One reason for my conversion was my discovery of Pycharm Pro. It is a great dev workbench, allowing me code windows, hover docstrings and library doc, console, terminal, and debugger- everything to build, test, debug, and run in a fluid rhythm. No, they’re not paying me to say that, I just appreciate well-designed products…

I use a variety of containers for data. Pandas DataFrames are highly flexible but not always efficient, so I often use lists of dict, or dict of dict structures. There is great efficiency and flexibility with these aggregations of core python objects. The pandas ‘from_records’ function allows me to read a list of dict directly into, or back from, a Pandas DataFrame.

I adapted the core events_df DataFrame from a Kaggle download of a csv file. Then I wrote validation functions to audit the data in this table and correct mistakes using data I sourced through Beautiful Soup scraping. I was happy to find that the Pandas read_html works well in a lot of straightforward extracts of tables from html pages, and this allowed me to simply augment that with Beautiful Soup for tricky details on certain pages. One design lesson I learned with this app is the value of building data validation or ‘audit’ functions to verify and correct externally sourced data.

In addition to event results, I also built functions to source detail on athletes, participants and medalists by country. Since it is “expensive” to go out and scrape this detail data each time I want to run the app and do a new analysis, I put options in the app to read from validated backup data versus going out and pulling source data from the Internet. Since the data is static, once I built the reconciliation routines to fix any incorrect records, I could simply write those files to storage once validated and use them the next time I ran the app.

As with my previous apps, I’ve posted all this data and python code to my GitHub account. If you are doing something similar, or are able to leverage what I’ve built, I’d love to hear about it, and I am open to legitimate opportunities. https://github.com/briangalindoherbert/github_Tokyo2021.

All the Best,

Brian Herbert