The Story
Wikispeedia can be seen as a directed graph
In the context of Wikispeedia, difficulty measures how challenging it is for a player to navigate from a given start article to a target article. But “challenge” can mean different things: some paths may be long (structurally far apart), others may be conceptually tricky even if a few clicks away. We consider both the objective structure of the Wikipedia network and observed player behavior. To start finding an answer to what is the most difficult Wikispeedia game, we need to understand the logic of the game. A starting article, a target article and many links in between. Actually 18 millions of possible start and end articles pairs! This is quite dense information, a good way to try and understand it is to build a directed graph from it. Each node is an article and the links connect them together.
Missing Links
Before going any further let's check if any article is disconnected from the graph. These correspond to articles that neither link to any other article nor are linked to by any article in the graph.
First, we find 10 articles without any links directing or emerging from the them. These articles are not of interest to us and should be discarded! Now what about articles without any links directing to them ? These are what we call impossible targets, there are 456 of them. In fact, if I give it to you as a target you might wonder a very very long time before reaching it... However, they could be a great start article so let's keep them! Finally, we have the opposite case: articles without any links on their page. We call them dead starts. In fact, good luck leaving those 4 article... However, they could be great target articles so let's also keep them!
Symmetry
Symmetry dictates an important aspect of our graph because it is directed and not symmetric. A difficult pair might not even be doable if reversed, so let's keep this in mind! For example, the article 11th century can be reached from €2 commemorative coins in 2 clicks, while the other way round is not possible, as there is no possible path.
Strongly Connected Graph
Once this graph is built we can observe two important points. There are some islands, more precisely one! This island is completely disconnected from the main graph (look at the upper right of the plot below). These article should be discarded. There is not much interest in playing in this small island from Friend Directdebit to Directdebit... It also seems to be that these articles don't have any categories and have buggy HTML so they have also been discarded during data processing for those reasons too.
Now that we have only one island, let's define an important notion : a Strongly Connected Graph is a graph where each node is reachable from any other node. We can easily try and compute whether our graph is strongly connected or if it is divided in strongly connected subgraphs!
Interesting! Here we see that the graph is mainly a central graph of 4051 articles all strongly connected to each other. Then we also have a lot of peripheral articles with some aggregated in islands up to a size of 6 ! These can reach the main island or be reached from it but not both! What if I gave you a start article in this island of size 6 and had to reach an article in the main island ? That would probably be difficult but possible!
How can we measure the difficulty of a specific pair
Asteroid ➜ Viking
Now let's take our previous example, Asteroid as starting article and Viking as a target. Let's find a way to rate the difficulty of a pair of articles.
Shortest Path
The simplest metric we can calculate to answer this question is the shortest path between those articles. In other words, the minimum number of links we need to click on to reach a target article from a specific start article. However some articles are technically only two links apart, but those links can be conceptually difficult to find for the average person. For example, who would guess that parrot and cricket (the game) are just one Winston Churchill away? So we shouldn't only look at the length of the shortest path as two articles can have one single optimal path which is short, but that is very difficult to find. Would you have found Winston Churchill from parrot in one click ? We wouldn't have... Let's add the number of possibilites to optain the shortest paths as well. Another metric to rate the difficulty to find a shortest path is the minimum number of categories you have to cross to find the target article. The broad categories of Wikispeedia reflect the general knwoledge needed to exceel in this game, it would be easier for EPFL's student to navigate in the Science category only than to have to go through Art, Music and Business Studies additionaly.
With our first metrics, let's look again at our example.
We see that you can go from Asteroid to Viking in only 3 steps. There are 15 ways to do this and you would only need to cross 2 categories (Science and History categories). One way to find the shortest paths is to go through Earth and Middle Ages. Is this considered difficult in Wikispeedia?
Actually not so much. It is even on the easier part of our distribution with an average distance of 3.20 and an average minimum coverage of 2.50 categories. We see here that some articles are at best 9 links away to one another or require to cross 7 different categories (out of 15 possible)! The distributions are quite logarthimically normally distributed.
Switching Category
Let's look into our category coverage metrics. Articles can be divided depending on their category, subcategory, and subsubcategory. We calculated the minimum number of different categories we need to go through to go from start to end ? This tells us how much our shortest path actually travels through categories!
Asteroid is in the category Science and Viking is in History. Are those categories difficult to connect?
Each cell shows the average shortest path length between a given start and end category. Colors ranking (pink (shortest) -> blue -> yellow (longest)) indicate longer distances (higher structural difficulty). This heatmap helps identify which category transitions are the hardest. We can see that when having countries as an end category has an overall low shortest path with any start category. It's also interesting to note that going from IT to Design and Tehcnology or Science is not in average the games with the shortest path, this is a bit conterintuitive. Not all categories to themselves have the same shortest path distance, meaning there is a difficulty in the category itself, for example Mathematics to Mathematics is easier to reach than Science to Science, suggesting that subcategories have an impact.
Links
The difficulty of a game could also be influenced by the amount of links that are in the first article, and how many links in the whole game bring you to the target. If you don't have many options to leave the initial article, or only 3 ways to land on the target, you might struggle... Asteroid has 19 outgoing links, while Viking has 78 links that can help you land on the page. Is that considered difficult?
Well when looking at this distributions, we see that the number of incoming links to Viking are above average and the number of links within Asteroid is normal. In the middle, you have 19 options to leave the start article and 8 ways to land on the target. Overall, the plot illustrates that start articles have significantly more outgoing links than target articles have incoming links, revealing a structural asymmetry in how players can navigate the Wikispeedia graph.
We can also look at our link counts like this, it uses all possible paths in the network, each bin contains paths (not single articles). The x axis represents the number of links within the starting article and the y axis the number of links "bringing" you to the target article. The bins contain paths with a specific number of each. The color shows how many paths have that certain amount of ingoing and outgoing articles, so the darker it is, the less paths have this number of links (e.g. not many paths have 1 link in the start and 1000 going to the target). Yellow is a bin with lots of articles having the same amount of both types of links, the most common combination is having only 1 incoming link to the target but over 25 articles in the start article. Most articles have around 1-100 incoming links and 10-60 outgoing links.
Page Rank
One last metric could be the Page Rank score. It is a score computed in a directed graph of links to measure the overall influence of one node within the set. Basically, it reflects the overall influence of the article in the network. Asteroid and Viking have a PageRank score of 0.0004 and 0.0008 respectivly. Those Page Rank are way above the median (purple dashed line) of 0.000087 and the mean of 0.000218.
PCA
In this structural analysis, we have defined some metrics of difficulty based on the start article, target, and the paths connecting them, we can try to perform a PCA with three components to have an idea of the overall difficulty of a game.
By plotting the PCA, we can see that over 72% of the variability is explained by the three first principal components. We can estimate a rough difficulty rating based on these components. We get results that make sense, with easy games (level 1) such as Christianity -> United States or Japan -> Europe, and harder games Edward Jenner -> Malaspina Glacier or Scheme programming language -> John Bull (locomotive). I wouldn't want to have to play that last one... Our example game is considered relatively hard, at the border between level 4 and 3. So now with all this structural information about Asteroid and Viking, would you change your difficulty rating?
When interpreting the importance of features in the PCA, we distinguish two types of contributions:
- Pink features, which have a strong negative impact on the principal component (PC)
- Yellow features, which have a positive impact on the component.
PC1 is dominated by features related to the target articles, such as incoming link count and PageRank. These features contribute negatively: Well-connected target articles (high incoming links, high PageRank) push PC1 toward negative values. Distance, on the other hand, has a strong positive effect. The further apart two articles are, the larger (more positive) PC1 becomes. Overall, PC1 appears to be positively correlated with difficulty: Large positive PC1 indicates distant articles, poorly connected target article resulting in a difficult pair Large negative PC1 indicates close articles, well-connected target article resulting in an easy pair
PC2 is primarily influenced by characteristics of the starting articles, notably outgoing link count and PageRank, both with strong positive contributions. Thus, PC2 is negatively correlated with difficulty: Large positive PC2 indicates well-connected starting article with many outgoing links resulting in an easy pair
PC3 is shaped mainly by minimum number of categories crossed (negative impact) and number of optimal paths (positive impact) This again suggests a negative correlation with difficulty: Large positive PC3 indicates many optimal paths and few category transitions resulting in an easy pair
User Behaviour ; Do you actually just suck ?
After defining structural metrics that helps us understand what a "difficult" game theoretically looks like, we need to validate these metrics with actual player performance. The structure of the graph, through PCA and other metrics, only gives us a map of the game, i.e. which articles are central and which are isolated. But to build a true model of difficulty, based on player behavior, we need to observe the "hikers" walking the paths. To achieve this, we analyzed a dataset of over 74,000 recorded games played by users. While thismay sound like a lot of data, it really highlights the size of the Wikispeedia universe. With more than 18 million possible start-target pairs, our human data only covers less than 0.4% of all possible games. Furthermore, unlike a controlled experiment, users can play the same game multiple times, learning and improving. This sparsity and repetition make our task challenging: we are trying to generalize the difficulty of the entire "universe" based on a sample of explorers. To better analyse player behavior, it is crutial to limit our dataset to games that depict genuine navigational attempts, filtering out anomalies like long pauses or rapid random clicking, which brings our dataset down to 41,436 games. A dinstinction must also be made between finished and unfinished games. Finished games provide complete paths from start to target, allowing us to analyze success and efficiency, and allows defining more precise difficulty metrics. Unfinished games, where players gave up or surrendered, offer insights into frustration and perceived difficulty.
Winners ! Let's finish it
A game is classified as finished when the player successfully reached the target. There are 27'800 finished games out of 41'436 total games. Even here, players’ performances vary widely. To quantify behavior, we introduce several metrics.
Size matters : When short isn't easy (Path Length Ratio)
A game with a shortest path of 7 clicks might be intuitive and easy, while a game with a shortest path of just 1 click could be obscure and impossible to find. Structural distance tells us how close two articles are, but it doesn't tell us how hard the bridge is to cross. To capture this, we look at Efficiency. We defined the Path Length Ratio as:
This ratio is only defined for finished games, where the user successfully reached the target. A PLR of 1 means the user took the optimal path, while a PLR of 1/4 indicates a path that was 4 times longer than the shortest path. To visualize this, we computed the PLR for all finished games, regroupped all games by mission (a set of start and target articles) and averaged the PLR across all users who played that mission.
When we plot the Path Length Ratio against the Theoretical Shortest Path, we expect a linear difficulty curve. We assume that as the target gets further away (Shortest Path 6 or 7), users will make more mistakes and the ratio will decreases. The data tells a different story:
- Short Paths, Big Mistakes: Games that should be "easy" (1-3 clicks) are surprisingly unforgiving. Because the path is so short, a single mistake is statistically catastrophic. If a target is one click away and you miss it, taking just three extra clicks makes your PLR rocket to 0.25. Short games are a test of precision, where a moment of inattention is punished heavily.
- Long Paths, Small Mistakes: Games requiring at least 5 clicks appear suspiciously efficient, recovering higher ratios. Are humans actually better at complex navigation? Not necessarily, as we only gather data of the survivors in this section. A novice player given a 7-click path likely gets lost, frustrated, and quits. Their data never makes it into the "Finished" set. Patient and skilled players will have to select optimal paths to survive, leading to higher PLRs.
Paradoxically, users appear "clumsier" on easy games and "smarter" on hard games. This isn't because they play better, but because short games allow you to get lost and recover, while long games demand perfection to survive.
The cost of thinking (Time Per Page)
We can also define a "good" game not just by the number of clicks, but by the time it takes to make them. Does the time spent per page reflect the cognitive load of the mission?
When we plot Time Per Page (TPP) against the Shortest Path Length, we see a clear positive correlation. Unlike the Path Length Ratio—which deceptively suggested that long games were "efficient"—the TPP tells the true story. As the structural distance increases, users spend significantly more time on each page.
- Short Paths: Users rely on intuition and quick associations (low TPP).
- Long Paths: Users are forced to stop, read, and strategize (high TPP). The deeper the mission, the heavier the cognitive load.
But is TPP just a reflection of path length, or does it capture something deeper about the mission's structure? To explore this, we selected some structural metrics from our earlier analysis to see how they correlate with TPP.
The matrix above helps us visualize the relationship between TPP and the structure of the article itself. The maximal TPP is found in the top-left corner (Few Words, Many Links). These are typically "Hubs" or "List" pages. Users fly through them because they don't have to read; they just scan for the next keyword. As we move to the right (High Word Count), the TPP skyrockets. Content-heavy pages act as brakes. Regardless of the number of links or subtitles, if an article is dense, users get stuck reading it, drastically slowing down the game. The subtitle count is correlated with higher number of words and links so its effect is less clear. However, it seems that more subtitles slightly reduce TPP, possibly by breaking up content and making navigation easier.
Of course, if you randomly click very fast on a lot of articles, it doesn't describe if you played well, even with a small TPP. You would have to be very lucky in that scenario, dead-ends, backtracks and other mistake wouldn't caracterize those games as well played.
Backtracking refers to visiting an articles that has already been visited. It is the decision to return to a previous page. We identify specific articles and categories that trigger the most of these U-turns (backtracks). These are likely pages that looked promising but didn't contain useful outgoing links.
It was shown in the structural analysis that switching from one category to another resulted in different lenght of shortest path on average. Now, on a user percpective, which categories are switched to the most ? These categories are called the "Highways" of Wikipedia. Some categories act as central hubs, while others are traps. Countries and Geography are the easiest targets, it's the "Geography" Highway. They act as "global hubs", if you need to get to "London," you know exactly how to find it. Categories like Citizenship, Everyday Life, and IT are the hardest to navigate, they act as traps. These concepts are often abstract, lacking the clear hierarchical structure found in Geography or History.
Baddly, basicly and geniusly played sessions
In our dataset, each session is associated with a hash of the originating IP address. While this is useful for grouping repeated sessions from the same source, it does not identify a unique individual. A single IP address may correspond to one person, a shared family device, a school network, or a larger institutional gateway. As a result, any attempt to infer “player types” from this identifier would be like playing M. Hyde and Dr Jekyll.
To avoid attributing behaviors to individuals we cannot reliably distinguish, we restrict our analysis to session-level patterns rather than player-level traits. Each session is treated as its own independent exploration of the graph, with no assumptions about who is behind it or whether two sessions from the same IP come from the same person. Even if it were the same person, how could we know if they were at their peak or if they were drunk anyway ?
To better understand how the graph is explored, we performed a clustering analysis on all recorded sessions. The goal was to identify whether different styles of play emerged naturally from the behavioral data.
Using metrics such as path length ratio (how optimal the chosen path is), tpp (time per position), and number of backtracks, sessions were automatically grouped into three clusters representing distinct session behaviors.
- Efficient sessions: These sessions follow a nearly optimal path and reach the exit quickly. Characteristics include low path length ratio (close to the shortest path), low tpp (faster decision rhythm), very few backtracks.
- Casual sessions: They progress in a mostly efficient way but include moments of hesitation or exploration, with moderate path length ratio, average tpp.
- Lost sessions: These sessions take longer and include more wandering or revisiting of previous positions. They typically show high path length ratio, high tpp (slow decision-making).
These clusters help reveal how different session patterns emerge from the data. Even without identifying individual users, the analysis shows that the Wikispeedia graph encourages several distinct ways of being explored.
Do players know what's hard? (ratings)
Wikispeedia allows players to assign a difficulty rating at the end of a game. We have defined "Hard" mathematically (High TPP, High Ratio), but does that align with what players actually feel? We checked the 1-5 star ratings provided by users to see if they correlate with our behavioral metrics.
The plot above reveals a messy truth. While there is a slight trend—players generally rate "fast and efficient" games higher, the correlation is incredibly weak. Subjective interest often overrides objective efficiency, causing users to rate slow, wandering games highly if the content is engaging, while penalizing perfect runs simply because the topic is dull. We identify two major issues with using player ratings as a ground truth for difficulty:
- Sparsity: Only a tiny fraction of games are rated.
- Selection Bias: Users can only rate games they have finished and often only when they have a strong emotional reaction (love it or hate it).
Subjective ratings are too biased and sparse to serve as a ground truth for difficulty. We cannot rely on what players say; we must rely on what they do. To create a universal definition of difficulty, we need to build an objective model—one that predicts difficulty based on the graph itself, not user opinion.
Up to this point, we have explored difficulty through multiple lenses: structural properties of the graph, behavioral efficiency metrics, and subjective player ratings. The correlation matrix highlights an important result: no single feature explains difficulty on its own. Behavioral variables (TPP, path length ratio, backtracks) correlate weakly with structural features, while user ratings show only marginal alignment with either. This fragmentation suggests that difficulty is an emergent property of the system: it arises from the interaction between graph structure and human behavior, rather than from a single observable metric.
This matrix summarizes the monotonic relationships between all behavioral, structural, and outcome variables using Spearman correlation. Bright colors indicate strong correlations, dark colors weak or no correlation. Behavioral variables (TPP, PLR, backtracks) cluster together, wheras structural variables (path length, PageRank, link counts) form a separate block.
To move beyond descriptive analysis, we now ask a stronger question: can we predict difficulty directly from the structure of the Wikispeedia graph?
Can we rate the games ?
Players tell us what was hard, their clicks show us what actually was. Since subjective ratings are sparse and biased, we do not use them as a ground truth for difficulty. Instead, we define behavioral proxies that reflect how difficult a mission actually was to play:
- Path Length Ratio (PLR) → navigational efficiency
- Time Per Page (TPP) → cognitive load
- User Rating → perceived difficulty (used only for comparison)
For each proxy, we train a linear regression model using only structural features of the start–target pair, such as: shortest path length, number of shortest paths, category coverage, link counts, and PageRank. The goal is not to predict individual behavior perfectly, but to identify which structural properties consistently make a game harder.
The coefficient plots reveal which structural features consistently influence difficulty. Across all models, shortest path length emerges as the strongest predictor: longer minimal distances significantly increase cognitive load (TPP) and reduce navigational efficiency (PLR). The number of shortest paths plays a stabilizing role: when multiple optimal routes exist, users recover more easily from mistakes, leading to better performance. Interestingly, PageRank effects differ between start and target articles. A highly connected start article facilitates exploration, while a poorly connected target increases difficulty by reducing landing opportunities. Some features, such as outgoing link count or minimum category coverage, appear significant only in specific behavioral contexts, highlighting that difficulty is multi-dimensional rather than absolute.
The true versus predicted plots show substantial variance around the identity line. This is expected. Human navigation is inherently noisy: curiosity, fatigue, prior knowledge, and randomness all influence player behavior. A purely structural model cannot capture these effects. However, despite this noise, the models capture global trends: structurally distant pairs tend to induce higher TPP constrained graphs yield worse PLR Rather than predicting individual outcomes, the models are effective at ranking missions by expected difficulty, which is precisely what is required for level generation.
Game Over : What if you're a loser ?
So far, we only modeled games that were finished. But unfinished games tell a more fundamental story. A mission that players systematically abandon is, by definition, difficult — regardless of how short the theoretical path may be. We therefore trained a final model to predict the probability of finishing a game, again using only structural features.
A clear separation emerges between finished and abandoned missions. Predicting struggle is ok. Predicting surrender is better.
Measuring survival
So we stop measuring effort and start measuring survival. To answer our original question we went form difficulty to playability. So what will it be ? Can you finish the game or will you surrender ? If I give you Asteroid to Viking now that you have all the story... Will you make it to the end? Or will you give up? Do you think it will be in the high, mid or low probability of finishing ?
Let's play !
Now that you know everything about difficulty in Wikispeedia, it's time to test your skills. Click the button below to start a new game. Choose your difficulty wisely, as it will determine how challenging your mission will be. Good luck, and may your navigation be swift and successful!