posts, we explored Part I of the seminal book Reinforcement Learning by Sutton and Barto [1] (*). In that section, we delved into the three fundamental techniques underlying nearly every modern Reinforcement Learning (RL) algorithm: Dynamic Programming (DP), Monte Carlo methods (MC), and Temporal Difference Learning (TD). We not only discussed algorithms from each field in depth but also implemented each one in Python.
Part I of the book focuses on tabular solution methods — approaches suited for problems small enough to be represented in table form. For instance, with Q-learning, we can compute and store a complete Q-table containing every possible state-action pair. In contrast, Part II of Sutton’s book tackles approximate solution methods. When state and action spaces become too large — or even infinite — we must generalize. Consider the challenge of playing Atari games: the state space is simply too vast to model exhaustively. Instead, deep neural networks are used to compress the state into a latent vector, which then serves as the basis for an approximated value function [2].
While we’ll venture into Part II in upcoming posts, I’m excited to announce a new series: we will benchmark all the algorithms introduced in Part I against one another. This post serves both as a summary and an introduction to our benchmarking framework. We’ll evaluate each algorithm based on how quickly and effectively it can solve increasingly larger Gridworld environments. In future posts, we plan to extend our experiments to more challenging scenarios, such as two-player games, where the differences between these methods will be even more apparent.
Summarized, in this post, we will:
- Introduce the benchmark task and discuss our comparison criteria.
- Provide a chapter-by-chapter summary of the methods introduced in Sutton’s book along with initial benchmarking results.
- Identify the best-performing methods from each group and deploy them in a larger-scale benchmarking experiment.
Table of Contents
Introducing the Benchmark Task and Experiment Planning
In this post we’ll work with the Gymnasium [3] environment “Gridworld”. It’s essentially a maze-finding task, in which the agent has to get from the top-left corner of the maze to the bottom-right tile (the present) — without falling into any of the icy lakes:
The state space is a number between 0 and N — 1, where N is the maximal number of tiles (16 in our case). There are four actions the agent can execute in every step: going left, right, up or down. Reaching the goal yields reward 1, falling into the lake ends the episode without any reward.
The nice thing about this environment is that one can generate random worlds of arbitrary size. Thus, what we’ll do with all methods, is plot the number of steps / updates needed to solve the environment versus the environment size. In fact, Sutton does this in some parts of the book, too, so we can refer to that.
Preliminaries
I’d like to start with some general notes — hoping these will guide you through my thought process.
It is not easy to compare algorithms in a “fair” way. When implementing the algorithms I mainly looked for correctness, but also simplicity — that is, I wanted readers to easily be able to connect the Python code with pseudocode from Sutton’s book. For “real” use cases, one would surely optimize the code more, and also use a large array of tricks common in RL, such as using decaying exploration, optimistic initialization, learning rate tuning, and more. Further, one would take great care to tune the hyperparameters of the deployed algorithm.
Applied RL Tricks
Due to the large number of algorithms under investigation, I cannot do this here. Instead, I identified two important mechanisms, and measured their effectiveness on an algorithm known to work quite well: Q-learning. These are:
- Intermediate rewards: instead of only rewarding the agent for reaching the goal, we reward it for progress along the maze. We quantify this by the (normalized) difference in x and y coordinates between current and previous state. This makes use of the fact that the goal in each Gridworld environment is always at the bottom right, and thus higher x / y coordinates usually are better (although one could still get “stuck”, in case an icy lake is between the agent and the goal). Since this difference is normalized by the number of states, its contribution is small, s.t. it does not overshadow the reward of reaching the goal.
- Decaying exploration: throughout this post series, the exploration-exploration dilemma came up frequently. It describes the trade-off of exploiting states / actions already known to be good, and exploring less explored states — potentially finding even better solutions, at the risk of wasting time in less optimal areas. One common technique addressing this is to decay exploration: starting with a high amount of exploration early, and slowly decaying this down to lower levels. We do this by linearly decaying ε from 1 (100% random exploration) to 0.05 over 1000 steps.
Let’s have a look at how Q-learning performs with these techniques:

As we can see, in the baseline setup the number of steps needed quickly grows, and reaches the maximal number of allowed steps (100.000 ) — meaning the algorithm did not solve the environment in the allotted number of steps. Also decaying ε alone did not contribute much. However, adding intermediate rewards proved to be extremely effective — and the combination of this and decaying ε performed best.
Thus, for most methods to come we start with the “naïve” environment, the baseline implementation. Later we show results for the “improved” environment consisting of intermediate rewards and decaying exploration.
Comparison Criteria
As was visible in the previous section I chose the number of steps needed until the found policy solves the Gridworld environment as the default way of comparing methods. This seems a bit more fair than just measuring elapsed time, since time depends on the concrete implementation (similar to the concept of Big O notation)— and, as mentioned above, I did not optimize for speed. However, it is important to note that also steps can be misleading, as e.g. one step in DP methods contains a loop over all states, whereas one step in MC and TD methods is the generation in one episode (actually for TD methods we usually count one step as one value update, i.e. an episode generation consists of multiple steps — however I made this more comparable to MC methods on purpose). Due to this we also show elapsed time sometimes.
Experiment Structure
To reduce the variance, for each Gridworld size we run each method three times, and then save the lowest number of steps needed.
The code required to run all following benchmarking can be found on GitHub.
Recap and Benchmarking of All Algorithms
With the fundamentals out of the way, let’s properly get started. In this section we will recap all introduced algorithms from Part I of Sutton’s book. Further, we will benchmark them against each other on the previously introduced Gridworld task.
Dynamic Programming
We start with Chapter 4 of Sutton’s book, describing methods from DP. These can be applied to a wide variety of problems, always building on the principle of iteratively building larger solutions from smaller subproblems. In the context of RL, DP methods maintain a Q-table which is filled out incrementally. For this, we require a model of the environment, and then, using this model, update the expected value of states or state-action pairs depending on the possible successor states. Sutton introduces two methods we picked up in our corresponding post: policy and value iteration.
Let’s start with policy iteration. This consists of two interleaved steps, namely policy evaluation and policy improvement. Policy evaluation uses DP to — as the name says — evaluate the current policy: we incrementally update the state estimates by using the model and policy. Next comes policy improvement, which employs a fundamental concept of RL: according to the policy improvement theorem, any policy gets better when changing the predicted action in one state to a better action. Following this, we construct the new policy from the Q-table in greedy fashion. This is repeated, until the policy has converged.
The corresponding pseudocode looks as follows:

Let’s come to value iteration. This is very similar to policy iteration, but with a simple, yet crucial modification: in every loop, only one step of policy evaluation is run. It can be shown that this still converges to the optimal policy, and overall does so faster than policy iteration:

For more details, here’s my corresponding post about DP methods.
Results
Now it’s time to see what these first two algorithms are really made of. We run the experiment sketched above, and get the following plot:

Both methods are able to solve all created Gridworld sizes in the minimal number of steps, 100. Surprising? Well, this actually shows both a strength and as well as a weakness of DP methods, as simultaneously of our methodology: DP methods are “thorough”, they require a complete model of the world, and then iterate over all states — yielding a good solution with just a few passes over all states. However, this means that all states need to be estimated till convergence — even though some of them might be much less interesting — and this scales quite badly with the environment size. In fact, one measured step here contains a full run over all states — indicating that for these methods time is a better measure.
For this, we get the following graph:

Now, we can see increase in compute needed w.r.t. to the number of states. And, we can also see that, as claimed, value iteration converges much faster, and scales much better. Note that the x-axis labels denote n, with the Gridworld having a size of n x n.
Monte Carlo Methods
Next in our post series on RL we covered MC methods. These can learn from experience alone, i.e. one can run them in any kind of environment, without having a model of it — which is a stunning realization, and very useful: often, we don’t have this model, other times, it would be too complex and impractical to use. Consider the game of Blackjack: while we can certainly model all possible outcomes and corresponding probabilities, it is a very tedious task — and learning to play by just doing that is a very tempting thought. Due to not using a model, MC methods are unbiased — but on the downside their expectation has a high variance.
One issue when implementing these methods is making sure that all state-action pairs are continuously visited, and thus updated. Due to not having a model, we cannot simply iterate over all possible combinations (compare e.g. DP methods), but (in a way) randomly explore the environment. If due to this we missed some states entirely, we’d loose theoretical convergence guarantees, which would translate into practice.
One way of satisfying this is the exploring starts assumption (ES): we start each episode in a random state and choose the first action randomly, too. Apart from that, MC methods can be implemented rather simply: we simply play out full episodes, and set the expected value of state-action pairs to the average obtained reward.
MC with ES looks as follows:

To remove the assumption of ES, we can resort to two classes of algorithms: on- and off-policy methods. Let’s start with the on-policy one.
This is actually not too different from the ES algorithm, we simply use an ε-greedy policy for generating episodes. That is, we remove the assumption of ES and use a “soft” instead of a “hard” policy for generating episodes: the used policy in every iteration is not fully greedy, but ε-greedy — which guarantees that in the limit we see all possible state-action pairs:

Off-policy methods follow the idea of splitting exploration and learning in two policies. We maintain a policy π, which we want to optimize, and a behavior policy, b.
However, we can’t simply use b everywhere in our algorithm. When generating an episode and computing returns, we obtain:

I.e., the resulting value is the predicted value of b, not π.
This is where importance sampling comes in. We can fix this expectation with the right ratio:

This ratio is defined by:

In our case, we obtain the following formula:

(Note that this uses weighted importance sampling, instead of “naïve” importance sampling.)
We could of course compute these ratios naively in every step. However, Sutton introduces a clever scheme updating these values (denoted by W
) incrementally, which is way more efficient. In fact, in my original post I showed the naive version, too — I believe this helps with understanding. However, since here we mainly care about benchmarking, and the “naïve” and the “incremental” version are identical, up to performance — we here only list the slightly more complex incremental version.
In pseudocode the corresponding algorithm looks as follows:

Note that, opposed to our initial post introducing these methods, where the behavior policy was simply randomly, here we pick a better one — namely an ε-greedy one w.r.t. to the current Q-table.
For more details here’s my corresponding post on MC methods.
Results
With that, let’s compare these three algorithms on small Gridworld environments. Note that one step here denotes one full episode generated:

We observe off-policy MC to already time out at a Gridword size of 5×5, and, even though MC with ES and on-policy MC perform better, also they start to struggle with larger sizes.
This might be somewhat surprising, and disappointing for MC fans. Don’t worry, we will manage to boost this — however it shows a weakness of these algorithms: in “large” environments with sparse rewards, MC methods basically have to hope to stumble upon the goal by chance — which decreases exponentially with the size of the environment.
Thus, let’s try and make the task easier for the model, and use the previously introduced tricks empirically found to help performance of TD-learning: adding intermediate rewards, and ε-decay — our “improved” setup.
In fact, with this all methods fare much better:

However, now MC ES is causing problems. Thus, let’s put this aside and continue without it: ES anyways was a theoretical concept on the way of developing MC methods, and clunky to use / implement (some might remember how I implemented having the environment start in random states …):

Here, at least we get close to the results of DP. Note that I capped the maximal number of steps to 100.000, so whenever this number shows up in the graph it means that the algorithm could not solve this environment in the given step limit. On-policy MC actually seems to perform really well, the number of steps needed barely increases— but off-policy MC seems to perform worse.
Discussion
To me, MC methods perform surprisingly well — since they essentially stumble around the environment randomly in the beginning, hoping to find the goal by exploration alone. However, of course this is not fully true — their performance (speaking of on-policy MC) becomes really good only after enabling intermediate rewards — which guide the model towards the goal. In this setup it seems MC methods perform really well — one potential reason being that they are unbiased — and less sensitive to hyperparameter tuning and co.
Temporal-Difference Learning
Let’s come to TD methods. These can be seen as combining the strengths of both approaches previously introduced: similar to MC, they do not need a model of the environment — but still they build upon previous estimates, they bootstrap — as in DP.
Let’s recap DP and MC models:
DP methods turn the Bellman equation into an update rule, and compute the value of a state based on the estimated values of its successor states:

MC methods, on the other hand, play out full episodes and then update their value estimates based on the observed return:

TD methods combine these two ideas. They play out full episodes, but after every step update value estimates with the observed return, and the previous estimate:

Some of the most fundamental RL algorithms stem from this field — and we will discuss them in the following.
Let’s begin with Sarsa. First, we modify above introduced update rule to work with state-action pairs:

With this, Sarsa is actually introduced rather quickly: we play episodes, and update values following our current policy. The name comes from the tuples used in the updates:

In pseudocode this looks as follows:

Next up we have Q-learning. This is very similar to Sarsa, with one key difference: it is an off-policy algorithm. Instead of simply following the executed transition during the update, we take the maximum Q-value of all successor states:

You can picture this as creating a behavior policy b, which is equal to π, except being greedy in the transitions under question.
The pseudocode looks like this:

Another algorithm is Expected Sarsa, which (you guessed it) — is an extension of Sarsa. Instead of following the one transition executed by the policy, we account for all possible successor states, and weigh them by how likely they are given the current policy:

The last algorithm in this chapter is an extension of Q-learning. Q-learning suffers from a problem known as maximization bias: since it uses a maximum over expected values, the resulting estimate can have positive bias. We can address this by using two Q-tables: for each update we use one for selecting a value-maximizing action, and the other for computing the update target. Which is used where is determined by a coin flip. The algorithm is known as Double Q-learning:

Results
Let’s have a look at the results, starting with the naïve environment:

We can see that both Q-learning methods start to get problems with Gridworld sizes of 11 x 11.
Thus let’s apply our known tricks, yielding the “improved” setup:

All methods can now find solutions significantly quicker — just Expected Sarsa falls out. This could very well be — it is significantly less used than Q-learning or Sarsa, and maybe more a theoretical concept.
Thus, let’s continue without this method and see how large world sizes we can solve:

Q-learning can now also solve grid sizes of 25 x 25 without problems — but Sarsa and Double Q-learning start to degrade.
More details can be found in my introductory post about TD methods.
Discussion
In the improved setup, TD methods in general perform well. We only eliminated Expected Sarsa early, which anyways is not such a common algorithm.
“Simple” Sarsa and Double Q-learning struggle for larger environment sizes, while Q-learning performs well overall. The latter is somewhat surprising, since Double Q-learning should address some of the shortcomings of standard Q-learning, in particular the high variance. Potentially, we already reduce the variance by running each experiment n times. Another hypothesis could be that Double Q-learning takes longer to converge, since the number of parameters has also doubled — which would indicate that the power of Double Q-learning shows better for more complex problems with more time.
As mentioned performs Q-learning better than Sarsa. This mirrors what can see in research / literature, namely that Q-learning is significantly more popular. This can probably explained by it being off-policy, which usually yields more powerful solution methods. Sarsa on the other hand performs better for stochastic or “dangerous” tasks: since in Sarsa the actual chosen action is taken into account in the value update, it better understands the effects of its actions, which comes in handy for stochastic environments and / or environments where one can, e.g., fall off a cliff. Despite the latter being the case here, the environment is probably not complex or large enough, that this effect comes into play.
TD-n
TD-n methods in a way marry classical TD learning and MC methods. As Sutton so nicely puts it, they “free us from the tyranny of the timestep” [1]. In MC methods, we are forced to wait a full episode before making any updates. In TD methods, we update estimates in every step — but are also forced to only look one step in the future.
Thus, it makes sense to introduce n-step returns:

With that, we can simply introduce Sarsa-n:

We play episodes following the current policy, and then update the value estimates with the n-step return.
In my corresponding post, we also introduce an off-policy version of this. However, to not blow up this post too long, and negative experience with off-policy MC methods, we focus on the “classics” — such as Sarsa-n — and tree-n tree backup, which we introduce next.
n-step tree backup is an extension of the previously seen Expected Sarsa. When computing the n-step return, the corresponding transition tree looks as follows:

I.e., there is a single path down the tree corresponding to the actual action taken. Just as in Expected Sarsa, we now want to weigh actions according to their probability determined by the policy. But since now we have a tree of depth > 1, the cumulative value of later levels is weighted by the probability of the action taken to reach these levels:

The pseudocode looks as follows:

Here’s my corresponding post on n-step TD methods.
Results
As usual, we start with the “naïve” setting, and obtain the following results:

Sarsa-n starts to struggle already with smaller grid world sizes. Let’s see if the improved setup changes this:

Now indeed Sarsa-n performs much better, but n-step tree backup does not.
Discussion
I found this discovery unexpected and somewhat hard to explain. I’d love to hear your thoughts on this — but in the meantime I was chatting with my chat agent of choice, and came to this hypothesis: intermediate rewards potentially confuse the tree algorithm, since it needs to learn a matching return distribution over all possible actions. Further, the more ε decays, the more the expected distribution might differ from the behavior policy.
Model-Based Reinforcement Learning / Planning
In the previous chapter we discussed the topic “planning” — in the RL context with this we mainly refer to model-based methods. That is: we have (or build) a model of the environment, and use this model to explore further “virtually”, and in particular use these explorations for more and better updates / learnings of the value function. The following image displays the integration of planning into learning very well:

In the top-right corner we see the “classical” RL training loop (also dubbed “direct” RL): starting with some value function / policy we act in the (real) environment, and use this experience to update our value function (or policy in the case of policy-gradient methods). When incorporating planning, we additionally also learn a model of the world from this experience, and then use this model to generate further (virtual) experience, and update our value or policy function from this.
This actually is exactly the Dyna-Q algorithm, which looks as follows in pseudocode:

Steps (a) — (d) are our classical Q-learning, whereas the rest of the algorithm adds the novel planning functionality, in particular the world model learning.
Another related algorithm is Prioritized Sweeping, which changes how we sample states for the “planning loop”: we explore and play in the real environment, while learning the model, and save state-action pairs with large expected value changes to a queue. Only with this queue we start the “planning loop”, i.e. something to the steps (e) and (f) above:

More details can be found in my previous post on model-based RL methods.
Results
Let’s start with the naïve setting:

Dyna Q performs reasonably well, while Prioritized Sweeping struggles early on.
In the improved setting we see a similar thing:

Discussion
Prioritized sweeping already performed poorly in the corresponding introductory post — I suspect there either is some issue, or more likely this simply is a “tuning” thing — i.e. using a wrong sampling distribution.
Dyna-Q yields solid results.
Benchmarking the Best Algorithms
We have now seen the performance of all algorithms from Part I of Sutton’s book by benchmarking them per chapter and on Gridworlds of up to size 25 x 25. Already here we saw better and worse performing algorithms, and in particular already discarded a few candidates not suited for larger environments.
Now we want to benchmark the remaining ones — the best ones from each chapter — against one another, on Gridworlds up to size 50 x 50.

These algorithms are:
- value iteration
- on-policy MC
- Q-learning
- Sarsa-n
- Dyna-Q
Results
Here’s how they perform on Gridworld, this time with a maximal step limit of 200.000:

Let’s also plot the corresponding time needed (note that I plot unsuccessful runs — runs reaching the maximal number of steps without generating a feasible policy — at 500s):

We can observe several interesting facts from these figures:
- The number of steps vs. time needed is highly correlated.
- Value iteration performs exceptionally well, solving even Gridworlds of size 50 x 50 with ease, and doing so magnitudes faster than the next-best algorithm.
- The ranking for the remaining algorithms is (better to worse): On-policy MC, Dyna-Q, Q-learning, Sarsa-n.
In the next section we discuss these in more details.
Discussion
1. Steps vs. Time
We started this post with a discussion on which metrics / measurement to use, and — in particular — whether to use number of steps or time needed to solve the problem. Looking back, we can say that this discussion was not so relevant after all, and — somewhat surprisingly — these two numbers are highly correlated. That is, despite the fact that, as initially described, one “step” can differ depending on the algorithm.
2. Value Iteration Dominates
Value Iteration performed remarkably well, solving even large Gridworlds (up to 50×50) with ease—outpacing all other algorithms by a wide margin. This might be surprising, considering that DP methods are often considered theoretical tools, rarely used in practice. Real-world applications tend to favor methods like Q-learning [2], PPO [4], or MCTS [5].
So why does such a “textbook” method dominate here? Because this environment is tailor-made for it:
- The model is fully known.
- The dynamics are simple and deterministic.
- The state space is relatively small.
These are exactly the conditions under which DP thrives. In contrast, model-free methods like Q-learning are designed for settings where such information is not available. Their strength lies in generality and scalability, not in exploiting small, well-defined problems. Q-learning incurs high variance and requires many episodes to converge—disadvantages that are magnified in small-scale environments. In short, there’s a clear trade-off between efficiency and generality. We’ll revisit this point in a future post when we introduce function approximation, where Q-learning has more room to shine.
3. A Ranking Emerges
Beyond Value Iteration, we observed the following performance ranking: On-policy MC > Dyna-Q > Q-learning > Sarsa-n
On-policy Monte Carlo emerged as the best-performing model-free algorithm. This fits with our earlier reasoning: MC methods are simple, unbiased, and well-suited to problems with deterministic goals—especially when episodes are relatively short. While not scalable to large or continuous problems, MC methods seem to be quite effective in small to medium-sized tasks like Gridworld.
Dyna-Q comes next. This result reinforces our expectations: Dyna-Q blends model-based planning with model-free learning. Although the model is learned (not given, as in Value Iteration), it’s still simple and deterministic here—making the learned model useful. This boosts performance substantially over pure model-free approaches.
Q-learning, while still powerful, underperforms in this context for the reasons discussed above: it’s a general-purpose algorithm that isn’t able to fully leverage the structure of simple environments.
Sarsa-n landed in last place. A likely explanation is the added bias introduced through bootstrapping in its multi-step updates. Unlike Monte Carlo methods, which estimate returns from full trajectories (unbiased), Sarsa-n uses bootstrapped estimates of future rewards. In small environments, this bias can outweigh the benefits of reduced variance.
Lastly, let’s compare our results vs. the ones from Sutton:

Note that Sutton lists the total number of steps on the x-axis, whereas we list n, with the total number of states being n x n. For 376 states, Sutton report ~100k steps before the optimal solution is found, whereas we report 75k for 400 states (20 x 20), considering Dyna-Q. The numbers are highly comparable and provide a reassuring validation of our setup and implementation.
Conclusion
This post served both as a recap of our series on Part I of Sutton and Barto’s Reinforcement Learning [1]and as an extension beyond the book’s scope—by benchmarking all introduced algorithms on increasingly larger Gridworld environments.
We began by outlining our benchmarking setup, then revisited the core chapters of Part I: Dynamic Programming, Monte Carlo methods, Temporal-Difference learning, and Model-Based RL / Planning. In each section, we introduced key algorithms such as Q-learning, provided complete Python implementations, and evaluated their performance on Gridworlds up to size 25×25. The goal of this initial round was to identify top performers from each algorithmic family. Based on our experiments, the standouts were:
Value Iteration, On-policy MC, Q-learning, Sarsa-n, and Dyna-Q. Python code to reproduce these results, and in particular implementations of all discussed methods, is available on GitHub.
Next, we stress-tested these high-performers on larger environments (up to 50×50) and observed the following ranking:
Value Iteration > On-policy MC > Dyna-Q > Q-learning > Sarsa-n
While this result may be surprising—given the widespread use of Q-learning and the relatively rare application of Value Iteration and MC methods—it makes sense in context. Simple, fully-known, deterministic environments are ideal for Value Iteration and MC methods. In contrast, Q-learning is designed for more complex, unknown, and high-variance environments where function approximation becomes necessary. As we discussed, there’s a trade-off between efficiency in structured tasks and generality in complex ones.
That brings us to what’s next. In upcoming posts, we’ll push the boundaries further:
- First, by benchmarking these methods in more challenging environments such as two-player games, where direct competition will expose their differences more starkly.
- Then, we’ll dive into Part II of Sutton’s book, where function approximation is introduced. This unlocks the ability to scale reinforcement learning to environments far beyond what tabular methods can handle.
If you’ve made it this far—thank you for reading! I hope you enjoyed this deep dive, and I’d love to have you back for the next installment in the series.
Other Posts in this Series
References
[1] http://incompleteideas.net/book/RLbook2020.pdf
[2] https://arxiv.org/abs/1312.5602
[3] https://gymnasium.farama.org/index.html
[4] https://arxiv.org/abs/1707.06347
[5] https://arxiv.org/abs/1911.08265
(*) Images from [1] used with permission from the authors.