We consider the exploration/exploitation problem in reinforcement learning (RL). The Bayesian approach to model-based RL offers an elegant solution to this problem, by considering a distribution over possible models and acting to maximize expected reward; unfortunately, the Bayesian solution is intractable for all but very restricted cases. In this paper we present a simple algorithm, and prove that with high probability it is able to perform epsilon-close to the true (intractable) optimal Bayesian policy after some small (polynomial in quantities describing the system) number of time steps. The algorithm and analysis are motivated by the so-called PAC-MDP approach, and extend such results into the setting of Bayesian RL. In this setting, we show that we are able to achieve lower sample complexity bounds than existing PAC-MDP algorithms, while using exploration strategies that are much greedier than the (extremely cautious) exploration strategies used by these existing algorithms.
Discussion
Hi Zico,
You had a great talk yesterday! I checked the paper to settle our discussion (is the 1/n bonus PAC-MDP or not?).
Your intepretation of PAC-MDP seems strange: you fix the exploration parameter first, and then pick the epsilon afterwards. This gives a very strong criterion, according to which neither Rmax nor MBIE would qualify as PAC-MDP (your counterexample would work there: e.g., for R-max, you fix a value for m (an (x,a) pair becomes known after m visits). Then you can always pick a counterexample-MDP and epsilon, where m visits are just too few.
This is actually great news
: the 1/n exploration bonus is PAC-MDP (in the sense Rmax is PAC-MDP). We have a proof of that in our last year ICML paper ( http://www.conflate.net/icml/paper/2008/490 , with the details in a Tech.rep: http://arxiv.org/abs/0904.3352 )
István
Hi Istvan,
I think there are a few points of confusion here, so I will do my best to clarify. The distinction between the 1/n and 1/sqrt(n) term is very important and it is _not_ the case that an exploration bonus of 1/n (or 1/n^p for any p > 1/2) is PAC-MDP, at least not under definitions that I'll try to explain better here. This response is quite long, but I want to address these concerns in detail, as I think the claims you make would decrease the value of certain elements of this paper (even if they didn't actually affect the main ideas behind the PAC-style bounds w.r.t. the Bayesian policy).
First, however, I want to highlight the high-level comment that the main point of Theorem 2 in the paper is to show that the Bayesian policy and BEB are not PAC-MDP. And in this it is perfectly sufficient: both these algorithms are of the form that the theorem states. The theorem is not saying anything about R-MAX or MBIE, and it is certainly not saying that these are somehow not PAC-MDP (I know that's not really what you meant by your comment, but I do want to stress this point in the discussion).
Now, I think the crux of the issue that you're having here is that in the formulation of the theorem, I don't allow beta to depend on epsilon (that is, the epsilon that is input to the algorithm). This is mainly for two reasons 1) as mentioned above, I'm primarily focused on the Bayesian policy here, and 2) for the two-armed bandit I'm considering here this is actually ok: an algorithm with constant beta (and exploration bonus greater than 1/sqrt(n)) could be PAC-MDP in this setting (you can use the Hoeffding bound to show that with high probability the running estimate of the second arm's reward will always be optimistic). Indeed, for p=1/2 you can see exactly where the proof of Theorem 2 will no longer go through: the epsilon_0 in the last equation on page 6 would be zero.
Now, there is one exception you could take with this, and I think this is what you are getting at, which is that there _are_ some PAC-MDP algorithms that do have a (very weak) dependence on epsilon in the exploration bonus term. However, R-MAX and MBIE are not the algorithms to consider here; they are not exploration bonus algorithms, and I don't see any reason to consider them as such. The algorithm to compare with here is MBIE-EB, which does have the beta/sqrt(n) bonus term. So let's look at the form of beta for this algorithm. In MBIE-EB,
beta = (1/(1-gamma)) sqrt(ln(2|S||A|m/delta)),
where m = O(1/epsilon^2), in addition to other terms. Therefore, the dependence on epsilon is approximately of the form (ignoring the square root around the log, as it's not important and just decreases the dependence on 1/epsilon)
beta = c_2 ln (1/epsilon).
Now, my claim is that this logarithmic dependence on 1/epsilon in beta won't affect the correctness of Theorem 2 at all. If beta depends on ln(1/epsilon), we can still find an epsilon_0 such that the empirical estimate of the reward, plus the bonus, may be pessimistic by epsilon_0 _and_ (this is the critical part, of course), we have epsilon_0 > epsilon. In other words, if we take MBIE-EB but use an exploration bonus of1/n(s,a)^p for p > 1/2 (and even if we increase the bonus by some constant factor), then there exists an epsilon small enough such that MBIE-EB will _not_ be able to guarantee the PAC-MDP condition for such an epsilon. I.e., MBIE-EB with this exploration bonus will not be PAC-MDP.
I'll briefly sketch the proof for this. I'll consider the case p=1 just to simplify the math a bit (discussion boards aren't great for math notation), but the same logic will hold for all p > 1/2. The proof proceeds entirely as before, until reaching the last line on page 6, where we define epsilon_0. In the case that beta = c_1 ln(1/epsilon), this means that
epsilon_0 = 3/(64 beta) = 3/ (64 c_1 ln (1/epsilon)) = c_2 / ln(1/epsilon)
where I'll just absorb all the constants into c_2. Now, since
epsilon < c_2 / ln(1/epsilon)
for small enough epsilon (regardless of the constant), we have that epsilon_0 > epsilon, so the algorithm will _not_ have the PAC-MDP guarantee. Thus, Theorem 2 (with this modification) applies exactly to an existing PAC-MDP algorithm, MBIE-EB: the algorithm is PAC-MDP for p=1/2 but not PAC-MDP if we use p > 1/2. As mentioned above, I wanted to present Theorem 2 in a simpler form, since it's main purpose was to provide the bounds for the Bayesian and BEB algorithms, but this is the distinction between the 1/n and 1/sqrt(n) terms that the paper makes.
Now, you could argue that we could introduce other terms that depend on epsilon directly in the bonus (for instance, make the bonus have a 1/epsilon term), but I think this violates the idea behind the exploration bonus algorithms. Essentially, by doing this we'd be saying that since we don't care at all about optimism of the algorithm below some threshold, so we'll just add some constant such that c/n > 1/sqrt(n) in the range that we care about. While this might provide the desired bound (and I'm not entirely sure it would, I'd have to think about this more), it looses the important, and very real, distinction between the 1/n and 1/sqrt(n) decay rates (and would also lead to much worse sample complexity, by multiplying by additional 1/epsilon factors). Indeed, the ln(1/epsilon) factor in MBIE-EB only comes around in a very roundabout manner … the true idea behind the exploration bonus, to some extent, is to just have a constant term that will decay with the number of samples. And this is the intuition that I was trying to capture in the formulation for Theorem 2.
Anyway, thank you for the comments, and I'm certainly happy to clarify these points. I know this was a rather long explanation, but hopefully it explains the basic ideas clearly enough.
Hi Zico,
Thanks a lot for the detailed answers! I think our statements are closer to each other than it seems, with many points that we agree on, and a few that we don't. I try my best to identify both groups. I'm sorry about the length of the comment that follows. To jump to the conclusions, I still have good resons to believe that the 1/n bonus can be PAC-MDP.
1. First of all, I agree that the issues we are discussing do not affect the main ideas of your paper. 2. I agree that Bayesian exploration is not PAC-MDP is general. I also agree that the beta/n bonus does not result in a PAC-MDP algorithm for any epsilon-independent constant beta 3a. I agree that Theorem 2 is true, and its proof is correct 3b. But this is _not_ the same as saying that the algorithm is non-PAC-MDP! If beta is allowed to depend on epsilon and delta, the result can be PAC-MDP, for example, by setting beta = O(Rmax^2/epsilon/(1-gamma)^3 * ugly logarithmic term). Intuitively, if you start with a high value, it will take a while until beta/n goes below constant/sqrt(n) (to the non-sufficiently-optimistic region). if you want a higher precision epsilon_0, you can choose a higher beta, so that the optimistic period lasts longer. 4. I think that allowing the algorithms to depend on epsilon is an important point (should they have bonuses or not): for MBIE and MBIE-EB, it is weak, but cannot be omitted, for Rmax, E3 and OIM and delayed Q-learning, it is stronger (for delayed Q, m is quadratic in 1/epsilon!). If the parameters of the algorithm are fixed, none of them will be PAC-MDP any more, counterexamples similar to yours can be constructed. 5. Your proof above: again, I agree completely that beta = c*ln(1/epsilon) is sufficient for MBIE-EB, but not for the 1/n bonus. But if you choose beta = constant*(1/epsilon), that will make the 1/n bonus PAC-MDP (your proof above is a great illustration: it carries through to 1/epsilon^z with z<1, but not to 1/epsilon. There is a rigorous proof in the full version of our last year's paper). I believe that beta=O(1/epsilon^(2p)) should be sufficient for 1/2 < p ⇐ 1, but I did not make the computations for this general case.
I have a few more comments that are not relevant to the main point of the discussion, but I feel that they would make the picture more complete.
6. My analogy tried to point out that R-max is PAC-MDP with a suitably defined m(epsilon,delta), but not with any constant m (independent of epsilon), and similarly, MBIE is PAC-MDP with bonus beta(epsilon,delta)/sqrt(n), but not with beta/sqrt(n) for any constant beta. Sorry if the analogy looked like a strawman argument, I really did not want it to look it that way.
7. On the other side, let me note that Rmax and MBIE _can_ be rewritten as exploration bonus methods. E.g., for Rmax, the bonus would be “Rmax-R_t(s,a), if number of visits to (s,a) is less than m(epsilon,delta), and 0 otherwise”. This is not a nice bonus, which is a good reason why it's generally not considered as such. But it is still a bonus, and can be dominated by any 1/n^p, so your theorem would apply.
8. I'm not so sure about the true idea behind exploration bonuses (people have defined many weird ones, like the TD-error based exploration bonus), but I feel it should be something like “we want to remain simple”. And from that respect, the 1/n bonus is special: you do not have to add the bonus explicitly to the equations - they will be automatically enforced by optimistic initialization.
9. Finally, I agree completely that the worst-case bounds will be much worse: O(S^3 A / epsilon^3 / (1-gamma)^4) vs. O(S^2 A / epsilon^3 / (1-gamma)^6), but the standard benchmarks and real-world examples are far from being worst case (we had the same experience as you described: the 1/n bonus beats confidently the 1/sqrt(n) bonus on all benchmarks)
Best regards, István
Hi Istvan,
I see your arguments, and I think that overall we're both presenting valid (or at least mathematically sound) viewpoints, but I do still think we are interpreting things differently at some level. I'll try to summarize my responses quickly, as we could probably continue this discussion indefinitely…
The argument that I make in the paper is that the O(1/sqrt(n)) bonus is “special” in regards to PAC-MDP learning, and I still hold this to be important. By this I just mean that any exploration bonus algorithm with no dependence on epsilon in the bonus term (or weak dependence, as in the ln(1/espilon) term described above), cannot be PAC-MDP when using a decay of O(1/n^p) p > 1/2. I think on this point we both agree. Philosophically I would also argue that this is the “correct” form for an exploration bonus, precisely because intuitively the actual bonus term _shouldn't_ depend on epsilon, we should just have a algorithm that works regardless of the parameters we want for its theoretical guarantees, and just takes longer to reach a stronger guarantee; of course, I'm certainly aware that the algorithms we have aren't actually of this form yet, but I think think this is the correct intuition. However, this is really a philosophical issue that would be somewhat foolish to debate, since there is no “right” answer.
We also agree that if we allow arbitrary dependence on epsilon allows a O(1/n) decay rate to be PAC-MDP. However, what I don't understand is why this rate is somehow “special,” as you are claiming. If we do allow beta to depend arbitrarily on epsilon, then _any_ O(1/n^p) for any p could be PAC-MDP, as long as we had beta = O(1/epsilon^q) for some q that depended on our choice of p, right? (and in my mind, this is somewhat of a “hack” that just makes 1/n^p > 1/sqrt(n) in the range we care about). If this is correct, then I don't see why the O(1/n) rate is important, as you claim, since we could make _any_ decay rate PAC-MDP. Thus, I think the only real distinction we could make is to consider what happens in the important case (and the case most comparable to the BEB algorithm) where beta has weak dependence on epsilon, and this is precisely where the 1/sqrt(n) break point is relevant. However, if I am missing something, and if O(1/n^p) for p > 1 cannot be PAC-MDP given arbitrary dependence on epsilon, for some reason that I don't see, then I'd certainly concede that both the 1/sqrt(n) and 1/n bonuses are interesting breakpoints in the discussion of PAC-MDP algorithms, but currently I only see the importance of the 1/sqrt(n) bonus.
Zico
Hi Zico,
I think we finally agree on everything
Let me summarize my take on the state of matters:
Seems like we managed to separate three different levels of PAC-MDPness: - “PAC-MDP with arbitrary epsilon-dependence” - all O(1/n^p)-bonus methods belong here (including BEB and OIM), as well as Rmax, E3, and the non-bonus version of MBIE, and delayed Q-learning - “PAC-MDP with weak epsilon-dependence” - MBIE and MBIE-EB belong here - “PAC-MDP with no epsilon-dependence” - no algorithms (yet!) belong here (though I think that the log-regret RL algorithm of Auer and Ortner could be transformed with little effort. Incidentally, it also has an 1/sqrt(n)-like bonus)
I also completely agree that the O(1/sqrt(n)) bonus is “special”, because it has the “correct” magnitude, all other algorithms should have worse bounds.
The O(1/n) bonus is “special” on a different coordinate axis: you can get it without ever calculating the bonus explicitly: optimistic initialization corresponds to an O(1/n) bonus term. In this way, the algorithm can be made simple to the extreme: no exploration is ever done, the algorithm is always greedy (with respect to an optimistically initialized model). So, the O(1/n) bonus seems to give the “simplest possible” algorithm that is still PAC-MDP (note: I am fully aware of the slackyness of “simplest possible”)
cheers, István
It looks like BEB improves on Delayed Q-learning by a factor of H^2/epsilon.
There are a few other cases I know where statements of the form “A Bayesian approach gives a better analysis than is possible with a more adversarial analysis” apply. Two that I've worked on are: (a) <a href=“http://hunch.net/~jl/projects/prediction_bounds/MDL/current.ps”>Suboptimal Behavior of Bayes and MDL in Classification under Misspecification</a> (which I worked on with Peter Grunwald). (b) Comparing the <a href=“http://www.cse.ucsd.edu/~yfreund/papers/QBC.pdf”>analysis of query by committee</a> to the more agnostic approaches covered in our <a href=“http://hunch.net/~active_learning”>active learning tutorial</a>.
The basic lesson seems to be that there is some genuine power in correctly specifying a Bayesian prior, but of course if you misspecify your prior you can fail basic consistency tests.
Yes, I believe that I misspoke in a rather unfortunate manner when answering your question at the talk, and said a “constant factor” improvement over Delayed Q-learning, when of course what I meant was a “constant improvement in the polynomial factors,” and should have obviously avoided the word “constant” altogether. Also, as we discussed afterwards, the improvement in the bound over Q-learning is H^2/epsilon^2, since BEB has a O(1/epsilon^2) term in the sample complexity.