Offline RL Without OffPolicy Evaluation
Abstract
Most prior approaches to offline reinforcement learning (RL) have taken an iterative actorcritic approach involving offpolicy evaluation. In this paper we show that simply doing one step of constrained/regularized policy improvement using an onpolicy Q estimate of the behavior policy performs surprisingly well. This onestep algorithm beats the previously reported results of iterative algorithms on a large portion of the D4RL benchmark. The simple onestep baseline achieves this strong performance without many of the tricks used by previously proposed iterative algorithms and is more robust to hyperparameters. We argue that the relatively poor performance of iterative approaches is a result of the high variance inherent in doing offpolicy evaluation and magnified by the repeated optimization of policies against those highvariance estimates. In addition, we hypothesize that the strong performance of the onestep algorithm is due to a combination of favorable structure in the environment and behavior policy.
1 Introduction
An important step towards effective realworld RL is to improve sample efficiency. One avenue towards this goal is offline RL (also known as batch RL) where we attempt to learn a new policy from data collected by some other behavior policy without interacting with the environment. Recent work in offline RL is well summarized by levine2020offline.
In this paper, we challenge the dominant paradigm in the deep offline RL literature that primarily relies on actorcritic style algorithms that alternate between policy evaluation and policy improvement [fujimoto2018off, fujimoto2019benchmarking, peng2019advantage, kumar2019stabilizing, kumar2020conservative, wang2020critic, wu2019behavior, kostrikov2021offline, jaques2019way, Siegel2020Keep, nachum2019algaedice]. All these algorithms rely heavily on offpolicy evaluation to learn the critic. Instead, we find that a simple baseline which only performs one step of policy improvement using the behavior Q function often outperforms the more complicated iterative algorithms. Explicitly, we find that our onestep algorithm beats prior results of iterative algorithms on most of the gymmujoco [brockman2016gym] and Adroit [rajeswaran2017learning] tasks in the the D4RL benchmark suite [fu2020d4rl].
We then dive deeper to understand why such a simple baseline is effective. First, we examine what goes wrong for the iterative algorithms. When these algorithms struggle, it is often due to poor offpolicy evaluation leading to inaccurate Q values. We attribute this to two causes: (1) distribution shift between the behavior policy and the policy to be evaluated, and (2) iterative error exploitation whereby policy optimization introduces bias and dynamic programming propagates this bias across the state space. We show that empirically both issues exist in the benchmark tasks and that one way to avoid these issues is to simply avoid offpolicy evaluation entirely.
Finally, we recognize that while the the onestep algorithm is a strong baseline, it is not always the best choice. In the final section we provide some guidance about when iterative algorithms can perform better than the simple onestep baseline. Namely, when the dataset is large and behavior policy has good coverage of the stateaction space, then offpolicy evaluation can succeed and iterative algorithms can be effective. In contrast, if the behavior policy is already fairly good, but as a result does not have full coverage, then onestep algorithms are often preferable.
Our main contributions are:

[leftmargin=*]

A demonstration that a simple baseline of one step of policy improvement outperforms more complicated iterative algorithms on a broad set of offline RL problems.

An examination of failure modes of offpolicy evaluation in iterative offline RL algorithms.

A description of when onestep algorithms are likely to outperform iterative approaches.
2 Setting and notation
We will consider an offline RL setup as follows. Let be a discounted infinitehorizon MDP. In this work we focus on applications in continuous control, so we will generally assume that both and are continuous and bounded. We consider the offline setting where rather than interacting with , we only have access to a dataset of tuples of collected by some behavior policy with initial state distribution . Let be the expected reward. Define the stateaction value function for any policy by . The objective is to maximize the expected return of the learned policy:
(1) 
Following fu2020d4rl and others in this line of work, we allow access to the environment to tune a small (< 10) set of hyperparameters. See paine2020hyperparameter for a discussion of the active area of research on hyperparameter tuning for offline RL. We also discuss this further in Appendix C.
3 Related work
Iterative algorithms.
Most prior work on deep offline RL consists of iterative actorcritic algorithms. The primary innovation of each paper is to propose a different mechanism to ensure that he learned policy does not stray too far from the data generated by the behavior policy. Broadly, we group these methods into three camps: policy constraints/regularization, modified of imitation learning, and Q regularization:

[leftmargin=*]

The majority of prior work acts directly on the policy. Some authors have proposed explicit constraints on the learned policy to only select actions where has sufficient support under the data generating distribution [fujimoto2018off, fujimoto2019benchmarking, laroche2019safe]. Another proposal is to regularize the learned policy towards the behavior policy wu2019behavior usually either with a KL divergence [jaques2019way] or MMD [kumar2019stabilizing]. This is a very straighforward way to stay close to the behavior with a hyperparameter that determines just how close. All of these algorithms are iterative and rely on offpolicy evaluation.

Siegel2020Keep, wang2020critic, chen2020bail all use algorithms that filter out datapoints with low Q values and then perform imitation learning. wang2018exponentially, peng2019advantage use a weighted imitation learning algorithm where the weights are determined by exponentiated Q values. These algorithms are iterative.

Another way to prevent the learned policy from choosing unknown actions is to incorporate some form of regularization to encourage staying near the behavior and being pessimistic about unknown state, action pairs [wu2019behavior, nachum2019algaedice, kumar2020conservative, kostrikov2021offline, gulcehre2021regularized]. However, properly being able to quantify uncertainty about unknown states is notoriously difficult when dealing with neural network value functions [buckman2020importance].
Onestep algorithms.
Some recent work has also noted that optimizing policies based on the behavior value function can perform surprisingly well. As we do, goo2020you studies the continuous control tasks from the D4RL benchmark, but they examine a complicated algorithm involving ensembles, distributional Q functions, and a novel regularization technique. In contrast, we analyze a substantially simpler algorithm that performs better in our experiments and focus more of our contribution on explaining this result. gulcehre2021regularized studies the discrete action setting and finds that a onestep algorithm (which they call “behavior value estimation”) outperforms prior work on Atari games and other discrete action tasks from the RL Unplugged benchmark [gulcehre2020rl]. They also introduce a novel regularizer for the evaluation step. In contrast, we consider the continuous control setting. This is a substantial difference in setting since continuous control requires actorcritic algorithms with parametric policies while in the discrete setting the policy improvement step can be computed exactly from the Q function. Moreover, while gulcehre2021regularized attribute the poor performance of iterative algorithms to “overestimation”, we define and separate the issues of distribution shift and iterative error exploitation which can combine to cause overestimation. This separation helps to expose the difference between the fundamental limits of offpolicy evaluation from the specific problems induced by iterative algorithms, and will hopefully be a useful distinction to inspire future work.
There are also important connections between the onestep algorithm and the literature on conservative policy improvement [kakade2002approximately, schulman2015trust, achiam2017constrained], which we discuss in more detail in Appendix B.
4 Defining the algorithms
In this section we provide a unified algorithmic template for offline RL algorithms as offline approximate modified policy iteration. We show how this template captures our onestep algorithm as well as a multistep policy iteration algorithm and an iterative actorcritic algorithm. Then any choice of policy evaluation and policy improvement operators defines onestep, multistep, and iterative algorithms.
4.1 Algorithmic template
We consider a generic offline approximate modified policy iteration (OAMPI) scheme, shown in Algorithm 1. Essentially the algorithm alternates between two steps. First, there is a policy evaluation step where we estimate the Q function of the current policy by using only the dataset . Implementations also often use the prior Q estimate to warmstart the approximation process. Second, there is a policy improvement step. This step takes in the estimated Q function , the estimated behavior , and the dataset and produces a new policy . Again an algorithm may use to warmstart the optimization. Moreover, we expect this improvement step to be regularized or constrained to ensure that remains in the support of and . Choices for this regularization/constraint are discussed below. Now we discuss a few ways to instantiate the template.
Onestep.
The simplest algorithm sets the number of iterations . We train the policy evaluation to estimate , and then use one of the policy improvement operators discussed below to find the resulting .
Multistep.
The multistep algorithm now sets . The evaluation operator must evaluate offpolicy since is collected by , but evaluation steps for require evaluating policies . Each iteration is trained to convergence in both the estimation and improvement steps.
Iterative actorcritic.
An actor critic approach looks somewhat like multistep policy iteration, but does not attempt to train to convergence at each iteration. Instead, each iteration consists of one gradient step to update the Q estimate and one gradient step to improve the policy. Since all of the evaluation and improvement operators that we consider are gradientbased, this algorithm can adapt the same evaluation and improvement operators used by the multistep algorithm. Most algorithms from the literature fall into this category [fujimoto2018off, kumar2019stabilizing, kumar2020conservative, wu2019behavior, wang2020critic, Siegel2020Keep].
4.2 Policy evaluation operators
Following prior work on continuous state and action problems, we always evaluate by simple fitted Q evaluation [fujimoto2018off, kumar2019stabilizing, Siegel2020Keep, wang2020critic, paine2020hyperparameter, wang2021instabilities]. Explicitly the evaluation step for the onestep or multistep algorithms looks like
(2) 
where the right hand side may depend on to warmstart optimization. In practice this is optimized by stochastic gradient descent with the use of a target network [mnih2015human]. For the iterative algorithm the is replaced by a single stochastic gradient step. We estimate the expectation over next state by a single sample from (or from the dataset in the case when ). See voloshin2019empirical, wang2021instabilities for more comprehensive examinations of this evaluation step.
4.3 Policy improvement operators
To instantiate the template, we also need to choose a specific policy improvement operator . We consider the following improvement operators selected from those discussed in the related work section. Each operator has a hyperparameter controlling deviation from the behavior policy.
Behavior cloning.
The simplest baseline worth including is to just return as the new policy . Any policy improvement operator ought to perform at least as well as this baseline.
Constrained policy updates.
Algorithms like BCQ [fujimoto2018off] and SPIBB [laroche2019safe] constrain the policy updates to be within the support of the data/behavior. In favor of simplicity, we implement a simplified version of the BCQ algorithm that removes the policy correction network which we call Easy BCQ. We define a new policy by drawing samples from and then executing the one with the highest value according to . Explicitly:
(3) 
Regularized policy updates.
Another common idea proposed in the literature is to regularize towards the behavior policy [wu2019behavior, jaques2019way, kumar2019stabilizing, ma2019imitation]. For a general divergence we can define an algorithm that maximizes a regularized objective:
(4) 
A comprehensive review of different variants of this method can be found in wu2019behavior which does not find dramatic differences across regularization techniques. In practice, we will use reverse KL divergence, i.e. . To compute the reverse KL, we draw samples from and use the density estimate to compute the divergence. Intuitively, this regularization forces to remain within the support of rather than incentivizing to cover beta.
Variants of imitation learning.
Another idea, proposed by [wang2018exponentially, Siegel2020Keep, wang2020critic, chen2020bail] is to modify an imitation learning algorithm either by filtering or weighting the observed actions so as to get a policy improvement. The weighted version that we implement uses exponentiated advantage estimates to weight the observed actions:
(5) 
5 Benchmark Results
Our main empirical finding is that one step of policy improvement is sufficient to beat state of the art results on much of the D4RL benchmark suite fu2020d4rl. This is striking since prior work focuses on iteratively estimating the Q function of the current policy iterate, but we only use onestep derived from . Results are shown in Table 1. Full experimental details are in Appendix C.
Iterative  Onestep  
fu2020d4rl  BC  Easy BCQ  Rev. KL Reg  Exp. Weight  
halfcheetahm  46.3  41.9 0.1  52.6 0.2  55.2 0.4  48.4 0.1 
walker2dm  81.1  68.6 6.3  87.2 1.3  85.9 1.4  81.8 2.2 
hopperm  58.8  49.9 3.1  74.5 6.2  83.7 4.5  59.6 2.5 
halfcheetahme  64.7  61.1 2.7  78.2 1.6  93.8 0.5  93.4 1.6 
walker2dme  111.0  78.5 22.4  112.2 0.3  111.2 0.2  113.0 0.4 
hopperme  111.9  49.1 4.3  85.1 2.2  98.7 7.5  103.3 9.1 
halfcheetahmre  47.7  34.6 0.9  38.3 0.3  41.9 0.5  38.1 1.3 
walker2dmre  26.7  26.6 3.4  69.1 4.2  74.9 6.6  49.5 12.0 
hoppermre  48.6  23.1 2.7  78.4 7.2  92.3 1.1  97.5 0.7 
halfcheetahr  35.4  2.2 0.0  5.4 0.3  8.8 3.8  3.2 0.1 
walker2dr  7.3  0.9 0.1  3.7 0.1  6.2 0.7  5.6 0.8 
hopperr  12.2  2.0 0.1  6.6 0.1  7.9 0.7  7.5 0.4 
penc  56.9  46.9 11.0  65.9 3.6  57.4 3.5  60.0 4.1 
hammerc  2.1  0.4 0.1  2.9 0.5  0.2 0.1  2.1 0.7 
relocatec  0.1  0.1 0.0  0.3 0.2  0.2 0.1  0.2 0.1 
doorc  0.4  0.0 0.1  0.6 0.6  0.2 0.7  0.2 0.3 
As we can see in the table, all of the onestep algorithms usually outperform the best iterative algorithms tested by fu2020d4rl. The one notable exception is the case of random data (especially on halfcheetah), where iterative algorithms have a clear advantage. We will discuss potential causes of this further in Section 7.
To give a more direct comparison that controls for any potential implementation details, we use our implementation of reverse KL regularization to create multistep and iterative algorithms. We are not using algorithmic modifications like Q ensembles, regularized Q values, or early stopping that have been used in prior work. But, our iterative algorithm recovers similar performance to prior regularized actorcritic approaches. These results are shown in Table 2.
Onestep  Multistep  Iterative  
halfcheetahm  55.2 0.4  59.3 0.7  51.2 0.2 
walker2dm  85.9 1.4  74.5 2.8  74.8 0.7 
hopperm  83.7 4.5  54.8 4.3  54.7 1.9 
halfcheetahme  93.8 0.5  94.2 0.5  93.7 0.6 
walker2dme  111.2 0.2  109.8 0.3  108.7 0.6 
hopperme  98.7 7.5  90.6 18.8  94.5 11.9 
halfcheetahr  8.8 3.8  18.3 6.5  21.2 5.2 
walker2dr  6.2 0.7  5.4 0.2  5.4 0.4 
hopperr  7.9 0.7  21.9 8.9  9.7 0.4 
Put together, these results immediately suggest some guidance to the practitioner: it is worthwhile to run the onestep algorithm as a baseline before trying something more elaborate. The onestep algorithm is substantially simpler than prior work, but usually achieves better performance.
6 What goes wrong for iterative algorithms?
The benchmark experiments show that one step of policy improvement often beats iterative and multistep algorithms. In this section we dive deeper to understand why this happens. First, by examining the learning curves of each of the algorithms we note that iterative algorithms require stronger regularization to avoid instability. Then we identify two causes of this instability: distribution shift and iterative error exploitation.
Distribution shift causes evaluation error by reducing the effective sample size in the fixed dataset for evaluating the current policy and has been extensively considered in prior work as discussed below. Iterative error exploitation occurs when we repeatedly optimize policies against our Q estimates and exploit their errors. This introduces a bias towards overestimation at each step (much like the training error in supervised learning is biased to be lower than the test error). Moreover, by iteratively reusing the data and using prior Q estimates to warmstart training at each step, the errors from one step are amplified at the next. This type of error is particular to multistep and iterative algorithms.
6.1 Learning curves and hyperparameter sensitivity
To begin to understand why iterative and multistep algorithms can fail it is instructive to look at the learning curves. As shown in Figure 2, we often observe that the iterative algorithm will begin to learn and then crash. Regularization can help to prevent this crash since strong enough regularization towards the behavior policy ensures that the evaluation is nearly onpolicy.
In contrast, the onestep algorithm is more robust to the regularization hyperparameter. The rightmost panel of the figure shows this clearly. While iterative and multistep algorithms can have their performance degrade very rapidly with the wrong setting of the hyperparameter, the onestep approach is more stable. Moreover, we usually find that the optimal setting of the regularization hyperparameter is lower for the onestep algorithm than the iterative or multistep approaches.
6.2 Distribution shift
Any algorithm that relies on offpolicy evaluation will struggle with distribution shift in the evaluation step. Trying to evaluate a policy that is substantially different from the behavior reduces the effective sample size and increases the variance of the estimates. Explicitly, by distribution shift we mean the shift between the behavior distribution (the distribution over stateaction pairs in the dataset) and the evaluation distribution (the distribution that would be induced by the policy we want to evaluate).
Prior work.
There is a substantial body of prior theoretical work that suggests that offpolicy evaluation can be difficult and this difficulty scales with some measure of distribution shift. wang2020statistical, Amortila2020AVO, zanette2021exponential give exponential (in horizon) lower bounds on sample complexity in the linear setting even with good feature representations that can represent the desired Q function and assuming good data coverage. Upper bounds generally require very strong assumptions on both the representation and limits on the distribution shift [wang2021instabilities, duan2020minimax, chen2019information]. Moreover, the assumed bounds on distribution shift can be exponential in horizon in the worst case. On the empirical side, wang2021instabilities demonstrates issues with distribution shift when learning from pretrained features and provides a nice discussion of why distribution shift causes error amplification. fujimoto2018off raises a similar issue under the name “extrapolation error”. Regularization and constraints are meant to reduce issues stemming from distribution shift, but also reduce the potential for improvement over the behavior.
Empirical evidence.
Both the multistep and iterative algorithms in our experiments rely on offpolicy evaluation as a key subroutine. We examine how easy it is to evaluate the policies encountered along the learning trajectory. To control for issues of iterative error exploitation (discussed in the next subsection), we train Q estimators from scratch on a heldout evaluation dataset sampled from the behavior policy. We then evaluate these trained Q function on rollouts from 1000 datapoints sampled from the replay buffer. Results are shown in Figure 3.
The results show a correlation betweed KL and MSE. Moreover, we see that the MSE generally increases over training. One way to mitigate this, as seen in the figure, is to use a large value of . We just cannot take a very large step before running into problems with distribution shift. But, when we take such a small step, the information from the onpolicy is about as useful as the newly estimated . This is seen, for example, in Figure 2 where we get very similar performance across algorithms at high levels of regularization.
6.3 Iterative error exploitation
The previous subsection identifies how any algorithm that uses offpolicy evaluation is fundamentally limited by distribution shift, even if we were given fresh data and trained Q functions from scratch at every iteration. But, in practice, iterative algorithms repeatedly iterate between optimizing policies against estimated Q functions and reestimating the Q functions using the same data and using the Q function from the previous step to warmstart the reestimation. This induces dependence between steps that causes a problem that we call iterative error exploitation.
Intuition about the problem.
In short, iterative error exploitation happens because tends to choose overestimated actions in the policy improvement step, and then this overestimation propagates via dynamic programming in the policy evaluation step. To illustrate this issue more formally, consider the following: at each we suffer some Bellman error based on our fixed dataset collected by . Formally,
(6) 
Intuitively, will be larger at stateactions with less coverage in the dataset collected by . Note that can absorb all noise due to our finite dataset as well as function approximation error.
All that is needed to cause iterative error exploitation is that the are highly correlated across different , but for simplicity, we will assume that is the same for all policies estimated from our fixed offline dataset and instead write . Now that the errors do not depend on the policy we can treat the errors as auxiliary rewards that obscure the true rewards and see that
(7) 
This assumption is somewhat reasonable since we expect the error to primarily depend on the data. And, when the prior Q function is used to warmstart the current one (as is generally the case in practice), the approximation errors are automatically passed between steps.
Now we can explain the problem. Recall that under our assumption the are fixed once we have a dataset and likely to have larger magnitude the further we go from the support of the dataset. So, with each step is able to better maximize , thus moving further from and increasing the magnitude of relative to . Even though may provide better signal than , it can easily be drowned out by . In contrast, has small magnitude, so the onestep algorithm is robust to errors^{1}^{1}1We should note that iterative error exploitation is similar to the overestimation addressed by double Q learning [van2016deep, fujimoto2018addressing], but distinct. Since we are in the offline setting, the errors due to our finite dataset can be iteratively exploited more and more, while in the online setting considered by double Q learning, fresh data prevents this issue. We are also considering an algorithm based on policy iteration rather than value iteration..
An example.
Now we consider a simple gridworld example to illustrate iterative error exploitation. This example fits exactly into the setup outlined above since all errors are due to reward estimation so the is indeed constant over all . The gridworld we consider has one deterministic good state with reward 1 and many stochastic bad states that have rewards distributed as . We collect a dataset of 100 trajectories, each of length 100. One run of the multistep offline regularized policy iteration algorithm is illustrated in Figure 4.
In the example, like in the D4RL benchmark, we see that one step outperforms multiple steps of improvement. Intuitively, when there are so many noisy states, it is likely that a few of them will be overestimated. Since the data is reused for each step, these overestimations persist and propagate across the state space due to iterative error exploitation. This property of having many bad, but poorly estimated states likely also exists in the highdimensional control problems encountered in the benchmark where there are many ways for the robots to fall down that are not observed in the data for nonrandom behavior. Moreover, both settings have larger errors in areas where we have less data. So even though the errors in the gridworld are caused by noise in the rewards, while errors in D4RL are caused by function approximation, we think this is a useful mental model of the problem.
Empirical evidence.
In practice we cannot easily visualize the progression of errors. However, the dependence between steps still arises as overestimation of the Q values. We can track the overestimation of the Q values over training as a way to measure how much bias is being induced by optimizing against our dependent Q estimators. As a control we can also train Q estimators from scratch on independently sampled evaluation data. These independently trained Q functions do not have the same overestimation bias even though the squared error does tend to increase as the policy moves further from the behavior (as seen in Figure 3). Explicitly, we track 1000 state, action pairs from the replay buffer over training. For each checkpointed policy we perform 3 rollouts at each state to get an estimate of the true Q value and compare this to the estimated Q value. Results are shown in Figure 5.
7 When are multiple steps useful?
So far we have focused on why the onestep algorithm often works better than the multistep and iterative algorithms. However, we do not want to give the impression that onestep is always better. Indeed, our own experiments in Section 5 show a clear advantage for the multistep and iterative approaches when we have randomly collected data. While we cannot offer a precise delineation of when onestep will outperform multistep, in this section we offer some intuition as to when we can expect to see benefits from multiple steps of policy improvement.
As seen in Section 6, multistep and iterative algorithms have problems when they propagate estimation errors. This is especially problematic in noisy and/or high dimensional environments. While the multistep algorithms propagate this noise more widely than the onestep algorithm, they also propagate the signal. So, when we have sufficient coverage to reduce the magnitude of the noise, this increased propagation of signal can be beneficial. The D4RL experiments suggest that we are usually on the side of the tradeoff where the errors are large enough to make onestep preferable.
In Appendix A we illustrate a simple gridworld example where a slight modification of the behavior policy from Figure 4 makes multistep dramatically outperform onestep. This modified behavior policy (1) has better coverage of the noisy states (which reduces error, helping multistep), and (2) does a worse job propagating the reward from the good state (hurting onestep).
We can also test empirically how the behavior policy effects the tradeoff between error and signal propagation. To do this we construct a simple experiment where we mix data from the random behavior policy with data from the medium behavior policy. Explicitly we construct a dataset out of the datasets for random and for medium such that each trajectory in comes from the medium dataset with probability . So for we have the random dataset and we have the medium dataset, and in between we have various mixtures. Results are shown in Figure 6. It takes surprisingly little data from the medium policy for onestep to outperform the iterative algorithm.
8 Discussion, limitations, and future work
This paper presents the surprising effectiveness of a simple onestep baseline for offline RL. We examine the failure modes of iterative algorithms and the conditions where we might expect them to outperform the simple onestep baseline. This provides guidance to a practitioner that the simple onestep baseline is a good place to start when approaching an offline RL problem.
But, we leave many questions unanswered. One main limitation is that we lack a clear theoretical characterization of which environments and behaviors can guarantee that onestep outperforms multistep or visa versa. Such results will likely require strong assumptions, but could provide useful insight. We don’t expect this to be easy as it requires understanding policy iteration which has been notoriously difficult to analyze, often converging much faster than the theory would suggest [sutton2018reinforcement, Agarwal2019ReinforcementLT]. Another limitation is that while only using one step is perhaps the simplest way to avoid the problems of offpolicy evaluation, there are possibly other more elaborate algorithmic solutions that we did not consider here. However, our strong empirical results suggest that the onestep algorithm is at least a strong baseline.
Acknowledgements
This work is partially supported by the Alfred P. Sloan Foundation, NSF RI1816753, NSF CAREER CIF 1845360, NSF CHS1901091, Samsung Electronics, and the Institute for Advanced Study. DB is supported by the Department of Defense (DoD) through the National Defense Science & Engineering Graduate Fellowship (NDSEG) Program.
References
Appendix A Gridworld example where multistep outperforms onestep
As explained in the main text, this section presents an example that is only a slight modification of the one in Figure 4, but where a multistep approach is clearly preferred over just one step. The datagenerating and learning processes are exactly the same (100 trajectories of length 100, discount 0.9, for reverse KL regularization). The only difference is that rather than using a behavior that is a mixture of optimal and uniform, we use a behavior that is a mixture of maximally suboptimal and uniform. If we call the suboptimal policy (which always goes down and left in our gridworld), then the behavior for the modified example is , where is uniform. Results are shown in Figure 7.
By being more likely to go to the noisy states, this behavior policy allows us to get lower variance estimates of the rewards. Essentially, the coverage of the behavior policy in this example reduces the magnitude of the evaluation errors. This allows for more aggressive planning using multistep methods. Moreover, since the behavior is less likely to go to the good state, the behavior Q function does not propagate the signal from the rewarding state as far, harming the onestep method.
Appendix B Connection to policy improvement guarantees
The regularized or constrained onestep algorithm performs an update that directly inherits guarantees from the literature on conservative policy improvement [kakade2002approximately, schulman2015trust, achiam2017constrained]. These original papers consider an online setting where more data is collected at each step, but the guarantee at each step applies to our onestep offline algorithm.
The key idea of this line of work begins with the performance difference lemma of kakade2002approximately, and then lower bounds the amount of improvement over the behavior policy. Define the discounted state visitation distribution for a policy by . We will also use the shorthand to denote . Then we have the performance difference lemma as follows.
Lemma 1 (Performance difference, kakade2002approximately).
For any two policies and ,
(8) 
Then, Corollary 1 from achiam2017constrained (reproduced below) gives a guarantee for the onestep algorithm. The key idea is that when is sufficiently close to , we can use as an approximation to .
Lemma 2 (Conservative Policy Improvement, achiam2017constrained).
For any two policies and , let . Then,
(9) 
where denotes the total variation distance.
Replacing with and the TV distance by the KL, we get precisely the objective that we optimize in the onestep algorithm. This shows that the onestep algorithm indeed optimizes a lower bound on the performance difference. Of course, in practice we replace the potentially large multiplier on the divergence term by a hyperparameter, but this theory at least motivates the soundness of the approach.
We are not familiar with similar guarantees for the iterative or multistep approaches that rely on offpolicy evaluation.
Appendix C Experimental setup
c.1 Benchmark experiments (Tables 1 and 2, Figure 2)
Data.
We use the datasets from the D4RL benchmark [fu2020d4rl]. We use the latest versions, which are v2 for the mujoco datasets and v1 for the adroit datasets.
Hyperparameter tuning.
Algorithm  Hyperparameter set 
Reverse KL ()  {0.03, 0.1, 0.3, 1.0, 3.0, 10.0} 
Easy BCQ ()  {2, 5, 10, 20, 50, 100} 
Exponentially weighted ()  {0.1, 0.3, 1.0, 3.0, 10.0, 30.0} 
We follow the practice of fu2020d4rl and tune a small set of hyperparameters by interacting with the simulator to estimate the value of the policies learned under each hyperparameter setting. The hyperparameter sets for each algorithm can be seen in Table 3.
This may initially seem like “cheating”, but can be a reasonable setup if we are considering applications like robotics where we can feasibly test a small number of trained policies on the real system. Also, since prior work has used this setup, it makes it easiest to compare our results if we use it too. While beyond the scope of this work, we do think that better offline model selection procedures will be crucial to make offline RL more broadly applicable. A good primer on this topic can be found in paine2020hyperparameter.
Models.
All of our Q functions and policies are simple MLPs with ReLU activations and 2 hidden layers of width 1024. Our policies output a truncated normal distribution with diagonal covariance where we can get reparameterized samples by sampling from a uniform distribution and computing the differentiable inverse CDF [Burkhardt2014truncated]. We found this to be more stable than the tanh of normal used by e.g. fu2020d4rl, but to achieve similar performance when both are stable. We use these same models across all experiments.
Onestep training procedure.
For all of our onestep algorithms, we train our behavior estimate by imitation learning for 500k gradient steps using Adam [kingma2014adam] with learning rate 1e4 and batch size 512. We train our estimator by fitted Q evaluation with a target network for 2 million gradient steps using Adam with learning rate 1e4 and batch size 512. The target is updated softly at every step with parameter . All policies are trained for 100k steps again with Adam using learning rate 1e4 and batch size 512.
Easy BCQ does not require training a policy network and just uses and to define it’s policy. For the exponentially weighted algorithm, we clip the weights at 100 to prevent numerical instability. To estimate reverse KL at some state we use 10 samples from the current policy and the density defined by our estimated .
Each random seed retrains all three models (behavior, Q, policy) from different initializations. We use three random seeds.
Multistep training procedure.
For multistep algorithms we use all the same hyperparameters as onestep. We initialize our policy and Q function from the same pretrained and as we use for the onestep algorithm trained for 500k and 2 million steps respectively. Then we consider 5 policy steps. To ensure that we use the same number of gradient updates on the policy, each step consists of 20k gradient steps on the policy followed by 200k gradient steps on the Q function. Thus, we take the same 100k gradient steps on the policy network. Now the Q updates are offpolicy so the next action is sampled from the current policy rather than from the dataset.
Iterative training procedure.
For iterative algorithms we again use all the same hyperparameters and initialize from the same and . We again take the same 100k gradient steps on the policy network. For each step on the policy network we take 2 offpolicy gradient steps on the Q network.
Evaluation procedure.
To evaluate each policy we run 100 trajectories in the environment and compute the mean. We then report the mean and standard deviation over three training seeds.
c.2 MSE experiment (Figure 3)
Data.
To get an independently sampled dataset of the same size as the training set, we use the behavior cloned policy to sample 1000 trajectories. The checkpointed policies are taken at intervals of 5000 gradient steps from each of the three training seeds.
Training procedure.
The training procedure is the same as before so we use Adam with step size 1e4 and batch size 512 and a target network with soft updates with parameter 0.005. We train for 1 million steps.
Evaluation procedure.
To evaluate MSE, we sample 1000 state, action pairs from the original training set and from each state, action pair we run 3 rollouts. We take the mean over the rollouts and then compute squared error at each state, action pair and finally get MSE by taking the mean over state, action pairs. The reported reverse KL is evaluated by samples during training. At each state in a batch we take 10 samples to estimate the KL at that state and then take the mean over the batch.
c.3 Gridworld experiment (Figure 4)
Environment.
The environment is a 15 x 15 gridworld with deterministic transitions. The rewards are deterministically 1 for all actions taken from the state in the top right corner and stochastic with distribution for all actions taken from states on the left or bottom walls. The initial state is uniformly random. The discount is 0.9.
Data.
We collect data from a behavior policy that is a mixture of the uniform policy (with probability 0.8) and an optimal policy (with probability 0.2). We collect 100 trajectories of length 100.
Training procedure.
We give the agent access to the deterministic transitions. The only thing for the agent to do is estimate the rewards from the data and then learn in the empirical MDP. We perform tabular Q evaluation by dynamic programming. We initialize with the empirical rewards and do 100 steps of dynamic programming with discount 0.9. Regularized policy updates are solved for exactly by setting .
c.4 Overestimation experiment (Figure 5)
This experiment uses the same setup as the MSE experiment. The main difference is we also consider the Q functions learned during training and demonstrate the overestimation relative to the Q functions trained on the evaluation dataset as in the MSE experiment.
c.5 Mixed data experiment (Figure 6)
We construct datasets with by mixing the random and medium datasets from D4RL and then run the same training procedure as we did for the benchmark experiments. Each dataset has the same size, but a different proportion of trajectories from the medium policy.
Appendix D Learning curves
In this section we reproduce the learning curves and hyperparameter plots across the onestep, multistep, and iterative algorithms with reverse KL regularization, as in Figure 2.