

















Review Feedback and Response 

 The
purpose of author feedback is to point out technical errors or
significant misunderstandings. it is not to dispute the opinions of the
reviewers, or to explain how criticisms can be addressed!
 Do not expect replies to your feedback
in the final reviews. Rest assured that feedback identifying true flaws
or misunderstandings will be taken into account in final decisions.

Please do not provide feedback explaining how you can address the
referees' concerns or criticisms. It is our assumption already that
concerns will be addressed in the final version.
 Feedback should be brief and specific.
You will see three reviews. Target your feedback to specific reviewer
comments. For example: "The review says crucial formula x+y on page 5
is wrong. Note the formula given on page 5 is actually x+z, not x+y."

Your job is to identify major misunderstandings, not to dispute the referees' criticisms or value judgements.

Feedback is strictly limited in length to 250 words per review, no exceptions


Paper ID 
75 
Paper authors 
Yao Heng shuai 
Paper title 
Preconditioned temporal difference learning 
Paper subtitle 

Preference 
Research Paper 
Keywords 
Acting and Decision Making::Reinforcement Learning 
Abstract 
We
extend LSTD and LSPE to a family of reinforcement learning algorithms
called the preconditioned temporal difference (PTD) learning. The PTD
family is general and includes four new algorithms besides an extended
version of LSTD, and LSPE. The pertimestep complexities of the four
new PTD algorithms are the same with those of RLSTD and LSPE, i.e.,
O(K2), where K is the number of features. We study the numerical
stability and data efficiency of the PTD algorithms on an extended
Boyan chain example. Experimental results show that, when the problem
becomes illconditioned, the new PTD algorithms remain stable while
LSTD(¸) become instable; for Boyan’s original experimental settings,
the new algorithms have similar data efficiency with LSTD, and converge
faster than linear TD. Experiment on balancing the cartpole shows that
the number of trials needed by the PTD policy evaluation to accomplish
the task is smaller than conventional AHC. 
Average overall recommendation 
7.27 
Download 





 
ReviewerID: 65535 
7 
10 
Comments to author(s)
HIGH LEVEL COMMENTS:
This is a wellwritten paper that explores the use of preconditioning
to improve leastsquares methods for solving prediction problems (such
as policy evaluation). The use of preconditioning is very standard in
the study of large systems of linear equations (MATLAB makes extensive
use of preconditioning).
This paper uses fairly conventional preconditioning methods, including
GaussSeidel, Jacobi and SOR., all of whose properties are well
documented in standard linear algebra texts (e.g Strang).
The basic idea is to solve Ax=b by splitting A into ST, and then
iteratively solving (ST)x = b as Sx = Tx + b. The differences between
preconditioners comes from how they chose to split into S and T. The
rate of convergence of these methods can be shown to depend on the
spectral properties of S^{1}T.
Jacobi uses the diagonal terms of A as S, and moves offdiagonal
elements to T. GaussSeidel uses the whole lower triangular part of A
as S, and moves the upper triangular part to T. SOR is a variant of
these.
One disappointing limitation of this paper is the lack of any
theoretical analysis on the convergence properties of these different
preconditioners. The author(s) claim that the standard results do not
apply since "the coefficients A_{t+1}, b_{t+1}, and C_{t+1} are
stochastic". I find this to be a specious argument, and urge them to
undertake a more complete review of the matrix preconditioning
literature (which is vast).
One is instead left with a set of numerical experiments on largely
simple problems: a discrete chain and a continuous cartpole task (with
highly coarse discretization). These results predictably show
improvements with preconditioning, which is hardly surprising given
what we know already about matrix preconditioning, but are nonetheless
nice to see. The graphs in my hardcopy were completely illegible, with
some graphs (e.g Figure 2 and 3) containing dozens of poorly separated
curves.
DETAILED COMMENTS:
1. It is somewhat strange to talk about LSTD as a "modelbased" solver,
since the A matrix could hardly be called a model in the conventional
sense (it does not provide a distribution of future states conditioned
on past states). Boyan himself admitted the character of A was unclear,
speculating that it could be viewed as a "compressed model" on the
feature space spanned by \phi. The paper adds to the confusion by
sometimes talking about models of A (e.g. in the description of LSPE),
which further clouds the issue. Why not simply characterize these as
lowrank approximations of A, which is really what LSTD and LSPE (and
all the dozen other variants) are doing.
2. Equation 6 should be preceded with a clearer description of
classical matrix preconditioning (such as the description I have made
above in terms of generically solving systems like Ax=b, before
introducing it in the RL context).
3. The computational complexity of the various preconditioning methods
is not specified. Table 1 should be expanded to include this
information.
4. Figure 2 middle plot is completely useless. Can you really expect
anyone to make sense of the confusing barrage of curves all lying one
on top of another?
5. Figure 3 plot 1 is again bewildering. No less than 10 methods are
compared, and no amount of colorization or labeling is going to make
this easier. Is it necessary to display all the variant?
6. The experiments show that GaussSeidel seems to perform the best.
Classically, this is hardly a surprising result since it is known that
GaussSeidel is faster than Jacobi (for example for positive definite
tridiagonal matrices). It would have been helpful to show results for
random matrices, since for nonPD matrices, Jacobi methods can be
faster sometimes.

Summary of review
This is a wellwritten paper that explores the use of preconditioning
to improve leastsquares methods for solving prediction problems (such
as policy evaluation). The use of preconditioning is very standard in
the study of large systems of linear equations (MATLAB makes extensive
use of preconditioning).
It is nice to see they help improve LSTD, although that is hardly
surprising. The lack of any theoretical analysis is a significant
omission.



Authors response by Yao Heng shuai (20070722 03:47:25) to review 


The
theoretical analysis is indeed important. However, we did not include
it in the current paper due to space limitations and we think
readers may be more interested in the ideas and effectiveness of the
proposed algorithms.
In fact,
(1)We have proven the convergence of the RobbinsMonro procedure
(Equation 8) for estimating $A,b$. The proof treated each entry of
$A_t,b_t$ as some compressed information about the transition
probability. The proof was much simpler and more understandable than
those by Tadic (2001) and Bertsekas (2003) for infinitehorizon
problems. We plan to mention these results when modifying our paper.
(2)The convergence and convergence rate of PTD algorithms are
determined by the spectral radius of $I+\alpha (C_t^{1}A_t)$. With
proper $\alpha$ for some specific $C_t$, we can ensure the spectral
radius smaller than $1$ w.p.1, which completes the convergence proof in
a similar way to Bertsekas (2004). It is a wellknown result in
iterative analysis literature that the magnitude of this spectral
radius is just the "w.p.1" convergence rate for PTD algorithms.

Review Evaluation: ++ outstanding review 
The reviewer has a good understanding of preconditioning, and the literature of iterative anlaysis.




 
ReviewerID: 982 
8 
10 
Comments to author(s)
I was really hoping for a paper like this which applies wellknown
concepts from linear algebra to recent RL algorithms that end up
solving linear systems. It is really interesting to see the connection
between LSTD and LSPE, both of them being variants of a general
preconditioning method. And indeed the proposed generic preconditioning
approach offers new opportunities for many variants.
It is somewhat disappointing that the authors chose to stress the
difference between the TD(lambda) and the PTD algorithms. It is a
wellknown fact that batch algorithms outperform incremental ones. What
I was hoping to see is whether there are any new PTD algorithms which
are better than LSTD and LSPE. My impression from what I see in the
paper is that LSTD is by far the best (contrary to the author's claim
that performance is similar to LSTD) and all other PTD algorithms
(including LSPE) are somewhere between LSTD and TD. Of course, this is
an issue that needs deeper study to draw a conclusion and comparison on
problems harder than the simple chain world. In addition, it would be
better to conduct such comparisons on pure RL prediction problems to
exclude the effects of an actor.
The presentation is not the best possible. There is a lot of room to
improve the use of language and make the text flow more naturally. The
graphs are very small and in some cases there is no way to distinguish
the different curves.
In the introduction, it is stated that "LSTD and LSPE build a model".
This statement is not so accurate. LSTD and LSPE build "models" in the
space of features which may be a lot different than the real state
space. Only when these two spaces coincide one can talk about model in
the typical sense. It would be much more accurate to say that "LSTD and
LSPE build a compressed model in the space of features".
The introduction of the step size alpha in equation (7) is confusing.
There is no such step size in equation (6) which is the base equation.
Is there a justification for that? It is also strange that later
(middle of Section 4) the PTD update is rewritten as a convex
combination of the old and the new weights. Which one is true?
There are multiple references to "optimal deltas" used for the
perturbation. How are these "optimal" values found? Are they just good
choices or truly optimal?
Some detailed comments:
Abstract: "called the preconditioned temporal difference" > "called preconditioned temporal difference"
Abstract: "instable" > "unstable" (there are many occurrences throughout the paper)
Abstract: What is AHC?
Section 1: Equation (1) is TD(lambda) and not just TD. Small
difference, but better be accurate. This should also be consistent
throughout the entire paper.
Section 2: "and publications on discussing their relations are few"
> "and there are only a few publications discussing their
relationship"
Section 2: "We follows with an overview" > "We continue with an overview"
Section 2: Some examples from the class of iterative methods for linear
systems should be given, similarly to the examples for the class of
direct methods.
"Good preconditioner can avoid problems such as numerical instability
suffered by matrixinversion approach, and more efficient than direct
approaches because of the iterative computation." > "Good
preconditioning can avoid numerical instabilities which are common in
matrixinversion approaches, and tends to be more efficient than direct
approaches because of the iterative nature of computation."
Section 4: "is variant of" > "variants of"
Section 5: "With the lookup table representation,"  It is not clear
what the features are in this case. It is better to say that indicator
basis functions are used.
Section 5: "guarantee convergence is because it satisfies" > "guarantee convergence because it satisfies"
Section 6: Why are there two graphs in Figure 5. All curves could be
plotted on a single graph with a stretched scale on the yaxis.

Summary of review
I think this is a very good paper which offers new perspective on many
recent RL algorithms. It will make a nice contribution to NIPS. There
is space for improvement and I hope that the authors will take the time
to improve their paper.



Authors response by Yao Heng shuai (20070722 03:56:41) to review 


Simulations
on Boyan's example were not meant to show that LSTD is advantageous
than other PTD algorithms. In the experiment, LSTD was best on Boyan's
example because a bad initialization of $w$ for all nonLSTD algorithms
was used. However, as pointed by Bertsekas(2004), LSPE (and certainly,
other PTD algorithms) can take advantage of a good initialization,
while LSTD cannot. This is especially helpful when the policy is
changed by policy mprovement. Moreover, it should be noted that the
computational complexity of LSTD is $O(K^3)$ while nonLSTD algorithms
are only $O(K^2)$, where K is the number of features.
Why we introduced the step size alpha in equation (7) is explained
in the last passage of Section 3. The introduction of another (convex)
form of PTD update in the middle of Section 4, is to relate it to
standard iteration, and avoid the computation of matrix inversion. One
can verify that this convex form is equivalent to the original PTD
update (Equation 7).
We appreciate the numerous comments which we'll use to improve our
paper presentation. We will modify the article carefully, make the
presentation more logic, and delete useless words and sentences. We
will spare no efforts in making the figures more clear and
constructive. We will magnify them, delete some useless curves, or
replace them with a table in case that many algorithms have to be
compared.

Review Evaluation: ++ outstanding review 




 
ReviewerID: 2451 
5 
2 
Comments to author(s)
This paper presents a new family of modelbased temporal difference
(TD) learning algorithms (preconditioned TD) that encompasses several
previously suggested algorithms within a single framework. The
algorithms use experience to estimate a model of the dynamics of the
environment, learning both the model and the state or action values
incrementally from experience. Instantiations of the algorithm are
tested on an MDP chain problem first explored by Boyan, and the
benchmark cartpole problem.
Unfortunately, although I am quite familiar with reinforcement learning
in general and TD in particular, I found this paper cryptic and almost
impossible to follow. It starts with an onslaught of 7(!) undefined
acronyms in the abstract, and continues with references and reliance on
previous work which prevent the paper from being understood on its own.
Even taking into consideration the space limitation on NIPS papers, I
think it is inexcusable that key ideas are not explained. Chief among
these is the whole idea of preconditioning: what is the role of
preconditioning, is it really done prior to value learning (from
equations (7,8) it seems as if both are carried out simultaneously?),
does it bear any relationship to preconditioning in animal experiments
(such as sensory preconditioning) or is this just an unfortunate
coincidence of names?
Importantly, the key constructs in the preconditioning algorithm are
not explained in an intuitive way: what does C *mean*, and what is its
function in learning? The fact that it can be set to the identity
matrix makes it seem like learning can do without it as well as with
it.. what functionality is assumed by different forms of C? What does
it mean to use A_{t+1} as a preconditioner?
Without such an explanation, it is pretty much impossible for me to
understand the merits of the framework, and what it does and does not
encompass. Why would I want to use it? And which form of C should I
choose? I appreciate that careful reading of all the previous related
work, and working out the equations (or writing simulations) myself
might clarify these issues, but I expect a paper presenting a new
method to try, at least, to explain its semantics rather than just lay
out the algorithm.
The simulations shown do not help clarify the paper. FIrst, they are so
small and crowded with curves, that they are practically ineligible.
Second, comparing a model based method to modelfree TD is hardly a
fair comparison!
Due to my poor understanding of the qualitative merits of the proposed
algorithm, I feel that I can not judge the quality, originality and
significance of the paper. However, if I am a minority and other
reviewers find this paper clear and understandable, my review should be
wholly ignored.

Summary of review
Please summarize your review here in 12 sentences.



Authors response by Yao Heng shuai (20070722 04:09:11) to review 


Preconditioning
presented in this paper is a classical technique from numerical
analysis or standard iterative analysis. It is not related to animal
experiments, at least in this paper. The technique
preconditions a linear system to make it more robust and easier to
solve than matrixinversion approach.
Although not presented in the submitted version, we have indeed
developed some intuitions behind PTD. Researchers have noted
experience replay can improve the data efficiency (Lin 1992). However,
Lin's methods have to maintain a data set, which requires a significant
amount of memory when a lot of experience is available. PTD algorithms,
however, provide another way to replay the experience without
maintaining a data set. These algorithms are thus more efficient than
Lin's methods.

Review Evaluation: 
We
feel sorry for not providing enough materials for preconditioning due
to space limitations. One of our research aims is to provide some new
perspectiver in viewing LSTD and LSPE. It is also a good future
direction to study the proposed points provided by the reviewer.











