banner
bild bild bild bild bild bild bild bild
bild bild bild bild bild bild bild bild bild bild bild bild
print print help help
 User: Yao Heng shuai (hs105)
Start page
Edit user data
Logout
bild
 Author
View own papers
bild
bild
Review Feedback and Response

  • The purpose of author feedback is to point out technical errors or significant misunderstandings. it is not to dispute the opinions of the reviewers, or to explain how criticisms can be addressed!
  • Do not expect replies to your feedback in the final reviews. Rest assured that feedback identifying true flaws or misunderstandings will be taken into account in final decisions.
  • Please do not provide feedback explaining how you can address the referees' concerns or criticisms. It is our assumption already that concerns will be addressed in the final version.
  • Feedback should be brief and specific. You will see three reviews. Target your feedback to specific reviewer comments. For example: "The review says crucial formula x+y on page 5 is wrong. Note the formula given on page 5 is actually x+z, not x+y."
  • Your job is to identify major misunderstandings, not to dispute the referees' criticisms or value judgements.
  • Feedback is strictly limited in length to 250 words per review, no exceptions

 Paper ID 75 
 Paper authors Yao Heng shuai 
 Paper title Preconditioned temporal difference learning 
 Paper subtitle  
 Preference Research Paper
 Keywords Acting and Decision Making::Reinforcement Learning
 Abstract We extend LSTD and LSPE to a family of reinforcement learning algorithms called the preconditioned temporal difference (PTD) learning. The PTD family is general and includes four new algorithms besides an extended version of LSTD, and LSPE. The per-time-step complexities of the four new PTD algorithms are the same with those of RLS-TD and LSPE, i.e., O(K2), where K is the number of features. We study the numerical stability and data efficiency of the PTD algorithms on an extended Boyan chain example. Experimental results show that, when the problem becomes ill-conditioned, the new PTD algorithms remain stable while LSTD() become instable; for Boyan’s original experimental settings, the new algorithms have similar data efficiency with LSTD, and converge faster than linear TD. Experiment on balancing the cart-pole shows that the number of trials needed by the PTD policy evaluation to accomplish the task is smaller than conventional AHC.
 Average overall recommendation 7.27 
 Download download file


bild
Criterion01Criterion02
 ReviewerID: 65535   7 10
Comments to author(s)
HIGH LEVEL COMMENTS:

This is a well-written paper that explores the use of preconditioning to improve least-squares methods for solving prediction problems (such as policy evaluation). The use of preconditioning is very standard in the study of large systems of linear equations (MATLAB makes extensive use of preconditioning).

This paper uses fairly conventional preconditioning methods, including Gauss-Seidel, Jacobi and SOR., all of whose properties are well documented in standard linear algebra texts (e.g Strang).

The basic idea is to solve Ax=b by splitting A into S-T, and then iteratively solving (S-T)x = b as Sx = Tx + b. The differences between preconditioners comes from how they chose to split into S and T. The rate of convergence of these methods can be shown to depend on the spectral properties of S^{-1}T.

Jacobi uses the diagonal terms of A as S, and moves off-diagonal elements to T. Gauss-Seidel uses the whole lower triangular part of A as S, and moves the upper triangular part to T. SOR is a variant of these.

One disappointing limitation of this paper is the lack of any theoretical analysis on the convergence properties of these different preconditioners. The author(s) claim that the standard results do not apply since "the coefficients A_{t+1}, b_{t+1}, and C_{t+1} are stochastic". I find this to be a specious argument, and urge them to undertake a more complete review of the matrix preconditioning literature (which is vast).

One is instead left with a set of numerical experiments on largely simple problems: a discrete chain and a continuous cart-pole task (with highly coarse discretization). These results predictably show improvements with preconditioning, which is hardly surprising given what we know already about matrix preconditioning, but are nonetheless nice to see. The graphs in my hardcopy were completely illegible, with some graphs (e.g Figure 2 and 3) containing dozens of poorly separated curves.

DETAILED COMMENTS:

1. It is somewhat strange to talk about LSTD as a "model-based" solver, since the A matrix could hardly be called a model in the conventional sense (it does not provide a distribution of future states conditioned on past states). Boyan himself admitted the character of A was unclear, speculating that it could be viewed as a "compressed model" on the feature space spanned by \phi. The paper adds to the confusion by sometimes talking about models of A (e.g. in the description of LSPE), which further clouds the issue. Why not simply characterize these as low-rank approximations of A, which is really what LSTD and LSPE (and all the dozen other variants) are doing.

2. Equation 6 should be preceded with a clearer description of classical matrix preconditioning (such as the description I have made above in terms of generically solving systems like Ax=b, before introducing it in the RL context).

3. The computational complexity of the various preconditioning methods is not specified. Table 1 should be expanded to include this information.

4. Figure 2 middle plot is completely useless. Can you really expect anyone to make sense of the confusing barrage of curves all lying one on top of another?

5. Figure 3 plot 1 is again bewildering. No less than 10 methods are compared, and no amount of colorization or labeling is going to make this easier. Is it necessary to display all the variant?

6. The experiments show that Gauss-Seidel seems to perform the best. Classically, this is hardly a surprising result since it is known that Gauss-Seidel is faster than Jacobi (for example for positive definite tridiagonal matrices). It would have been helpful to show results for random matrices, since for non-PD matrices, Jacobi methods can be faster sometimes.


Summary of review
This is a well-written paper that explores the use of preconditioning to improve least-squares methods for solving prediction problems (such as policy evaluation). The use of preconditioning is very standard in the study of large systems of linear equations (MATLAB makes extensive use of preconditioning).
It is nice to see they help improve LSTD, although that is hardly surprising. The lack of any theoretical analysis is a significant omission.




Authors response by Yao Heng shuai (2007-07-22 03:47:25) to review  

The theoretical analysis is indeed important. However, we did not include it in the current paper due to space limitations and we think readers may be more interested in the ideas and effectiveness of the proposed algorithms.

In fact, (1)We have proven the convergence of the Robbins-Monro procedure (Equation 8) for estimating $A,b$. The proof treated each entry of $A_t,b_t$ as some compressed information about the transition probability. The proof was much simpler and more understandable than those by Tadic (2001) and Bertsekas (2003) for infinite-horizon problems. We plan to mention these results when modifying our paper.

(2)The convergence and convergence rate of PTD algorithms are determined by the spectral radius of $I+\alpha (C_t^{-1}A_t)$. With proper $\alpha$ for some specific $C_t$, we can ensure the spectral radius smaller than $1$ w.p.1, which completes the convergence proof in a similar way to Bertsekas (2004). It is a well-known result in iterative analysis literature that the magnitude of this spectral radius is just the "w.p.1" convergence rate for PTD algorithms.

Review Evaluation: ++ outstanding review

The reviewer has a good understanding of preconditioning, and the literature of iterative anlaysis.


bild
Criterion01Criterion02
 ReviewerID: 982   8 10
Comments to author(s)
I was really hoping for a paper like this which applies well-known concepts from linear algebra to recent RL algorithms that end up solving linear systems. It is really interesting to see the connection between LSTD and LSPE, both of them being variants of a general preconditioning method. And indeed the proposed generic preconditioning approach offers new opportunities for many variants.

It is somewhat disappointing that the authors chose to stress the difference between the TD(lambda) and the PTD algorithms. It is a well-known fact that batch algorithms outperform incremental ones. What I was hoping to see is whether there are any new PTD algorithms which are better than LSTD and LSPE. My impression from what I see in the paper is that LSTD is by far the best (contrary to the author's claim that performance is similar to LSTD) and all other PTD algorithms (including LSPE) are somewhere between LSTD and TD. Of course, this is an issue that needs deeper study to draw a conclusion and comparison on problems harder than the simple chain world. In addition, it would be better to conduct such comparisons on pure RL prediction problems to exclude the effects of an actor.

The presentation is not the best possible. There is a lot of room to improve the use of language and make the text flow more naturally. The graphs are very small and in some cases there is no way to distinguish the different curves.

In the introduction, it is stated that "LSTD and LSPE build a model". This statement is not so accurate. LSTD and LSPE build "models" in the space of features which may be a lot different than the real state space. Only when these two spaces coincide one can talk about model in the typical sense. It would be much more accurate to say that "LSTD and LSPE build a compressed model in the space of features".

The introduction of the step size alpha in equation (7) is confusing. There is no such step size in equation (6) which is the base equation. Is there a justification for that? It is also strange that later (middle of Section 4) the PTD update is rewritten as a convex combination of the old and the new weights. Which one is true?

There are multiple references to "optimal deltas" used for the perturbation. How are these "optimal" values found? Are they just good choices or truly optimal?

Some detailed comments:

Abstract: "called the preconditioned temporal difference" -> "called preconditioned temporal difference"

Abstract: "instable" -> "unstable" (there are many occurrences throughout the paper)

Abstract: What is AHC?

Section 1: Equation (1) is TD(lambda) and not just TD. Small difference, but better be accurate. This should also be consistent throughout the entire paper.

Section 2: "and publications on discussing their relations are few" -> "and there are only a few publications discussing their relationship"

Section 2: "We follows with an overview" -> "We continue with an overview"

Section 2: Some examples from the class of iterative methods for linear systems should be given, similarly to the examples for the class of direct methods.

"Good preconditioner can avoid problems such as numerical instability suffered by matrix-inversion approach, and more efficient than direct approaches because of the iterative computation." -> "Good preconditioning can avoid numerical instabilities which are common in matrix-inversion approaches, and tends to be more efficient than direct approaches because of the iterative nature of computation."

Section 4: "is variant of" -> "variants of"

Section 5: "With the lookup table representation," - It is not clear what the features are in this case. It is better to say that indicator basis functions are used.

Section 5: "guarantee convergence is because it satisfies" -> "guarantee convergence because it satisfies"

Section 6: Why are there two graphs in Figure 5. All curves could be plotted on a single graph with a stretched scale on the y-axis.


Summary of review
I think this is a very good paper which offers new perspective on many recent RL algorithms. It will make a nice contribution to NIPS. There is space for improvement and I hope that the authors will take the time to improve their paper.



Authors response by Yao Heng shuai (2007-07-22 03:56:41) to review  

Simulations on Boyan's example were not meant to show that LSTD is advantageous than other PTD algorithms. In the experiment, LSTD was best on Boyan's example because a bad initialization of $w$ for all non-LSTD algorithms was used. However, as pointed by Bertsekas(2004), LSPE (and certainly, other PTD algorithms) can take advantage of a good initialization, while LSTD cannot. This is especially helpful when the policy is changed by policy mprovement. Moreover, it should be noted that the computational complexity of LSTD is $O(K^3)$ while non-LSTD algorithms are only $O(K^2)$, where K is the number of features.

Why we introduced the step size alpha in equation (7) is explained in the last passage of Section 3. The introduction of another (convex) form of PTD update in the middle of Section 4, is to relate it to standard iteration, and avoid the computation of matrix inversion. One can verify that this convex form is equivalent to the original PTD update (Equation 7).

We appreciate the numerous comments which we'll use to improve our paper presentation. We will modify the article carefully, make the presentation more logic, and delete useless words and sentences. We will spare no efforts in making the figures more clear and constructive. We will magnify them, delete some useless curves, or replace them with a table in case that many algorithms have to be compared.

Review Evaluation: ++ outstanding review

bild
Criterion01Criterion02
 ReviewerID: 2451   5 2
Comments to author(s)
This paper presents a new family of model-based temporal difference (TD) learning algorithms (preconditioned TD) that encompasses several previously suggested algorithms within a single framework. The algorithms use experience to estimate a model of the dynamics of the environment, learning both the model and the state or action values incrementally from experience. Instantiations of the algorithm are tested on an MDP chain problem first explored by Boyan, and the benchmark cart-pole problem.

Unfortunately, although I am quite familiar with reinforcement learning in general and TD in particular, I found this paper cryptic and almost impossible to follow. It starts with an onslaught of 7(!) undefined acronyms in the abstract, and continues with references and reliance on previous work which prevent the paper from being understood on its own.

Even taking into consideration the space limitation on NIPS papers, I think it is inexcusable that key ideas are not explained. Chief among these is the whole idea of preconditioning: what is the role of preconditioning, is it really done prior to value learning (from equations (7,8) it seems as if both are carried out simultaneously?), does it bear any relationship to preconditioning in animal experiments (such as sensory preconditioning) or is this just an unfortunate coincidence of names?

Importantly, the key constructs in the preconditioning algorithm are not explained in an intuitive way: what does C *mean*, and what is its function in learning? The fact that it can be set to the identity matrix makes it seem like learning can do without it as well as with it.. what functionality is assumed by different forms of C? What does it mean to use -A_{t+1} as a preconditioner?

Without such an explanation, it is pretty much impossible for me to understand the merits of the framework, and what it does and does not encompass. Why would I want to use it? And which form of C should I choose? I appreciate that careful reading of all the previous related work, and working out the equations (or writing simulations) myself might clarify these issues, but I expect a paper presenting a new method to try, at least, to explain its semantics rather than just lay out the algorithm.

The simulations shown do not help clarify the paper. FIrst, they are so small and crowded with curves, that they are practically ineligible. Second, comparing a model based method to model-free TD is hardly a fair comparison!

Due to my poor understanding of the qualitative merits of the proposed algorithm, I feel that I can not judge the quality, originality and significance of the paper. However, if I am a minority and other reviewers find this paper clear and understandable, my review should be wholly ignored.

Summary of review
Please summarize your review here in 1-2 sentences.



Authors response by Yao Heng shuai (2007-07-22 04:09:11) to review  

Preconditioning presented in this paper is a classical technique from numerical analysis or standard iterative analysis. It is not related to animal experiments, at least in this paper. The technique preconditions a linear system to make it more robust and easier to solve than matrix-inversion approach.

Although not presented in the submitted version, we have indeed developed some intuitions behind PTD. Researchers have noted experience replay can improve the data efficiency (Lin 1992). However, Lin's methods have to maintain a data set, which requires a significant amount of memory when a lot of experience is available. PTD algorithms, however, provide another way to replay the experience without maintaining a data set. These algorithms are thus more efficient than Lin's methods.

Review Evaluation:

We feel sorry for not providing enough materials for preconditioning due to space limitations. One of our research aims is to provide some new perspectiver in viewing LSTD and LSPE. It is also a good future direction to study the proposed points provided by the reviewer.

bild
ConfMaster.net The Conference Management System  Copyright (C) www.ConfMaster.net, 2002-2007.
bild