Jekyll2021-03-14T19:05:31-07:00https://indylab.org/feed/publications.xmlIntelligent Dynamics Lab | PublicationsCFR-DO: A Double Oracle Algorithm for Extensive-Form Games2021-02-08T00:00:00-08:002021-02-08T00:00:00-08:00https://indylab.org/pub/McAleer2021CFRDO<p>Policy Space Response Oracles (PSRO) is a deep reinforcement learning algorithm for two-player zero-sum games that has empirically found approximate Nash equilibria in large games. Although PSRO is guaranteed to converge to a Nash equilibrium, it may take an exponential number of iterations as the number of information states grows. We propose XDO, a new extensive-form double oracle algorithm that is guaranteed to converge to an approximate Nash equilibrium linearly in the number of infostates. Unlike PSRO, which mixes best responses at the root of the game, XDO mixes best responses at every infostate. We also introduce Neural XDO (NXDO), where the best response is learned through deep RL. In tabular experiments on Leduc poker, we find that XDO achieves an approximate Nash equilibrium in a number of iterations 1-2 orders of magnitude smaller than PSRO. In experiments on a modified Leduc poker game, we show that tabular XDO achieves over 11x lower exploitability than CFR and over 82x lower exploitability than PSRO and XFP in the same amount of time. We also show that NXDO beats PSRO and is competitive with NFSP on a large no-limit poker game.</p>Policy Space Response Oracles (PSRO) is a deep reinforcement learning algorithm for two-player zero-sum games that has empirically found approximate Nash equilibria in large games. Although PSRO is guaranteed to converge to a Nash equilibrium, it may take an exponential number of iterations as the number of information states grows. We propose XDO, a new extensive-form double oracle algorithm that is guaranteed to converge to an approximate Nash equilibrium linearly in the number of infostates. Unlike PSRO, which mixes best responses at the root of the game, XDO mixes best responses at every infostate. We also introduce Neural XDO (NXDO), where the best response is learned through deep RL. In tabular experiments on Leduc poker, we find that XDO achieves an approximate Nash equilibrium in a number of iterations 1-2 orders of magnitude smaller than PSRO. In experiments on a modified Leduc poker game, we show that tabular XDO achieves over 11x lower exploitability than CFR and over 82x lower exploitability than PSRO and XFP in the same amount of time. We also show that NXDO beats PSRO and is competitive with NFSP on a large no-limit poker game.Pipeline PSRO: A Scalable Approach for Finding Approximate Nash Equilibria in Large Games2020-12-07T00:00:00-08:002020-12-07T00:00:00-08:00https://indylab.org/pub/McAleer2020Pipeline<p>Finding approximate Nash equilibria in zero-sum imperfect-information games is challenging when the number of information states is large. Policy Space Response Oracles (PSRO) is a deep reinforcement learning algorithm grounded in game theory that is guaranteed to converge to an approximate Nash equilibrium. However, PSRO requires training a reinforcement learning policy at each iteration, making it too slow for large games. We show through counterexamples and experiments that DCH and Rectified PSRO, two existing approaches to scaling up PSRO, fail to converge even in small games. We introduce Pipeline PSRO (P2SRO), the first scalable PSRO-based method for finding approximate Nash equilibria in large zero-sum imperfect-information games. P2SRO is able to parallelize PSRO with convergence guarantees by maintaining a hierarchical pipeline of reinforcement learning workers, each training against the policies generated by lower levels in the hierarchy. We show that unlike existing methods, P2SRO converges to an approximate Nash equilibrium, and does so faster as the number of parallel workers increases, across a variety of imperfect information games. We also introduce an open-source environment for Barrage Stratego, a variant of Stratego with an approximate game tree complexity of 10<sup>50</sup>. P2SRO is able to achieve state-of-the-art performance on Barrage Stratego and beats all existing bots. Experiment code is available at [https://github.com/JBLanier/pipeline-psro].</p>Finding approximate Nash equilibria in zero-sum imperfect-information games is challenging when the number of information states is large. Policy Space Response Oracles (PSRO) is a deep reinforcement learning algorithm grounded in game theory that is guaranteed to converge to an approximate Nash equilibrium. However, PSRO requires training a reinforcement learning policy at each iteration, making it too slow for large games. We show through counterexamples and experiments that DCH and Rectified PSRO, two existing approaches to scaling up PSRO, fail to converge even in small games. We introduce Pipeline PSRO (P2SRO), the first scalable PSRO-based method for finding approximate Nash equilibria in large zero-sum imperfect-information games. P2SRO is able to parallelize PSRO with convergence guarantees by maintaining a hierarchical pipeline of reinforcement learning workers, each training against the policies generated by lower levels in the hierarchy. We show that unlike existing methods, P2SRO converges to an approximate Nash equilibrium, and does so faster as the number of parallel workers increases, across a variety of imperfect information games. We also introduce an open-source environment for Barrage Stratego, a variant of Stratego with an approximate game tree complexity of 1050. P2SRO is able to achieve state-of-the-art performance on Barrage Stratego and beats all existing bots. Experiment code is available at [https://github.com/JBLanier/pipeline-psro].Hierarchical Variational Imitation Learning of Control Programs2019-12-29T00:00:00-08:002019-12-29T00:00:00-08:00https://indylab.org/pub/Fox2019Hierarchical<p>Autonomous agents can learn by imitating teacher demonstrations of the intended behavior. Hierarchical control policies are ubiquitously useful for such learning, having the potential to break down structured tasks into simpler sub-tasks, thereby improving data efficiency and generalization. In this paper, we propose a variational inference method for imitation learning of a control policy represented by parametrized hierarchical procedures (PHP), a program-like structure in which procedures can invoke sub-procedures to perform sub-tasks. Our method discovers the hierarchical structure in a dataset of observation-action traces of teacher demonstrations, by learning an approximate posterior distribution over the latent sequence of procedure calls and terminations. Samples from this learned distribution then guide the training of the hierarchical control policy. We identify and demonstrate a novel benefit of variational inference in the context of hierarchical imitation learning: in decomposing the policy into simpler procedures, inference can leverage acausal information that is unused by other methods. Training PHP with variational inference outperforms LSTM baselines in terms of data efficiency and generalization, requiring less than half as much data to achieve a 24% error rate in executing the bubble sort algorithm, and to achieve no error in executing Karel programs.</p>Autonomous agents can learn by imitating teacher demonstrations of the intended behavior. Hierarchical control policies are ubiquitously useful for such learning, having the potential to break down structured tasks into simpler sub-tasks, thereby improving data efficiency and generalization. In this paper, we propose a variational inference method for imitation learning of a control policy represented by parametrized hierarchical procedures (PHP), a program-like structure in which procedures can invoke sub-procedures to perform sub-tasks. Our method discovers the hierarchical structure in a dataset of observation-action traces of teacher demonstrations, by learning an approximate posterior distribution over the latent sequence of procedure calls and terminations. Samples from this learned distribution then guide the training of the hierarchical control policy. We identify and demonstrate a novel benefit of variational inference in the context of hierarchical imitation learning: in decomposing the policy into simpler procedures, inference can leverage acausal information that is unused by other methods. Training PHP with variational inference outperforms LSTM baselines in terms of data efficiency and generalization, requiring less than half as much data to achieve a 24% error rate in executing the bubble sort algorithm, and to achieve no error in executing Karel programs.Toward Provably Unbiased Temporal-Difference Value Estimation2019-12-14T00:00:00-08:002019-12-14T00:00:00-08:00https://indylab.org/pub/Fox2019Toward<p>Temporal-difference learning algorithms, such as Q-learning, maintain and iteratively improve an estimate of the value that an agent can expect to gain in interaction with its environment. Unfortunately, the value updates in Q-learning induce positive bias that causes it to overestimate this value. Several algorithms, such as Soft Q-learning, regularize the value updates to reduce this bias, but none provide a principled schedule of their regularizers such that early in the learning process updates are more agnostic, but then increasingly trust the value estimates more as they become more certain during learning. In this paper, we present a closed-form expression for the regularization coefficient that completely eliminates bias in entropy-regularized value updates, and illustrate this theoretical analysis using a proof-of-concept algorithm that approximates the conditions for unbiased value estimation.</p>Temporal-difference learning algorithms, such as Q-learning, maintain and iteratively improve an estimate of the value that an agent can expect to gain in interaction with its environment. Unfortunately, the value updates in Q-learning induce positive bias that causes it to overestimate this value. Several algorithms, such as Soft Q-learning, regularize the value updates to reduce this bias, but none provide a principled schedule of their regularizers such that early in the learning process updates are more agnostic, but then increasingly trust the value estimates more as they become more certain during learning. In this paper, we present a closed-form expression for the regularization coefficient that completely eliminates bias in entropy-regularized value updates, and illustrate this theoretical analysis using a proof-of-concept algorithm that approximates the conditions for unbiased value estimation.AutoPandas: Neural-Backed Generators for Program Synthesis2019-10-25T00:00:00-07:002019-10-25T00:00:00-07:00https://indylab.org/pub/Bavishi2019AutoPandas<p>Developers nowadays have to contend with a growing number of APIs. While in the long-term they are very useful to developers, many modern APIs, with their hundreds of functions handling many arguments, obscure documentation, and frequently changing semantics, have an incredibly steep learning curve. For APIs that perform data transformations, novices can often provide an I/O example demonstrating the desired transformation, but are stuck on how to translate it to the API. A programming-by-example synthesis engine that takes such I/O examples and directly produces programs in the target API could help such novices. Such an engine presents unique challenges due to the breadth of real-world APIs, and the often-complex constraints over function arguments. We present a generator-based synthesis approach to contend with these problems. This approach uses a program candidate generator, which encodes basic constraints on the space of programs. We introduce neural-backed operators which can be seamlessly integrated into the program generator. To improve the efficiency of the search, we simply use these operators at non-deterministic decision points, instead of relying on domain-specific heuristics. We implement this technique for the Python pandas library in AutoPandas. AutoPandas supports 119 pandas dataframe transformation functions. We evaluate AutoPandas on 26 real-world benchmarks and find it solves 17 of them.</p>Developers nowadays have to contend with a growing number of APIs. While in the long-term they are very useful to developers, many modern APIs, with their hundreds of functions handling many arguments, obscure documentation, and frequently changing semantics, have an incredibly steep learning curve. For APIs that perform data transformations, novices can often provide an I/O example demonstrating the desired transformation, but are stuck on how to translate it to the API. A programming-by-example synthesis engine that takes such I/O examples and directly produces programs in the target API could help such novices. Such an engine presents unique challenges due to the breadth of real-world APIs, and the often-complex constraints over function arguments. We present a generator-based synthesis approach to contend with these problems. This approach uses a program candidate generator, which encodes basic constraints on the space of programs. We introduce neural-backed operators which can be seamlessly integrated into the program generator. To improve the efficiency of the search, we simply use these operators at non-deterministic decision points, instead of relying on domain-specific heuristics. We implement this technique for the Python pandas library in AutoPandas. AutoPandas supports 119 pandas dataframe transformation functions. We evaluate AutoPandas on 26 real-world benchmarks and find it solves 17 of them.Multi-Task Hierarchical Imitation Learning for Home Automation2019-08-25T00:00:00-07:002019-08-25T00:00:00-07:00https://indylab.org/pub/Fox2019Multi<p>Control policies for home automation robots can be learned from human demonstrations, and hierarchical control has the potential to reduce the required number of demonstrations. When learning multiple policies for related tasks, demonstrations can be reused between the tasks to further reduce the number of demonstrations needed to learn each new policy. We present HIL-MT, a framework for Multi-Task Hierarchical Imitation Learning, involving a human teacher, a networked Toyota HSR robot, and a cloud-based server that stores demonstrations and trains models. In our experiments, HIL-MT learns a policy for clearing a table of dishes from 11.2 demonstrations on average. Learning to set the table requires 19 new demonstrations when training separately, but only 11.6 new demonstrations when also reusing demonstrations of clearing the table. HIL-MT learns policies for building 3- and 4-level pyramids of glass cups from 8.2 and 5 demonstrations, respectively, but reusing the 3-level demonstrations for learning a 4-level policy only requires 2.7 new demonstrations. These results suggest that learning hierarchical policies for structured domestic tasks can reuse existing demonstrations of related tasks to reduce the need for new demonstrations.</p>Control policies for home automation robots can be learned from human demonstrations, and hierarchical control has the potential to reduce the required number of demonstrations. When learning multiple policies for related tasks, demonstrations can be reused between the tasks to further reduce the number of demonstrations needed to learn each new policy. We present HIL-MT, a framework for Multi-Task Hierarchical Imitation Learning, involving a human teacher, a networked Toyota HSR robot, and a cloud-based server that stores demonstrations and trains models. In our experiments, HIL-MT learns a policy for clearing a table of dishes from 11.2 demonstrations on average. Learning to set the table requires 19 new demonstrations when training separately, but only 11.6 new demonstrations when also reusing demonstrations of clearing the table. HIL-MT learns policies for building 3- and 4-level pyramids of glass cups from 8.2 and 5 demonstrations, respectively, but reusing the 3-level demonstrations for learning a 4-level policy only requires 2.7 new demonstrations. These results suggest that learning hierarchical policies for structured domestic tasks can reuse existing demonstrations of related tasks to reduce the need for new demonstrations.Multi-Task Learning via Task Multi-Clustering2019-06-15T00:00:00-07:002019-06-15T00:00:00-07:00https://indylab.org/pub/Yan2019Multi<p>Multi-task learning has the potential to facilitate learning of shared representations between tasks, leading to better task performance. Some sets of tasks are related, and can share many features that are useful latent representations for these tasks. Other sets of tasks are less related, possibly sharing some features, but also competing on the representational resources of shared parameters. We propose to discover how to share parameters between related tasks and split parameters between conflicting tasks, by learning a multi-clustering of the tasks. We present a mixture-of-experts model, where each cluster is an expert that extracts a feature vector from the input, and each task belongs to a set of clusters whose experts it can mix. In experiments on the CIFAR-100 MTL domain, multi-clustering outperforms a model that mixes all experts in accuracy and computation time. The results suggest that the performance of our method is robust to regularization that increases the model’s sparsity when sufficient data is available, and can benefit from sparser models as data becomes scarcer.</p>Multi-task learning has the potential to facilitate learning of shared representations between tasks, leading to better task performance. Some sets of tasks are related, and can share many features that are useful latent representations for these tasks. Other sets of tasks are less related, possibly sharing some features, but also competing on the representational resources of shared parameters. We propose to discover how to share parameters between related tasks and split parameters between conflicting tasks, by learning a multi-clustering of the tasks. We present a mixture-of-experts model, where each cluster is an expert that extracts a feature vector from the input, and each task belongs to a set of clusters whose experts it can mix. In experiments on the CIFAR-100 MTL domain, multi-clustering outperforms a model that mixes all experts in accuracy and computation time. The results suggest that the performance of our method is robust to regularization that increases the model’s sparsity when sufficient data is available, and can benefit from sparser models as data becomes scarcer.